Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2794
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Peter Kemper William H. Sanders (Eds.)
Computer Performance Evaluation Modelling Techniques and Tools 13th International Conference, TOOLS 2003 Urbana, IL, USA, September 2-5, 2003 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Peter Kemper Universit¨at Dortmund, FB Informatik 44221 Dortmund, Germany E-mail:
[email protected] William H. Sanders University of Illinois at Urbana-Champaign Coordinated Science Laboratory Electrical and Computer Engineering Dept. 1308 West Main St., Urbana, IL 61801-2307, USA E-mail:
[email protected]
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
.
CR Subject Classification (1998): C.4, D.2.8, D.2.2, I.6 ISSN 0302-9743 ISBN 3-540-40814-2 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP Berlin GmbH Printed on acid-free paper SPIN 10931875 06/3142 543210
Preface
We are pleased to present the proceedings of Performance TOOLS 2003, the 13th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. The series of TOOLS conferences has provided a forum for our community of performance engineers with all their diverse interests. TOOLS 2003, held in Urbana, Illinois during September 2–5, 2003, was the most recent meeting of the series, which in the past has been held in the following cities: 1984 1985 1987 1988 1991
Paris Sophia-Antipolis Paris Palma Turin
1992 1994 1995 1997 1998
Edinburgh Vienna Heidelberg Saint Malo Palma
2000 Chicago 2002 London 2003 Urbana
The proceedings of the TOOLS conferences have been published by SpringerVerlag in its LNCS series since 1994. TOOLS 2003 was the second conference in the series to be held in the state of Illinois, USA. It was one of four component conferences that met together under the umbrella of the 2003 Illinois Multiconference on Measurement, Modelling, and Evaluation of Computer-Communication Systems. Other conferences held in conjunction with TOOLS 2003 were the 10th International Workshop on Petri Nets and Performance Models (PNPM 2003), the International Conference on the Numerical Solution of Markov Chains (NSMC 2003), and the 6th International Workshop on Performability Modeling of Computer and Communication Systems (PMCCS-6). The format allowed for a number of joint components in the programs: the three keynote speakers, the tool demonstrations, the tutorials, and the social events were all shared by the participants of the multiconference. Moreover, the PNPM, TOOLS, and NSMC tracks of the multiconference ran concurrently, so that attendees could choose to attend whichever sessions of those component conferences they wished. For TOOLS 2003, the program committee consisted of 37 members, each of whom reviewed at least four papers to ensure a rigorous and fair selection process. From 37 submissions, 17 high-quality papers were selected as regular papers. The range of topics gave rise to sessions on tools for measuring, benchmarking, and online control, on tools for evaluation of stochastic models, on queueing models, on Markov arrival processes and phase-type distributions, and on tools for supporting model-based design of systems. In addition to the regular paper sessions, the multiconference included a session with brief presentations of tools (which were accepted by the tools chair) and two sessions with demonstrations of the tools. We were pleased to have Prof. David Nicol present his paper, co-authored with Michael Liljenstam and Jason Liu, entitled “Multiscale Modeling and Simulation of Worm Effects on the Internet Routing Infrastructure” as
VI
Preface
the TOOLS 2003 keynote address. The three keynote addresses of the multiconference, including Prof. Nicol’s talk and the presentations of Valeriy A. Naumov for NSMC 2003 and Jean Peccoud for PNPM 2003, were clearly highlights of the conference. It is our pleasure to acknowledge the help of the many people who made this conference a successful event. We are grateful to the members of the Program Committee and the outside reviewers who gave in-depth reviews in the short time we all had. In particular, we would like to thank the PC members who actively participated in the PC meeting held at Schloss Dagstuhl in Germany; we believe that many of them will remember the unique atmosphere of the setting, which turned out to make the meeting very productive. More thanks are due to Tod Courtney, for managing the Web-based review process; to Jenny Applequist, for handling local arrangements; to Falko Bause, for arranging the tool presentations and demonstrations; and to Aad van Moorsel, for assembling a series of four excellent tutorials. Finally, we would like to thank the University of Illinois at Urbana-Champaign and its Coordinated Science Laboratory for hosting the conference and providing technical and financial support. We are very pleased with the program that resulted from our preparations, and hope that you will find the papers in this volume interesting and thoughtprovoking. June 2003
Peter Kemper Program Co-chair William H. Sanders General Chair and Program Co-chair
Organization
Chairs General chair: Program chairs: Tutorials chair: Tools chair: Local arrangements chair:
William H. Sanders (UIUC, USA) Peter Kemper (U Dortmund, DE) William H. Sanders (UIUC, USA) Aad van Moorsel (HP Labs, USA) Falko Bause (U Dortmund, DE) Jenny Applequist (UIUC, USA)
Steering Committee Heinz Beilner (DE) Peter Harrison (UK) Boudewijn Haverkort (NL)
Raymond Marie (FR) Ramon Puigjaner (ES)
Program Committee Gianfranco Balbo (IT) Heinz Beilner (DE) Henrik Bohnenkamp (NL) Peter Buchholz (DE) Maria Carla Calzarossa (IT) Gianfranco Ciardo (USA) Adrian Conway (USA) Dan Deavours (USA) Susanna Donatelli (IT) Tony Field (UK) Reinhard German (DE) G¨ unter Haring (AT) Peter Harrison (UK) Boudewijn Haverkort (NL) Jane Hillston (UK) Ravi Iyer (USA) Joost-Pieter Katoen (NL) Pieter Kritzinger (SA) Christoph Lindemann (DE)
Raymond Marie (FR) Daniel Menasce (USA) Bruno M¨ uller-Clostermann (DE) Brigitte Plateau (FR) Rob Pooley (UK) Ramon Puigjaner (ES) Jerome Rolia (USA) Gerardo Rubino (FR) Herb Schwetman (USA) Guiseppe Serazzi (IT) Markus Siegle (DE) Evgenia Smirni (USA) Connie Smith (USA) William J. Stewart (USA) Miklos Telek (HU) Kishor S. Trivedi (USA) Aad van Moorsel (USA) Murray Woodside (CA)
VIII
Organization
External Reviewers Simona Bernardi Matthias Beyer Dongyan Chen Shuo Chen Paolo Cremonesi Marco Gribaudo Carlos Guerrero Armin Heindl Holger Hermanns
Kai-Steffen Hielscher Andras Horvath Gabor Horvath William Knottenbelt Matthias Kuntz Christian Kurz Kai Lampka Luisa Massari Andriy Panchenko
Theo C. Ruys Matteo Sereno Dave Thornley Axel Th¨ ummler Shelley Unger Wei Xie
Table of Contents
Keynote Presentation Multiscale Modeling and Simulation of Worm Effects on the Internet Routing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.M. Nicol, M. Liljenstam, J. Liu
1
Tools for Measuring, Benchmarking, and Online Control A Low-Cost Infrastructure for High Precision High Volume Performance Measurements of Web Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . K.-S.J. Hielscher, R. German
11
MIBA: A Micro-Benchmark Suite for Evaluating InfiniBand Architecture Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Chandrasekaran, P. Wyckoff, D.K. Panda
29
WebAppLoader: A Simulation Tool Set for Evaluating Web Application Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Wolter, K. Kasprowicz
47
A Comprehensive Toolset for Workload Characterization, Performance Modeling, and Online Control . . . . . . . . . . . . . . . . . . . . . . . . . . . L. Zhang, Z. Liu, A. Riabov, M. Schulman, C. Xia, F. Zhang
63
Tools for Evaluation of Stochastic Models Logical and Stochastic Modeling with SmArT . . . . . . . . . . . . . . . . . . . . . . . . . G. Ciardo, R.L. Jones, A.S. Miner, R. Siminiceanu
78
The Peps Software Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Benoit, L. Brenner, P. Fernandes, B. Plateau, W.J. Stewart
98
The Modest Modeling Tool and Its Implementation . . . . . . . . . . . . . . . . . . . 116 H. Bohnenkamp, H. Hermanns, J.-P. Katoen, R. Klaren
Queueing Models An M/G/1 Queuing System with Multiple Vacations to Assess the Performance of a Simplified Deficit Round Robin Model . . . . . . . . . . . . . . . 134 L. Lenzini, B. Meini, E. Mingozzi, G. Stea Queueing Models with Maxima of Service Times . . . . . . . . . . . . . . . . . . . . . 152 P. Harrison, S. Zertal
X
Table of Contents
Heuristic Optimization of Scheduling and Allocation for Distributed Systems with Soft Deadlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 T. Zheng, M. Woodside
Markovian Arrival Processes and Phase-Type Distributions Necessary and Sufficient Conditions for Representing General Distributions by Coxians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 T. Osogami, M. Harchol-Balter A Closed-Form Solution for Mapping General Distributions to Minimal PH Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 T. Osogami, M. Harchol-Balter An EM-Algorithm for MAP Fitting from Real Traffic Data . . . . . . . . . . . . . 218 P. Buchholz The Correlation Region of Second-Order MAPs with Application to Queueing Network Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 A. Heindl, K. Mitchell, A. van de Liefvoort
Supporting Model-Based Design of Systems EvalVid – A Framework for Video Transmission and Quality Evaluation . 255 J. Klaue, B. Rathke, A. Wolisz A Class-Based Least-Recently Used Caching Algorithm for World-Wide Web Proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 B.R. Haverkort, R. El Abdouni Khayari, R. Sadre Performance Analysis of a Software Design Using the UML Profile for Schedulability, Performance, and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 J. Xu, M. Woodside, D. Petriu
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Multiscale Modeling and Simulation of Worm Effects on the Internet Routing Infrastructure David M. Nicol, Michael Liljenstam, and Jason Liu Department of Electrical and Computer Engineering University of Illinois, Urbana-Champagn
Abstract. An unexpected consequence of recent worm attacks on the Internet was that the routing infrastructure showed evidence of increased BGP announcement churn. As worm propagation dynamics are a function of the topology of a very large-scale network, a faithful simulation model must capture salient features at a variety of resolution scales. This paper describes our efforts to model worm propagation and its affect on routers and application traffic. Using our implementations of the Scalable Simulation Framework (SSF) API, we model worm propagation, its affect on the routing infrastructure and its affect on application traffic using multiscale traffic models.
1
Introduction
The last two years have seen wide-scale worm infestations across the Internet, e.g. Code Red[1] (July 2001), nimda[2] (September 2001), and SQL-Slammer[3] (January 2003). Worms of these types use an activity called scanning in order to propagate[6]. An infected host enters a loop in which it repeatedly samples at random an IP address, and either attempts to open a connection with a device at that address (success at which leads to further attempt to impregnate the device, using another packet), or simply sends a packet which, if accepted by a susceptible and as-yet-uninfected device, infects it. Publicity surrounding these events focused on the effects to hosts, and (in the case of Slammer) ATM machines. What is not commonly known is that the worm spread also affected devices that execute a protocol which determines how traffic is to be routed through the Internet. This protocol, the Border Gateway Protocol[7] (BGP), operates by having routers exchange prospective paths, to every point in the Internet. Every message sent in the course of executing BGP is in principle the result of a failure or closing of a communication session—somewhere—between two routers. One such failure may cause a cascade of messages to propagate through the BGP routing infrastructure. So-called “withdrawal” messages are of particular interest, as these are proclamations that the sender no longer knows of any path to the subnetwork named in the withdrawal. Analysis of BGP message traffic across the Internet shows that the global BGP system generated an abnormally high number of messages while the worms scanned, and in some cases generated an abnormally high number of withdrawal messages. P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 1–10, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
D.M. Nicol, M. Liljenstam, and J. Liu
In the wake of the Code Red and nimba events we became quite interested in understanding and modeling the relationship between the worm scans and BGP’s behavior. In addition to the direct evidence of increased BGP activity, we received anecdotal evidence of increased router CPU utilization, and reports of router crashes. Router failures can explain the observed BGP behavior[5]. Routers run real-time operating systems, and it is known that excessive CPU utilization at the highest OS priority level frequently precedes router failure, as essential background services become starved. One of the explanations we developed—eventually developed and corroborated by advisories from router manufacturers—is that that cache thrashing was a leading contributer to router failure. All routers acquire a packet’s forwarding port by lookup into a forwarding table. In high end routers the entire forwarding table is in CPU speed memory. Routers at the lower end of the price/performance spectrum use a high-speed cache to hold forwarding directions for recently seen destinations, and schedule a significantly slower lookup from slow memory—at high CPU priority—when a cache miss occurs. The scanning behavior of worms, when intense, destroys the locality of reference needed for any caching system. Traffic diversity causes a high rate of cache misses. Address Resolution Protocol (ARP) behavior was likewise affected, as local area networks received worm packets whose addresses were legitimate to subnetwork level, but whose full IP address failed to match any device in the subnetwork. We initially developed simulation models of the worm propagation, router dynamics, and BGP behavior with the goal of experimenting with hypothesized causes of the observed behavior. As those hypotheses were validated we expanded the effect to include other models of traffic, including application traffic critical to support of network-wide data gathering and distribution. This effort has resulted in a new package for the SSFNet[8] simulation system. In this paper we describe how this package models traffic at different temporal and physical scales, all within the same simulation system. We illustrate the need for multiscale modeling in an application that sought to assess the effectiveness of worm defense mechanisms in a large-scale (but private) network.
2
Application: Effectiveness of Worm Defenses
The worms seen to date have heightened awareness of their threat, and raised interest in deploying defensive mechanisms for detecting and containing them. Simulation and modeling provides a critically needed approach for assessing the effectiveness of contemplated measures. We consider an example of a international-scale enterprise network (meaning that the network is isolated from the rest of the Internet). The network owner is able to deploy specialized hardware or software to detect worm scans, and react to them by quarantining subnetworks suspected to be infected. The sorts of questions the owning organization asks include – Can the counter-measures be effective at stopping worm spread? If so, how does one optimize placement and parameters of those counter-measures?
Multiscale Modeling and Simulation of Worm Effects
3
– What are the effects on critical network applications when worms attack, and what impact do the counter-measures have on those applications? – What are the tradeoffs between the cost of defense and risk of having no defense? We consider two specific mechanisms for detecting worm scans. One of them uses a slight modification to router software which sends a “blind-copy” ICMP control message to a repository, when the router is presented with a packet whose destination cannot be reached[4]. The idea here is that random scans will select IP addresses that don’t exist, so-called “back-scatter”. If one knows what fraction of IP space in a subnet is unreachable, and assumes some sampling distribution for the scans, then measured misses can be used to estimate the scanning intensity. This estimated intensity can be thresholded, and reaction taken when suspicion of scans is high. Another mechanism requires specialized hardware. Some fraction of the packets at a network access point is diverted to the device, where analysis is done (e.g., source/destination, hash functions of packet content) to produce one or more signatures for a packet, which are put into a cache. The idea is that packets from a common infestation have a great deal of structural similarity (e.g. the infection payload), so that detection of an abnormally high number of similar packets may signal the presence of a worm. Further sophistication is possible when these network devices fuse information from back-scatter with common content signals, and analyze global propagation growth patterns to provide early warning of a worm’s advance. In order to answer the questions listed earlier we’ll need to address pressing modeling issues. Worm spread is dependent on the distribution of vulnerable hosts throughout the enterprise, the nature of the mechanisms it uses to spread, the effect the worm has on the network infrastructure, how the infrastructure reacts to the impact the worm has on it, and the topology of that infrastructure. Thus we see that to accurately capture worm dynamics we will have to model network-wide topology in sufficient detail to capture the interactions between worm and infrastructure. The size of the network forces us to model worm scanning traffic at a fairly coarse time-scale. At another scaling extreme, we need to model the impact that worm traffic has on the caching behavior of individual devices. Our interest spans the ISO stack model as well. We are interested in the behavior at the highest layer (the application layer), as well as behavior at the next-to-lowest layer (the data-link layer).
3
Multiscale Models
To execute multiscale network simulation models it is necessary to construct specialized models of routing devices, designed to simultaneously manage traffic flows with different characteristics, at different levels of abstraction, while properly interacting with each other.
4
D.M. Nicol, M. Liljenstam, and J. Liu
3.1
Traffic
The approach we take is to first differentiate between individual packets (or frames), and abstracted flows. Packets and frames are handled individually. A “flow” gives a higher level of abstraction. At any point in simulation time, at any point in the network, a flow is characterized by the rate at which bits are moving. Flows may have other characteristics as well: – fixed update epoch—a fixed time-step after which a flow’s rate is updated, – dynamic changes—some flows may alter rates dynamically, – interactive—a flow’s rate change is a function of dynamic quantities in the simulation, – target subnet(s)—a set of IP prefixes may be specified as the targeted recipients of the flow. This general form allows specification of a single destination IP address, or a number of entire subnets as destination. – scanning behavior—some flows may represent worm scans, The first distinguishing characteristic of a flow is whether changes to its rate happen at fixed epochs, or dynamically as a function of the simulation state. While time-stepped updates are the norm, we have also developed discrete-event fluid formulations of UDP and TCP which allow for arbitrary spacing between update instants. A second characteristic is whether a rate update depends on the model state or not. It is legitimate to model some fraction of background traffic with rate functions that are completely specified before the simulation is run, and which are not altered while the simulation is running. These flows help provide context for the evolution of other flows whose behavior is of more interest. Interactive flows have rate updates that affect, and are affected by other elements of the model state. A third characteristic of a flow is its destination set. We allow a flow to have an arbitrary number of A third characteristic is scanning behavior. We describe the topology of a scan as a one-to-many flow. The destination set is described by a set of IP prefixes (subnets). This allows us to model a worm that has a target list, as well as a worm whose scans are purely random. Scanning flows are notable in that the flow splits at a router, with the destination prefixes and incoming flow rate being partitioned among outgoing. The fraction of incoming scan flow which is carried along an outgoing link is identically the fraction of incoming scan flow target IP addresses that are reached over that link. Standard multicast can be modeled with a minor variation which does not split the incoming flow rates, but duplicates them on outgoing links. 3.2
Routers
Our implementation of SSFNet contains routers that handle diverse traffic models, concurrently, modeling their interactions. Packet flows may be part of this mix. The state of fluid flows in the router are used to estimate packet loss probability, queueing delay, and bandwidth allocation. The packet streams affect the interactive fluid flows by having a fluid representation whose rates are based on observed packet arrival behavior.
Multiscale Modeling and Simulation of Worm Effects
5
BGP speaking routers are modeled to capture the interaction between worm scan traffic and router behavior. This model is also multiscale. It contains a detailed model of BGP operations, with BGP speakers communicating using full packets over TCP. The processing time of BGP messages is governed by an estimate of background CPU utilization. CPU utilization goes up as the intensity of scan traffic coming in or going out increases. BGP memory utilization is modeled too, as a function of the number of flows and rate of scan traffic. Our model of a BGP speaker has an artificial layer in the stack model which triggers router failure. The CPU and memory utilitzation states are checked periodically (e.g. every simulated second). A decision is made randomly to fail the router. The probability of failure is a function of the CPU and memory utilizations, naturally being monotone non-decreasing as either of those utilizations increase. Upon failure a router remains inoperative for a down time, typically measured in tens of minutes. This simple model is intended to capture complex dynamics within a router of the effects that scanning traffic has on it, from both the ingress and egress sides. BGP speakers that share sessions are required to send each other messages (if only “I’m alive”) every 30 seconds. The rest of the BGP system notices a failure by sensing that it is no longer sending messages. This in turn triggers BGP announcements concerning subnetworks whose accepted paths passed through the failed router. New paths for those subnetworks are announced, if possible, else withdrawals are announced. Thus there is a causality chain, where scan traffic intensity affects utilization, which affects BGP processing costs and a failure model, both of which affecting BGP behavior. In turn, BGP affects scan intensities, because as BGP modifies forwarding tables it affects the paths that scan flows take.
3.3
Worm Infection
Our worm infection model takes an abstract view of subnetworks (unless the modeler specifies a subnetwork in detail). For most global purposes the key attributes a subnetwork needs to have identify how many vulnerable devices there are, and how many are infected (with the possible addition of state models describing how many devices are in each state of a finite-state-machine description of an infected host’s behavior). We advance propagation dynamics in a timestepped fashion, at each time-step calculating what the incoming scan intensity to a subnetwork is. This intensity is used in conjunction with the knowledge of the size of the IP space being scanned (which contains the subnetwork’s address space) and the number of vulnerable-but-uninfected devices in that subnetwork, to randomly choose the number of devices which newly become infected. One might also associate an infection duration with a device to model detection of the infection and removal (either of the worm or the the device). An important point to consider is that the total scanning rate into the subnetwork is a function of all infected subnetworks throughout the entire network. This single fact forces us to model infection behavior at a high level of abstraction.
6
3.4
D.M. Nicol, M. Liljenstam, and J. Liu
Worm Detection
As we have described earlier, worm scans may be detected by the presence of common data payloads and by presence of backscatter. Our models allow the representation of such devices, and the communication between them, with the effect that when certain thresholds are reached a subnetwork that is suspected of being infected quarantines itself. When this happens no traffic enters or leaves the subnetwork. The idea behind quarantine is to contain the worm in those subnetworks it has infected, leaving the uninfected networks fully operational. Just as the flow intensities of scan traffic affect the state of a router and so affect BGP, those intensities help to determine the rate of backscatter detection, and the rate of common content detection. The estimates of those rates are thresholded to trigger communication between detection devices, and eventually quarantine of some subnetworks.
4
Network State and Application Behavior
We have considered two kinds of network applications, both modeling functionality that provides a Common Operational Picture to the enterprise (that is, the availability of all data, to any place in the enterprise). Three types of application traffic are modeled. The first is point-to-point traffic, as would be used to browse web sites. We also represent multi-source to single destination traffic, which models the convergence of information to some critical decision maker in the enterprise. We also represent single source to multiple destination traffic, to model hot spots in data provisioning. The network state affects networked applications through its impact on the bandwidth they receive, the end-to-end latency, and packet loss. These variables all are affected by the intensity and placement of worm scans, the failure of routers, and access to IP addresses representing application traffic sources and sinks.
5
Example
We have used the concepts described in this paper to evaluate the impact of a fast-moving worm attack on a large network modeled loosely on the NIPRNet. Our model involves 130 Autonomous Systems (AS), 3233 routers (of which 559 are BGP speakers), and represents 163 LANs. The worm dynamics are modeled after Slammer, which essentially infected all vulnerable machines in a few minutes. The experiment begins with 6 infected hosts in one AS, latent until time 300 seconds, after which they begin to scan (the first 300 seconds are used in the simulation to allow BGP to initialize and converge on forwarding tables. In this experiment we use a variety of traffic models. The background traffic was generated off-line using fluid models of TCP where the background flows interacted with each other. Time series of bandwidth use during these experiments were recorded, and are used as non-interactive background traffic in the worm experiments. The time-step for background traffic rate updates is approximately 5
Multiscale Modeling and Simulation of Worm Effects 100000
7
unprotected defended
Number of Infected Hosts
10000
1000
100
10
1 0
100
200
300
400
500
600
700
800
900
1000
Fig. 1. Number of infected hosts as a function of time, with and without quarantine defenses
seconds. Application traffic is fluid-based, but is interactive. Rate updates occur every second. BGP message traffic is packet oriented, and is handled discretely rather than continuously. This model uses a variety of resolutions for devices as well. BGP speaking routers are modeled in detail. Non-BGP speaking routers are represented more in terms of the effects they have on traffic bandwidth and latency than they are in terms of actual forwarding tables. Non-BGP routers within an AS are assumed to use the OSPF protocol, which essentially maintains shortest-path information within an AS. Device failures cause our simulation to recompute the shortest path information. LANs are represented very abstractly, with just enough detail and state information to capture worm propagation dynamics. We assume that the outbound link between the LAN and the rest of the Internet is the bottleneck on flows entering and leaving the LAN, and so do not explicitly model the interior of the LAN (although this would certainly be possible, as would a mixture of abstract and concrete LAN representations). Figure 1 shows the number of infected hosts (on a log scale) as a function of time, for both the situation where we deploy worm defenses and when we don’t. The unprotected case shows the characteristic exponential growth curve with tail-off we expected of such worms. We see that the worm defenses detect the worm just over a minute after it begins scanning, and effectively isolates the infected networks. Figure 2 illustrates the aggregate bandwidth consumed by all applications. The offered load is shown; the other two curves describe the behavior with and without defenses. The variation in application throughput is largely due to vari-
8
D.M. Nicol, M. Liljenstam, and J. Liu Application Traffic 400 350
offered load throughput/unprotected throughput/defended
Transfer Rate in Mbps
300 250 200 150 100 50 0 0
100
200
300
400
500
600
700
800
900
1000
Fig. 2. Aggregate delivered bandwidth as a function of time, with and without quarantine defenses
ation in non-interactive background traffic. These curves are identical until the time (around 380 seconds) when the detection mechanisms quarantine infected networks. For a short time after this the aggregate application throughput of the unprotected case is larger than the protected case, which dropped when quarantines were established. As the worm spreads it consumes most bandwidth at bottleneck points where LANs attach to the Internet, so that the bandwidth available to application traffic decreases. A significant drop occurs around time 930 seconds, when a router fails and isolates an important subnetwork. Figure 3 shows the network performance from the viewpoint of an IP address that is destination for a large set of concurrent transfers. The IP address is one generated by our simulation—any resemblance to an actual network of that name is purely coincidental! The y-axis plots the fraction of those transfers that are live as a function of time (on-off cycling explains why we don’t observe 100% activity prior to the worm spread at time 300 seconds.) The unprotected and defended curves track each other until infected subnetworks are quarantined, at which point the defended curve drops slightly but appears to stabilize while the unprotected curve declines, again showing a significant drop shortly after time 930, where we have seen already other effects of a router failure. Figure 4 likewise focuses on performance from the perspective of a single device, in this case the source of a multicast video stream (again with an artificial IP address). Here we plot on the y-axis the fraction of packets lost among all of the streams it pushes. Once again we see that after quarantine the loss rate stabilizes, while in the system with no defenses the loss rate grows in tendency, with a significant jump after a router failure.
Multiscale Modeling and Simulation of Worm Effects
9
Data Transfers to 9.96.0.0 90%
Percentage of Connections Alive
80% 70% 60% 50% 40% 30% 20% 10% unprotected defended
0% 0
100
200
300
400
500 time
600
700
800
900
1000
Fig. 3. Fraction of working transfers to 9.96.0.0 with and without quarantine defenses
Video Stream from 9.128.0.0 1
unprotected defended
0.9 0.8
Packet Loss Rate
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
100
200
300
400
500 time
600
700
800
900
1000
Fig. 4. Packet loss rate on streams from 9.128.0.0 with and without quarantine defenses
10
6
D.M. Nicol, M. Liljenstam, and J. Liu
Conclusions
Multiscale modeling of network traffic and device behavior is essential if one is to capture the detailed effects that a large-scale Internet event such as a worm attack may have. This paper sketches our present approach in the context of the SSFNet simulation system. We illustrate the concepts using an example where one wishes to assess the effectiveness of worm detection and defense mechanisms. The network considered is very large, yet through aggressive modeling techniques the whole simulation model can be handled on a laptop class computer. Acknowledgements. The authors thank BJ Premore for all the help he’s given supporting our work with BGP, and for his development of the BGP implementation in SSFNet. This research was supported in part by DARPA Contract N66001-96-C-8530, NSF Grant ANI-98 08964, NSF Grant EIA-98-02068, Dept. of Justice contract 2000-CX-K001, and Department of Energy contract DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of t his contribution, or allow others to do so, for U.S. Government purposes.
References 1. CERT. Cert advisory ca-2001-19 code red worm exploiting buffer overflow in iis indexing service dll, July 2001. http://www.cert.org/advisories/CA-2001-19.html. 2. CERT. Cert advisory ca-2001-26 nimda worm, September 2001. http://www.cert.org/advisories/CA-2001-26.html. 3. CERT. Cert advisory ca-2003-04 ms-sql server worm, January 2003. 4. ISTS. Dib:s scan detection and correlation, 2003. http://www.ists.dartmouth.edu/IRIA/projects/dibs/. 5. M. Liljenstam, Y. Yuan, B. Premore, and D. Nicol. A mixed abstraction level simulation model of large-scale internet worm infestations. In In Proceedings of the 2002 MASCOTS Conference, July 2002. 6. Stuart Staniford, Vern Paxson, and Nicholas Weaver. How to own the internet in your spare time. In Proceedings of the 11th USENIX Security Symposium (Security ’02), 2002. 7. L. van Beijnum. BGP. O’Reilly, 2001. 8. www.ssfnet.org.
A Low-Cost Infrastructure for High Precision High Volume Performance Measurements of Web Clusters Kai-Steffen Jens Hielscher and Reinhard German Department of Computer Science 7 (Computer Networks and Communication Systems) University of Erlangen-N¨ urnberg Martensstraße 3 D-91058 Erlangen, Germany {ksjh,german}@informatik.uni-erlangen.de Abstract. We present a software monitoring infrastructure for distributed web servers for precise measurements of one-way delays of TCP packets to parameterize, validate and verify simulation and state-space based models of the system. A frequency synchronization of the clocks is not sufficient, both frequency and phase (offset) of all clocks have to be in sync to correctly estimate distributions of the involved delays. We present a cost-effective combination of standard methods with own additions and improvements to achieve this goal. The solution we developed is based on using only one GPS receiver in combination with NTP and the PPS-API for the time synchronization of all object systems’ clocks with a precision in the range of one microsecond. The timestamping and generation of the event trace is done in a Linux kernel module. In addition, example measurements generated with our infrastructure are presented.
1
Introduction
Despite the economic problems in computer and telecommunication markets today, the user base of the Internet is constantly growing. In parallel, prices for off-the-shelf PC products keep dropping while their performance increases. A way to cope with the high demands for internet services is to set up a cluster system of commodity PC products as a distributed web server. We built a cluster based web server in our laboratory to gain insight into the timing mechanisms of distributed web servers using both measurement and modeling. The measurements enable us to do the input modeling for simulation and analytical models, to calibrate the models and to validate their results. For this purpose we need one-way measurements of packet delays with fine granularity in the range of microseconds. One possibility is to use hardware monitoring [8,5], where a specialized hardware device with a built-in clock generates timestamps for signals observed at some interface of the system under test. While this solution provides precise timestamps it can only be applied to systems located in one place and it is P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 11–28, 2003. c Springer-Verlag Berlin Heidelberg 2003
12
K.-S.J. Hielscher and R. German
hard to obtain all information needed to identify individual packets at different nodes of the system (e.g. TCP sequence numbers). Currently available systems have very limited capacity for storing the event trace, so long-time high-volume measurements are impossible. These disadvantages are shared by the hybrid monitoring approch where the event recognition is done in software and timestamping is done in hardware. When using software monitoring, the object system generates the necessary timestamps in software. But for determining one-way delays in a distributed system, it is necessary that all systems share a common time base, either by using a global clock or by synchronizing the local clocks. We evaluated the possibility of obtaining timestamps from a global, GPS-controlled clock, but it took several microseconds to get one single timestamp from the clock system, a value too high to be tolerable in our application. So software monitoring with disciplined local clocks of the object system is the method of choice in the work described in this paper. For this purpose we use a combination of the NTP system [11], the PPS-API, which is specified in RFC 2783 [13] and mainly used for connecting primary reference time sources to stratum-1 NTP servers, and a GPS receiver system. The PPS-API kernel patch for the Linux operating system [20] not only provides a mechanism for detecting external PPS pulses but also improves the timing resolution from one microsecond to one nanosecond using the timestamp counter (TSC) that is available in all modern PC processors. Since we instrumented the netfilter framework [16] part of the linux kernel, timestamping can be done in kernel mode without context switching. For this purpose we allocate kernel space memory as a buffer for the event trace. This buffer implements a character device that can be read in a by a user space daemon and this reading process is not time-critical. We took this approach because most other time synchronization projects concentrate mainly on synchronizing the frequency of clocks and not the phase. But for determining the distribution parameters of one-way delays for input modeling, it is crucial to synchronize the offset of the clocks with high accuracy. Furthermore, experiments with a number of different system architectures showed that although it is possible to compensate the long-term frequency of the clocks [17], large time offset differences of the local clocks can be observed when just the systematic drift of the oscillators is compensated and the clocks are running undisciplined, even when the systems are located next to each other in a air-conditioned, temperature controlled server room. This shows the necessity to use some form of synchronization of the clocks based on the current frequency or offset of the local clocks when high precision timestamps have to be obtained.
2
The Web Cluster
A distributed web server needs at least one load-balancing node that distributes incoming user requests to several nodes that process the requests, denoted here as real servers. The most common approach to load-balancing is the DNS-based loadbalancing mechanism where the hostname of the server is resolved to different IP
A Low-Cost Infrastructure
13
Fig. 1. Distributed Web Server Architecture
addresses belonging to different machines. The drawback of this method, known as round-robin DNS, is that the time-to-live entry for the DNS record must be small to avoid asymmetrically balanced load. For this reason, the entry is only cached for a short time and frequent name resolution processes are needed [4]. 2.1
The Linux Virtual Server System
In our solution we use a routing-based approach that is more suited for local load balancing, where all servers are located in geographical proximity. The Linux Virtual Server [10] system is an Open Source project that supports loadbalancing of various IP-based services and supports out-of-band transmission (e.g. for FTP) and persistent connections (e.g. for SSL). It is a layer-4-switching system where routing decisions are based on the fields of TCP or UDP headers like port numbers. The whole distributed web server carries a single IP address called Virtual IP Address (VIP). Requests sent to this address are balanced among the real servers carrying the Real IP Addresses (RIPi ). Three mechanisms for load-balancing are available: • Network Address Translation, • IP Tunneling and • Direct Routing. Network Address Translation (NAT) is a method specified in RFC 1631 [6] for mapping a group of n IP addresses with their TCP/UDP ports to a group of m different IP addresses (n-to-m NAT). When used for load balancing, the VIP is assigned to the load balancer only. This node receives all incoming packets, selects the IP address of a real server according to a chosen scheduling algorithm,
14
K.-S.J. Hielscher and R. German
creates an entry in a connection table, changes the destination address (and optionally the port) of the packet to the chosen RIPi and routes it to the selected real server. The connection table is used to forward packets of the same client to the same real server (a HTTP request can consist of more packets) and the answer packets to the right client. The load balancer is used as the standard gateway for the real servers. When packets belonging to replies arrive, the source address is changed to the VIP and forwarded to the client via the Internet. NAT involves rewriting the packets twice and is limited to geographically close nodes since the load balancer has to be the gateway for all real servers.
Fig. 2. Load Balancing via NAT
Tunneling and Direct Routing cause less overhead because the packets sent by the real servers do not have to pass the load balancer. Since our load balancer does not reach saturation even with the NAT approach, we did most of our measurements with NAT. Details about the other two methods can be found in [10]. The Linux Virtual Server system offers different scheduling algorithms: Round Robin, Weighted Round Robin, Least Connection, Weighted Least Connection, Locality-based Least Connection, Locality-based Least Connection with Replication, Destination Hashing and Source Hashing scheduling. While the first four algorithms can be used for any IP based service, the later four are intended for cluster-based caching proxy servers.
A Low-Cost Infrastructure
15
The System is implemented as a Linux kernel patch that is integrated into the netfilter framework. This framework is used for the manipulation of IP packets for firewalling and NAT. The kernel part can be configured using the user mode tool ipvsadm. Only the load balancer needs to run the Linux operating system, the real servers can operate under any OS that supports the necessary features like IP-IP encapsulation for Tunneling or non-arping interfaces for Direct Routing. In addition to monitoring the state of the real servers and removing them from the scheduling in case of an error, there are different software add-ons that can be used to implement a fail-over solution for the load balancer for high availability [10]. An identical configuration of all machines simplifies the development of a time synchronization solution. Therefore we use Linux with a 2.4.x kernel version on all machines for our measurements. While other operating systems can create non-ARP interfaces without any modification, a special hidden-patch for Linux [1] is needed for the real servers with Direct Routing. Our measurements were done serving static content with the Apache webserver. 2.2
Hardware Setup
The hardware we use in our project consists of two load balancers (for fail-over) with the following main components: • SMP-Mainboard with ServerWorks ServerSet III LE chipset with 64bit PCI bus, • two intel Pentium III processors with 1 GHz each, • 512 MB SD-RAM PC133 memory, • two 1000-Base-SX network interface cards with Alteon AceNIC chipset with 64bit PCI interface, • on-board 100-Base-TX NIC with intel chipset for management purposes. We use ten real servers and one NTP server with identical hardware: • Mainboard with VIA Apollo KT133 chipset (VT8363A north bridge and VT82C686B south bridge), • AMD Athlon Thunderbird processor with 900MHz, • 256 MB SD-RAM PC133 memory, • two 3Com 100-Base-TX PCI network interface cards. We use a 24 port Cisco Catalyst 3500XL switch with two 1000-Base-SX GBIC modules to interconnect the Internet router, the load balancers and the real servers. It supports the use of SNMP and RMON for monitoring the switch internals. The 100-Base-TX NICs used for management purposes are connected to another switch to minimize the influence of management traffic on our measurements.
3
Measurement Infrastructure
Our measurement solution is based on using standard time synchronization tools such as NTP with a GPS receiver and the PPS-API. We will describe modifications of the standard components that can improve timekeeping accuracy, such
16
K.-S.J. Hielscher and R. German
as dynamic PPS ECHO feedback and a new Linux driver for PPS recognition using the parallel port. Furthermore, we show how we instrumented the TCP stack in the kernel and the user mode load generation and web server software to generate high resolution timestamps derived from the timestamp counters (TSC) available in most modern processor architectures. In addition to the hardware described in the previous section, we can equip our nodes with four Meinberg GPS 167PCI GPS receivers. All receivers can be connected to a roof-mounted GPS antenna through a 4-port antenna splitter. The usual solution for synchronizing the clocks of a number of nodes is by using NTP over a network connection [11]. The accuracy of this method is limited to a few milliseconds. Another possibility would be to equip each of our nodes with one GPS receiver. In our scenario, this would involve using at least 12 GPS receiver cards and three roof mounted antennas, since the number of receivers that can be connected to one antenna is limited by the low signal strength of the GPS transmissions. Besides the high costs caused by this solution, a synchronization of the system clocks is also necessary in this setup, since our measurements have shown that directly reading the time from GPS receiver cards takes a considerable amount of time and disturbs measurements by causing additional I/O load (PCI bus cycles) on the object systems. The precision obtained by synchronizing the system clock is generally lower than the specified precision of the internal oscillator of the reference clock. Earlier experiments have also shown that the precision achieved by estimating the clock skew from network delay measurements without a GPS reference clock like in [17,14,18,7] is not sufficient for determining the distributions of the delays in our system. 3.1
PPS-API
We use the PPS-API as specified in RFC 2783 [13] to avoid these difficulties. As shown in figure 3, we need only one GPS receiver for the whole system using this approach. The PPS-API provides a facility to timestamp external events delivered to a system with high resolution. It is intended for connecting external time sources like GPS receivers to a stratum-1 NTP server. Our GPS cards have a PPS signal output that is documented to deliver a TTL pulse that marks the beginning of a second with a uncertainty below 500 ns with respect to UTC. The Linux PPS-API kernel patch [20] modifies the serial port driver for detecting and timestamping signals delivered to the DCD pin of the serial port. In addition to PPS recognition, the Linux kernel patch extends the timekeeping resolution of the the Linux kernel to one nanosecond by utilizing the timestamp counter (TSC), which is present in most modern CPUs and is incremented with every tick of the internal clock. Since the signal levels of TTL are different from the ones in RS-232, we built a 5V-powered level converter using Maxim MAX3225 chips. These chips were selected because of their relatively low propagation delay. One chip can convert two TTL signals to RS-232 levels, so we used seven chips connected on the TTL
A Low-Cost Infrastructure
17
Fig. 3. Synchronization System
side to deliver the PPS signal to all twelve nodes of our cluster plus the NTP server. The timestamps of the PPS pulses can be used in two ways to discipline the kernel clock: either by using the hardpps() kernel consumer or by using the user level NTP daemon. Both of them make use of the kernel model for precision timekeeping as specified in RFC 1589 [12] and estimate the frequency error of the local oscillator by averaging the measured time interval between successive PPS pulses. Since the PPS pulse just marks the beginning of an arbitrary second, but does not contain information on the absolute time (second, minute, hour, day, moth, year), all clocks of the cluster nodes must be set to have offsets less than 500 ms. We achieved this using a standard NTP server on the network and using the ntpdate command before starting PPS synchronization or by using the NTP daemon with a configuration file that contains two time sources, the PPS clock driver and the NTP server. 3.2
PPS Pulse Latency
In Linux the recognition of the PPS pulse is done by instructing the hardware to generate an interrupt in case of a signal transition on the DCD pin of the serial port. The interrupt handling routine in the serial port driver is modified by the patch to timestamp every invocation. The PPS API can generate an ECHO signal on the DSR pin of the serial port to be able to estimate the delay between the PPS pulse and the timestamping using an external clock. This delay decho is composed of the hardware propagation delay for the incoming PPS pulse dhwi , the interrupt latency dlat , a delay between the timestamping and the generation
18
K.-S.J. Hielscher and R. German
of the echo signal dts and the hardware propagation delay for the outgoing echo signal dhwo . While the other delays remain more or less constant and can be compensated for, dlat depends on the the state of the system at the time of the signal reception. Thus, if the time of the generation of n-th PPS pulse is t(n), the time of timestamping this event is tts (n) = t(n) + dhwi (n) + dlat (n) and the time the echo pulse is observable as an external signal transition is techo (n) = t(n) + decho (n) = t(n) + dhwi (n) + dlat (n) + dts (n) + dhwo (n) By recording the PPS pulse and the resulting ECHO with an external clock the value of decho can be determined. The time of the local clock at the n-th echo generation tloc,echo (n) is the timestamp generated by the PPS API driver. The time of the local clock at the n-th external PPS signal tloc,pps (n) is tloc,echo (n) − dts (n). Thus the differences of two nodes i and k at the time of the n-th PPS pulse can be calculated as ∆ti,k (n) = (tloc,echo,i (n) − dts,i (n)) − (tloc,echo,k (n) − dts,k (n)) The delay dloc,ts is not observable, but since dts (n) and dhwo (n) can be viewed as constant across different nodes and time, a reasonable approximation for ∆ti,k (n) can be calculated as ˜ i,k (n) = (tloc,echo,i (n) − decho,i (n)) − ∆t (tloc,echo,k (n) − decho,k (n)) = ∆ti,k (n) + dts,k (n) − dts,i (n) + dhwo,k (n) − dhwo,i (n) The assumption that dhwo is constant for all systems is justified because all signal level converters use the same hardware with low propagation delay and share the same ambient temperature. The generation of the echo signal inside an interrupt handler with other interrupts disabled and the identical hardware on all real servers makes the assumption of a constant dts also reasonable. The PPS serial port driver is implemented such that the serial port remains usable for general communication besides PPS recognition. Due to this fact, there are several instructions executed between the timestamping and the echo generation. So we decided to implement a driver for the parallel port for exclusive use for PPS signal recognition. This also enabled us to avoid the use of signal level converters since the parallel port makes use of TTL signals as provided by the GPS receiver. First experiments with our driver showed a slight reduction of the jitter of the timestaps.
A Low-Cost Infrastructure
3.3
19
ECHO Feedback
The interrupt latency dlat occurs not only in our setup, but in every system that uses an external reference clock. It does not matter if the clock is connected to a serial or parallel port or a system bus like PCI. The calculation of ∆ti,k lead us to the idea of Dynamic PPS Echo Feedback: By measuring the time between the PPS pulse and the generated ECHO for each pulse with an external clock, we can compensate for the interrupt latency by subtracting decho (n) from the timestamp tloc,echo (n). The resulting timestamp tloc,echo (n) = tloc,pps (n) − dts (n) − dhwo (n) is lower than the desired tloc,pps by dts (n) + dhwo (n) but does not depend on the interrupt latency any more. An estimator for the time dispersion due to frequency variance is the time deviation τ2 mod σy 2 (τ ), σx (τ ) = 3 where mod σy 2
1 = 2 2τ
2 n 1 (xi+2n − 2xi+n + xi ) n i=1
5e−05
is the modified Allan variance for the averaging time τ = nτ0 and the successive time differences xi between UTC as realized by the PPS pulse and the system clocks. [9,19,3]
σx(τ) [s]
5e−07
2e−06
1e−05
PPS ECHO Feedback
1
5
10
50
100
500
5000
τ [s]
Fig. 4. Time Deviation
Figure 4 was produced by plotting the time deviation for the raw PPS pulses as received by the PPS API with an undisciplined local clock and for the PPS
20
K.-S.J. Hielscher and R. German
pulses corrected by dynamic PPS ECHO feedback against the averaging interval τ . Both axes are scaled logarithmically. The measurement process took 24 hours. All timestamps were compensated for the systematic frequency errors by using linear regression over the whole measurement period. While using the ECHO feedback improves the time deviation for small averaging periods to some extent, a careful selection of the averaging interval τ is crucial to the accuracy of the system. Our measurements imply an optimum value of 32 seconds, but NTP bases the choice of τ on the Allan intercept point, a minimum of the Allan variance. This emphasizes the synchronization of frequencies, but the resulting averaging interval is larger than the optimal choice of τ for synchronization of the phase (clock offset). Furthermore, standard NTP and hardpps implementations impose a lower limit of 1024 seconds on τ . The PPS-API patch also increases the resolution of the do clock gettime() system call to one nanosecond. When using this system call from kernel space, there is no context switch involved. The measured mean execution time of one system call on our real server nodes is 70 ns. This measurement was done by allocating a buffer in the kernel space and writing the result of successive calls of the do clock gettime() system call to that buffer space. The content of the buffer was read by a user mode tool where we calculated the differences of successive timestamps. This system call is utilized in a special kernel module that implements a character device with a buffer for events and corresponding timestamps. To obtain an optimal synchronization, we are currently working on a solution that uses the TSC of the CPU to timestap both the events and the PPS pulses and leaving the system clocks completely unsynchronized. The synchronization is done offline after the measurements took place. This enables us to use an optimized τ found by analyzing the time deviation. This approach is also applicable in small embedded systems where an online synchronization would be too time consuming. When using configurable hardware, it is even possible to latch the current TSC reading in hardware at every PPS pulse. This latched TSC can be read in the interrupt service handler for the PPS pulse to be used in an offline synchronization process, as this completely avoids the negative impact of the interrupt latency. 3.4
Instrumentation
The Linux TCP/IP stack contains a filter framework for packet filtering and mangling. The latest implementation in 2.4.x kernels is known as the netfilter framework. It provides hooks at several places in the kernel stack where one can register own code that is executed when an incoming or outgoing IP packet is processed by the kernel stack. We instrumented the code there to timestamp all relevant packages and log the client IP address, client port number, packet type (incoming/outgoing, SYN/ACK/FIN) and the TCP sequence number. Together with our 64 bit timestamps one entry occupies 19 bytes of buffer space. The size of the kernel event trace buffer can be configured, the standard size we used in the measurements presented in this paper was 12 MByte. In addition we use a user mode program that communicates via IOCTL-commands with the logging
A Low-Cost Infrastructure
21
device to read the recorded event trace and to reset and clean the buffer. The event trace is written to disk in binary form for offline analysis. In addition to the kernel level timestamping of TCP packets we instrumented the Apache webserver software to obtain application level timestamps. Apache’s C API provides the possibility to implement handlers in certain stages of processing a HTTP request. The first handler is the post-read request handler. We use an external Apache module that registers a handler there to write a timestamp along with the client IP address, the client port and the URI of the request to a file for each incoming request. These entries can be found in two structures defined in the Apache C API: request rec and conn rec. 3.5
Load Generation
On the client side an HTTP load generator is needed. httperf, a load generator developed by David Mosberger [15], is able to generate load that overloads a web server. Unlike most other load generators it does not try to simulate a certain number of users. The number of requests and the rate of the requests can be specified on the command line. Since the test client PC has certain limitations on the number of TCP connections that can held open simultaneously, httperf supports parallel execution on a number of client machines. This load generator is ideal for studying the behaviour of a web server in extreme situations for finding the system’s limits. SURGE [2] on the other hand emulates the behaviour of a configurable number of users as observed by analyzing the log files of web servers by its author Paul Barford. For this purpose the relative percentage of the number of accesses per file, embedded references, temporal locality of references and inactive periods of the user are determined by an analytical model derived from empirical observations. Some of the probability density functions used in the model are heavy-tailed. The on-off processes which are are used to model the user generate bursts and self similar traffic as observed in recent studies about real-world traffic on the Internet. It is also usable in a distributed environment with a number of load generating nodes. We use both load generators and instrumented their request generation phase for timestamping each request on the HTTP layer. 3.6
Offline Analysis of the Traces
Matching the TCP sequence numbers, packet types and client IP addresses of the traces collected on different nodes of the cluster in an offline analysis process enables us to reconstruct the way of a packet belonging to a client request or server reply from the client through the different different nodes of our cluster based web server system. The one-way delays between the computers can be calculated from the timestamps in the log files if the local clocks of the machines are synchronized with high accuracy. On the other hand the end-to-end delay on HTTP level between the client and the real server can be evaluated by looking at the traces of the load generator
22
K.-S.J. Hielscher and R. German
and the Apache web server. This requires matching the request URL, the client IP address and the client port. Since the calculated delays are used for estimating the parameters of a distribution function of the one-way delays in the input modeling process, low phase jitter in time synchronization is the crucial point.
4
Example Measurements
Five real servers and one load balancer were used in a NAT environment with round-robin scheduling for the following measurements. The unused second load balancing node functioned as a test client utilizing the httperf load generator. All slave nodes were connected to the switch with their 100-Base-TX interfaces. One 1000-Base-SX interface of the load generator was connected to the switch, the other one to one interface of the load generator using a cross-over fiber optics link. Time synchronization was done via the serial port using NTP (no hardpps kernel consumer) with a fixed polling interval of 64 seconds and the PPS-API without dynamic PPS ECHO feedback. An important step is to wait for the time synchronization to stabilize prior to generating requests for the system. For this example we generated 10000 HTTP/1.0 requests for a binary file with a size of 1024 bytes. This resulted in a request size of 65 bytes. The web server software we used was our instrumented version of Apache 1.3.22. The web server added 244 bytes of header information, so the resulting replies had a size of 1268 bytes. Since this is smaller than the maximum segment size used (1500 bytes), all replies consisted of exactly one TCP segment. Figure 5 provides an illustration for the 27 individual delays in the exchange of TCP segments that contribute to the total processing time of the HTTP request drawn as colored vertical bars. Time advances along the vertical axis from top to bottom. The horizontal position shows where the delays are caused: Either by the load generator (LG), the network channel between the load generator and the load balancer (C1), the load balancer (LB), the network between the load balancer and the real servers (C2) or by one of the real servers (RS). The delays in the channel C1 and C2 include not only the physical propagation delay and the store-and-forward delay of the Cisco switch, but also to time between the reception of the packet at the node of the cluster and the beginning of the packet processing in the TCP/IP stack of the operating system. The segments sent during delay 11, 12 and 16 are due to TCP protocol mechanisms (fast ACK) and do not mark a state change in the HTTP protocol. Figure 6 shows the individual delays as in illustrated in figure 5 plotted against the time of the measurement. Histograms of the delays are shown in figure 7. All gathered data enables us to estimate theoretical distributions or to use an empirical distribution to parameterize either a simulation or state-space based model of our distributed web server and to calibrate and validate the model. Care must be taken when using these probability distributions in models of the system, because as it is obvious that most delays are correlated (see eg. Delay 23 in figure 6).
A Low-Cost Infrastructure
Fig. 5. Illustration of the Delays
23
K.-S.J. Hielscher and R. German
60
80
100
20
40
60
80
0
60
80
100
0
20
40
60
80
20
40
60
80
0
40
60
Delay [µs] 0
20
40
60
80
100
0
20
40
60
Time [s]
Delay 13
Delay 14
Delay 15
80
100
20
40
60
80
100
0
20
40
60
Time [s] Delay 18
0.90
Time [s] Delay 17 Delay [µs]
140
80
100
80
100
80
100
80
100
80
100
0.60
100
Delay [µs]
60 60
100
60 0
Time [s] Delay 16
40
80
100
Delay [µs]
1.25
Delay [µs]
1.10 60
100
140
Time [s] 1.40
Time [s]
40
80
300
75
Delay [µs]
65
100
500
Delay 12
85
Delay 11
50
0
20
40
60
80
100
0
20
40
60
Time [s]
Time [s]
Delay 20
Delay 21
60
100 140
Delay [µs]
20
Delay [µs]
25
Time [s] Delay 19
15
40
60
80
100
0
20
40
60
80
100
0
20
40
60
Time [s]
Delay 22
Delay 23
Delay 24 8 7 6
Delay [µs]
5
40
80
1.3
120 160
Time [s]
Delay [µs]
Time [s]
40
60
80
100
0
20
40
60
80
100
40
60
Delay 26
Delay 27
Delay [µs] 40
60
Time [s]
80
100
0
20
40
60
80
100
56
Delay 25
Delay [µs]
Time [s]
0.60 20
20
Time [s]
60 0
0
Time [s]
0.70
20
100 140
0
52
Delay [µs]
20
Delay 10
80
100
140
100
Time [s]
60
80
60 0
Time [s]
40
100
100
Delay [µs]
40 30
Delay [µs]
20 100
Time [s]
60
Delay [µs]
60
Delay 9
1.1
Delay [µs]
40
Time [s]
50
20
1.5
0
Delay [µs]
20
Delay 8
80
80
1.5 0
Time [s]
Delay 7
60
100
1.3
100
Time [s]
40
80
1.1
Delay [µs]
100 140
Delay [µs]
60 40
30
20
70
0
60
Delay 6
80
20
40
Time [s]
70
0
20
Delay 5
0.65
20
260 300 340 380
0
65
100
Time [s]
Delay 4
40
20
55
Delay [µs] 0
Time [s]
0.55
Delay [µs]
0
Delay [µs]
1.2
Delay [µs] 40
20
20
120 160
0
Delay [µs]
20
5 10
Delay [µs]
0
Delay 3
0.8
100 140
Delay 2
60
Delay [µs]
Delay 1
0.75
24
0
Time [s]
Fig. 6. Scatterplots of Observed Delays
20
40
60
Time [s]
A Low-Cost Infrastructure
100
140
160
0.8
0.9
1.0
1.1
52
60
Delay 6
64
66
20
62
14
Density
0
0.000 12
60
80
100
120
140
160
1.05
120
140
160
20
25
30
35
40
45
60
100
120
Delay 10
Delay 11
Delay 12 Density
65
70
75
80
250
300
350
400
450
Delay 13
Delay 14
Delay 15
Density
15
380
1.10
1.15
60
50
120 Delay [µs] Delay 18
60
80
100
120
140
160
0.60
0.65
0.70 Delay [µs]
Delay 20
Delay 21
Density
0.15 0.00
Density
0.10 0.00
60
65
12
14
16
18
20
0.000 0.006 0.012
Delay [µs]
Delay 19
0.30
Delay [µs]
55
22
60
80
100
120
Delay 22
Delay 23
Delay 24
1.25
1.30
Density
Density
8 4 0
1.20
160
40
60
80
100
120
140
0.75
140
160
0.0 0.4 0.8 1.2
Delay [µs]
0.000 0.006 0.012
Delay [µs]
12
Delay [µs]
1.15
140
8
Density
0.010
60
100
0
0.00
40
80
550
12
0.020
1.25
Delay 17
0.000
Density
0.10
1.20
Delay [µs]
500
4
Density
0 5 360
0.000 0.006 0.012
Delay [µs]
25
Delay [µs]
340
160
0.000
0.10
Density
0.00
0 10
0.70
Delay [µs]
320
140
0.010
Delay [µs]
0.20
Delay [µs]
Delay 16
160
5
6
Delay [µs]
Delay [µs]
Delay [µs]
Delay 25
Delay 26
Delay 27
7
8
0.0
0 60
80
100
120
Delay [µs]
140
160
0.2
Density
20
Density
10
0.010 0.000
Density
30
1.10
80
Delay [µs]
0.65
1.30
0.010
Density
0.00 100
1.25
0.000
0.10
Density
0.008
0.20
0.020
Delay 9
0.60
50
1.20
Delay [µs]
Delay [µs]
30
1.15
Delay 8
80
300
1.10
Delay [µs]
Delay 7
0.006
280
5 10
0.010
Density
0.4
10 Delay [µs]
0.000
Density
58
Delay [µs]
0.000
60
260
Density
56
Delay 5
8
0.55
Density
54
Delay [µs]
30
40
Density
0.2
Density
0.0 0.7
Delay 4
0.2
Density
120
Delay [µs]
0.0
6
Density
10 15
Density
0 80
0.6
60
Density
Delay 3
5
0.006
Delay 2
0.000
Density
Delay 1
25
0.62
0.64
0.66
0.68
0.70
0.72
Delay [µs]
Fig. 7. Histograms of Observed Delays
52
54 Delay [µs]
56
58
26
K.-S.J. Hielscher and R. German Table 1. Summary Statistics for Individual Delays in Microseconds Minimum 25% Q. Median Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay Delay
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
54.232 0.690 51.444 4.953 59.225 1.057 43.285 18.820 62.210 0.541 61.695 254.489 264.395 1.075 61.550 30.548 59.064 0.581 50.005 11.379 58.028 1.074 42.674 4.211 51.173 0.605 50.649
82.925 0.741 54.239 6.235 95.070 1.103 83.976 21.064 102.091 0.555 67.195 302.826 290.171 1.117 87.235 33.947 101.631 0.636 53.415 12.844 93.349 1.121 83.178 4.811 66.844 0.649 53.487
112.395 0.754 54.940 6.621 124.531 1.115 106.386 22.011 140.350 0.575 73.835 334.700 319.325 1.127 109.060 35.398 133.435 0.672 59.244 16.005 122.280 1.139 105.752 4.993 96.091 0.660 54.186
Mean 75% Q. 99.5% Q. Maximum 112.407 0.798 55.249 7.066 121.236 1.128 107.930 24.486 129.141 0.595 72.093 333.525 320.875 1.146 111.885 39.115 124.597 0.6810 57.611 15.216 118.104 1.146 107.220 5.298 102.336 0.661 54.300
141.611 0.772 55.696 7.060 149.065 1.128 134.612 24.499 156.637 0.626 75.145 344.924 351.589 1.141 137.273 39.263 147.522 0.713 61.039 16.853 142.922 1.158 134.197 5.306 139.416 0.670 54.965
170.936 1.443 72.246 27.508 174.716 1.594 158.010 46.814 172.649 0.721 84.464 665.231 381.352 1.402 174.859 71.260 166.668 0.893 70.231 26.047 173.169 1.490 156.626 8.631 166.659 0.791 58.845
297.031 10.826 390.614 452.295 241.577 3.866 164.803 67.133 180.240 3.119 99.207 741.078 799.464 10.921 181.055 115.535 503.900 3.170 84.364 36.828 206.766 3.520 163.368 19.907 204.041 3.125 98.571
Please note that the tails of the distributions have been cut off at the 99.5% quantile for better visualization in both the scatterplots and the histograms. The few longer delays are mainly caused by other system tasks and interrupt handlers being active during the reception of the packets. A summary statistics table including the 99.5% quantile is given as table 1. The minimum of the delays that occur as a differences between a starting point on one node and an end point on another node (delays with odd numbers) is longer than 42 µs. This suggests that a synchronization precision on the order of one Microsecond is sufficient for our purposes. Some of the other delays (delays with even numbers) are much shorter, but since they are caused by single nodes of the cluster, not affected by a phase synchronization.
5
Conclusions
We have presented an inexpensive software monitoring framework for obtaining high precision timestamps for high volume measurements of distributed web
A Low-Cost Infrastructure
27
servers using GPS, NTP, the PPS-API with a single PPS signal distributed to all nodes of our cluster and own extensions like dynamic PPS ECHO feedback, a PPS driver for the parallel port, a kernel buffer for the event trace and an instrumentation of the netfilter code of the Linux kernel. The example measurements show how it is used to obtain a detailed view of the network delays in our distributed web server. The accuracy achieved is suitable for this purpose. The observed individual delays can be used in creating a detailed model of the system. We also mentioned that a simple compensation of the systematic drift of the clocks of the object system with a simple skew model is not sufficient for the precise measurement one-way delay measurements. We learned from our experiments that an optimal selection of the averaging interval τ in NTP can improve timekeeping accuracy. In our future work we will record raw TSC values as timestamps for the events and use an offline analysis process to calculate real time clock readings from those timestamps with the help of recorded TSC timestamps for the PPS pulses. A simple modification of the NTP code to lower the averaging interval seems not possible, as this might lead to instability of the feedback loop used. Our infrastructure is also useful for other applications that require precise one-way delay measurements. We are currently evaluating a similar approach for measurements in a distributed setup of a several embedded systems.
References [1] J. Anastasov. Patches for solving ARP problems . http://www.linuxvirtualserver.org/˜julian/. [2] P. Barford and M. Crovella. Generating representative Web workloads for network and server performance evaluation. ACM SIGMETRICS Performance Evaluation Review, 26(1):151–160, 1998. [3] S. Bregni. Fast Algorithms for TVAR and MTIE Computation in Characterization of Network Synchronization Performance. In G. Antoniou, N. Mastorakis, and O. Panfilov, editors, Advances in Signal Processing and Computer Technologies. WSES Press, 2001. [4] V. Cardellini, M. Colajanni, and P.S. Yu. Dynamic load balancing on Web-server systems. IEEE Internet Computing, 3(3):28–39, May–June 1999. [5] P. Dauphin, R. Hofmann, R. Klar, B. Mohr, A. Quick, M. Siegle, and F. S¨ otz. ZM4/SIMPLE: a General Approach to Performance-Measurement and Evaluation of Distributed Systems. In T.L. Casavant and M. Singhal, editors, Readings in Distributed Computing Systems, chapter 6, pages 286–309. IEEE Computer Society Press, Los Alamitos, California, Jan 1994. [6] K. Egevang and P. Francis. The IP Network Address Translator (NAT). Request for Comments RFC-1631, Internet Engineering Task Force, May 1994. [7] R. Hofmann and U. Hilgers. Theory and Tool for Estimating Global Time in Parallel and Distributed Systems. In Proc. of the Sixth Euromicro Workshop on Parallel and Distributed Processing PDP’98, pages 173–179, Los Alamitos, January 21–23 1998. Euromicro, IEEE Computer Society. [8] R. Klar, P. Dauphin, F. Hartleb, R. Hofmann, B. Mohr, A. Quick, and M. Siegle. Messung und Modellierung paralleler und verteilter Rechensysteme. TeubnerVerlag, Stuttgart, 1995.
28
K.-S.J. Hielscher and R. German
[9] J. Levine. Introduction to time and frequency metrology. Rev. Sci. Instrum., 70:2567–2596, 1999. [10] Linux Virtual Server Project. http://www.linuxvirtualserver.org/. [11] D. Mills. Internet time synchronization: the Network Time Protocol. IEEE Trans. Communications, 39(10):1482–1493, October 1991. [12] D. Mills. A Kernel Model for Precision Timekeeping. Request for Comments RFC-1589, Internet Engineering Task Force, March 1994. [13] J. Mogul, D. Mills, J. Brittenson, J. Stone, and U. Windl. Pulse-per-second API for Unix-like operating systems, Version 1. Request for Comments RFC-2783, Internet Engineering Task Force, March 2000. [14] S. B. Moon, P. Skelly, and D. Towsley. Estimation and Removal of Clock Skew from Network Delay Measurements. In Proceedings of IEEE INFOCOM ’99, March 1999. [15] D. Mosberger and T. Jin. httperf: A Tool for Measuring Web Server Performance. In First Workshop on Internet Server Performance, pages 59–67. ACM, June 1998. [16] Netfilter/IPtables home. http://www.netfilter.org/. [17] A. P´ asztor and D. Veitch. PC based precision timing without GPS. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 1–10. ACM Press, 2002. [18] Vern Paxson. On Calibrating Measurements of Packet Transit Times. In Measurement and Modeling of Computer Systems, pages 11–21, 1998. [19] D.B. Sullivan, D.W. Allan, D.A. Howe, and F.L. Walls. Charakterization of Clocks and Oscillators. Technical note 1337, National Institute of Standards and Technology, 1990. [20] U. Windl. PPSKit. ftp://ftp.kernel.org/pub/linux/daemons/ntp/PPS/.
MIBA: A Micro-Benchmark Suite for Evaluating InfiniBand Architecture Implementations B. Chandrasekaran1 , Pete Wyckoff2 , and Dhabaleswar K. Panda1 1
Department of Computer and Information Sciences The Ohio State University, Columbus, OH 43201 {chandrab,panda}@cis.ohio-state.edu 2 Ohio Supercomputer Center 1224 Kinnear Road, Columbus, OH 43212 [email protected]
Abstract. Recently, InfiniBand Architecture (IBA) has been proposed as the next generation interconnect for I/O and inter-process communication. The main idea behind this industry standard is to use a scalable switched fabric to design the next generation clusters and servers with high performance and scalability. This architecture provides various types of new mechanisms and services (such as multiple transport services, RDMA and atomic operations, multicast support, service levels, and virtual channels). These services are provided by components (such as queue pairs, completion queue, and virtual-to-physical address translations) and their attributes. Different implementation choices of IBA may lead to different design strategies for efficient implementation of higher level communication layer/libraries (such as Message Passing Interface (MPI), sockets, and distributed shared memory). It also has an impact on the performance of applications. Currently there is no framework for evaluating different design choices and for obtaining insights about the design choices made in a particular implementation of IBA. In this paper we address these issues by proposing a new micro-benchmark suite (MIBA) to evaluate the InfiniBand architecture implementations. MIBA consists of several micro-benchmarks which are divided into two major categories: non-data transfer related micro-benchmarks and data transfer related micro-benchmarks. By using the new micro-benchmark suite, the performance of IBA implementations can be evaluated under different communication scenarios, and also with respect to the implementation of different components and attributes of IBA. We demonstrate the use of MIBA to evaluate the second generation IBA adapters from Mellanox Technologies.
1
Introduction
Emerging distributed and high performance applications require large computational power as well as low latency, high bandwidth and scalable communication
This research is supported in part by Sandia National Laboratory’s contract #30505, Department of Energy’s Grant #DE-FC02-01ER25506, and National Science Foundation’s grants #EIA-9986052 and #CCR-0204429.
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 29–46, 2003. c Springer-Verlag Berlin Heidelberg 2003
30
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
subsystems for data exchange and synchronous operations. In the past few years, the computational power of desktop and server computers has been doubling every eighteen months. The raw bandwidth of network hardware has also increased to the order of Gigabits per second. During the past few years, the research and industry communities have been proposing and implementing many user-level communication systems such as AM [20], VMMC [7], FM [14], EMP [17,18], U-Net [19,21], and LAPI [16] to address some of the problems associated with the traditional networking protocols. In these systems, the involvement of operating system kernel is minimized and the number of data copies is reduced. As a result, they can provide much higher communication performance to the application layer. More recently, InfiniBand Architecture [10] has been proposed as the next generation interconnect for I/O and inter-process communication. In InfiniBand, computing nodes and I/O nodes are connected to the switched fabric through Channel Adapters. InfiniBand provides a Verbs interface which is a superset of VIA [9,8]. This interface is used by the host systems to communicate with Host Channel Adapters. InfiniBand provides many novel features: three different kinds of communication operations (send/receive, RDMA, and atomic), multiple transport services (such as reliable connection (RC), unreliable datagram (UD), and reliable datagram (RD), different mechanisms for QoS (such as service levels and virtual lanes). In addition to providing scalability and high performance, InfiniBand also aims to meet applications’ need for Reliability, Availability and Serviceability (RAS). Recently several companies have started shipping out InfiniBand hardware. It is now a challenging task to report the performance of InfiniBand architectures accurately and comprehensively. The standard tests such as ping-pong latency and bandwidth give very little insight into the implementation of various components of the architecture. It does not evaluate the system for various communication scenarios. Therefore it does not depict all the characteristics of a real life application. Hence there is a need to study the aspects of various components involved in the communication. For example, the design choices in the implementation of virtual to physical address translation may lead to different performance results. InfiniBand architecture specification offers a wide range of features and services. This is a motivating factor for computer architects to develop highly efficient implementations of higher-level programming model layers such as MPI [12,11], sockets [4] and distributed shared memory [13]. Also the architecture provides a promising efficient communication subsystem for various applications such as web servers and data centers. The various features and services offered by the InfiniBand architecture increases the number of design choices for implementing such programming models and applications. Hence there is a need for a framework to evaluate various such design choices. The hardware products for the InfiniBand Architecture are still in their early stages but are rapidly developing. More features and still better performance of the hardware are expected in the near future. A systematic and in depth study of
MIBA: A Micro-Benchmark Suite
31
various components by a framework would provide valuable guidelines to hardware vendors to identify their strengths and weaknesses in their implementations and bring out better releases of the InfiniBand products. The requirements of such a framework are: 1. To evaluate various implementations of InfiniBand architecture and compare their strengths and weakness in a standardized manner. 2. To evaluate the system for various communication scenarios. 3. To provide insights to developers of programming model layers and applications and to guide them in adopting appropriate and efficient strategies in their implementations. 4. To give valuable guidelines to InfiniBand hardware vendors about their implementations so that it can be optimized. Traditional models of computation and communication are not sufficient to address the requirements listed above. We take on the challenge of designing a micro-benchmark suite to comprehensively evaluate the InfiniBand Architecture. This suite is divided into two major categories: Non-Data transfer related and Data Transfer related. Under the first category, we include micro-benchmarks for measuring the cost of several basic non-data transfer related operations: creating and destroying Queue Pairs, creating and destroying Completion Queues, and memory registration and deregistrations. The cost of such operations are evaluated by varying various parameters associated with them. The second category consists of several data-transfer related micro-benchmarks. The main objective here is to the isolate different components (such as virtual-to-physical address translation, multiple data segments, and event handling) and study them by varying their attribute values. This would clearly bring out the importance of that component in the critical path of communication. It would also help us to evaluate such components for various implementations of InfiniBand. The micro-benchmark suite would provide valuable insights to the developers of high performance parallel applications and data center enterprise applications The micro-benchmarks are evaluated on a Linux based InfiniBand cluster. The benchmark suite evaluates the Verbs Application Programmers Interface (VAPI) over InfiniHost(TM) MT23108 Dual Port 4X Host Channel Adapter (HCA) cards provided by Mellanox Technologies [1]. The rest of the paper is organized in the following manner. Section 2 gives an overview of the IBA architecture. Sections 3 and 4 describe the Mellanox HCAs and their Verbs API interface. Section 5 describes the benchmark tests in detail. In Section 6 we present the results. Related work, conclusions and future work are presented in Sections 7 and 8.
2
InfiniBand Architecture Overview
The InfiniBand Architecture defines a System Area Network (SAN) for interconnecting processing nodes and I/O nodes. Figure 1 provides an overview of
32
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
the InfiniBand architecture. It provides the communication and management infrastructure for inter-processor communication and I/O. The main idea is to use a switched, channel-based interconnection fabric. The switched fabric of InfiniBand Architecture provides much more aggregate bandwidth. Also, a switched fabric can avoid single point of failure and provide more reliability. InfiniBand Architecture also has built-in QoS mechanisms which provide virtual lanes on each link and define service levels for each packets.
Fig. 1. Illustrating a typical system configuration with the InfiniBand Architecture (Courtesy InfiniBand Trade Association)
In an InfiniBand network, processing nodes and I/O nodes are connected to the fabric by Channel Adapters (CA). Channel Adapters usually have programmable DMA engines with protection features. They generate and consume IBA packets. There are two kinds of channel Adapters: Host Channel Adapter (HCA) and Target Channel Adapter (TCA). HCAs sit on processing nodes. Their semantic interface to consumers is specified in the form of InfiniBand Verbs. Unlike traditional network interface cards, Host Channel Adapters are connected directly to the system controller. TCAs connect I/O nodes to the fabric. Their interface to consumers are usually implementation specific and thus not defined in the InfiniBand specification. The InfiniBand communication stack consists of different layers. The interface presented by Channel Adapters to consumers belongs to the transport layer. A queue-based model is used in this interface. A Queue Pair in InfiniBand Architecture consists of two queues: a send queue and a receive queue. The send
MIBA: A Micro-Benchmark Suite
33
queue holds instructions to transmit data and the receive queue holds instructions that describe where received data is to be placed. Communication operations are described in Work Queue Requests (WQR) and submitted to the work queue. Once submitted, a Work Queue Request becomes a Work Queue Element (WQE). WQEs are executed by Channel Adapters. The completion of work queue elements is reported through Completion Queues (CQ). Once a work queue element is finished, a completion queue entry is placed in the associated completion queue. Applications can check the completion queue to see if any work queue request has been finished.
3
Mellanox Hardware Architecture
Our InfiniBand platform consists of several InfiniHost HCAs and an InfiniScale switch from Mellanox [1]. In this section we will give a brief introduction to both the HCA and the switch. InfiniScale is a full wire-speed switch with eight 4X ports or 1X InfiniBand Ports. These ports have an integrated 2.5 Gb/s physical layer serializer/deserializer and feature auto-negotiation between 1X and 4X links. There is also support for eight Virtual Data Lanes (VLs) in addition to a Dedicated Management Lane (VL15). Additionally, there is also support for link packet buffering, inbound and outbound partition checking and auto-negotiation of link speed. Finally, the switch has an embedded RISC processor for exception handling, out of band data management support and performance monitoring counter support. The InfiniHost MT23108 dual 4X ported HCA/TCA allows for a bandwidth of up to 10 Gbit/s over its ports. It can potentially support up to 224 QPs, End to End Contexts and CQs. Memory protection along with address translation is implemented in hardware itself. PCI-X support along with DDR memory allows portions of host memory to be configured as a part of system memory using a transparent PCI bridge allowing the host to directly place HCA related data without going over the PCI-X bus. The DDR memory allows the mapping of different queue entries namely work queues entries (WQE’s) and execution queue entries to different portions of the system space transparently. At its heart, the HCA picks WQE’s in a round robin fashion (the scheduler is flexible and supports more complex scheduling including weighted round robin with priority levels) and posts them to execution queues allowing for the implementation of QoS at a process level. Different WQE’s specify how the completion notification should be generated. In the following section, we discuss the software interface to InfiniBand.
4
InfiniBand Software Interface
Unlike other specifications such as VIA, InfiniBand Architecture doesn’t specify an API. Instead, it defines the functionality provided by HCAs to operating systems in terms of Verbs[10]. The Verbs interface specifies such functionality as
34
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
transport resource management, multicast, work request processing and event handling. Although in theory APIs for InfiniBand can be quite different from the Verbs interface, in reality many existing APIs have followed the Verbs semantics. One such example is the VAPI interface [1] from Mellanox Technologies. Many VAPI functions are directly mapped from corresponding Verbs functionality. This approach has several advantages: First, since the interface is very similar to the Verbs, the efforts needed to implement it on top of HCA is reduced. Second, because the Verbs interface is specified as a standard in the InfiniBand Architecture, it makes the job much easier to port applications from one InfiniBand API to another if they are both derived from Verbs. As we have mentioned earlier, the communication in Verbs is based on queue-pairs. InfiniBand communication supports both channel (send/receive) and memory (RDMA) semantics. These operations are specified in work queue requests and posted to send or receive queues for execution. The completion of work queue requests is reported through completion queues (CQs). Note that all communication memory must be registered first. This step is necessary because the HCA uses DMA operation to send from or receive into host communication buffers. These buffers must be pinned in memory and the HCA must have the necessary address information in order to carry out the DMA operation.
5
Micro-Benchmark Suite for InfiniBand
In this section we discuss the MIBA micro-benchmark suite. Besides quantifying the performance seen by the user under different circumstances, MIBA is also useful to identify the time spent in each of the components during communication. The micro-benchmark tests can be categorized into two major groups: non-data transfer related micro-benchmarks and data transfer related microbenchmarks. These categories are discussed in detail in the rest of the section. Note that not all features supported by the IBA specification are available in the current implementations. We have evaluated most of the components that are available. We plan to extend the micro-benchmark suite as more features become available. 5.1
Non-data Transfer Operations
In this category we measure the costs of the following operations: Create, Modify and Destroy Work Queues: A Work Queue (or Queue Pair) is the virtual interface that the hardware provides to an IBA consumer, and communication takes place between a source QP and a destination QP. IBA supports various transport services through these QPs. To establish a reliable connection the QP must transit several states. This is established by appropriate modify operations on these QPs. To establish a Reliable connection, the modify operation is performed as per the IBA specification [10]. Here we measure the cost of setting up and tearing down the connection. The modify operation would represent the setting up of connections and the destroy operation represents
MIBA: A Micro-Benchmark Suite
35
the tearing down of the connection. QP connection does not correlate directly with TCP connection because of protection and other requirements. Note that the cost of such an operation would depend on parameters like the maximum number of WQEs supported by that QP. Create and Destroy Completion Queues: Completion Queues (CQ) serve as the notification mechanism for the Work Request completions. It can be used to multiplex work completions from multiple work requests across queue pairs on the same HCA. We measure the cost to create and destroy CQs. Again, such a cost will depend on the attributes of the CQ. Memory Registration and Deregistration: The IBA architecture provides sophisticated high performance operations like RDMA and user mode IO. To manage this, appropriate memory management mechanisms are specified. Memory Registration operation allows consumers to describe a set of virtually contiguous memory locations that can be accessed by the HCA for communication. We measure the cost for registering and deregistering the memory. Work Request Processing Operations: Work Requests are used to submit units of work to the Channel Interface. Some types of work requests are Send/Receive, RDMA read/write, and Atomic operations. A work request usually triggers communication between the participating nodes. The results from a Work Request operation are placed in a completion Queue Entry. This result can be retrieved by polling the completion queue. We measure the cost of work request operations, polling on completed work request operations, and polling on pending work request operations (empty CQs). The cost indicates the host overhead involved in communication. If the cost is less, then more CPU cycles can be allocated for other computation. 5.2
Data Transfer Operations
In this category, the basic operations which are used for transfer of data are evaluated under different scenarios. The rest of the section describes them in detail. 5.2.1 Basic Tests. These micro-benchmarks are used to find the latency, unidirectional bandwidth, bi-directional bandwidth, and CPU utilization for our base configuration. The base configuration has the following properties: 100% buffer reuse, one data segment, polling on Completion Queue, one connection, no notify mechanism. These properties are described in more detail in later in this section. Latency Test: Latency measures the time taken for a message of a given size to reach a designated node from the source or the sender node. For measuring the latency, the standard ping-pong test is used. We calculate the latency for both synchronous (Send/Receive on RC) and asynchronous operations (RDMA on RC). The ping side posts two work requests: one for send and another for receive. It then polls for the completion of the receive request. The pong side posts a receive request, waits for it to complete and then posts a send work request. This entire process is repeated for sufficient number of times (so that the timing error is negligible) from which an average round trip time is produced,
36
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
then it is divided by two to estimate the one way latency. This test is repeated for different message sizes. Bandwidth Test: The objective of the bandwidth test is to determine the maximum sustained date rate that can be achieved at the network level. To measure the bandwidth, messages are sent out repeatedly from the sender node to the receiver node for a number of times and then the sender waits for the last message to be acknowledged. The time for sending these back to back messages is measured and the timer is stopped when the acknowledgment for the last message is received. The number of messages being send is kept large enough to make the time for transmission of the acknowledgment of the last message negligible in comparison with the total time. In order to avoid overloading of the HCA, we use the concept of a window size w. Initially w messages are posted. Following which the sender waits for the send completion of w/2 messages. Upon completion, another w/2 messages are posted. This pattern for waiting for w/2 messages and posting w/2 messages are repeated sufficient number of times. Since there is always w/2 outstanding messages we make sure that the there is sustained data movement on the network. However, if the HCA is faster in dispatching the incoming work requests than the host posting a work request, then there might not be any change in the results for various window sizes. Bi-directional Bandwidth Test: Networking layer in IBA like any other modern interconnects supports bidirectional traffic in both the directions. The aim of this test is to determine the maximum sustained date rate that can be achieved at the network level both ways. To measure the bidirectional bandwidth, messages are sent out from both sender and receiver repeatedly, both wait on the completion of the last receive. The time for sending these back to back messages is measured. Similar to the bandwidth test, we incorporate window size here. CPU Utilization Test: Higher level applications usually involve a computation cycle followed by communication cycle. If the time spent on communication is small, the valuable CPU cycles can be allocated for useful computation. This raises an important question: how many CPU cycles are available for computation when communication is performed in tandem? CPU utilization test is similar to the bi-directional bandwidth test where computation is gradually inserted. Each iteration of a measurement loop includes four steps: post receive work request for expected incoming messages, initiate sends, perform computational work, and finally wait for message transmission to complete. As the amount of work increases, the host CPU fraction available to message passing decreases. 5.2.2 Address Translation. A very important component of any user-level communication system is the virtual-to-physical address translation. In InfiniBand, the HCA provides the address translation[21]. In the basic setup, messages are sent from only one buffer. Usually hardware implementations cache the physical address of this buffer and hence the cost of virtual-to-physical address translation is not reflected in the latency or bandwidth tests. However by varying the percentage of buffer reused one can see significant difference in the
MIBA: A Micro-Benchmark Suite
37
basic test results. Studying the impact of virtual-to-physical address translation can help higher level developer optimize buffer pool and memory management implementations. To capture the cost of address translation and effectiveness of physical address cache, we have devised two schemes. In Scheme 1, if P is the fraction (or percentage) of buffer reuse then there are 1/P buffers used by the test. Access to such buffers are evenly distributed across the basic tests (latency and bandwidth). Here we try to evaluate the effectiveness of the caching scheme. If the cache is effective enough to hold the address of all the 1/P buffers then there should be no variation in the results. In Scheme 2, if P is the fraction (or percentage) of buffer reuse and n is the total number of messages communicated between the two sides then n/P messages use the same buffer while (1 − n/P ) messages use different buffers. Again, different buffer access are evenly distributed across the test. Here we try to evaluate the cost of virtual-to-physical address translation. As the percentage of buffer reuse decreases, more and more new buffers are accessed. Illustration: Assume that we have ten buffers numbered 0 to 9 and buffer reuse percentage is 25%. In Scheme 1, the buffer access sequence would be 0, 1, 2, 3, 0, 1, 2, 3,..., and so on. If the cache is big enough to fit all the buffers then there will be no change in the latency and bandwidth numbers. In Scheme 2, the access sequence would be 0, 1, 2, 3, 0, 4, 5, 6, 0, 7, 8, 9,..., and so on. The buffer ’0’ is reused 25% of the time and the rest of the time different buffers which are not in the cache are used. 5.2.3 Multiple Queue Pairs. IBA architecture specification supports 224 QPs. For connection oriented transport services like RC, a QP is bound exclusively for one connection. Hence as the number of connections increases, the number of active QPs increases. Therefore, it is important to see whether the number of active QPs has any effect on the basic performance. This information is important for applications which run on many nodes and there is a need to establish reliable connection between the nodes. This benchmark thus provides valuable information regarding the scalability of the InfiniBand architecture for large scale systems. 5.2.4 Multiple Data Segments. IBA supports scatter and gather operations. Many high level communication libraries such as MPI which support gather and scatter operations can use this feature directly. Therefore it is necessary to study the impact of the number of gather and scatter data segments on the basic performance. 5.2.5 Maximum Transfer Unit Size. The maximum payload size supported by a particular connection may take any of the following values: 256, 512, 1024, 2048, or 4096 bytes. A smaller memory transfer unit (MTU) may improve the latency for small messages while a larger MTU may increase the bandwidth for larger messages due to smaller overload per payload. Hence depending on the
38
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
MTU the results of the base tests may vary. Therefore the higher level communication library and applications developers must be aware of such variations. We measure the performance through the basic tests by varying the MTU. 5.2.6 Maximum Scatter and Gather Entries. The maximum number of scatter gather entries (SGE) supported by a QP can be specified during creation of that QP. A larger SGE may potentially increase the size of Work Request posted to the HCA. On the other hand, a QP with smaller SGE may not be flexible if the application uses scatter and gather operations of large data segments frequently. Hence it is important that the application developer be aware of this trade-off. We measure the performance by varying the SGE values. 5.2.7 Event Handling. IBA also supports event notification. On completion of the work request, a consumer defined event handler is invoked which does the required functions. In our micro-benchmark suite the main thread waits on a semaphore while the event handler signals the semaphore, upon completion of the Work Request operations. Event handling is preferred to polling in scenarios where the application is better off performing other computation rather than waiting on polling. We evaluate the performance when event handling instead of polling is used for the basic tests. 5.2.8 Impact of Load at HCA. In all the basic tests only two nodes communicate between themselves and the HCA is exclusively used by the corresponding nodes. An interesting challenge would be to evaluate the performance of the system when a HCA is involved in more that one communication, hence causing contention for HCA resources. The objective here is similar to that of the CPU utilization test. The test is carefully designed so as to avoid contention at the host processors or at the PCI-bus and to create contention only at the HCA. Two nodes (Sender and Receiver) are involved in bandwidth test described previously. Other nodes try to load the HCA of the sender by sending RDMA messages with negligible size. RDMA messages are used because there is no contention at the sender host processor. The message size is chosen to be small (4 bytes in this case) so that the contention at PCI bus at the sender side (also at the switch and wire) are minimal. We measure the results for the basic test by varying the number of other nodes involved in sending RDMA messages to the sender.
6
Performance Evaluation and Discussion
In this section we evaluate VAPI over Mellanox HCA, the currently available implementation of IBA. 6.1
Experimental Testbed
Our experimental testbed consists of a cluster system of 8 SuperMicro SUPER P4DL6 nodes. Each node has dual Intel Xeon 2.40 GHz processors with a 512KB
MIBA: A Micro-Benchmark Suite
39
L2 cache and a 400 MHz front side bus. The machines are connected by Mellanox InfiniHost MT23108 DualPort 4X HCA adapter through an InfiniScale MT43132 Eight 4x Port InfiniBand Switch. The HCA adapters work under the PCI-X 64-bit 133MHz interfaces. The Mellanox InfiniHost HCA SDK build id is thcax86-0.2.0-build-001. The adapter firmware build id is fw-23108-rel-1 18 0000. 6.2
Non-data Transfer Operations
The results obtained for non-data transfer benchmarks are presented in Table 1, Figure 2, Figure 3(a), and Figure 3(b). Table 1 summarizes the cost of connection management and work request operations. Connection is established by modify QP operation and is destroyed by destroy QP operation as described in section 5.1. It is observed that the creation and tearing of connection is costly. When a reliable connection is created or destroyed, the resources for that connection must be allocated or freed. Hence the cost. This provides valuable information to the developers of application which requires dynamic creation of connections. In that case the developer may choose to implement Reliable Datagram (RD) instead of Reliable Connection. Note that RD is not currently supported in available IBA implementation but is expected soon. The cost for posting a work request is less implying that the CPU overhead for communication is less for Mellanox HCAs. Figure 2 shows the cost of memory registration and deregistration. The memory registration cost increases exponentially after 1MB and is around 100 milliseconds for 64MB. Figures 3(a) and 3(b) show the cost of CQ and QP operations with respect to the maximum number of outstanding requests expected on that queue. Note that the QP operations here do not involve setting up of connections hence the cost for QP destroy operation shown in the Figure 3(b) is not as high the cost of the QP destroy operation indicated in Table 1. Table 1. Non-Data Transfer Micro-Benchmarks Operation Time in µs Creating a Connection (modify QP) 195.5 Tearing Down Connection (destroy QP) 218.2 Posting a Receive Work Request 0.6 Posting a Send Work Request 0.7 Polling on Complete Queue 1.0 Polling on Empty Queue 0.3
6.3
Data Transfer Operations
In this section we present the data-transfer related benchmark results. All the tests use Send-Receive primitives unless explicitly specified as RDMA.
40
B. Chandrasekaran, P. Wyckoff, and D.K. Panda 300
Memory Register Memory Deregister
Cost (us)
250 200 150 100 50 4
16
64
256
1K
4K
16K 64K
Buffer Size (Bytes)
Fig. 2. Cost of Memory Operations
170
QP Create QP destroy
350 300 Cost (us)
150 140 130
250 200
(a) Completion Queue operations
Number of QP Entries
9000
10000
8000
7000
6000
5000
4000
3000
2000
0
9000
Number of CQEs
10000
8000
7000
6000
5000
4000
3000
2000
50 0
100
110
1000
150
120 1000
Cost (us)
400
CQ Create CQ destroy
160
(b) Queue Pair operations
Fig. 3. Cost of CQ and QP operations
6.3.1 Basic Tests. Here we present the results for the base settings described in section 5.2.1. The latency and bandwidth results are as shown in Figure 4(a) and Figure 4(b). The one-way RDMA latency is 5.7µs and peak unidirectional bandwidth is around 840MBps. Currently available PCI-X bus supports a bandwidth of around 1GBps. This and the chipset limitations are the reason why the bi-directional bandwidth (Figure 4(c)) is not twice that of the unidirectional bandwidth. There is no variation for different window sizes for both bandwidth and bi-directional bandwidth. Figure 4(d) shows the CPU utilization. The peak bi-directional bandwidth when there is no computation involved is around 900MBps. We increase the computation gradually to see how the communication is affected. From the graph we can see that there is fall in the bandwidth after 96% of CPU cycles are allocated for computation. We can achieve the peak bandwidth performance even when 96% of the CPU cycles are used for computation.This shows low CPU Utilization.
15 14 13 12 11 10 9 8 7 6 5
Send Receive RDMA Bandwidth (MB/s)
Time (us)
MIBA: A Micro-Benchmark Suite
1
4
16 64 256 1024 Message Size (Bytes)
4096
860 Send Receive 840 RDMA 820 800 780 760 740 720 700 680 660 4K 16K 64K 256K Message Size (Bytes)
910 Send Receive 900 RDMA 890 880 870 860 850 840 830 820 810 4K 16K 64K
1M
(b) Bandwidth
Bidirectional Bandwidth (MB/s)
Bandwidth (MB/s)
(a) Latency
41
256K
1M
Message Size (Bytes)
1000 800 600 400 200 0 0
20
40
60
80
100
% of CPU cycles dedicated to other computation
(c) Bi-directional Bandwidth
(d) CPU Utilization
Fig. 4. Basic Tests
6.3.2 Address Translation. Figure 5 shows the impact of virtual-to-physical address translation for the two schemes described in section 5.2.2. Scheme 1 shows no decrease in performance for up to 25% of buffer reuse (Figure 5(a)). This is because of the effective caching mechanism by the Mellanox HCAs. Figure 5(b) shows the cost of address translation. As the percentage of buffer reuse is decreased, more and more address translations have to be performed. For large messages, we can notice that there is a drop in the bandwidth values. This is because as the message size increases, it occupies more and more pages and hence requires more entries in the cache increasing the probability of cache misses.
6.3.3 Multiple Queue Pairs. This benchmark test shows that there is no difference in the latency and bandwidth numbers as we vary the number of connections established by a node. We varied the number of QP connections up
42
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
850
100% 75% 800 50% 25% 0% 750
Bandwidth (MB/s)
Bandwidth (MB/s)
850
700 650 600 4K
16K 64K 256K Message Size (Bytes)
100% 75% 800 50% 25% 0% 750 700 650 600 4K
1M
(a) Bandwidth for Scheme 1
16K 64K 256K Message Size (Bytes)
1M
(b) Bandwidth for Scheme 2
Fig. 5. Impact of Virtual-to-Physical Address Translation
to 64 and the latency and bandwidth numbers remained the same. This shows excellent scalability of the Mellanox HCAs.
6.3.4 Multiple Data Segments. This benchmark evaluates the performance of data transfer when multiple data segments are used, as described in section 5.2.4. It is observed that as the number of segments increases, the latency increases. Figure 6 shows the latency for different number of segments. Here each segment is of equal size. In the graph, the total message size (sum of size of all the segments) is plotted on the x-axis and time taken for the latency test is plotted on the y-axis. Note that each data segment has to be copied to the HCA through DMA. Hence as the number of segments increases, the number of DMAs increase. Therefore the performance of PCI and the corresponding chipsets are also major components in the impact of multiple data segments.
26
1 Segment 2 Segments 4 Segments 8 Segments 16 Segments 32 Segments
24
Time (us)
22 20 18 16 14 12 10 8 64
128
256
512
1024
2048
4096
Message Size (Bytes)
Fig. 6. Impact of Multiple Data Segments
17 16 15 14 13 12 11 10 9 8 7
850
256 512 1024 2048
800 Bandwidth (MB/s)
Time (us)
MIBA: A Micro-Benchmark Suite
750 700
43
256 512 1024 2048
650 600 550 500 450
1
4
16 64 256 1024 Message Size (Bytes)
(a) Latency
4096
400 4K
16K 64K 256K Message Size (Bytes)
1M
(b) Bandwidth Fig. 7. Impact of MTU
6.3.5 Impact of Maximum Transfer Unit size (MTU). This benchmark evaluates the performance of data transfer when MTU values are varied as described in section 5.2.5. Figure 7 shows that smaller MTU values have lower latency for small messages, but the bandwidth for smaller MTU values are significantly less. This is because larger MTU packets have lesser overhead per packet. MTU 1kB performs better than MTU 2048 in bandwidth test. This may be due to effective pipelining for MTU 1kB. 6.3.6 Maximum Scatter and Gather Entries. This benchmark evaluates the performance of data transfer when the maximum SGE supported by a QP is varied. Figure 8(a) shows the impact on the latency as the maximum number of scatter gather entries are varied. There is no significant difference observed for the bandwidth test. 6.3.7 Event Handling. Figure 8(b) shows the impact of event notification as compared to polling. We can see that the latency is significantly higher for event notification. This is due to the cost of invoking the event handler upon work completion and subsequent operation on the semaphore to notify the main thread about completion. However, event notification may help certain applications and hence it is important for the developers for such applications to be aware of the cost. No significant difference is noticed for the bandwidth test. 6.3.8 Impact of Load at HCA. Figure 9 shows the impact of contention for HCA resources from other communication. The graph is plotted by varying the number of contending nodes. The contending nodes try to load the HCA of the sender node in the basic test as described in section 5.2.8. We can see
44
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
16
13 12
Polling Event Notification
35 30 Time (us)
14 Time (us)
40
1 10 20 30 40 50
15
11 10
25 20 15
9
10
8 7
5 1
4
16 64 256 1024 Message Size (Bytes)
4096
1
(a) Maximum SGE
4
16 64 256 1024 Message Size (Bytes)
4096
(b) Event Notification
Fig. 8. Impact of SGE and Event Notification on Latency
that as the number of contending nodes increases, the bandwidth drops but not significantly. This shows that the scalability of the HCA with respect to the number of contending nodes.
850 Bandwidth (MB/s)
800 750 700 650
Nodes = 0 Nodes = 1 Nodes = 2 Nodes = 4
600 550 4K
16K
64K
256K
1M
Message Size (Bytes)
Fig. 9. Impact of contention from other communication on bandwidth
7
Related Work
To the best of our knowledge this is the first attempt to comprehensively evaluate InfiniBand Architecture using a micro-benchmark suite. Our benchmark is based on VIBe [5] Micro-Benchmark developed earlier in our group for VIA architecture. Bell et al [6] used a variant of LogGP [2] model to evaluate several current generation high performance networks like Cray T3E, the IBM SP,
MIBA: A Micro-Benchmark Suite
45
Quadrics, Myrinet 2000, and Gigbit Ethernet. They have also compared performance of MPI layer in these networks. NPB [3] benchmarks is an application level benchmark to evaluate the performance the system using MPI. Saavedra et al [15] developed a micro-benchmark to evaluate the memory subsystem of KSR1 architecture. Our micro-benchmark is a more in-depth evaluation at a lower layer API with the focus on IBA.
8
Conclusions and Future Work
In this paper we have proposed a new micro-benchmark suite for evaluating InfiniBand Architecture implementations. In addition to the standard latency and bandwidth test, we have presented several tests that help in obtaining a clear understanding of the implementation details of the components involved in the InfiniBand Architecture. It clearly provides valuable insights for the developers of higher layers and applications over IBA. IBA products are rapidly maturing. This tool will help hardware vendors to identify the strengths and weaknesses in their releases. As the products are released, more and more features of InfiniBand Architecture will be available. Some of the feature include service levels, virtual level to service level mapping, reliable datagram, partitioning and atomic operations. These features are important for large systems such as cluster based data centers and also for higher level communication libraries such as Message Passing Interface (MPI) standard and distributed shared memory. This micro-benchmark would then provide guidelines to make design choices in the implementation of such systems and libraries. We are planning to extend the micro-benchmark suite in tandem with the development of IBA products. MIBA Software Distribution The code for the benchmark suite described in this paper is available. If you are interested, please contact Prof. D. K. Panda ([email protected]). Acknowledgments. We would like to thank Jiuxing Liu, Sushmitha Prabhakar Kini, and Jiesheng Wu for their help with the experiments. Our appreciation is also extended to Jeff Kirk and Kevin Deierling from Mellanox Technologies for their insight and technical support on their InfiniBand hardware and software.
References 1. Mellanox Technologies. http://www.mellanox.com. 2. Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. LogGP: Incorporating long messages into the LogP model for parallel computation. Journal of Parallel and Distributed Computing, 44(1):71–79, 1997. 3. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63–73, Fall 1991.
46
B. Chandrasekaran, P. Wyckoff, and D.K. Panda
4. P. Balaji, P. Shivam, P. Wyckoff, and D.K. Panda. High Performance User Level Sockets over Gigabit Ethernet. In Cluster Computing, September 2002. 5. M. Banikazemi, J. Liu, S. Kutlug, A. Ramakrishna, P. Sadayappan, H. Sah, and D. K. Panda. Vibe: A micro-benchmark suite for evaluating virtual interface architecture implementations. In Int’l Parallel and Distributed Processing Symposium (IPDPS), April 2001. 6. C. Bell, D. Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael Welcome, and Katherine Yelick. An evaluation of current high-performance networks. In International Parallel and Distributed Processing Symposium (IPDPS’03), 2003. 7. M. Blumrich, C. Dubnicki, E. W. Felten, K. Li, and M. R. Mesarina. VirtualMemory-Mapped Network Interfaces. In IEEE Micro, pages 21–28, Feb. 1995. 8. Compaq, Intel, and Microsoft. VI Architecture Specification V1.0, December 1997. 9. D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A.M. Merritt, E. Gronke, and C. Dodd. The Virtual Interface Architecture. IEEE Micro, pages 66–76, March/April 1998. 10. InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.0, October 24 2000. 11. J. Liu, J. Wu, S. P. Kinis, D. Buntinas, W. Yu, B. Chandrasekaran, R. Noronha, P. Wyckoff, and D. K. Panda. MPI over InfiniBand: Early Experiences. Technical Report, OSU-CISRC-10/02-TR25, Computer and Information Science department, the Ohio State University, January 2003. 12. J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. International Conference on Supercomputing, June 2003. 13. R. Noronha and D. K. Panda. Implementing TreadMarks over GM on Myrinet: Challenges Design Experience and Performance Evaluation. Workshop on Communication Architecture for Clusters (CAC’03), to be held in conjuction with IPDPS ’03, April 2003. 14. S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM). In Proceedings of the Supercomputing, 1995. 15. Rafael H. Saavedra, R. Stockton Gaines, and Michael J. Carlton. Micro benchmark analysis of the KSR1. In Supercomputing, pages 202–213, 1993. 16. G. Shah, J. Nieplocha, J. Mirza, C. Kim, R. Harrison, R. K. Govindaraju, K. Gildea, P. DiNicola, and C. Bender. Performance and experience with LAPI - a new high performance communication library for the ibm rs/6000 sp. International Parallel Processing Symposium, March 1998. 17. P. Shivam, P. Wyckoff, and D. K. Panda. EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing. In Proceedings of SC ’01, Denver, CO, November 2001. 18. P. Shivam, P. Wyckoff, and D. K. Panda. Can user level protocols take advantage of multi-CPU NICs? In Proceedings of IPDPS ’02, Ft. Lauderdale, FL, April 2002. 19. T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-level Network Interface for Parallel and Distributed Computing. In ACM Symposium on Operating Systems Principles, 1995. 20. T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In International Symposium on Computer Architecture, pages 256–266, 1992. 21. M. Welsh, A. Basu, and T. von Eicken. Incorporating Memory Management into User-Level Network Interfaces. In Proceedings of Hot Interconnects V, Aug. 1997.
WebAppLoader: A Simulation Tool Set for Evaluating Web Application Performance Katinka Wolter1 and Kristian Kasprowicz2 1
2
Humboldt-Universit¨ at zu Berlin, Institut f¨ ur Informatik, Unter den Linden 6, 10099 Berlin, Germany [email protected] VIVEX GmbH Lietzenburger Straße 107 10707 Berlin [email protected]
Abstract. In this paper we present WebAppLoader, a set of tools that allows for analysing web application performance as it is perceived by the user. This is done by creating ’virtual users’ through tracing of web sessions and then simulating groups of virtual users by simulating sets of replications of earlier traced samples. The groups are defined based on a trace, and the number of users performing a trace as well as their start times can be adapted. Measurements of the transaction times broken up into several segments are taken on the client side. The resulting tool suite has three main components, one for tracking a web session, one for simulating repetitions of that trace and one for taking and evaluating measurements.
1
Introduction
As the use of the internet has become an integral part of every day’s life its well-performing has become essential. Especially e-commerce and various kinds of web applications have become a crucial element in many people’s daily life and consequently performance and reliability of web applications has become an important issue. The work presented in this paper has been initiated by Novedia AG [Nov]. Novedia provides its customers with web solutions including the selection of all necessary equipment. In order to find out how many servers a customer will need and how powerful they should be, Novedia wants to have a tool for testing the performance of a prototype system. The customers of Novedia typically know estimates of how many visitors they expect to their sites. The objective in building WebAppLoader was to be able to answer the question whether a web application will perform well on a given set of computers under the expected load generated by possibly large numbers of transactions of different types.
During this project both authors were still affiliated with the real-time and robotics group at the Technical University Berlin.
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 47–62, 2003. c Springer-Verlag Berlin Heidelberg 2003
48
K. Wolter and K. Kasprowicz
WebAppLoader has several components, which together serve the purpose outlined above. The components are a recording tool for tracking and writing traces of web application sessions, a simulation tool for executing several instances of sets of the recorded traces and a simple evaluation tool for taking measurements and for showing results of the simulation runs. The request sequences are generated by manually carrying out each sequence once and tracing it with the recording tool. A trace is then stored in XML format. One or more of these sequences are needed as profiles of a ’virtual user’ for the simulator. The XML files can be modified easily in an editor. Modifying the XML file of a user session one can for instance speed up the transaction or vary the time different users spend looking at a particular page. In consequence, different virtual users follow the same path at different pace. To make a simulation of virtual users be more realistic, several groups of users can be defined and included in the simulation simultaneously. In order to load a web server more and more over time the group sizes can increase. They can do so to a different degree and at different times for each group. As a result, WebAppLoader becomes a very flexible tool to generate traffic and study web server responsiveness. There is a vast amount of related work. A number of studies on how to characterize internet traffic as well as web application traffic have been carried out by taking measurements of real traffic or creating traffic with tools such as SURGE [BC98] and WAGON [LNJV99]. The Profit tool, presented in [PJA+ 00], takes a different angle. It measures server side response times and breaks them down into response times of the different types of servers (web server, application server, data base server) and various transactions carried out on those servers. A widely used tool for the analysis of web application performance is httperf [MJ98] developed at Hewlett-Packard Comp. httperf is very similar to WebAppLoader in that it also sends requests as they are specified instead of creating request sequences analytically to match given traces, as for instance SURGE does. The tool presented in this paper is more powerful than httperf in several ways: the simulator can issue transactions not only to one given address, but it can perform full sessions consisting of several transactions specified each by a web address and a think time. It can represent not only one user, but a group of users. During a simulation run the size of groups can increase, in order to increase the server load over time. A simulation can consist of several groups, possibly using different user profiles. The remainder of this paper is organized as follows: a short section displays the architecture of WebAppLoader, then we will present the components of the tool set in the sequence they are used; in Section 3 we present the tracing component, in Section 4 the simulation tool is described, in Section 5 results as they are obtained with the tool are presented and in Section 6 we conclude this paper.
2
The Architecture of WebAppLoader
WebAppLoader is built from several components as shown in Figure 1. These components are a tracing tool, a simulator, a logger and an evaluation tool. The
WebAppLoader: A Simulation Tool Set
Tracing component
49
Evaluation component
Simulator
Logger
XML
Tables
TXT
user profiles
Plots
userlog
Fig. 1. The architecture of the tool suite
tracing component records a user session and creates an XML trace file for each user session. If necessary, the XML file can be edited with regular text editors outside WebAppLoader.
HTTP interface
Tracing component
Proxy server
web application HTTP interface
step 2 step 3
HTTP interface
step 1
step 4
HTTP interface
web browser
Fig. 2. The architecture of the tracing tool
The simulator then takes the XML trace files as input and generates virtual users. Note that we implemented WebAppLoader in Java. Each virtual user is represented by a thread in the Java simulator executing a user profile. The logger logs all output of the simulator. It collects measurements for various metrics, and writes the collected data to a text file. This text file is used by the evaluation component of the tool to generate statistics.
50
K. Wolter and K. Kasprowicz
The most interesting component from a software architecture perspective is the tracing tool. The architecture of the tracing tool is shown in Figure 2. The tracing tool implements a proxy, which is plugged between the web browser and the Internet. Each request that is sent to the web application first passes the proxy. The request is scanned and time stamped to generate the above mentioned XML files. We now explain the functionalities offered by the various components of WebAppLoader in detail.
3
The Tracing Component
The first step in testing a web server’s performance with WebAppLoader is to create virtual user sessions as they will be performed by real users. A web appli-
Fig. 3. Tracing a web session
cation can offer a number of different sessions, which real users might carry out. If later all sessions shall be simulated then each of those sessions is carried out once by the tool user and meanwhile it is being traced. This is done by choosing the component ’user profile’1 (called Benutzerprofil in the tool) from the main menu. 1
The tool set is property of Novedia and therefore most of the labeling is in German.
WebAppLoader: A Simulation Tool Set
51
Any web browser can be chosen for creating the traces. In order to make the web browser communicate with the tracing component of WebAppLoader the web browser has to be configured so that it will use the WebAppLoader proxy over port 8081. The proxy in WebAppLoader is started when the tracing component of WebAppLoader is active. Figure 3 shows the interface of the tracing tool. As soon as the button ’start recording’ (Aufzeichnen in the tool) is pushed the tool starts tracing what web page is being loaded and how much time is spent by the test person looking at each individual page. Pushing ’stop recording’ (labelled Aufzeichnen beenden) ends the session. The session can then be saved to a file in XML format. This file can later be edited to add some attributes that might be beneficial for simulating, or for changing recorded values, if it is desired. We will describe this in more detail in Section 4. Table 1. The URLs of the web session traces A B C D1 D2 E F G
/htmlfiles/ /htmlfiles/teaching/ /htmlfiles/teaching/seminarWS02/ /htmlfiles/teaching/seminarWS02/restart1.pdf /htmlfiles/teaching/seminarWS02/kotzAcampus.pdf / /htmlfiles/seminarWS02.html /cgi-bin/hello
For illustrating how to use the tool set we have traced three very simple web sessions. All our experiments are carried out on an HP omnibook xt6200, 1.6 GHz laptop with 256MB RAM and 20GB disk. We decided to install the Apache V1.3 web server and do all experiments locally on this machine, showing how the server performance can be measured, without dealing with network delays. We note that the pages we load are just small and for the purpose of illustration. Two of our web sessions consist of loading a HTML page and a paper, respectively. The third one is a simple CGI-script designed to load the server more and more as it is executed by more virtual users. The script performs some simple mathematical computations. Table 2 lists the three recorded sessions and Table 1 gives the abbreviated web locations. The think time, which is the time spent on each of the pages is listed in Table 2 in the row following the transaction sequence. For example, the first session consists in first loading page A, where the URL abbreviated by A can be found in Table 1. The user spends 16 seconds looking at page A. Page A is loaded by issuing 6 different requests for loading the text and the included graphics. Following after page A is page B then C and D1, respectively. Pages D1 and D2 are different documents in the same location. Some of the visited pages need several requests since they include a number of graphics that are loaded separately. The number of requests needed to load a
52
K. Wolter and K. Kasprowicz
page is given in the third row for each session in Table 2. For the trace seminar1 the XML file can be found in Appendix A. Table 2. The web session traces URL1 URL2 URL3 URL4 session 1: seminar1 URL sequence A B C D1 think time (sec) 16 10 15 10 no. of requests 6 1 7 1 session 2: seminar2 URL sequence E A F D2 think time (sec) 6 4 25 35 no. of requests 2 6 1 1 session 3: helloworld URL sequence G E think time (sec) 22 1 no. of requests 1 1
In the next section we will show how to use traces for creating a set of virtual users that put load on the web server. We will use the three traces to compose a mixture of groups of potential users.
4
The Simulation Component
The simulation component executes given XML files in as many instances as the user of the tool specifies. We now shortly describe the structure of the XML trace files, an example of which is given in Appendix A. An XML trace file holds some header definition, e.g. the document type definition (DTD) used. Then follows a trace <\browsertrace> statement. The browsertrace can have an attribute called ActiveCookieBrowser, which indicates that each user using the profile will be treated as a new user, not sending any cookies and hence receiving new ones from the web server. The default value is SimplePlaybackBrowser that will keep all cookies and reuse them for new virtual users. A browsertrace consists of several <page> page <\page> statements. A page can have two attributes, thinktime, the time that is spent looking at this page and thinktimemode, which can be either fixed or random. Thinktimemode fixed indicates that each virtual user in the simulation should spend this exact time on a page, while thinktimemode random means that the time each virtual user spends on a page is to be chosen randomly. Each page then is composed of a number of
WebAppLoader: A Simulation Tool Set
53
<request> page element <\request> statements that load the different elements of a web page. Usually the elements are text, pictures, or icons. A request can have the attributes GET and POST, where the former means that a page is to be loaded, whereas the latter indicates posting of a page to some location. The default value is GET.
Fig. 4. The simulation tool
In addition to the attributes given in the XML file representing the session, the simulation tool accepts a set of parameters that are specified in the simulation Table 3. The parameters for the simulation Profile
connection initial inc. inc. time max. speed #users users interval users helloworld.xml 56Kbps 1 1 5 seconds 15 seminar1.xml 56Kbps 2 2 3 seconds 10 seminar2.xml 56Kbps 5 5 5 seconds 10
interface for each group of users. A group is a set of virtual users using the same profile, possibly running concurrently. In our tool implementation each virtual user is assigned a thread. This allows us to run several hundreds of virtual users concurrently. For each group we set the speed of the network connection,
54
K. Wolter and K. Kasprowicz
that can range from a slow simple modem connection up to a very fast Ethernet connection. We chose in our experiments the speed of a good modem connection, 56Kbit/sec. The interface has fields for the initial number of users of each group, the number of users that is added each time the group size increases and the time interval in seconds after which the group size is supposed to increase. Pushing the button start simulation will start a simulation run, pushing the button stop simulation does end it. We specify a scenario in Figure 4 and Table 3. We generated traffic for approximately 1 hour and 20 minutes. A logger attached to the simulator takes a number of time stamps and writes them to an output file which will be studied in the next section.
5
Results
In this section we first look at what measurements are being taken by the tool and then we analyze the data to show the kind of questions that can be answered by using WebAppLoader. The tool uses 5 different time stamps (t0 , . . . , t4 ), as shown in Figure 5. The first measurement point is the initiation time of a transaction, the second is the time at which a connection with the server has been established, the third is when a request is sent out completely, the fourth measurement point is after receipt of the first byte of the answer from the server and the last measurement is taken when the answer has been obtained completely. From these measurement points we can derive time metrics as shown in Table 4. transaction time response time
connect time
measurement 0
send time
measurement 1
t0 establish connection t1
time to first byte
measurement 2 send request
t2
receive time
measurement 3
wait for first byte
t3
measurement 4
receive reply
t4 time
Fig. 5. Measurement points
The transaction time is the total sum of all measured times and the response time is the time to first byte (TTFB) plus the receive time. Ignoring the network delay, the response time corresponds to the time the server needs to process and transmit a page. All measurements are taken in the proxy, we do not instrument either the server or the network. Figure 6 shows the results given by WebAppLoader in the menu ’statistics’ (labelled Statistik) after loading the file userlog.txt. Clicking on root will provide the list of groups that have been included in the simulation run. In the lower part of the window the average number of transactions per second and the average
WebAppLoader: A Simulation Tool Set
55
Table 4. The measurement intervals connect time (CT) send time (ST) time to first byte (TTFB) receive time (RT) response time (ReT) transaction time (TT)
t1 t2 t3 t4 t4 t4
-
t0 t1 t2 t3 t2 t1
Fig. 6. First level results output
number of bytes transmitted per second is shown. As the user chooses a group, the pages loaded for that group are shown in the right most column (see Figure 7). In the bottom part of the window the average, minimum and maximum transaction time in milliseconds for that page are shown, as well as the average bytes loaded per second while requesting that page. As one clicks on page i in the right section of the upper part of the window (shown in Figure 8), the lower part of the window shows the list of requests that are sent out for loading the page. For each request the average, maximum and minimum transaction time are shown as well as the average number of bytes transmitted and the status code. The status code field indicates what percentage of requests achieved HTTP status code 200, which means successful completion.
56
K. Wolter and K. Kasprowicz
Fig. 7. Statistics per group of users
Other values of the status code indicate that the pages have not been found, or have still been cached, so they were not reloaded. For the purpose of analyzing the web server performance we only consider requests with status code 200. On the bottom of the statistics window there is a button ’generate intervals’ (labelled Intervalle generieren). Pushing this button generates averages over time intervals in seconds as specified in the three fields next to the button. The size of the time intervals can be arbitrarily chosen. We have computed 10-, 30-, and 60-second averages. In the discussion below we always use the data from the 60-second averages if not stated otherwise. From the transaction times and loaded bytes per second we can derive which pages are big in size (these take long at high rate of bytes per second). Using our tool, we answer some questions about the reasons of delays by looking at the output file WebAppLoader generates. The answers we give are specific for our scenario, but they show the kind of answers that can be obtained by using WebAppLoader in an arbitrary setting to study a system’s behavior. The output file gives all measurements (see Figure 5) for all groups, individual virtual users in a group, pages per virtual user and requests per page. All plots in this paper are generated using the output file of WebAppLoader. Figure 10 shows the observations for the average transaction time, the receive time, the time to the first byte, the average number of bytes transmitted per second and the average number of active virtual users. All averages are taken
WebAppLoader: A Simulation Tool Set
57
Fig. 8. Statistics for each loaded page
over 60-second intervals, while all times are measured in milliseconds. To make the curves all fit into one plot some of the data has been scaled as labeled in the legend. E.g. the curve showing the number of bytes transmitted has to be multiplied by 10 to obtain the real numbers, while the number of active users has to be divided by 250. Figures 9, 11 and 12 show selected curves taken from Figure 10 for easier interpretation. The curves in Figure 9 indicate at least trends: i) the transaction time increases with time, ii) the receive time remains constant over time, and iii) the time to the first byte increases over time. Note, that the transaction time is the sum of the time to the first byte and the receive time, since all other contributing metrics equal zero in our measurements. So the trend in the transaction time is dominated by the trend in the time to the first byte. We also see from Figure 10 that the number of virtual users increases over time, and an obvious conclusion is that the time to the first byte increases, because more and more virtual users execute the CGI-script, carrying out computations on the web server. Figure 11 therefore shows the TTFB versus the number of virtual users in the original data. Note that the initial number of virtual users was 8, which very soon increased to 20 and more. Obviously, there is a strong correlation between the two metrics, which we do not try to characterize formally in this paper.
58
K. Wolter and K. Kasprowicz 16000
14000
average receive time average time to first byte average transaction time
12000
10000
8000
6000
4000
2000
0 0
1000
2000
3000
4000
5000
time
Fig. 9. Addition of time periods in 60 second intervals
16000
14000
average receive time average time to first byte average transaction time average bytes per second [1*10] average #active users [1/250]
12000
10000
8000
6000
4000
2000
0 0
1000
2000
3000
4000
5000
time
Fig. 10. Averages over 60 second intervals
In order to make sure, that the delay is not due to slower file transmission, Figure 12 shows the average number of transmitted bytes per second and the average number of virtual users. After an initial increase the average number of
WebAppLoader: A Simulation Tool Set
59
11000 10000 9000
time to first byte (TTFB)
8000 7000 6000 5000 4000 3000 2000 1000 0 5
10
15
20
25
30
35
number of users
Fig. 11. Average TTFB versus number of users
9000
average bytes per second [1*10] average #active users [1/250]
8000 7000 6000 5000 4000 3000 2000 1000 0 0
1000
2000
3000
4000
5000
time
Fig. 12. Transmitted bytes per second and number of users in 60 second intervals
transmitted bytes shows no trend over time, while the number of virtual users increases.
60
K. Wolter and K. Kasprowicz
It can be concluded from the above that the bottleneck in this web application is the computing power of the web server. If we consider a response time of up to 8 seconds acceptable, this corresponds to a TTFB of roughly 4 seconds (see Figure 9), then the system can deal with up to 28 users (see Figure 11). If we want to allow for more users, we should add a new server, possibly an application server to do the computations. If estimated network delays, as they will occur to real users, should be taken into consideration, the acceptable transaction time and the number of users will be even less. We do not carry out sound analysis in this paper but merely demonstrate how WebAppLoader can help gaining the desired insights for planning and dimensioning of web applications.
6
Conclusions
In this paper WebAppLoader, a tool for planning and dimensioning web services, has been presented. The interesting features of the tool are the sophisticated means by which user sessions can be specified and mixed. It has been shown that the tool set allows for a priori quantifying the capacity of a web application to be newly installed as well as for finding bottlenecks in the performance of a web application. In future work data analysis components should be added to extract automatically the important information from the measurements. In addition we plan to implement the restart technique [MH01] for reducing response time in internet transactions and to use the tool for studies of the method’s impact.
References [BC98]
P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. In ACM SIGMETRICS Performance Evaluation Review, Proc. of the ACM SIGMETRICS joint Intl. Conf. on Measurement and Modeling of Computer Systems, volume 26(1), pages 151–160, June 1998. [LNJV99] Z. Liu, N. Niclausse, and C. Jalpa-Villanueva. System Performance Evaluation: Methodologies and Applications, chapter Web Server Benchmarking and Web Traffic Modeling. CRC Press, 1999. [MH01] S. M. Maurer and B. A. Huberman. Restart strategies and internet congestion. Journal of Economic Dynamics & Control, 25:641–654, 2001. [MJ98] D. Mosberger and T. Jin. httperf: A Tool for Measuring Web Server Performance. In First Workshop on Internet Server Performance, pages 59–67, Madison, WI, USA, June 1998. ACM. [Nov] http://www.novedia.de. [PJA+ 00] G. T. Paix˜ ao, W. Meira Jr., V. A. F. Almeida, D. A. Menasce, and A. M. Pereira. Design and Implementation of a Tool for Measuring the Performance of Complex E-Commerce Sites. In Proc. 11th Int. Conf. on Computer Performance Evaluation; Modelling Techniques and Tools, LNCS 1786, pages 309–323, Schaumburg, IL, USA, March 2000.
WebAppLoader: A Simulation Tool Set
A
61
The seminar1.xml User Profile
<page thinktime="16"> <request method="GET" url="/htmlfiles/" version="HTTP/1.0"> <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* gzip en iso-8859-1,*,utf-8 <request method="GET" url="/icons/blank.gif" version="HTTP/1.0"> http://localhost:8080/htmlfiles/ <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png gzip en iso-8859-1,*,utf-8 <request method="GET" url="/icons/back.gif" version="HTTP/1.0"> <request method="GET" url="/icons/text.gif" version="HTTP/1.0"> <request method="GET" url="/icons/folder.gif" version="HTTP/1.0"> <request method="GET" url="/icons/unknown.gif" version="HTTP/1.0"> <page thinktime="10"> <request method="GET" url="/htmlfiles/teaching/" version="HTTP/1.0"> http://localhost:8080/htmlfiles/ <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* gzip en iso-8859-1,*,utf-8 <page thinktime="15"> <request method="GET" url="/htmlfiles/teaching/seminarWS02/" version="HTTP/1.0"> http://localhost:8080/htmlfiles/teaching/ <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080
62
K. Wolter and K. Kasprowicz image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* gzip en iso-8859-1,*,utf-8 <request method="GET" url="/icons/layout.gif" version="HTTP/1.0"> http://localhost:8080/htmlfiles/teaching/seminarWS02/ <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png gzip en iso-8859-1,*,utf-8 . . . . <page thinktime="10"> <request method="GET" url="/htmlfiles/teaching/seminarWS02/restart1.pdf" version="HTTP/1.0"> http://localhost:8080/htmlfiles/teaching/seminarWS02/ <user-agent>Mozilla/4.77 [en] (X11; U; Linux 2.4.19 i686) localhost:8080 image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* gzip en iso-8859-1,*,utf-8
A Comprehensive Toolset for Workload Characterization, Performance Modeling, and Online Control Li Zhang, Zhen Liu, Anton Riabov, Monty Schulman, Cathy Xia, and Fan Zhang IBM Thomas J. Watson Research Center P.O. Box 704 Yorktown Heights, NY 10598 {zhangli,zhenl,riabov,schulman,cathyx,fzhang}@us.ibm.com
Abstract. With the advances of computer hardware and software technologies, electronic businesses are moving towards the on-demand era, where services and applications can be deployed or accommodated in a dynamic and autonomic fashion. This leads to a more flexible and efficient way to manage various system resources. For on-demand services and applications, performance modeling and analysis play key roles in many aspects of such an autonomic system. In this paper, we present a comprehensive toolset developed for workload characterization, performance modeling and analysis, and on-line control. The development of the toolset is based on state-of-the art techniques in statistical analysis, queueing theory, scheduling techniques, and on-line control methodologies. Built on a flexible software architecture, this toolset provides significant value for key business processes. This includes capacity planning, performance prediction, performance engineering and on-line control of system resources. Keywords: Performance analysis, performance prediction, capacity planning, Web service modeling, queueing networks, on-line control.
1
Introduction
As e-businesses evolve and are being adopted by more and more industries, increasing portions of the business processes are being handled by computers, through Web interfaces and Web services. Complex business logic is built into these enterprise systems with routers, Web servers, authentication servers, application servers, back-end databases, etc. These enterprise systems, controlled by sophisticated software, perform a variety of business functions including authentication/verification, ordering, approval, billing, account management, etc. In order for such complex systems to perform critical business functions reliably and efficiently, system administrators need to be able to monitor and manage the whole system effectively. There are many challenging issues in the analysis and management of these large distributed systems. Here, we present a rich set of performance modeling and analysis tools called COMPASS, to assist in dynamic capacity planning and in the efficient management of highly accessed, commercial Web systems. COMPASS stands for Control and Optimization based on Modeling, Prediction and AnalySiS. The main components of this set of tools are workload characterization, P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 63–77, 2003. c Springer-Verlag Berlin Heidelberg 2003
64
L. Zhang et al.
system and application modeling and analysis, and on-line optimal control. Each component can function as an independent module. More importantly, these components also work in coordination to provide better understanding and management of the underlying system. In order to better manage such complex service systems we need to first understand what the requests to the system are, how these requests arrive, what the key characteristics are, and how the requests will change over time. Workload characterization is mainly concerned with the analysis of the request arrivals processes. It also includes on-line monitoring and prediction services for the request arrival and system usage measurements. We next need to understand how various requests are served by the system. We need to be able to build models for the system architecture, specify the service components for each types of request, and quantify the speed and overhead for all type of requests at each service component. These details form system application modeling and analysis. The systems, which contain many different types of resources, and may have extra capacity, typically have various resource control and scheduling mechanisms. Administrators can tune these mechanisms to achieve more efficient system usage, lower system cost and increased overall profit. The on-line optimal control component provides efficient algorithms to dynamically adjust the control policies on-line based on the recently observed arrival processes, the system and application models, and the specified control objective function. Based on advanced statistics, stochastic processes, queueing, control and optimization theories, the use of COMPASS tools can lead to significantly improved solutions for a range of mission critical business processes including capacity planning, performance engineering, life cycle management, and business process management, etc. For example, a number of fundamental problems for capacity planning, performance prediction, and service level agreement provisioning include: What is the capacity of the current system? What is the current request traffic volume? What level of response times are users experiencing? What can be done to improve the system’s performance? Where is the potential bottleneck for the system? When will the servers run out of capacity? and so on. The answers are often obtained through benchmarking, on-line monitoring, system modeling and analysis. The COMPASS tools will apply sophisticated statistics and modeling techniques for analyzing the request arrival patterns to the system, forecast how these arrivals will change over time, construct system models for the request service processes, and trigger appropriate control actions. The use of COMPASS tools will lead the current system toward a better operating state. In general, the systems will be managed more efficiently, in an autonomic, on-demand fashion. The rest of the paper is organized as follows. In Section 2, we present the overall architecture and interfaces for the toolset. Sections 3 though 5 present the main functional modules of the toolset. We summarize in the end with discussions.
2 The Overall Architecture In this section we first describe technologies used in COMPASS implementation. Next, we list high-level components of our implementation and explain how these compo-
A Comprehensive Toolset for Workload Characterization
65
nents interact in different typical usage scenarios. In this section, we provide only general descriptions of algorithm families, without describing the details of methods used for modeling, analyzing and controlling target systems. We leave all relevant detailed descriptions for following sections, and focus on overall COMPASS toolkit architecture. The implementation of COMPASS tools and algorithms is based on the Java 2 Standard Edition platform1 [18]. Java was chosen as the language used for implementation, because it satisfied our requirements of portability, short development cycle, development of an extensive GUI, and compatibility with other performance analysis tools. We used platform version 1.4.1 in our development and testing. At least version 1.4.1, or higher, is required to run COMPASS. In our implementation, we make use of several API sets included in the Java 2 SE platform. All user interface code is based on the portable and lightweight Swing library. JDBC is used for operations with large data arrays, which can be stored in a JDBCcompatible relational database. DOM parser provided by Java API for XML Processing (JAXP) is used to process XML files. We use Remote Method Invocation (RMI) API in some of our implementations of measurement and control components, in order to communicate to systems that are being monitored or controlled by COMPASS. Finally, Java Native Interface (JNI) is used on the target system to invoke kernel control code written in C. We also use two modules that are not included in standard Java 2 Platform. We make use of advanced mathematical functions included in the freely available open source JSci library [6]. To achieve compatibility with Agent Building and Learning Environment (ABLE) we utilize open-source ABLE API [1,2]. Figure 1 presents an overview of software components comprising of the current COMPASS toolkit. From the end-user point of view, there are three options for invoking COMPASS tools: end-system configuration mode, on-line measurement, analysis and control mode, and off-line analysis mode. Back-end algorithms and methodologies support the two analysis modes and the GUI elements. The GUI elements are responsible for the configuration of algorithm parameters. These algorithms and methodologies are often shared between on-line and off-line implementations, with minor differences in parts responsible for control flow. The core part of COMPASS toolkit consists of several families of analysis and prediction algorithms (shown on top in figure 1). Methods used for system performance analysis include simulation and approximation. Workload (traffic) analysis is based on session identification and pattern classification algorithms. Most algorithms can be used in both on-line and off-line modes, depending on the user-specified configuration. The most trivial example of this is when the traffic analysis algorithms are used in off-line mode to infer traffic model parameters based on web server logs. In on-line mode, performance analysis algorithms can initiate control actions for improving user-perceived system response time or for maximizing profit based on a service level agreement contract.A possible action can be raising alarms to warn about potential system malfunctions, or about unusually high loads. More sophisticated control algorithms use the abstract system control interface (listed in infrastructure section on figure 1), in order to change 1
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries.
66
L. Zhang et al.
ANALYSIS AND PREDICTION ALGORITHMS Performance Analysis
MODELS
Workload Analysis
Simulation
Session Identification
Approximation
Pattern Classification
System Models
Traffic Models
Queueing Model
EWMA
Periodic
INFRASTRUCTURE Target Systems Interface
Persistence
GUI
Mathematical Tools Statistical Analysis, Distribution Fitting BlueQueue (simulation)
XML
JDBC
Custom Binary
Measurement Components Control Components
Model Specification Monitoring and Alarms What−if Analysis
Fig. 1. COMPASS Components
configuration parameters of the target system in runtime. Algorithm components are designed to conform to a specified interface, and are made to be easily interchangeable and interconnectable, in order to achieve maximal code re-use in our implementation. Our algorithms are based on a set of mathematical models, which are used to describe system and traffic behavior. Initial model parameters and target system configuration are specified with the assistance of the COMPASS GUI. Afterwards, system or traffic analysis algorithms can update model parameters based on observed system behavior. Implementation of COMPASS analysis and optimization algorithms is supported by a set of utility classes. Mathematical utility packages include the home-grown discrete event queueing network simulation library, BlueQueue, and statistical analysis tools, which include probability distribution fitting algorithms. Algorithm parameters, time series of measurements, control actions and predictions can be persisted to a relational database via JDBC interface, or to custom binary files. Persistence utilities also provide XML serialization infrastructure for Java objects. Measurements and control actions are taken through a set of interfaces that form a target system abstraction layer. This layer, consisting of measurement and control interfaces, allows COMPASS to interact with a variety of platforms and software packages. This makes it easier to add support for new systems. The Java Swing-based graphical user interface is essential to the COMPASS toolkit. Most of COMPASS operations can be performed or monitored via the graphical interface. These operations include system configuration setup, on-line system and workload monitoring, and off-line what-if analysis. COMPASS uses custom GUI components for displaying charts and plots, which allow to efficiently display easy to read representation of system status, forecast and other types of data. In on-line analysis mode, algorithm components of COMPASS are connected through a flexible data interface. This interface supports automatic recording of time series data, raising alarms based on user specified sets of predicates, and allows GUI
A Comprehensive Toolset for Workload Characterization
67
Workload Analysis
Web Users Measurement Interface
Performance Analysis and Prediction
Target System
Control Interface
Control Algorithms
Fig. 2. Data flow in an on-line mode
components to receive data for monitoring. Monitoring, saving time series data and alarms are optional. They may be enabled or disabled, according to user specifications at any particular data transfer point. Figure 2 illustrates data flow between COMPASS components, when COMPASS is started in on-line analysis mode. Measurements are taken from the target system via system-specific components that conform to the COMPASS measurement interface. Measurements of traffic intensity are used to analyze workload and make predictions of future workload. System performance parameters, such as CPU utilization and memory, are measured and analyzed. The resulting model system performance, together with workload predictions, are used to make predictions about future system performance, make decisions about possible control actions, and raise alarms. Control actions, such as re-assigning system resources allocated to different types of tasks in the target system, are taken through the unified control interface. This interface allows easy customization for particular software configurations. The same sequence of algorithms can be used in training mode to infer system model parameters. This would be the case, if instead of connecting to a production system driven by web users, COMPASS components are connected to a test system, driven by a workload generator. In this scenario, the workload generator can also be controlled by COMPASS, and used to supply additional information to traffic modeling algorithms.
3 Workload Characterization The workload characterization of COMPASS consists of three key components: traffic monitoring and analysis, workload profiling and classification, and traffic prediction. Request arrival information is collected and passed to the workload characterization module through a flexible data flow interface. The analysis module analyzes arrival information and builds models for the arrival process. For example, such models for Web site usages may describe user arrivals in sessions. Within each session, users may visit
68
L. Zhang et al.
multiple pages, with think times in between page views. When a page is loaded, a series of requests for embedded images is initiated. Algorithms from [10] are implemented to identify the session, and page view (or click) information. Figure 3 shows the analysis result of a customer workload. The top plot shows the number of user session arrivals every minute over the specified time window. The middle plot shows the distribution for the number of clicks in each session. The bottom plot shows the inter-click time distribution. The analysis module also provides the distribution fitting function. This
Fig. 3. Example of A Customer Workload
function is used to calculate the best estimate of the distribution parameters for the session arrival processes, the inter-page-view, and number of page-views per session distributions. Users can select the format and type of mixture of distributions to use for the fitting of the distributions. The output result can be stored in a given file. This file also includes the goodness of fit measures, which is used for the fitting.
A Comprehensive Toolset for Workload Characterization
69
Key characteristics of the arrival process that have strong impacts on the server’s performance are also extracted by the workload characterization module. Studies [15,16, 17] have shown that the correlation characteristics, such as short-range and long-range dependencies, have significant impact on user response times. The burstiness characteristics, such as the variability and heavy-tailness of the request distributions, also have significant impact. The user response times, under long-range dependent and heavytailed request processes, can degrade by orders of magnitude and have fundamentally different decay rates, when compared with the traditional Poisson models. These key parameters include correlation factors, marginal distributions, identified user access patterns, page visit sequences, think times, and various matching distribution parameters. These parameters are calculated by the workload characterization module to establish a complete workload profile, which is used as input to the other modeling modules. The profiling and classification module further builds a profile for the arrival process for each customer system. It represents arrival patterns over time for each customer system. The clustering engine construct groups for the customer systems that have similar arrival patterns. For a new customer system, its pattern can be mapped into the most similar group by the classification engine. Clustering and classification algorithms are explained in detail in [12]. The left figure in Figure 4 is the customer view panel. It shows for a given customer, the normalized request patterns for different days of the week and their corresponding classes. The right figure in Figure 4 is the cluster view. It shows for a given class, the common request pattern and the list of class members. Each member of the class corresponds to a one-day pattern for a customer site.
Fig. 4. Example of Customer View and Cluster View in Profiling
The prediction module makes predictions of the request arrival process, using the time series models. It’s predictions are based on the past access patterns, and the constantly changing volume. Figure 5 provides the prediction results for two anonymous customer
70
L. Zhang et al.
Web systems, based on one of our adaptive, template-based algorithms. The top plot in Figure 5 shows the measurements and the predictions. The closer the measurement and prediction lines appear, the better the prediction will be. The prediction errors are also analyzed and plotted in the middle plot of Figure 5, to demonstrate the effectiveness of the prediction algorithm. This plot shows the percentage of relative error against the fraction of time, where the relative error is below a given value. The lower the curve appears, the better the prediction will be. One can also implement other prediction algorithms under our flexible framework, and compare with the collection of prediction algorithms in the repository. Figure 6 shows the prediction accuracy for a subset of available customer
Fig. 5. Predictions for Two Customer Web Site
Web sites. The algorithm predicts to above eighty percent accuracy over eighty percent of time for most of the customer Web sites. The request volume usually changes over time for a live running system. Given the changing system load as the result of the changing arrival volume, the system may need to take certain actions to better accommodate for the changing situation. A change point detection algorithm is provided to detect on-line the changing state of the arrival process. Within each state, the arrival process remains relatively unchanged. The vertical lines in the bottom plots of Figure 5 illustrate the changing points as detected by our algorithm.
A Comprehensive Toolset for Workload Characterization
71
Fig. 6. Prediction Accuracy
A centralized repository is set up so that all of the measurement data, analysis results, and prediction algorithms can be stored for later re-use and comparisons. One can also re-play the historical data from the repository in order to drive another system for benchmarking, and evaluate the prediction/change-point detection algorithms. The workload models in the repository can be used to scale up the traffic volume, and generate synthetic, realistic workloads. A set of clients can then generate web requests according to these synthetic workloads. An example that uses this scenario is a realistic benchmarking tool named Bluestone, which is currently being developed.
4
System and Application Modeling and Analysis
Queueing network models are commonly used to model the request serving process in many service systems, including manufacturing and Web server systems [9,13]. More functions provided by the Web service infrastructures have resulted in more complex systems, as well as more complicated user access patterns. To model how the user requests, or transactions, are served by such complex systems, a single type of request stream feeding into a single black box is far from adequate. The system and application module constructs flexible queueing network models to capture the Web serving process. Each of the multiple components within the server system can be represented as a queue, or as a more complex sub-queueing network. For example, one can use a queue to model the network component within a system, and use a single server queue to model a database, etc. Different routing mechanisms, such as round robin and probabilistic routing, can be used to approximate the load balancing schemes in the system [7]. Common service policies, such as processor sharing or priority policies, can be used for each queueing or server component within the model to mimic the
72
L. Zhang et al.
components service behavior. Users can be categorized into multiple classes based on their access behaviors. For given routing and service parameters of such a queueing system, the system and application modeling module readily obtains the performance related measures such as throughput, utilization and response times, via simulations and queueing network theories. The workload profile feeding into the system can be the original as well as the forecasted profile from the workload models. Figure 7 shows a sample system architecture. The two types of sources on the left represent two classes of arrival streams to a queue. The splitter represents a load balancing router, which spreads these jobs to a single server and a complex server station. The server station again has two servers, with a splitter to balance the load. After finishing service from either the top-level server or the server station, the jobs complete their service at the system and exit. The system editor supports many different kinds of queueing
Fig. 7. The System Model Editor
disciplines, load-balancing algorithms, and server processing policies. The supported queueing disciplines include first-come-first-serve (or FIFO), priority (or HOL) and time-dependent-priority (or TD) [9]. A range of supported load balancing algorithms include round robin (RR), class-dependent probabilistic routings (PRV, DET, and PRM in Figure 7), and send-to-first-available (STF). Each server can serve one job at a time (Single Job), or can serve multiple jobs concurrently according to the processor sharing,
A Comprehensive Toolset for Workload Characterization
73
discriminatory processor sharing [4], or weighted fair queueing policies [3]. Figure 8 shows an example of the screen for server configuration. This panel allows users to specify the class-dependent service time distributions at the server, the maximum number of jobs the server can serve at any given time, and the blocking policies for new jobs arriving to a busy server.
Fig. 8. The Server Specification Screen
The specified system is analyzed by running regenerative discrete event simulations, using our BlueQueue queueing network simulation class library. BlueQueue provides a set of components that facilitates the modeling of a queueing network, using a set of queueing, service and load balancing policies. Service and inter-arrival times can be modeled with random variables, following one of several commonly used families of probability distributions. BlueQueue models allow to specify pre-determined or random routings for multiple classes of jobs. Simulated parameters, such as response time, can be analyzed for individual components or arbitrary parts of the queueing network, using statistics collection objects. The statistics objects collect per-class response time mean, standard deviation, and distribution information for each checked component (including queues and servers)
74
L. Zhang et al.
in the system. The system begins a regeneration cycle when a job arrives to an empty system. The confidence intervals are calculated at the end of each regeneration cycle. The simulation stops when the specified confidence levels are reached for all of the checked components. The total running time of the simulation depends on the load of the simulated system. If the simulated system is heavily utilized, then it does not become empty very often. It takes a longer time for a heavily utilized system to observe the same number of regeneration cycles, compared with an under-utilized system. Hence, simulating a heavily utilized system takes a longer time. During the simulation run, the per-class queue lengths and server utilizations are displayed and periodically refreshed as shown in Figure 7. The simulation can be interrupted in the middle of a run or continue until confidence is reached. The detailed simulation results are shown in a display window and can be saved in text format. Analytical analysis can be conducted for the system throughput and utilization analysis. Given the type of growth rate for each type of request, the analytical analysis module identifies the potential bottleneck components and the time horizon for these bottlenecks to reach their limits. While drag and drop editing makes COMPASS tools user friendly, the XML import and export functions for model specification make the tools open and extensible. Using the extensible markup language (XML), the constructed models are easily described and customized for many different platforms in heterogeneous environments. Models can be built and saved in native binary format, or in XML format. We can load stored models, or import models from their XML descriptions. This makes it easier for users to communicate and share their work. After the system model has been built, we can further feed the workload predictions from the workload characterization component in Section 3, and obtain the predicted performance measures. Combining the predicted performance measures with the bottleneck analysis results, we can then obtain a deeper understanding of how the system functions. In particular, many capacity planning related issues can be solved. For example, the report includes the projected performance measures, for a projected load or a projected time horizon. These performance measures include the user response times, and system utilizations. The report also includes the projected time for the system to experience performance problems, identification of the overload component, and recommendations for the best actions. Examples of such recommendations are adding front-end or back-end servers, so that the system can deliver the expected performance.
5
Online Optimal Control
Based on the appropriate data of sufficient detail and accuracy, we can construct workload, system and performance models automatically. Based on these models and the measurements from the live system, various control actions can be taken. Many systems are capable of performing real-time resource manangement and scheduling functions [5]. These systems can be configured to adjust system resources, such as network bandwidth and CPU, allocated to different types of jobs based on the user’s class, or the type of requested service. These control mechanisms are used to achieve a certain QoS objective, as well as minimize overall system operating cost.
A Comprehensive Toolset for Workload Characterization
75
The optimization functions in the modeling modules map the given QoS requirements into the most cost and operation efficient hardware and software configurations. The on-line optimal control module activates a controller using RMI to dynamically change the scheduling and resource allocation policies within the servers, based on the system and performance prediction models. Results of these actions are reflected in changes in the monitored performance measures. These performance measures, together with the changing workload, will again influence the control decisions. With all of these functions in place, the system will then be empowered with self-managing capabilities. A sample test system has been set up for this on-line optimal control module. The system consists of a front end web server and a back-end database. A set of client machines are used to generate three classes of user requests. These three classes of requests are static images, CPU intensive CGI (Common Gateway Interface) scripts, and database queries. A proportional share scheduler is activited on the web server to provide proportional CPU allocation to the three classes of requests, based on the weight parameters assigned to them. These weights can be adjusted in real-time by the control module. Measurement agents collect throughput and response information from the web server, and then sends the collected information to the control module. The control module calculates the best parameter setting for the proportional share scheduler, in order to minimize a weighted sum of the response times. These control parameters are then passed on to the scheduler, and become active immediately. Figure 9 shows a sample control panel for the live system. The top two plots in Figure 9 provide the response time and request volume information, for each class of request. The lower left plot shows sharing proportions among the request classes for the system resources. The goal is for the system to minimize a weighted sum of the response times. This objective function over time is plotted on the lower right plot. These delay penalty weights can be specified, and changed on screen. They are used to distinguish the importance of different requests. The automatic control actions can be activated or deactivated on the screen. An improvement in the overall system performance has been observed in our system testbed, as a result of the control actions.
6
Summary
We have presented a comprehensive toolset for workload characterization, performance modeling and analysis, and on-line control, based on advanced techniques in statistical analysis, queueing theory and dynamic scheduling and control methodologies. Workload characterization provides on-line monitoring and analysis of requests to the system. It also provides agents to collect and display system usage measurements. This module conducts profiling, traffic analysis and predictions for the incoming workload. The workload models can be the input to the system and application modeling module. They are also used to generate realistic benchmarks for testing purposes. The use of system application modeling and analysis builds multi-class queueing network models for the request serving process. It provides a flexible way to build models for the general system architecture, specifies the service components, and quantifies the speed and overhead for each type of request at each service component. Simulation
76
L. Zhang et al.
Fig. 9. The Control Panel
and analytical solutions are used to analyze the given system, and provide valuable throughput, response time, and bottleneck analysis. The on-line optimal control component provides efficient algorithms to dynamically adjust resource control policies on-line, based on recently observed arrival processes, system and application models, and the control objective function. Built on top of a flexible architecture, with a rich set of functional components working in coordination, this toolset provides significant value for key business processes. The significant value includes capacity planning, performance prediction, performance engineering and on-line control of system resources.
References 1. J. P. Bigus, J. Bigus. Constructing Intelligent Agents with JavaTM : A Programmer’s Guide to Smarter Applications. John Wiley & Sons; Book and CD-ROM edition (December 1997). ISBN 0471191353. 2. J. P. Bigus, D. Schlosnagle, et al. Agent Building and Learning Environment. http://www.alphaworks.ibm.com/tech/able/
A Comprehensive Toolset for Workload Characterization
77
3. A. Dmers, S. Keshav and S. Shenker. Analysis and Simulation of Fair Queueing System. Internetworking Research and Experience, Vol. 1, 1990. 4. G. Fayolle, I. Mitrani, R. Iasnogorodski. Sharing a Processor Among Many Job Classes. J. ACM, Vol. 14, No. 2, 1967. 5. L. L. Fong, M. H. Kalantar, D. P. Pazel, G. Goldszmidt, K. Appleby, T. Eilam, S. A. Fakhouri, S. M. Krishnakumar, S. Miller and J. A. Pershing. Dynamic Resource Management in an eUtility. In Proceedings of NOMS 2002 IEEE/IFIP Network Operations and Management Symposium, Piscataway, NJ, IEEE. 2002, p. 727-40, April 2002. 6. M. Hale, et al. JSci - A science API for JavaTM . http://jsci.sourceforge.net/ 7. G. Hunt, G. Goldszmidt, R. King, and R. Mukherjee. Network dispatcher: A connection router for scalable internet services. In Proceedings of the 7th International World Wide Web Conference, April, 1998. 8. A. K. Iyengar, M. S. Squillante, and L. Zhang. Analysis and characterization of large-scale web server access patterns and performance. World Wide Web, 2, June 1999. 9. L. Kleinrock. Queueing Systems Volume II: Computer Applications. John Wiley and Sons, 1976. 10. Z. Liu, N. Niclausse, and C. Jalpa-Villanueva. Web traffic modeling and performance comparison between HTTP 1.0 and HTTP 1.1. In E. Gelenbe, editor, Systems Performance Evaluation: Methodologies and Applications, pages 177–189. CRC Press, 2000. 11. Z. Liu, M. S. Squillante, C. H. Xia, and L. Zhang. Preliminary analysis of various SurfAid customers. Technical report, IBM Research Division, July 2000. Revised, December 2000. 12. Z. Liu, M. S. Squillante, C. H. Xia, S. Yu, and L. Zhang Web Traffic Profiling, Clustering and Classification for Commercial Web Sites. The 10th International Conference on Telecommunication Systems, Modeling and Analysis (ICTSM10), 2002. 13. D. A. Menasce, and V. A. F. Almeida. Capacity Planning for Web Performance: metrics, models, and methods. Prentice Hall, 1998. 14. A. K. Parekh, and R. G. Gallager. A Generalized Processor Sharing Approach to Flow Control in Integrated Services Networks: The Single-Node Case. IEEE Transactions on Networking, Vol 1, No. 3, 1993. 15. M. S. Squillante, B. Woo, and L. Zhang. Analysis of queues with dependent arrival processes and general service processes. Technical report, IBM Research Division, 2000. 16. M. S. Squillante, D. D. Yao, and L. Zhang. Web traffic modeling and web server performance analysis. In Proceedings of the IEEE Conference on Decision and Control, December 1999. 17. M. S. Squillante, D. D.Yao, and L. Zhang. Internet traffic: Periodicity, tail behavior and performance implications. In E. Gelenbe, editor, Systems Performance Evaluation: Methodologies and Applications. CRC Press, 2000. 18. Sun Microsystems, Inc. JavaTM 2 Platform, Standard Edition (J2SETM ). http://java.sun.com/j2se/ 19. R. W. Wolff. Stochastic Modeling and the Theory of Queues. Prentice Hall, 1989. 20. L. Zhang, C. H. Xia, M. S. Squillante, and W. N. Mills III. Workload service sequirements analysis: A queueing network optimization approach. In Tenth IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2002.
Logical and Stochastic Modeling with SmArT G. Ciardo1 , R.L. Jones2 , A.S. Miner3 , and R. Siminiceanu1 1
Department of Computer Science, College of William and Mary 2 ASRC Aerospace Corporation 3 Department of Computer Science, Iowa State University
Abstract. We describe the main features of SmArT, a software package providing a seamless environment for the logic and probabilistic analysis of complex systems. SmArT can combine different formalisms in the same modeling study. For the analysis of logical behavior, both explicit and symbolic state-space generation techniques, as well as symbolic CTL model-checking algorithms, are available. For the study of stochastic and timing behavior, both sparse-storage and Kronecker numerical solution approaches are available when the underlying process is a Markov chain. In addition, discrete-event simulation is always applicable regardless of the stochastic nature of the process, but certain classes of non-Markov models can still be solved numerically. Finally, since SmArT targets both the classroom and realistic industrial settings as a learning, research, and application tool, it is written in a modular way that allows for easy integration of new formalisms and solution algorithms.
1
Introduction
Complex discrete-state systems such as computer and communication networks, distributed software, and factory assembly lines are increasingly engineered and placed in service in environments where their correct logical and timing behavior is essential. Thus, both their verification and their performance and reliability analysis are important tasks in the design and dimensioning of such systems. Logical verification is usually concerned with absence of design errors such as deadlocks and similar “untimed” and “non-stochastic” properties, while both performance and reliability analysis are usually concerned with timing behavior in a stochastic setting. The two are rarely considered at the same time. Individually, both model checking tools such as NuSMV [14] and performability (i.e., combined performance and reliability) tools, such as UltraSAN [16] and M¨ obius [17] can be very useful to discover potential design flaws and bottlenecks in a system. Recently, however, there has been a clear trend toward combining these two aspects into a single unifying framework, as done in PRISM [22] (it is worth mentioning that there are also tools for verification of real-time systems,
This work was partially supported by the National Aeronautics and Space Administration under NASA Contract NAG-1-2168, NAG-1-02095, and NAS-1-99124; by the National Science Foundation under grants CCR-0219745 and ACI-0203971; by a joint STTR project with Genoa Software Systems, Inc., for the Army Research Office; and by the Virginia Center for Innovative Technology under grant FED-95-011.
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 78–97, 2003. c Springer-Verlag Berlin Heidelberg 2003
Logical and Stochastic Modeling with SmArT
79
such as KRONOS [29] and UPPAAL [2]; however, while these tools combine logic and timing, they do not target probabilistic aspects). We believe that there are two main reasons for this integration trend. First and foremost, in many systems, it might be necessary to discuss correctness in light of timing behavior (e.g., it must not be possible for events a and b to occur within x time units of each other), and to accept a probabilistic, rather than an absolute, view of correctness (e.g., the probability of non-termination due to entering a deadlock should be less than 10−9 ). A second reason is that the data structures and techniques required to carry on either type of analysis have much in common, and can help each other. It is on this second aspect that we focus our presentation. We introduce the tool SmArT, whose development began in 1995. Designed as a powerful stochastic environment to integrate multiple modeling formalisms for use both in the classroom and in industrial applications, SmArT has evolved to include logical analysis as well. Currently, it employs some of the most efficient data structures and algorithms known for the analysis of discrete-state systems.
2
Overview of SmArT
At the heart of SmArT lies the ability to define parametric models for which a variety of measures can be computed. A modeler may use multiple models to study different aspects of a system. Indeed, each model can be expressed in the most appropriate formalism, although, currently, the only high-level formalism is Petri nets [26], with a software-oriented formalism being planned, while the other available formalisms, discrete-time and continuous-time Markov chains (DTMCs and CTMCs), are rather low-level. Models can interact by exchanging data: a measure computed in a model can be an input parameter for another model. For logical analysis, SmArT can generate the (reachable) state space of a model using highly optimized algorithms ranging from explicit ones based on “flat” hashing or search trees or “structured” trees of trees [10], to symbolic ones based on multi-valued decision diagrams (MDDs) [21,24]. A particularly innovative aspect in SmArT is the symbolic encoding of the transition relation N , which specifies which states can be reached from each state in one step. Unlike traditional symbolic approaches where a “double-height” MDD encodes the “from” and “to” states in the relation, SmArT uses either a boolean sum of Kronecker products of (small) boolean matrices [3], or a matrix diagram [11,23]. For mostly asynchronous systems, not only is this encoding enormously more efficient in terms of memory, but it also allows us to exploit the inherent event locality to greatly improve the runtimes as well [7,8,9]. For stochastic timing analysis, both numerical solutions and simulation are available. The former are feasible if the underlying process is a DTMC, CTMC, or semi-regenerative process satisfying certain conditions [19]. For DTMCs and CTMCs, an explicit solution approach requiring O(η(P)) or O(η(Q)) memory is available, where η(·) is the number of nonzero entries in the argument and P and Q are the transition probability or infinitesimal generator matrices, respectively.
80
G. Ciardo et al.
In the semi-regenerative case, an embedded process, a DTMC, must be recognized and built. The expected holding times in each state and the state-to-state transition probabilities in this process are obtained by solving a (usually large) set of subordinate processes: currently, SmArT can compute them if these subordinate processes are CTMCs, but, in the future, we plan to extend the approach to general cases, even to models where the solution of some of the subordinate processes may require the use of discrete-event simulation. Recognizing the embedded, or regeneration, instants is a fundamental capability for this approach. In SmArT, the same capability is now exploited to recognize such instants when regenerative simulation is being used as the overall solution algorithm. Being able to classify the type of stochastic process underlying a given model is of course highly desirable [6]. If the distributions appearing in a model are all geometric or all exponential, it is immediate to conclude that the underlying process is a DTMC or a CTMC, respectively (the converse is not true). Semi-regenerative processes, however, are much harder to recognize a priori : if we allow for the possibility of unbounded models and general distributions, the question is indeed undecidable, since it is equivalent to deciding whether two events in the model can ever be enabled concurrently and asynchronously. Thus, in SmArT we have focused instead on efficient on-line classification of the underlying stochastic process during its generation. If the model is not clearly Markov, SmArT attempts to generate its embedded and subordinate processes in the hope that it falls in the class of semi-regenerative processes it can solve. Currently, SmArT uses an explicit data structure to store the embedded state space it generates during this attempt. However, we plan to devise an approach that uses symbolic data structures analogous to the ones employed for logical analysis, except that the notion of state needs to be extended to incorporate relative ordering information for the enabled events. In summary, the interface of SmArT presents the user with a unified syntax to specify a wide variety of systems and ask questions of their models. Internally, the numerous logical, numerical, and simulation solution algorithms are tightly integrated, helping each other by sharing important data structures that are efficiently implemented, and, most importantly, are available regardless of the modeling formalism employed by the user.
3
SmArT Language
SmArT uses a strongly-typed computation-on-demand language with five types of basic statements and two compound statements, which can be arbitrarily nested: Declaration statements are used to declare functions over some set of arguments. As a special case, the set of arguments for a function can be empty, thus the function is constant. In this sense, however, “constant” should not be confused with “non-random” (i.e., deterministic). To ensure strict typechecking, the type of the function and of its arguments must be defined. Definition statements are used to declare functions in the same way declaration statements do, but they also specify how to compute their value.
Logical and Stochastic Modeling with SmArT
81
Model statements are used to define models. Like functions, they have arguments but instead of returning a value, they specify a block with declarations, specifications, and measures, the latter being accessible outside the model. Expression statements are used to print values, although function calls appearing in the expression may have side-effects, such as redirecting the output or displaying additional information. Option statements are used to modify the behavior of SmArT. For example, there are options to control the numerical solution algorithms (such as the precision or the maximum number of iterations), the verbosity level, etc. Option statements appear on a single line beginning with “#”. Compound for statements define arrays or repeatedly evaluate parametric expressions. This is particularly useful for studies that explore how a result is affected by a change in the modeling assumptions, such as the rate of an event or the maximum size of a buffer. Compound converge statements specify fixed-point iterations such as those employed to carry out approximate performance or reliability studies. They cannot appear within the declaration of a model. The following basic predefined types are available in SmArT: bool: the values true or false. bool c := 3 - 2 > 0; int: integers (machine-dependent). int i := -12; bigint: arbitrary-size integers. bigint i := 12345678901234567890 * 2; real: floating-point values (machine-dependent). real x := sqrt(2.3); string: character-array values. string s := "Monday"; In addition, composite types can be defined using the concepts of: aggregate: analogous to the Pascal “record” or C “struct”. p:t:3 set: collection of homogeneous objects. {1..8,10,25,50} array: homogeneous objects indexed by set elements, see Sect. 3.2. A type can be further modified by the following natures, which describe stochastic characteristics: const: (the default) a non-stochastic quantity. ph: a random variable with discrete or continuous phase-type distribution. rand: a random variable with arbitrary distribution. proc: a random variable that depends on the state of a model at a given time. Finally, predefined formalism types can be used to define stochastic processes evolving over time, see Sect. 3.5. 3.1
Function Declarations
Objects in SmArT are functions, possibly recursive, that can be overloaded: real pi := 3.14; // an argument-less function bool close(real a, real b) := abs(a-b) < 0.00001; int pow(int b, int e) := cond(e==1,b,b*pow(b,e-1)); real pow(real b, int e) := cond(e==1,b,b*pow(b,e-1)); pow(5,3); pow(0.5,3); // prints 125, integer, and 0.125, real
82
3.2
G. Ciardo et al.
Arrays
Arrays are declared using a for statement; their dimensionality is determined by the enclosing iterators. Since the indices along each dimension belong to a finite set, we can define arrays with real indices. For example, for (int i in {1..5}, real r in {1..i..0.5}) { real res[i][r]:= MyModel(i,r).out1; }
fills array res with the value of measure out1 for the parametric model MyModel, when the first input parameter, i, ranges from one to five and the second one, r, ranges from one to the value of the first parameter, with a step of 1/2. Thus, note that res is not a “rectangular” array of values. 3.3
Fixed-Point Iterations
The approximate solution of a model is often based on a heuristic decomposition, where (sub)models are solved in a fixed-point iteration [13]. This can be specified with the converge statement: converge real x real y real y real x }
{ guess 1.0; // initial guess for fixed-point iteration guess 2.0; := ModelA(x, y).measureY; := ModelB(x, y).measureX;
The converge statement iterates until the values of the variables declared in it differ by less than , in either relative or absolute terms, from one iteration to the next. Note that the values of these variables, x and y in our example, are updated either immediately or at the end of each iteration. The user can control both and the updating criterion using the appropriate option statements. 3.4
Random Variables
SmArT can manipulate discrete and continuous phase-type distributions of which the exponential (expo), geometric (geom), and non-negative integer constants are special cases. Combining ph types produces another ph type if phase-type distributions are closed under that operation: ph ph ph ph ph ph ph ph
int int int int int real real real
X Y A B C x y a
:= := := := := := := :=
2*geom(0.1); equilikely(1,3); // Y = 1, 2, or 3 with probability 1/3 min(X,Y); 3*X+Y; choose(0.4:X,0.6:4*Y); expo(3.2); erlang(4,5); min(3*x,y);
Logical and Stochastic Modeling with SmArT ph int
1
83
1 1/10 0
X := 2*geom(0.1);
2
Y := equilikely(1,3);
3 2/3 2 1/2 1
9/10 ph int
1/3 ph int
A := min(X,Y);
0
1
0
1/2
3,2 2/3 2,1 9/20 1,2 1/3
1
11/20
Fig. 1. The internal phase-type representation in SmArT.
Internally, SmArT uses an absorbing DTMC or CTMC to represent a ph int or ph real, respectively. These representations are actually built only if needed for a numerical computation. For example, if the expression avg(A) needs to be evaluated, SmArT builds the internal representation of the DTMCs corresponding to X, Y, and A shown in Fig. 1, then computes the mean-time-to-absorption in the third DTMC, 1.96667. The number of states in the representation of a phase-type obtained through operators such as max and min can grow very rapidly. Mixing ph int and ph real or performing operations not guaranteed to result in a phase-type distribution forces SmArT to consider the resulting type as generally distributed (such general rand random variables can be manipulated only via Monte Carlo methods, currently under development): rand int D := X-Y; rand int F := X*Y; rand real E := x+X;
3.5
Modeling Formalisms
Components of a model are declared using formalism-specific types (e.g., the places of a Petri net). The model structure is specified by using formalismspecific functions (e.g., set the initial number of tokens in a place of a Petri net). Measures are user-defined functions that specify some constant quantity of interest (e.g., the expected number of tokens in a given place in steady-state), and are the only components accessible outside of the model definition block. The design of SmArT allows for relatively easy addition of new model formalisms. Currently, the dtmc, ctmc, and spn formalisms are implemented. For the spn formalism, the type of underlying stochastic process is determined by the distributions specified for the transitions. For example, the model shown on the right of Fig. 2 is defined by the spn Net on the left of the same figure, where place and trans are formalism-specific types; arcs, firing, and init are formalism-specific functions; measures such as n s and speed are accessed outside the model using “.” notation. Since all firing time distributions are specified as expo, the underlying process is automatically recognized to be a CTMC. The statements after the model produce the output
84
n=1: n=2: n=3: n=4: n=1:
G. Ciardo et al.
5 states, 8 arcs, throughput = 0.292547 14 states, 34 arcs, throughput = 0.456948 30 states, 88 arcs, throughput = 0.553456 55 states, 180 arcs, throughput = 0.612828 E[tk(p5,p4,p3,p2,p1)] = (0.265952,0.469362,0.264686,0.208963,0.525085)
where the measure calls cause SmArT to perform the appropriate analysis, steadystate solution of the underlying CTMC in this case. The five-state CTMC for n=1 is shown at the bottom of the same figure: states are identified by the number of tokens in places p5 , p4 , p3 , p2 , and p1 , in order, and arcs are labeled with the rate of the transition responsible for each marking change. For example, the steady-state probability that p5 contains a token is the same as the probability of occupying state (10000) in the CTMC, 0.265952; multiplying it by the rate of transition a gives 0.2925472, the throughput of a, as shown in the print output.
p n spn Net(int n) := { 5 place p5, p4, p3, p2, p1; trans a, b, c, d, e; expo(1.1) arcs(p5:a,a:p4,a:p2,p4:c,c:p3,p3:b, a b:p4,p2:d,d:p1,p1:e,p3:e,e:p5); firing(a:expo(1.1),b:expo(1.2), p p c:expo(1.3),d:expo(1.4),e:expo(1.5)); 4 2 init(p5:n); b c d bigint n_s := num_states(false); expo(1.2) expo(1.3) expo(1.4) bigint n_a := num_arcs(false); real speed := avg_ss(rate(a)); real e1 := avg_ss(tk(p1)); p p real e2 := avg_ss(tk(p2)); 3 1 real e3 := avg_ss(tk(p3)); e real e4 := avg_ss(tk(p4)); expo(1.5) real e5 := avg_ss(tk(p5)); }; for (int n in {1..4}) { print("n=",n,": ",Net(n).n_s:2," states, ", Net(n).n_a:3," arcs, throughput = ",Net(n).speed,"\n"); } print("n=1: E[tk(p5,p4,p3,p2,p1)] = (", Net(1).e5,",", Net(1).e4,",", Net(1).e3,",",Net(1).e2,",", Net(1).e1,")\n");
Fig. 2. SmArT input file for the Petri net on the right, and underlying CTMC (for n=1).
Logical and Stochastic Modeling with SmArT
4
85
Advanced Features
In addition to the features already illustrated in the previous section, SmArT possesses several more that make it truly unique and powerful, while relieving the user from having to know about details of the internal solution algorithms. One important feature of model definition is the ability to define arrays of objects in a model. For example, consider a local-area network consisting of n workstations and one bus-like communication network, where each workstation i executes locally for an average amount of time 1/λ, then it attempts to access the network. If the network is not in use, workstation i is granted access and sends data on the network for an average amount of time 1/µ. Otherwise, if some other workstation j is already using the network, i goes into a “backoff” state for an average amount of time 1/δ, then it attempts again to access the network. If the network is again busy (either because j is still using it, or because some other workstation k has gained access to it in the meantime), i waits for another backoff period, and so on, until it is finally granted access to it. Assuming that all times are exponentially distributed, we can model this system as a CTMC. Since n is unknown, our model is parametric in the number of CTMC states; this is achieved by declaring arrays of places, as shown in Fig. 3, where we compute the steady-state expectation of the number of workstations in the backoff state and the probability that no workstation is in the backoff state for µ = 1, δ = 0.1, λ = 0.04, and n = 2, 3, . . . , 20.
4.1
Efficiency Considerations
One defining aspect of the SmArT language is its computation-on-demand operational semantics. An input file is parsed and type-checked as it is read, but no computation takes places until it is required to perform an output. Thus, for example, an spn model M may define transient measures tm1 and tm2 and steady-state measures sm1 and sm2, but none of them will be computed when the model is parsed. An expression statement “M.sm1;” causes the spn to be solved in steady-state and the value of M.sm1 to be computed and printed. However, a subsequent request to print M.sm2 should not require the steady-state solution of M again. Thus, SmArT employs a set of heuristics to decide appropriate grouping of measures. In our example, asking for either M.sm1 or M.sm2 would cause the computation of both, but not that of the transient quantities tm1 and tm2. A similar issue arises when a model is solved repeatedly with different parameters. For example, the spn Net in Fig. 2 requires a new generation of the state-space and of the underlying CTMC if repeatedly called with a different value of the parameter n. However, if it also had a parameter lambda for the rate of transition a, only the underlying CTMC, but not the state space, would have to be recomputed. Thus, SmArT remembers the last set of parameters and measures a model has been solved for, and recomputes only what is needed when the same model is exercised again.
86
G. Ciardo et al.
ctmc Lan(int n, real lambda, real mu, real delta) := { for (int i in {0..n-1}) { state idle[i], busy[i]; // "i" counts the workstations in backoff } init (idle[0]:1.0); for (int i in {0..n-1}) { arcs(idle[i]:busy[i]:(n-i)*lambda, busy[i]:idle[i]:mu); } for (int i in {1..n-1}) { arcs(idle[i]:busy[i-1]:i*delta, busy[i-1]:busy[i]:(n-i)*lambda); real aux[i] := prob_ss(in_state(idle[i])|in_state(busy[i])); real a[i] := cond(i>1,a[i-1]+i*aux[i],aux[1]); } real p := prob_ss(in_state(idle[0])|in_state(busy[0])); }; real mu := 1.0; real delta := 0.1; real lambda := 0.04; for (int n in {2..20}) { print("Prob[no backoff|n=",n,"]=",Lan(n,lambda,mu,delta).p,"\n"); print("Avg[backoff|n=",n,"]=",Lan(n,lambda,mu,delta).a[n-1],"\n"); } idle[0] nλ
µ
busy[0]
idle[1] δ
(n-1)λ
(n-1)λ
µ
busy[1]
idle[2] 2δ
(n-2)λ
(n-2)λ
µ
busy[2]
idle[3] 3δ
(n-3)λ
(n-3)λ
µ
busy[3]
idle[n-1] (n-1)δ
λ λ
µ
busy[n-1]
Fig. 3. A CTMC model with a parametric number of states.
4.2
State-Space Generation and Storage
The generation and storage of the state space of a model is a key component of any state-space-based solution technique, and it is an integral part of model checking. SmArT implements a wide variety of techniques for constructing state spaces. There are explicit techniques that store each state individually and implicit techniques that employ multi-valued decision diagrams (MDDs) to symbolically store sets of states. This is governed by the choice of the #StateStorage option. Explicit algorithms, encompassing AVL and splay trees, and hash tables (option values AVL, SPLAY, HASHING), impose no restrictions on the model to be applicable, but the algorithms require time and memory at least linear in the number of reachable states. Symbolic algorithms, instead, require a partition of the model to exploit the system structure, but are normally much more efficient. A partition of the model into K submodels implies that its (global) state can be written as the concatenation of K (local) states. In particular, the partition is Kronecker-consistent if the global model behavior can be expressed as a functional product of local behaviors for each submodel. For example, from a logical point of view, an event in the model is (globally) enabled iff it is (locally) enabled in each of the K submodels while, from a stochastic point of view, its (global) rate must be obtainable by multiplying some reference rate with the
Logical and Stochastic Modeling with SmArT
¿ ¾ ½
0 1 2 3 0 0 1 2
0 1 2
0 0 1
0 1
0 0 1 2
0 1 2
0 1
0 1
0 1 2 0
87
0 1 2 1
Fig. 4. An MDD example for a system partitioned into K = 4 components.
product of K dimensionless values, the k th one depending only on the state of the k th submodel. SmArT automatically checks whether a partition specified by the user is Kronecker-consistent prior to attempting an analysis method that requires it (this check can be performed quite efficiently). For Petri nets, a partition can be specified by directly assigning class indices (contiguous, strictly positive numbers) to places: partition(2:p); partition(1:r); partition(1:t, 2:q, 1:s);
or by simply enumerating (without index information) the places in each class partition(p:q, r:s:t);
In the latter case, the class indices are assigned automatically, in decreasing parsing order, thus the two examples shown have the same effect. A common characteristic of symbolic approaches in SmArT is the use of MDDs to encode sets of states. MDDs are a natural extension of the classic binary decision diagrams [4] to boolean functions of K multi-valued variables, f : {0, . . . , nK − 1} × . . . × {0, . . . , n1 − 1} → {0, 1}. In particular we employ a quasi-reduced version of MDDs, where arcs can only connect nodes in adjacent levels. Given a partition of a model into K submodels, the local state space of the k th submodel, Sk , can be generated with explicit methods. The resulting local states are indexed {0, . . . , nk −1}, for a finite value of nk = |Sk |, hence a set of global states can be encoded as an MDD over the domain SK × . . . × S1 = S, called the potential state-space. An example of an MDD encoding is shown in Fig. 4 (nodes labeled with 0 encode the empty set and are not explicitly stored). MDD-based algorithms are further distinguished by the iteration strategy employed for state-space exploration. One can choose a variant of breadthfirst search, MDD LOCAL PREGEN, locality-driven heuristics, MDD FORWARD PREGEN and MDD UPSTREAM PREGEN [7], or a saturation approach, MDD SATURATION and MDD SATURATION PREGEN [8]. A unique feature of all symbolic approaches in SmArT is the use of a Kronecker encoding of the transition relation between states. Unlike the BDD encoding of traditional symbolic techniques for binary relations, our Kronecker representation facilitates the detection and exploitation of event locality, a property inherently present in all asynchronous systems. The most efficient iteration strategy in SmArT is saturation, which exhaustively fires, in an MDD node at level k, all enabled events that affect only levels k and below, until the set encoded by the node reaches a fixed point. Subsequently, all the nodes below are also saturated, which means that the sub-MDD needs not be explored again, resulting in enormous reductions in computation
88
G. Ciardo et al.
time. Since only saturated nodes are stored, there is also a reduction in the peak storage requirements: often, the peak and final numbers of nodes are almost the same. This improves upon traditional approaches, in which the number of nodes can explode during generation, before “contracting” to the final representation. There are two types of symbolic state-space generation approaches in SmArT: the “ PREGEN” ones, which explicitly generate the local state spaces in isolation a priori, and the rest, which intertwine (explicit) local and (symbolic) global statespace generation. The former are slightly more efficient, but can be applied only when the submodels are “well-behaved” in isolation. The latter, instead, free the user from worrying about the behavior of submodels in isolation, and can indeed be applied whenever the overall state-space is finite [9]. Several predefined functions in SmArT can be used to define measures that refer to the state space. For example, num states returns the size of the state space and num arcs returns the number of arcs in the reachability graph (both, optionally, can print the states, or the arcs, as a side-effect), while reachable returns the state space itself, for further manipulation (see Section 4.3). Compared to other decision-diagrams-based tools, SmArT excels at statespace generation for globally-asynchronous locally-synchronous systems, such as parallel and distributed protocols, where event locality is quite common. Fig. 5 shows the SmArT input file describing the dining philosophers protocol (with deadlock) and a table reporting the runtime and memory consumption for statespace generation. The model can be scaled up to a huge number of philosophers, while requiring under four megabytes of memory and six minutes of runtime. 4.3
CTL Model Checking
Model checking is concerned with verifying temporal logic properties of discretestate systems evolving in time. SmArT implements the branching time Computation Tree Logic (CTL), widely used in practice due to its simple yet expressive syntax [15]. In CTL, operators occur in pairs: the path quantifier, either A (on all future paths) or E (there exists a path), is followed by the tense operator, one of X (next), F (future, finally), G (globally, generally), and U (until). CTL model-checking queries are available in SmArT via a set of modeldependent measures with type stateset. Each stateset, a set of states satisfying a given CTL formula, is stored as an MDD, and all MDDs for a model instance are stored in one MDD forest, sharing common nodes for efficiency. All model-checking algorithms in SmArT use symbolic techniques, and thus require a user-defined model partition. Four categories of model checking functions exist: Atom builders: nostates, returns the empty set; initialstate, returns the initial state or states of the model; reachable, returns the set of reachable states in the model; potential(e), returns the states of S satisfying condition e (S depends on the model partition and the initial state). Set operators: union(P, Q), returns P ∪ Q; intersection(P, Q), returns P ∩ Q; complement(P), returns S \ P; difference(P, Q), returns P \ Q; includes(P, Q), returns true iff P ⊇ Q; eq(P, Q), returns true iff P = Q;
Logical and Stochastic Modeling with SmArT
89
spn phils(int N) := { for (int i in {0..N-1}) { place idle[i],waitL[i],waitR[i],hasL[i],hasR[i],fork[i]; partition(1+div(i,2):idle[i]:waitL[i]:waitR[i]:hasL[i]:hasR[i]:fork[i]); init(idle[i]:1, fork[i]:1); trans Go[i], GetL[i], GetR[i], Stop[i]; firing(Go[i]:expo(1),GetL[i]:expo(1),GetR[i]:expo(1),Stop[i]:expo(1)); } for (int i in {0..N-1}) { arcs(idle[i]:Go[i], Go[i]:waitL[i], Go[i]:waitR[i], waitL[i]:GetL[i], waitR[i]:GetR[i], fork[i]:GetL[i], fork[mod(i+1,N)]:GetR[i], GetL[i]:hasL[i], GetR[i]:hasR[i], hasL[i]:Stop[i], hasR[i]:Stop[i], Stop[i]:idle[i], Stop[i]:fork[i], Stop[i]:fork[mod(i+1, N)]); } bigint n_s := num_states(false); }; # StateStorage MDD_SATURATION print("The model has ", phils(read_int("N")).n_s, " states.\n");
Number of States philosophers |S| 100 4.97 × 1062 1,000 9.18 × 10626 10,000 4.26 × 106269
MDD Nodes Mem. (bytes) CPU Final Peak Final Peak (secs) 197 246 30,732 38,376 0.04 1,997 2,496 311,532 389,376 0.45 19,997 24,496 3,119,532 3,821,376 314.13
Fig. 5. SmArT code and computational requirements for the dining philosophers.
Temporal logic operators: the CTL operators EX(P), AX(P), EF(P), AF(P), EG(P), AG(P), EU(P, Q), AU(P, Q), and their dual counterparts in the past, EXbar(P), AXbar(P), EFbar(P), AFbar(P), EGbar(P), AGbar(P), EUbar(P, Q), AUbar(P, Q). Execution trace output: EFtrace(R, P), EGtrace(R, P) print a witness for EF(P), EG(P) respectively, starting from a state in R; EUtrace(R, P, Q), prints a witness for EU(P, Q) starting from a state in R; dist(P, Q), returns the length of a shortest path from any state in P to any state in Q. Utility functions: card(P), returns the number of states in P (as a bigint); printset(P), prints the states in P (up to a given maximum). The example in Fig. 6 shows a way to check for deadlocks in a generic model. The set D contains all reachable states that have no successors: if it is not empty, MyModel.deadlock is true, and evaluating the expression MyModel.dead causes all such deadlock states to be printed. It is also shown how to define a set of states T, from which there is an infinite path (i.e., a path with cycles) of states in P. Finally, the expression MyModel.trace can be used to generate a witness for EG(P). The counterexamples/witnesses feature in SmArT allows the user to inspect execution traces for debugging purposes. The computation of traces in SmArT is available for the fixed-point existential-type CTL operators: EF, EU, and EG.
90
G. Ciardo et al.
spn MyModel := { ... stateset R := reachable; stateset S := EX(potential(true)); // states with a successor stateset D := difference(R,S); // reachable deadlock states stateset P := potential(e_p); stateset T := EG(P); // infinite run of X states bool deadlock := neq(D,nostates); bool dead := printset(D); bool trace := EGtrace(initialstate,P); } Fig. 6. CTL queries on a generic model.
They are subject to several enhancing techniques that make SmArT unique. The computation of EF traces can fully exploit the saturation algorithm, in conjunction with either a specific type of edge-valued decision diagrams (called EV+ MDDs [12]), multi-terminal (or algebraic) decision diagrams, or simply forests of MDDs. These alternatives are controlled by the option #TraceStorage. An approach that only partially exploits saturation, is available instead for EU, while EG traces are implemented using traditional breadth-first search (hence they are not as efficient as the other two, but still benefit from the Kronecker representation of the transition relation and event locality). 4.4
Markov and Non-Markov Models
SmArT provides standard numerical solutions to Markov models: power method and uniformization for transient analysis, iterative methods (Jacobi, GaussSeidel, SOR) for steady-state analysis. When all firing times are ph int, the underlying stochastic process is a DTMC, but on an expanded state space that encodes the distribution of the remaining firing time (the phase) of each enabled transition along with the structural state (the marking, for spn models). The resulting potential phase space is the Kronecker product of the state spaces of the absorbing DTMCs describing the firing time distributions. With ph real timing delays, the underlying process is a CTMC and the phase space is the Kronecker sum of the absorbing CTMCs describing the firing time distributions. Mixing ph int and ph real timing delays in a model greatly complicates matters, but an spn model may still enjoy the Markov property if the phase advancements of the ph int transitions are synchronized. With the remaining firing time distributions already encoded in the state, the resulting stochastic process is semi-regenerative. As such, a single embedded DTMC arises separately from, but interacting with, many subordinate CTMCs corresponding to ph real firings. Reducing this otherwise difficult problem to Markov chain solutions allows it to be tackled efficiently using standard numerical methods: SmArT can compute steady-state measures of mixed ph models if it determines (automatically, while generating the state space) that the underlying process is semi-regenerative. For example, consider the spn Net in Fig. 2 with n=1 but with firing delays firing(a:expo(1.1), b:2, c:1, d:equilikely(1,2), e:expo(1.5));
Logical and Stochastic Modeling with SmArT
91
so that both ph real and ph int firing delays are present. The phases of transition b are taken from {2, 1, 0, •}, where “•” means that the transition is disabled. Upon enabling, b starts in phase 2, delays one unit of time, then moves to phase 1, where it delays again until phase 0, where it fires. Transition d behaves similarly over the same set of phases except it starts either in phase 1 or 2 with equal probability. The phase of expo transitions, one of {1, 0, •}, does not need to be recorded: it is 1 if the transition is enabled, • if it is disabled, and 0 only for the instantaneous amount of time just before firing. Transitions b and c can be enabled concurrently and asynchronously with d. However, these transitions are “(phase-)synchronized”: c and d become simultaneously enabled when a fires and the alternation of the enablings of c and b preserves the phase synchronization with respect to d. This is a unique feature of our class of “synchronous spn” models, which have an underlying stochastic semi-regenerative process. To solve a semi-regenerative process, we must first build its embedded DTMC. We observe the state of the model after each ph int phase change, if at least one such transition is enabled. Otherwise, we observe the state after each ph real phase change. Unlike ph int phase changes, which occur every unit of time, the time between ph real phase changes is an expo random variable that depends on the combined phase-change rate for all enabled ph real transitions, but it is unnecessary to explicitly record this time, thanks to the expo memoryless property. Fig. 7 shows the resulting embedded DTMC for Net(1) (with the firing time distributions just mentioned). States are identified by the number of tokens in p5 , p4 , p3 , p2 , and p1 , followed by the phase of transitions b, c, and d. The graph of the DTMC shows that expo transitions can fire from any state that enables them, while each enabled ph int transition must first go through a “phased delay”, until it reaches phase 0, when it immediately fires. The embedded state transitions require some explanation, however. State (10000,• • •), which enables transition a, can be exited in two ways, each with probability 0.5, since the firing of a enables transition d, which starts its phased delay from one of two phases with equal probability. The sojourn time of this embedded state has the same distribution as the firing delay of transition a: expo(1.1). Other state transitions have probability one except for those associated with the parameter α. These depend on whether the expo transition e fires before the ph int transition b moves out of phase 1 or phase 2 in discrete time. As ph int phase changes occur every time unit, we can compute α, the probability that transition e does not fire in the interval [0, 1), by solving a two-state subordinate CTMC that starts in state 1 to indicate that transition e is in phase 1 while delaying an expo(1.5) time before transitioning to phase 0 and firing. Such a small CTMC has an easy solution: α = exp (−1.5), or 0.77687. Since the embedded DTMC is irreducible and aperiodic, SmArT computes a unique stationary probability vector, which, paired with the expected sojourn times in each state obtained from the subordinate CTMCs, is used to compute measures defined on the original semi-regenerative process. Our example produces the output n=1: 7 states, 10 arcs, throughput = 0.31062 n=1: E[tk(p5,p4,p3,p2,p1)] = (0.282382,0.355228,0.36239,0.46593,0.251688)
92
G. Ciardo et al.
Fig. 7. Embedded DTMC underlying the synchronous spn derived from Fig. 2.
SmArT employs also embedding with elimination [18,20]: certain embedded states can be eliminated either when only ph int or only ph real transitions are enabled, or when both types are enabled but the ph int transitions require long phase evolutions before firing. In such cases, the embedded states become good candidates for elimination when their effect on the defined measures can just as easily be reformulated for inclusion within neighboring embedded states. In Fig. 7, state (00110,1•2) serves no purpose other than imposing a ph int delay (no ph real delays and no transition firings). Thus, we can eliminate it and its transitions, bypass it with a new transition from (01010,•11) directly to (00101,2••), still with probability one, set the sojourn time in (01010,•11) to 2 instead of 1, and, finally “distill” into state (01010,•11) any information about the eliminated state that is pertinent to the measures. In the above synchronous spn example, we chose n=1 deliberately because, for n>1, the spn becomes asynchronous: ph real transition a can enable ph int transition c when ph int transition d is enabled, but in a way that causes their phases to be unsynchronized. Indeed, even with n=1, we can achieve asynchronous behavior by changing the firing delay of transition b to ph real. For asynchronous models, a separate clock is required in principle for each “group of synchronous phases”, but including these clocks within the state of the model results in a general state space Markov chain (GSSMC) with a continuous state space. SmArT is able to recognize a class of asynchronous models that exhibit “regenerative regions” on these continuous state spaces. For these models, SmArT implements a novel regenerative simulation method for steady-state analysis, based on identifying hidden regenerative regions within the underlying GSSMC. Identifying regeneration points requires that we fix a state (marking and phase) known to be visited infinitely often and observe visits to it that also have clock readings with a distinguished probability distribution fixed and independent of the past. Through a specialized data structure, the simulation algorithm is efficient and able to identify a broad variety of regenerative regions [20]. 4.5
Kronecker Encoding of the Markov Chain Matrix
For spn models where all transitions have expo distributions (or are immediate, i.e., can fire as soon as they are enabled), SmArT provides advanced solution methods based on a Kronecker encoding of the transition rate matrix R of the underlying CTMC. These methods are quite effective in reducing the memory requirements and, just like MDD methods, they require a user-specified partition. Indeed, both MDD and Kronecker techniques can be employed for the
Logical and Stochastic Modeling with SmArT redo2
m2
93
bad2
t0
t4 redo1
m1
back1 ok1 N
back2
bad1
out1
N
ok2
out2
redo3
bad3
m4
m3
ok4
out4
back4 N back3 ok3
N
bad4
kan2
kan1 s1_23
redo4
out3
kan4
s23_4
kan3
spn kanban(int N) := { for (int k in {1..4}) { place m[k],bad[k],kan[k],out[k]; init(kan[k]:N); partition(m[k]:bad[k]:kan[k]:out[k]); trans redo[k], ok[k], back[k]; firing(back[k]:expo(0.3)); arcs(m[k]:redo[k],redo[k]:bad[k],m[k]:ok[k],ok[k]:out[k], bad[k]:back[k],back[k]:m[k]); real e[k] := avg_ss(tk(m[k])+tk(bad[k])+tk(out[k])); } trans t0, s1_23, s23_4, t4; firing(t0:expo(1.0),redo[1]:expo(0.36),ok[1]:expo(0.84),s1_23:expo(0.4), redo[2]:expo(0.42),ok[2]:expo(0.98),redo[3]:expo(0.39),ok[3]:expo(0.91), s23_4: expo(0.5),redo[4]:expo(0.33),ok[4]:expo(0.77),t4:expo(0.9)); arcs(kan[1]:t0,t0:m[1],out[1]:s1_23,s1_23:kan[1],kan[2]:s1_23,s1_23:m[2], kan[3]:s1_23,s1_23:m[3],out[2]:s23_4,s23_4:kan[2],out[3]:s23_4, s23_4:kan[3],kan[4]:s23_4,s23_4:m[4],out[4]:t4,t4:kan[4]); }; kanban(read_int("N")).e[4];
Matrix diagram Mem. GS (bytes) its sec/it 5 2,546,432 24,460,016 21,667 139 73 6 11,261,376 115,708,992 32,702 185 336 7 41,644,800 450,455,040 46,678 238 1,290
N
States |S|
Nonzeros η(R)
Mem. (bytes) 9,486 14,106 20,388
Kronecker GS JOR its sec/it its sec/it 214 148 527 74 289 723 713 359 374 2,923 — —
Sparse GS its sec/it 214 19 — — — —
Fig. 8. Kanban system.
solution process, to store S and R, respectively. This allows SmArT to use probability vectors of size |S|, which are then the only objects for which the memory requirements effectively limit the applicability of a numerical solution technique. The technique used to store the transition rate matrix R, and the multiplication algorithm used in the numerical iterations, are specified with the #MarkovStorage option. Explicit SPARSE storage is the most general, and the default option. Algorithms based on the potential state space S or the actual state space S [5] for use with a Kronecker representation can be specified (POTENTIAL KRONECKER and KRONECKER). Kronecker-based vector-matrix multiplication algorithms can be used with Jacobi by forcing SmArT to store the rep-
94
G. Ciardo et al.
resentation “by rows” (set option #MatrixByRows to true); otherwise, SmArT uses a Jacobi iteration that, like Gauss-Seidel, accesses the matrix by columns. SmArT also offers a particularly efficient data structure that combines the idea of decision diagrams with that of Kronecker algebra: matrix diagrams [11]. Selecting the option value MATRIX DIAGRAM KRONECKER causes SmArT to transform the Kronecker representation into a matrix diagram, which has a significantly lower computational overhead during the numerical solution. For example, the model of Fig. 8 [28] describes an assembly line where “kanbans” are used to limit access to portions of the line, for a form of flow control. There are four assembly stations, and we accordingly use a partition into four submodels. The model is parameterized by the number of kanban tokens, N , initially assigned to each station, in p1 , p2 , p3 , and p4 . The figure also shows the SmArT input file and a table reporting the time and memory requirements for a numerical solution The number of states |S| and of nonzero entries η(R) grow rapidly. However, with the Kronecker option, SmArT can solve models having over one order of magnitude more states than with traditional sparse-storage approaches. Furthermore, from the columns showing the time required (seconds per iteration), we can conclude that the time overhead imposed by the Kronecker approach is smallest when using matrix diagrams. As discussed in [5], there is memory/time trade-off when employing a Kronecker encoding of the transition rate matrices: the Gauss-Seidel method uses less memory and converges in fewer iterations, but it requires more time per iteration because it must use an algorithm to multiply a vector by a single matrix column; the Jacobi method (in this case, using a relaxation parameter ω = 0.9) can use a more efficient matrix-vector multiplication algorithm, but requires more iterations for convergence. Employing matrix diagrams, we attain the best of both worlds: both the number of iterations and the time per iterations are smaller than with any other Kronecker-based approach. Finally, the ordinary sparse storage implementation is quite efficient, but only as long as the entries of the transition rate matrix can all fit in main memory, thus it is not competitive for large models (a “—” indicates that the model could not be solved due to excessive memory requirements). 4.6
Approximations
The use of MDDs for S and a Kronecker encoding for R allow us to compactly represent enormous CTMCs. For numerical solution, the only remaining memory bottleneck is the solution vector. When S is too large, even just the solution vector alone requires excessive memory and other techniques, such as discreteevent simulation or approximations, must be used. Currently, SmArT provides a novel approximation technique [25] for stationary analysis of models with an underlying structured CTMC which can still relay on complete knowledge of S and R, thanks to their extremely compact MDD and matrix diagram encodings. The technique performs K approximate aggregations, where each aggregation is based on the structure of the MDD representing S: states whose path in the MDD passes through the same downward pointer at level k are grouped together for aggregation k. The rate of transition
Logical and Stochastic Modeling with SmArT
N 5 6 7 30 66
Worst relative error |S| Average number of tokens Average 2.55 × 106 +2.557% 1.13 × 107 +2.262% 4.16 × 107 +2.032% 4.99 × 1013 — 1.99 × 1017 —
95
CPU firing rate (sec) −0.074% 0.84 −0.099% 1.38 −0.097% 2.19 — 462.48 — 13,424.50
Fig. 9. Accuracy and CPU time of the approximation for the Kanban system
between two groups of states can be determined efficiently from the Kronecker structure for R; however, for some transitions, it is necessary to know the relative probability of a state within its group. Since the rates of each aggregated CTMC may depend on the probabilities computed for the other aggregations, fixed-point iterations are used to break cyclic dependencies. The results can be quite accurate. For example, Fig. 9 shows the worst relative error (where an exact comparison is possible) when computing the average number of tokens in each place and the average firing rate of each transition for the Kanban system of Fig. 8. The total CPU time required by the approximation is also shown.
5
Conclusions and Future Directions
We have presented the main features of SmArT, a software tool for the logical and stochastic modeling of discrete-state systems. One of its goals is being able to solve truly large models, as this is needed to tackle practical systems. The data structures and solution algorithms implemented in SmArT reflect years of research advancements, and are the means to make progress toward this goal. The SmArT User Manual, examples, related publications, and instructions on how to obtain a copy are available at http://www.cs.wm.edu/˜ciardo/SMART/. Nevertheless, much work is still required, in both research and implementation. A tighter integration of the logical and stochastic analysis capabilities of SmArT is desirable (e.g., CSL [1]). Research items on our “to-do list” have been mentioned throughout the paper. Implementation items include the spin-off of libraries for some of the most advanced functionalities, in order to make them available to the research community, and the creation of a model database which will contain both benchmark-type and realistic modeling examples, along with a detailed description, sample SmArT input code, and resource requirements.
References [1] C. Baier, B. R. Haverkort, H. Hermanns, and J.-P. Katoen. Model checking continuous-time Markov chains by transient analysis. In Proc CAV, LNCS 1855, pp. 358–372, July 2000. Springer. [2] J. Bengtsson et al. New Generation of Uppaal. In Int. Workshop on Software Tools for Technology Transfer, June 1998. [3] J. W. Brewer. Kronecker products and matrix calculus in system theory. IEEE Trans. Circ. and Syst., CAS-25:772–781, Sept. 1978.
96
G. Ciardo et al.
[4] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Trans. Comp., 35(8):677–691, Aug. 1986. [5] P. Buchholz, G. Ciardo, S. Donatelli, and P. Kemper. Complexity of memoryefficient Kronecker operations with applications to the solution of Markov models. INFORMS J. Comp., 12(3):203–222, 2000. [6] G. Ciardo, R. German, and C. Lindemann. A characterization of the stochastic process underlying a stochastic Petri net. IEEE TSE, 20(7):506–515, July 1994. [7] G. Ciardo, G. Luettgen, and R. Siminiceanu. Efficient symbolic state-space construction for asynchronous systems. Proc. ICATPN, LNCS 1825, pp. 103–122, June 2000. Springer-Verlag. [8] G. Ciardo, G. Luettgen, and R. Siminiceanu. Saturation: An efficient iteration strategy for symbolic state space generation. Proc. TACAS, LNCS 2031, pp. 328–342, Apr. 2001. Springer-Verlag. [9] G. Ciardo, R. Marmorstein, and R. Siminiceanu. Saturation unbound. In Proc. TACAS, Apr. 2003. Springer-Verlag. To appear. [10] G. Ciardo and A. S. Miner. Storage alternatives for large structured state spaces. Proc. Int. Conf. on Modeling Techniques and Tools for Computer Performance Evaluation, LNCS 1245, pp. 44–57, June 1997. Springer-Verlag. [11] G. Ciardo and A. S. Miner. A data structure for the efficient Kronecker solution of GSPNs. Proc. PNPM, pp. 22–31, Sept. 1999. IEEE Comp. Soc. Press. [12] G. Ciardo and R. Siminiceanu. Using edge-valued decision diagrams for symbolic generation of shortest paths. Proc. FMCAD, LNCS 2517, pp. 256–273, Nov. 2002. [13] G. Ciardo and K. S. Trivedi. A decomposition approach for stochastic reward net models. Perf. Eval., 18(1):37–59, 1993. [14] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri. NuSMV: a new Symbolic Model Verifier. Proc. CAV, LNCS 1633, pp. 495–499, July 1999. Springer. [15] E. M. Clarke and E. A. Emerson. Design and synthesis of synchronization skeletons using branching time temporal logic. Proc. IBM Workshop on Logics of Programs, pp. 52–71, 1981. [16] J. A. Couvillion et al. Performability modeling with UltraSAN. IEEE Software, pp. 69–80, Sept. 1991. [17] D. Daly, D. D. Deavours, J. M. Doyle, P. G. Webster and W. H. Sanders. M¨ obius: an extensible tool for performance and dependability modeling. LNCS 1786, pp. 332–336, 2000. Springer-Verlag. [18] R. L. Jones. Analysis of Phase-Type Stochastic Petri Nets with Discrete and Continuous Timing. Tech. Rep. CR-2000-210296, NASA Langley, June 2000. [19] R. L. Jones and G. Ciardo. On phased delay stochastic Petri nets: Definition and an application. Proc. PNPM, pp. 165–174, Sept. 2001. IEEE Comp. Soc. Press. [20] R. L. Jones III. Simulation and Numerical Solution of Stochastic Petri Nets with Discrete and Continuous Timing. Ph.D. thesis, College of William and Mary, Department of Computer Science, Williamsburg, VA, 2002. [21] T. Kam, T. Villa, R. Brayton, and A. Sangiovanni-Vincentelli. Multi-valued decision diagrams: theory and applications. Multiple-Valued Logic, 4(1–2):9–62, 1998. [22] M. Z. Kwiatkowska, G. Norman, and D. Parker. PRISM: Probabilistic Symbolic Model Checker. In Proc. Comp. Perf. Eval. / TOOLS, pp. 200–204, Apr. 2003. [23] A. S. Miner. Efficient state space generation of GSPNs using decision diagrams. In Proc. DSN, pp 637–646, Washington, DC, June 2002. [24] A. S. Miner and G. Ciardo. Efficient reachability set generation and storage using decision diagrams. Proc. ICATPN, LNCS 1639, pp. 6–25, June 1999. Springer.
Logical and Stochastic Modeling with SmArT
97
[25] A. S. Miner, G. Ciardo, and S. Donatelli. Using the exact state space of a Markov model to compute approximate stationary measures. Proc. ACM SIGMETRICS, pp. 207–216, June 2000. ACM Press. [26] T. Murata. Petri Nets: properties, analysis and applications. Proc. of the IEEE, 77(4):541–579, Apr. 1989. [27] E. Pastor, O. Roig, J. Cortadella, and R. Badia. Petri net analysis using boolean manipulation. Proc. ICATPN, LNCS 815, pp. 416–435, June 1994. Springer. [28] M. Tilgner, Y. Takahashi, and G. Ciardo. SNS 1.0: Synchronized Network Solver. In Int. Workshop on Manufacturing and Petri Nets, pp. 215–234, June 1996. [29] S. Yovine. Model checking timed automata. In European Educational Forum: School on Embedded Systems, pp. 114–152, 1996.
The Peps Software Tool Anne Benoit1 , Leonardo Brenner2 , Paulo Fernandes2 , Brigitte Plateau1 , and William J. Stewart3 1
3
Laboratoire ID, CNRS-INRIA-INPG-UJF, 51, av. Jean Kuntzmann 38330 Montbonnot Saint-Martin France {Anne.Benoit, Brigitte.Plateau}@imag.fr 2 PUCRS, Faculdade de Inform´ atica, Av. Ipiranga, 6681 90619-900 Porto Alegre, Brazil {lbrenner, paulof}@inf.pucrs.br NCSU, Computer Science Department, Raleigh, NC 27695-8206, USA† [email protected]
Abstract. Peps is a software package for solving very large Markov models expressed as Stochastic Automata Networks (SAN). The SAN formalism defines a compact storage scheme for the transition matrix of the Markov chain and it uses tensor algebra to handle the basic vector matrix multiplications. Among the diverse application areas to which Peps may be applied, we cite the areas of computer and communication performance modeling, distributed and parallel systems and finite capacity queueing networks. This paper presents the numerical techniques included in version 2003 of the Peps software, the basics of its interface and three practical examples.
1
Introduction
Parallel and distributed systems can be modeled as sets of interacting components. Their behavior is usually hard to understand and formal techniques are necessary to check their correctness and predict their performance. A Stochastic Automata Network (SAN) [6,12] is a formalism to facilitate the modular description of such systems and it allows the automatic derivation of the underlying Markov chain which represents its temporal behavior. Solving this Markov chain for transient or steady state probabilities allows us to derive performance indices. The main difficulties in this process is the complexity of the model and the size of the generated Markov chain. Several other high-level formalisms have been proposed to help model very large and complex continuous-time Markov chains in a compact and structured manner. For example, queueing networks [9], generalized stochastic Petri nets [1], stochastic reward nets [11] and stochastic activity nets [17] are, thanks to their extensive modeling capabilities, widely used in diverse application domains, and notably in the areas of parallel and distributed systems. The pioneering work on the use of Kronecker algebra for solving large Markov chains has been conducted in a SAN context. The modular structure of a SAN †
Research supported in part by NSF grant ITR-105682
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 98–115, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Peps Software Tool
99
model has an impact on the mathematical structure of the Markov chain in that it induces a product form represented by a tensor product. Other formalisms have used this Kronecker technique, as, e.g., stochastic Petri nets [5] and process algebras [10]. The basic idea is to represent the matrix of the Markov chain by means of a tensor (Kronecker) formula, called descriptor [2]. This formulation allows very compact storage of the matrix. Moreover, computations can be conducted using only this formulation, thereby saving considerable amounts of memory (as compared to an extensive generation of the matrix). Recently, other formats which considerably reduce the storage cost, such as matrix diagrams [4], have been proposed. They basically follow the same idea: components of the model have independent behaviors and are synchronized at some instants; when they behave independently their properties are stored only once, whatever the state of the rest of the system. Using tensor products, a single small matrix is all that is necessary to describe a large number of transitions. Using matrix diagrams (a representation of the transition matrix as a graph), transitions with the same rate are represented by a single arc. At this time, SAN algorithms use only Kronecker technology, but a SAN model could also be solved using matrix diagrams. A particular SAN feature is the use of functional rates and probabilities [7]. These are basically state dependent rates, but even if a rate is local for a component (or a subset of components) of the SAN, the functional rate can depend on the entire state of the SAN. It is important to notice that this concept is more general than the usual state dependent concept in queueing networks. In queueing networks, the state dependent service rate is a rate which depends only on the state of the queue itself. The basic technique to solve SAN models is the so-called shuffle algorithm [13], which makes use of the Kronecker product structure, and has acceptable complexity [14]. Since the initial introduction of this algorithm, many improvements have been developed including: – a reduction in the cost of function evaluation by a different implementation of the shuffle algorithm, – the reduction of the model product state space and the complexity of the descriptor formula by the automatic grouping of automata, – the acceleration of the convergence of iterative methods by using preconditioning techniques and projection methods. These improvements were implemented in the previous version of the software dedicated to solving SAN models, called Peps - Performance Evaluation of Parallel Systems - (the version Peps 2000) [2,6,7]. The objective of this paper is to present the performance of the new version, Peps 2003. It is based on the previous version, and offers new features and better performance. Furthermore, the interface has been modified to allow the notion of replication: – replicas of states when they have the same behavior (typically the inner states of a queue), and – replicas of automata (typically, identical components of a system).
100
A. Benoit et al.
Another improvement of the Peps 2003 version results from the function evaluation itself: in previous versions, functions were interpreted while in the new version they are compiled. This paper gives an experimental assessment of this modification. Finally, the shuffle algorithm has been modified in order to work with vectors of the size of the reachable state space [3]. Previously, the shuffle algorithm’s main drawback was the use of vectors of the size of the product state space. When the reachable state space is small compared to the product state space (basically less than 50%), the new version of the shuffle algorithm uses less memory and is more efficient. This paper is divided in sections as follows. Section 2 presents the SAN formalism and Peps syntax by means of an example. Section 3 briefly presents the algorithms implemented in Peps. Section 4 presents examples which are used to compute some numerical results showing the performance of Peps 2003. This improved performance is discussed in the conclusions.
2
A Formalism for Stochastic Automata Networks
In a SAN [7,13] a system is described as a collection of interacting subsystems. Each subsystem is modeled by a stochastic automaton, and the interaction among automata is described by firing rules for the transitions inside each automaton. The SAN models can be defined on a continuous-time or discretetime scale. In this paper, attention is focused only on continuous-time models and therefore the occurrence of transitions is described as a rate of occurrence. The concepts presented in this paper can be generalized to discrete-time models, since the theoretical basis of such SAN models has already been established [2]. In this section an informal description of a SAN is presented, and then the textual format to input SAN models in the Peps 2003 tool is briefly described. 2.1
Informal Description
Each automaton is composed of states, called local states, and transitions among them. Transitions on each automaton are labeled with a list of the events that may trigger them. Each event is denoted by its name and its rate (only the name is indicated in the graphical representation of the model). When the occurrence of the same event can lead to different arrival states, a probability of occurrence is assigned to each possible transition. The label on the transition is given as evt(prob), where evt is the event name, and prob is the probability of occurrence. When not explicitly specified, this probability is set to 1. There are basically two ways in which stochastic automata interact. First, the rate at which an event may occur can be a function of the state of other automata. Such rates are called functional rates. Rates that are not functional are said to be constant rates. The probabilities of occurrence of events can also be functional or constant. Second, an event may involve more than one automaton: the occurrence of such an event triggers transitions in two or more automata at the same time. Such events are called synchronizing events. They may have
The Peps Software Tool
101
constant or functional rates. An event which involves only one automaton is said to be a local event. Consider a SAN model with N automata and E events. It is a N -component Markov chain whose components are not necessarily independent (due to the possible presence of functional rates and synchronizing events). A local state of the i-th automaton (A(i) , where i = 1 . . . N ) is denoted x(i) while the complete set of states for this automaton is denoted S (i) , and the cardinality of S (i) is denoted by ni . A global state for the SAN model is a vector x = (x(1) , . . . , x(N ) ). Sˆ = S (1)× · · · × S (N ) is called the product state space, and its cardinality is N equal to i=1 ni . The reachable state space of the SAN model is denoted by S; it is generally smaller than the product state space since synchronizing events and functional rates may prevent some states in Sˆ from being reachable. The set of automata involved with a (local or synchronizing event) e is denoted by Oe . The event e can occur if, and only if, all the automata in Oe are in a local state from which a transition labeled by e can be triggered. When it occurs, all the corresponding transitions are triggered. Notice that for a local event e, Oe is reduced to the automaton involved in this event and that only one transition is triggered. Fig. 1 presents an example. The first automaton A(1) has three states x(1) , (1) y , and z (1) ; the second automaton A(2) has two states x(2) and y (2) . The events of this model are: – e1 , e2 and e3 : local events involving only A(1) , with constant rates respectively equal to λ1 , λ2 and λ3 ; – e4 : a synchronizing event involving A(1) and A(2) , with a constant rate λ4 ; – e5 : a local event involving A(2) , with a functional rate f : – f = µ1 , if A(1) is in state x(1) ; – f = 0, if A(1) is in state y (1) ; – f = µ2 , if A(1) is in state z (1) . When the SAN is in state (z (1) , y (2) ), the event e4 can occur at rate λ4 , and the resulting state of the SAN can be either (y (1) , x(2) ) with probability π or (x(1) , x(2) ) with probability 1 − π.
A(1) e3 , e4 (1 − π)
A(2)
(1)
x
e4 (π) z (1)
e1 y (1)
e4
x(2)
y (2)
e5
e2
Fig. 1. Very Simple SAN model example
We see then that a SAN model is described as a set of automata (each automaton containing nodes, edges and labels). These may be used to generate
102
A. Benoit et al.
the transition matrix of the Markov chain representing its dynamic behavior using only elementary matrices. This formulation of the transition matrix is called the SAN descriptor. 2.2
Peps 2003 Textual Format
A textual formalism for describing models is proposed, and it keeps the key feature of the SAN formalism: its modular specification. Peps 2003 incorporates a graph-based approach which is close to model semantics. In this approach each automaton is represented by a graph, in which the nodes are the states and the arcs represent the occurrence of events. This textual description has been kept simple, extensible and flexible. – Simple because there are few reserved words, just enough to delimit the different levels of modularity; – Extensible because the definition of a SAN model is performed hierarchically; – Flexible because of the inclusion of replication structures which allow the reuse of identical automata, and the construction of automata having repeated state blocks with the same behavior, such as found in queueing models. This section describes the Peps 2003 textual formalism used to describe SAN models. To be compatible with Peps 2003, any file describing a SAN should have the suffix .san. Fig. 2 shows an overview of the Peps input structure. A SAN description is composed of five blocks (Fig. 2) which are easily located with their delimiters1 (in bold). The other reserved words in the Peps input language are indicated with an italic font. The symbols “<” and “>” indicate mandatory information to be defined by the user. The symbols “{” and “}” indicate optional information. The first block identifiers contains all declarations of parameters: numerical values, functions, or sets of indices (domains) to be used for replicas in the model definition. An identifier (< id name >) can be any string of alphanumeric characters. The numerical values and functions are defined according to a C-like syntax. In general, the expressions are similar to common mathematical expressions, with logical and arithmetic operators. The arguments of these expressions can be constant input numbers (input parameters of the model), automata identifiers or state identifiers. In this last case, the expressions are functions defined on the SAN model state space. For example, “the number of automata in state n0” (which gives an integer result) can be expressed as “nb n0”. A function that returns the value 4 if two automata (A1 and A2) are in different states, and the value 0 otherwise, is expressed as “(st A1 ! = st A2) ∗ 4”. Comparison 1
The word “delimiters” is used to indicate necessary symbols, having a fixed position in the file.
The Peps Software Tool
103
identifiers
< id name >=< exp > ; < id name >= [domain] ;
events
// without replication loc < evt name >< rate >< automaton > syn < evt name >< rate >< automata > // with replication loc < evt name > [domain] < rate >< automata > [domain] syn < evt name > [domain] < rate >< automata > [exp − domain]
{partial} reachability =< exp > ; network < net
name > (< type >) aut < aut name > {[domain]} stt < stt name > {[domain]} to( < stt name > ) < evt name > {< prob >} stt < stt name > to( < stt name > ) < evt name > {< prob >}
results
< res name >=< exp >
;
Fig. 2. Modular structure of SAN textual format
operators return the value “1” for a true result and the value “0” for a false result2 . Sets of indices are useful for defining numbers of events, automata, or states that can be described as replications. A group of replicated automata of A with the set in index [0 . . . 2, 5, 8 . . . 10] defines the set containing the automata A[0], A[1], A[2], A[5], A[8], A[9], and A[10]. The events block defines each event of the model given – its type (local or synchronizing); – its name (an identifier); – its firing rate (a constant or function previously defined in the identifiers block); – its Oe set, i.e., the automaton, for local events, or the automata, for synchronizing events, concerned by this event. Additionally, events can be replicated using the sets of indexes (domains). This facility can be used when events with the same rate appear in a set of automata. The reachability block contains a function defining the reachable state space of the SAN model. Usually, this is a Boolean function, returning a nonzero value for states of Sˆ that belongs to S. A model where all the states are reachable has the reachability function defined as any constant different from zero, e.g., the value 1. Optionally, a partial reachability function can be defined by adding the reserved word “partial”. In this case, only a subset of S is defined, and the overall S will be computed by Peps 2003. 2
In the Peps 2003 user manual, a full description of the possible functions and the full grammatical definition of the textual format of SAN models is available [15].
104
A. Benoit et al.
The network block is the major component of the SAN description and has a hierarchical structure: a network is composed of a set of automata; each automaton is composed of a set of states; each state is connected to a set of output arcs; and each arc has a set of labels identifying events (local or synchronizing) that may trigger this transition. The first level, named “network”, includes general information such as the name of the model “< net name >” and the type of time scale of the model, namely “continuous” or “discrete”. Currently, only “continuous” model analysis is available in Peps 2003. The delimiter of the automaton is the reserved word “aut” and “< aut name >” is the automaton identifier. Optionally, a [domain] definition can be used to replicate it, i.e., to create a number of copies of this automaton. In this case, if i is a valid index of the defined [domain] and A the name of the replicated automaton, A[i] is the identifier of one automaton. The stt section defines a local state of the automaton. “< stt name >” is the state identifier, and [domain] can be used to create replicas of the state. A description of each output transition from this state is given by the definition of a “to()” section. The identifier “< stt name >” inside the parenthesis indicates the output state of this transition. Inside a group of replicated states, the expression of the other states inside the group can be made by positional references to the current (==), the previous (– –) or the successor (++). Larger jumps, e.g., of states two ahead (+2), can also be defined, but any positional reference pointing to a non-existing state or to a state outside the replicated group is ignored. Finally, for each transition defined, a set of events (local and synchronizing) that can trigger the transition can be expressed by their names (“< evt name >”) and optionally (if different from 1) the probability of occurrence. The from section is quite similar to the stt section, but it cannot define local states. This is commonly used to define additional transitions which cannot be defined in the stt section. A typical use of the from section is to define a transition leaving from only one state of a group of replicated states to a state outside the group, e.g., a queue with particular initial or final states may need this kind of transition definition. The functions used to compute performance indexes of the SAN model are defined in the results block. The results obtained by Peps are the mean values of these functions computed using the stationary probability solution of the model.
3
The Peps 2003 Software Tool
Peps is implemented using the C++ programming language, and although the source code is quite standard, only Linux and Solaris version have been tested. The main features of version 2000 are [7]: – Textual description of continuous-time SAN models (without replicas); – Stationary solution of models using Arnoldi, GMRES and Power iterative methods [16,19]; – Numerical optimization regarding functional dependencies, diagonal precomputation, preconditioning and algebraic aggregation of automata [6]; and – Results evaluation.
The Peps Software Tool
105
Peps 2003 includes some bug corrections and three new features: – Compact textual description of continuous-time SAN models; – Numerical solution using probability vectors with the size of the reachable state space; and – Fast(er) function evaluation. The previous Section presented the new textual format used in Peps 2003 to allow compact descriptions mostly due to the idea of replications of automata, states and events. The next two sections (3.1 and 3.2) present the other new features of version 2003. This paper will not present details on how to install and operate Peps; to learn how to do so, the reader is invited to read the user manual available at the Peps home page [15]. 3.1
Probability Vector Handling
SAN allow Markov chain models to be described in a memory efficient manner because their storage is based on a tensor structure (descriptor). However, the use of independent components connected via synchronizations and functions ˆ Within may produce a representation with many unreachable states (|S| |S|). this tensor (Kronecker) framework, a number of algorithms have been proposed to compute the product of a probability vector and the descriptor. The first and perhaps best-known, is the shuffle algorithm [3,6,7], which computes the product but never needs the matrix explicitly. However, this algorithm needs ˆ This algorithm is to use “extended” vectors π ˆ whose size is equal to that of S. denoted E-Sh, for extended shuffle. This algorithm was the only one available in Peps 2000. ˆ E-Sh is not However, when there are many unreachable states (|S| |S|), efficient, because of its use of extended vectors. The probability vector will have many zero elements, since only states corresponding to reachable states have nonzero probability. Moreover, computations are carried out for all the elements of the vector, even those elements corresponding to unreachable states. Therefore, the computation gain obtained by exploiting the tensor formalism may be negated if many useless computations are performed. Furthermore, memory is used for states whose probability is always zero. The use of reduced vectors (vectors π which contains entries only for reachable states, i.e., vectors of size |S|) allows a reduction in memory needs, and some unneeded computations are avoided. This leads to significant memory gains when using iterative methods such as Arnoldi or GMRES which can possibly require many probability vectors. A modification to the E-Sh shuffle algorithm permits the use of such vectors. However, to obtain good performance at the computation-time level, some intermediate vectors of size Sˆ are also used. An algorithm described in [3] and implemented in Peps 2003 allows us to save computations by taking into account the fact that probabilities corresponding to non-reachable states are always zero in the resulting vector. This partially reduced computation corresponds to the algorithm called PR-Sh. However,
106
A. Benoit et al.
the savings in memory turns out to be somewhat insignificant for the shuffle algorithm itself. A final version of the shuffle algorithm concentrates on the amount of memory used, and allows us to handle even more complex models. In this algorithm, all the intermediate data structures are stored in reduced format. This fully reduced computation corresponds to the algorithm called FR-Sh. This algorithm is also implemented in Peps 2003, and described in [3]. 3.2
Fast Function Evaluation
One of the most important improvements of the new version of the Peps software is the inclusion of fast function evaluation. The use of functions is one of the key features of Peps . It is an intuitive and compact form to represent interdependences among automata. An efficient numerical solution of SAN models with functions was published in [7]. This work described properties of the tensor operations and algorithms to reduce the number of function evaluations performed during vector multiplications. In the previous version of Peps (version 2000), the functions were represented in reverse polish notation which is interpreted during execution time and evaluated using a stack implemented as a regular data structure. This implementation is quite flexible, and it allows the transformation of functions during the compilation of a SAN model3 . However, the interpretation of functions may cost a great amount of time during execution. Typically, for a functional rate in automata i, denoting nj the number of local states of the j-th automaton, and N the number of automata, each functional element N in a matrix corresponding to the i-th automaton can be evaluated up to ( j=1,j=i nj ) times. Additionally, since Peps solves the models using iterative methods, all these evaluations must be carried out during each and every iteration. The efficient algorithm presented in [7] can take advantage of the particular dependencies of the functional elements to reduce this number of evaluations, but very often this number remains quite large. The natural alternative to function interpretation is to compile the functions to generate executable code. Since each SAN model has its own functions, the functional elements are compiled into C++ code by a system call to the gcc compiler to generate a dynamic library which is called by the Peps software. Such a technique is similar to the just in time code generation technique employed in Java environments [20]. In the case of Peps however, the purpose of such real 3
The functions defined in the section identifiers of the model textual description are rarely used as stated. The generation of the internal representation of a model, i.e., the operand matrices of the Markovian descriptor, usually create new functions. This is done, e.g., for the normalization of the descriptor, where all matrix elements must be divided by a normalizing factor. In the case of automata algebraic grouping, even more new functions must be created due to the symbolic evaluation of matrix elements that should be added (in the tensor sum local part) or multiplied (in the tensor product synchronized part).
The Peps Software Tool
107
time code generation is not to provide machine independence, but to provide a tailor-made function evaluation to each SAN model. The gains obtained by the new function evaluation are substantial during the iterative solution of the SAN models, but the flexibility of the interpreted evaluation still justifies its use during the compilation of the model and the normalization of the descriptor. A version of Peps 2003 without the compiled function evaluation is also available to run Peps on platforms where the gcc compiler is not available.
4
Examples of SAN Models in Peps 2003
In this section three examples are presented to illustrate the modeling power and the computational effectiveness of Peps 2003 . For each example, the generic SAN model is described and then numerical results are computed for some specific cases. The machine used to run the examples is an IBM-PC running Linux OS (Mandrake distribution, v.8.0, with 1.5 Gbytes of RAM and with 2.0 GHz Pentium IV processor. Peps 2003 was compiled using gcc with the optimization option -O3. The indicated processing times do not take system time into account, i.e., they refer only to the user time spent for one iteration. 4.1
Example: A Model of Resource Sharing
The first example is a traditional resource sharing model, where N distinguishable processes share a certain amount (R) of indistinguishable resources. Each of these processes alternates between a sleeping and a resource using state. When a process wishing to move from the sleeping to the using state finds R processes already using the resources, that process fails to access the resource and it returns to the sleeping state. Notice that when R = 1 this model reduces to the usual mutual exclusion problem. Analogously, when R = N all the processes are independent and there is no restriction to access the resources. We shall let λi be the rate at which process i awakes from the sleeping state wishing to access the resource, and µi , the rate at which this same process releases the resource. A(1)
A(2) sleeping
R1
• • •
R2
G1 using
A(N ) sleeping
sleeping RN
GN
G2 using
• • •
using
Fig. 3. Resource Sharing Model - version 1
In our SAN representation (Fig. 3), each process is modeled by a two state automaton A(i) , the two states being sleeping and using. We shall let stA(i) denote the current state of automaton A(i) . Also, we introduce the function
108
A. Benoit et al.
f =δ
N
δ(stA
(i)
= using) < R .
i=1
where δ(b) is an integer function that has the value 1 if the Boolean b is true, and the value 0 otherwise. Thus the function f has the value 1 when access to the resource is permitted and has the value 0 otherwise. Fig. 3 provides a graphical illustration of this model, called RS1. In this representation each automaton A(i) has two local events: – Gi which corresponds to the i-th process getting a resource, with rate λi f ; – Ri which corresponds to the i-th process releasing a resource, with rate µi . The textual .san file describing this model is: //===================== RS model version 1 ================== // N=4, R=2 //=========================================================== identifiers R = 2; // amount of resources mu1 = 6; // rate for leaving a resource for process 1 lambda1 = 3; // rate for requesting a resource for process f1 = lambda1 * (nb using < R); mu2 = 5; // rate for leaving a resource for process 2 lambda2 = 4; // rate for requesting a resource for process f2 = lambda2 * (nb using < R); mu3 = 4; // rate for leaving a resource for process 3 lambda3 = 6; // rate for requesting a resource for process f3 = lambda3 * (nb using < R); mu4 = 3; // rate for leaving a resource for process 4 lambda4 = 5; // rate for requesting a resource for process f4 = lambda4 * (nb using < R); events loc loc loc loc loc loc loc loc
G1 R1 G2 R2 G3 R3 G4 R4
(f1) (mu1) (f2) (mu2) (f3) (mu3) (f4) (mu4)
P1 P1 P2 P2 P3 P3 P4 P4
// // // // // // // //
local local local local local local local local
event event event event event event event event
reachability = (nb using <= R);
G1 R1 G2 R2 G3 R3 G4 R4
has has has has has has has has
rate rate rate rate rate rate rate rate
1 2 3 4
f1 and appears in automaton P1 mu1 and appears in automaton P1 f2 and appears in automaton P2 mu2 and appears in automaton P2 f3 and appears in automaton P3 mu3 and appears in automaton P3 f4 and appears in automaton P4 mu4 and appears in automaton P4
// only the states where at the most R // resources are being used are reachable
network rs1 (continuous) aut P1 stt sleeping to(using) stt using to(sleeping) aut P2 stt sleeping to(using) stt using to(sleeping) aut P3 stt sleeping to(using) stt using to(sleeping) aut P4 stt sleeping to(using) stt using to(sleeping)
G1 R1 G2 R2 G3 R3 G4 R4
results full empty use1 average
probability of all resources being used probability of all resources being available probability that the first process uses the resource average number of occupied resources
= = = =
nb nb st nb
using == R; using == 0; P1 == using; using;
// // // //
It was not possible to use replicators to define all four automata in this example. In fact, the use of replications is only possible if all automata are
The Peps Software Tool
109
identical, which is not the case here since each automaton has different events (with different rates). If all the processes had the same acquiring (λ) and releasing (µ) rates, this example could be represented more simply as: //========== RS model version 1 with same rates ============= // N=4, R=2 //=========================================================== identifiers N = [1..4]; // amount (and identifier) of processes R = 2; // amount of resources mu = 6; // rate for leaving a resource for all processes lambda = 3; // rate for requesting a resource for all processess f = lambda * (nb using < R); events loc G loc R
(f) (mu)
P P
// local event G has rate f and appears in all automata P // local event R1 has rate mu and appears in all automata P
reachability = (nb using <= R);
// only the states where at the most R // resources are being used are reachable
network rs1 (continuous) aut P[proc] stt sleeping to(using) G stt using to(sleeping) R results full empty use1 average
= = = =
nb nb st nb
using == R; using == 0; P[1] == using; using;
// // // //
probability of all resources being used probability of all resources being available probability that the first process uses the resource average number of occupied resources
We wish to point out that in the Peps documentation, a number of variants of this model are included, to show that it is possible with only simple modifications to introduce a complete set of related models. Within the scope of this paper, it is interesting to describe a specific variation of this model that describes exactly the same problem, but which does not use functions to represent the resource contention. In fact, in this case, and in many others, synchronizing events can be used to generate an equivalent model without functional rates (freqeuntly with many more automata, states, and/or synchronizing events). Fig. 4 presents this new model, where an automaton is introduced to represent the resource pool. The resource allocation (events Gi , rate λi ) and release (events Ri , rate µi ) that were formerly described as local events, will now be synchronizing events that increment the number of occupied resources at each possible allocation, and decrement it at each release. The resource contention is modeled by the impossibility of a process passing to the using state when all resources are occupied, i.e., when the automaton representing the resource is in the last state (where only release events can happen). The Peps 2003 textual format of this model is as follows: //===================== RS model version 2 ================== // N=4, R=2 //=========================================================== identifiers R = 2; // amount of resources mu1 = 6; // rate for leaving a resource for process 1 lambda1 = 3; // rate for requesting a resource for process 1 mu2 = 5; // rate for leaving a resource for process 2 lambda2 = 4; // rate for requesting a resource for process 2
110
A. Benoit et al.
mu3 lambda3 mu4 lambda4 res_pool events syn syn syn syn syn syn syn syn
G1 R1 G2 R2 G3 R3 G4 R4
= = = = =
4; 6; 3; 5; [0..R];
(f1) (mu1) (f2) (mu2) (f3) (mu3) (f4) (mu4)
P1 P1 P2 P2 P3 P3 P4 P4
RP RP RP RP RP RP RP RP
// // // // //
rate for leaving a resource for process 3 rate for requesting a resource for process 3 rate for leaving a resource for process 4 rate for requesting a resource for process 4 domain to describe the available resources pool
// // // // // // // //
event event event event event event event event
G1 R1 G2 R2 G3 R3 G4 R4
has has has has has has has has
rate rate rate rate rate rate rate rate
f1 and appears in automata P1 and RP mu1 and appears in automata P1 and RP f2 and appears in automata P2 and RP mu2 and appears in automata P2 and RP f3 and appears in automata P3 and RP mu3 and appears in automata P3 and RP f4 and appears in automata P4 and RP mu4 and appears in automata P4 and RP
reachability = (nb [P1..P4] using == st RP); // the number of Processes using ressources must be equal to number // of occupied ressources in the Ressource Pool network rs2 (continuous) aut P1 stt sleeping to(using) stt using to(sleeping) aut P2 stt sleeping to(using) stt using to(sleeping) aut P3 stt sleeping to(using) stt using to(sleeping) aut P4 stt sleeping to(using) stt using to(sleeping) aut RP stt n[res_pool] to(++) G1 G2 G3 G4 to(--) R1 R2 R3 R4 results full empty use1 average
= = = =
st st st st
RP == n[R]; RP == n[0]; P1 == using; RP;
// // // //
G1 R1 G2 R2 G3 R3 G4 R4
probability of probability of probability of average number
A(1)
A(2) sleeping
R1
A(N ) sleeping
• • •
R2
sleeping RN
G1
GN
G2
using
• • •
using G1 , G2 , . . . , GN
A(N +1)
all resources being used all resources being available the first process use the resource of occupied resources
0 R1 , R2 , . . . , RN
G 1 , G2 , . . . , G N 1 R1 , R2 , . . . , RN
• • • 2
using G1 , G2 , . . . , GN R
• • •
R1 , R2 , . . . , RN
Fig. 4. Resource Sharing Model without functions - version 2
4.2
Example: First Server Available Queue
The second example considers a queue with common exponential arrival and a finite number (C) of distinguishable and ordered servers (Ci , i = 1 . . . C). As a client arrives, it is served by the first available server, i.e., if C1 is available, the client is served by it, otherwise if C2 is available the client is served by it,
The Peps Software Tool
111
and so on. This queue behavior is not monotonic, so, as far as we can ascertain, there is no product-form solution for this model. The basic technique to model this queue is to consider each server as a two-state automaton (states idle and busy). The arrival in each server is expressed by a local event (called Li ) with a functional rate that is nonzero and equal to λ, if all preceding servers are busy, and zero otherwise. At a given moment, only one server, the first available, has a nonzero arrival rate. The end of service at each server is simply a local event (Di ) with constant rate µi . The same model can also be expressed as a SAN without functions. In this case, each function is replaced by a synchronizing event that synchronizes the automaton representing the server accepting a client with all previous automata in the busy state. The Peps 2003 textual formats for the original (with functions) and this alternative model are as follows: //================== FSA model ============== // (with functions) C=4 //=========================================== identifiers lambda = 5; mu1 = 6; f1 = lambda; mu2 = 5; f2 = (st C1 == busy) * lambda; mu3 = 4; f3 = (nb[C1..C2] busy == 2) * lambda; mu4 = 3; f4 = (nb[C1..C3] busy == 3) * lambda; events loc L1 (f1) C1 loc D1 (mu1) C1 loc L2 (f2) C2 loc D2 (mu2) C2 loc L3 (f3) C3 loc D3 (mu3) C3 loc L4 (f4) C4 loc D4 (mu4) C4 reachability = 1; network fsa (continuous) aut C1 stt idle to(busy) L1 stt busy to(idle) D1 aut C2 stt idle to(busy) L2 stt busy to(idle) D2 aut C3 stt idle to(busy) L3 stt busy to(idle) D3 aut C4 stt idle to(busy) L4 stt busy to(idle) D4 results full = nb busy == C; empty = nb busy == 0; use1 = st P1 == busy; average = nb busy;
4.3
//============ FSA alternative model ======= // (with synchronizing events) C=4 //========================================== identifiers lambda = 5; mu1 = 6; mu2 = 5; mu3 = 4; mu4 = 3; events loc L1 (lambda) C1 loc D1 (mu1) C1 syn L2 (lambda) C1 C2 loc D2 (mu2) C1 syn L3 (lambda) C1 C2 C3 loc D3 (mu3) C3 syn L4 (lambda) C1 C2 C3 C4 loc D4 (mu4) C4 reachability = 1; network fsa2 (continuous) aut C1 stt idle to(busy) stt busy to(idle) to(busy) aut C2 stt idle to(busy) stt busy to(idle) to(busy) aut C3 stt idle to(busy) stt busy to(idle) to(busy) aut C4 stt idle to(busy) stt busy to(idle) results full empty use1 average
= = = =
nb nb st nb
L1 D1 L2 L3 L4 L2 D2 L3 L4 L3 D3 L4 L4 D4
busy == C; busy == 0; P1 == busy; busy;
Example: A Mixed Queueing Network
The final example is a mixed queueing network (Fig. 5) in which customers of class 1 arrive to and eventually depart (i.e., open) and customers of class 2 circulate forever in the network, (i.e., closed). This quite complex example is
112
A. Benoit et al.
presented to stress the power of description of Peps 2003 , and to provide a really large SAN model. Due to its size, the equivalent SAN model is not presented as a figure. However, the construction technique does not differ significantly from the technique employed with the previous models. In this model, each queue visited by only the first class of customer (Queues 2 and 3) is represented 1 1 by one automaton each (A(2 ) and A(3 ) , respectively). Queues visited by two classes of customers are represented by two automata (one for each class) and the total number of customers in a queue is the sum of customers (of both classes) represented in each automaton. The size of this model depends on the maximum capacity of each queue, denoted Ki for queue i. For the second class of customer (closed system) it is also necessary to define the number of customers in the system (N2 ). In this example, all queues block when the destination queue is full, even though other behavior, e.g., loss, could be easily modeled with the SAN formalism. π3 π1
2 1
π6
π2
3
π4 π5
4
5
Fig. 5. Mixed queueing network
1
2
The equivalent SAN model for this example has eight automata (A(1 ) , A(1 ) , 1 1 1 2 1 2 A(2 ) , A(3 ) , A(4 ) , A(4 ) , A(5 ) , A(5 ) ) representing each possible pair (customer, queue). The model has two local events (arrival and departure of class 1 customers), and nine synchronizing events (the routing paths for customers from both classes). Functional transitions are used to represent the capacity restriction of admission in queues accepting both classes of customer. The reachability function of the SAN model representing this queueing network must take into account both the unreachable states due to the use of two automata to represent a queue accepting two classes of customer and the fixed number of customers of class 2. Assuming queue capacities K1 = 10, K2 = 5, K3 = 5, K4 = 8, K5 = 8, and a total population of class 2 customers, N2 , equal to 10, the reachability function for this model is: reachability =
1
2
st(A(1 ) ) + st(A(1 ) ) ≤ 10 (41 )
st(A
(42 )
) + st(A
1
2
) ≤8
st(A(5 ) ) + st(A(5 ) ) ≤ 8 (12 )
st(A
(42 )
) + st(A
and and and
(52 )
) + st(A
) == 10
The Peps Software Tool
113
The equivalent SAN model has a product state space containing 28, 579, 716 states of which only 402, 732 are reachable. The complete description and .san file for this model can be obtained from the Peps web page [15].
5
Numerical Results
Table 1 shows some numerical results obtained by Peps 2003. The different columns indicates the model, its product state space and reachable state space sizes, the time for a single vector-descriptor multiplication with Peps 2000 and with Peps 2003 with the extended and the fully reduced shuffle algorithms. Table 1. Peps 2003 numerical results Models RS with functions N=20, R=1 RS with functions N=20, R=5 RS with functions N=20, R=10 RS with functions N=20, R=15 RS with functions N=20, R=19 RS with functions N=20, R=20 RS with synch. events N=20, R=1 RS with synch. events N=20, R=5 RS with synch. events N=20, R=10 RS with synch. events N=20, R=15 RS with synch. events N=20, R=19 RS with synch. events N=20, R=20 FSA with functions C=25 FSA with synch. events C=25 Mixed Queueing Network
pss
rss
1 iteration time PEPS2000 PEPS2003 - E-Sh PEPS2003 - FR-Sh
1,048,576
21
23.1
14.5
1.5
1,048,576
21,700
23.1
14.5
2.0
1,048,576
616,666
23.1
14.5
15.8
1,048,576
1,042,380
23.2
14.6
28.0
1,048,576
1,048,575
23.2
14.6
29.2
1,048,576
1,048,576
1.0
1.1
17.7
2,097,152
21
4.8
7.7
4.6
6,291,456
21,700
22.2
22.8
7.1
11,534,336
616,666
43.3
41.0
28.8
16,777,216
1,042,380
61.1
60.7
52.9
20.971.520
1,048,575
76.0
74.7
59.3
22.020.096
1,048,576
77.4
76.7
59.6
33,554,432 33,554,432
245.2
154.9
-
33,554,432 33,554,432 28,579,716 402,732
339.1 79.1
328.4 52.8
71.4
Examining the results of the “resource sharing” model with functions4 , the values obtained for Peps 2000 show that the cost of function evaluation is high (when N = R Peps automatically removes the functions). The small variations come from the fact that the number of function evaluations changes according to R. This cost is decreased by a factor of 2/3 when using compiled functions in Peps 2003. The use of the FR-Sh algorithm gives better times when the size of the reachable state space is less than 50% of the product state space. The curve plotted in the first graphic of Fig. 6 shows this more clearly. 4
No distinctions between the RS1 model with the same or different rates is made since their numerical results are identical.
114
A. Benoit et al. Mutex N=20 - Time
Mutex with synch. events N=20 - Time
32
80 PEPS2000 PEPS2003 - E-Sh PEPS2003 - FR-Sh
70
24
60
20
50 Time
Time
28
PEPS2000 PEPS2003 - E-Sh PEPS2003 - FR-Sh
16
40
12
30
8
20
4
10
0
0 0
2
4
6
8
10 # Resources
12
Model with functions
14
16
18
20
0
2
4
6
8
10 # Resources
12
14
16
18
20
Model with synchronizing events
Fig. 6. Computational time for RS models
The results of the resource sharing model with synchronizing events display similar results for Peps 2000 and Peps 2003, as there are no functions in the model. On the other hand, the use of the FR-Sh algorithm is beneficial when the ratio between the reachable state space and the product state space is high as demonstrated by the curve in the second graphic of Fig. 6. For the “first available server” example, the reachable state space is equal to the product state space, so the FR-Sh algorithm does not bring any benefit. The model with functions has better times in Peps 2000 than the model without functions, because, even with the same state spaces, the descriptor is more complex with synchronizing events than with functions. Moreover, in the model with function, Peps 2003 provides better performance. This is a case in which the benefits of using functional rates is clear. For the “mixed queueing network” example, a curious phenomenon appears. The rather good ratio (less than 2%) between product and reachable state space should suggest a much better performance of the sparse vector techniques. However, the complexity of the SAN Markovian descriptor of this model seems to be unsuitable. In this model almost all the transition rates are associated with synchronizing events and very few are associated with local events. The FR-Sh algorithm includes an overhead for descriptor parts that refer to synchronizing events, because intermediate computation vectors may leave the reachable state space. In any case, a more complete study on the benefits, and possibly some improvements, of FR-Sh sparse vectors algorithms applied to such models is a natural for future work. This is now facilitated by the availability of Peps 2003. A natural extension of this work includes studies aimed at finding the best techniques for each class of problem modeled by SAN. Also, the evolution of interest in structural representations and the rapid evolution of numerically efficient methods also suggest further versions of Peps. The authors current work will provide, in the near future, a new version of Peps which will include new algorithms to perform automatic lumping of models with replicas and which will offer simulation algorithms to solve SAN models.
The Peps Software Tool
115
References 1. M.Ajmone-Marsan, G.Balbo, G.Conte, S.Donatelli, G.Franceschinis. Modelling with Generalized Stochastic Petri Nets. John-Wiley, 1995. 2. K. Atif and B. Plateau. Stochatic Automata Networks for Modeling Parallel Systems. IEEE Transactions on Software Engineering, v.17, n.10, pp.1093–1108, 1991. 3. A. Benoit, B. Plateau and W.J. Stewart. Memory-efficient Kronecker algorithms with applications to the modelling of parallel systems. To appear in PMEOPDS’03, 2003. 4. G.F.Ciardo, A.S.Miner. A Data Structure for the Efficient Kronecker Solution of GSPNs. In: Proc. 8th International Workshop on Petri Nets and Performance Evaluation, 1999. 5. S.Donnatelli. Superposed Stochastic Automata: a Class of Stochastic Petri Nets with Parallel Solution and Distributed State Space. Performance Evaluation, v.18, pp.21–36, 1993. 6. P.Fernandes. M´ethodes Num´eriques pour la Solution de Syst`emes Markoviens a ` Grand Espace d’Etats. Th`ese de doctorat, Institut National Polytechnique de Grenoble, France, 1998. 7. P.Fernandes, B.Plateau and W.J.Stewart. Efficient Descriptor-Vector Multiplication in Stochastic Automata Networks. Journal of the ACM, v.45, n.3, pp.381–414, 1998. 8. J.Fourneau, B.Plateau. A Methodology for Solving Markov Models of Parallel Systems. Journal of Parallel and Distributed Computing, v.12, pp.370–387, 1991. 9. E.Gelenbe, G.Pujolle. Introduction to Queueing Networks. John Wiley, 1997. 10. J.Hillston. A Compositional Approach for Performance Modelling. Ph.D. Thesis, University of Edinburg, United Kingdom, 1994. 11. J.K.Muppala, G.F.Ciardo, K.S.Trivedi. Stochastic Reward Nets for Reliability Prediction. Communications in Reliability, Maintainability and Serviceability, v.1, n.2, pp.9–20, 1994. 12. B.Plateau. De l’Evaluation du Parell´elisme et de la Synchronisation. Th`ese de Doctorat d’Etat, Paris-Sud, Orsay, France, 1984. 13. B.Plateau. On the Stochastic Structure of Parallelism and Synchronization Models for Distributed Algorithms. In: Proc. ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, Austin, Texas, 1985. 14. B.Plateau, J.M.Fourneau, K.Lee. PEPS: A Package for Solving Complex Markov Models of Parallel Systems. In: R.Puigjaner, D.Potier, eds. Modelling Techniques and Tools for Computer Performance Evaluation, 1988. 15. Peps team. Peps 2003 Software Tool. On-line document available at http://www-apache.imag.fr/software/peps, visited Feb. 14th, 2003. 16. Y.Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, 1995. 17. W.H.Sanders, J.F.Meyer. An Unified Approach for Specifying Measures of Performance, Dependability, and Performability. Dependable Computing for Critical Applications, v.4, pp.215–238, 1991. 18. W.J.Stewart. marca: Markov Chain Analyzer. IEEE Computer Repository No. R76 232, 1976. 19. W.J.Stewart. Introduction to the Numerical Solution of Markov Chains. Princeton University Press, 1994. 20. Sun Microsystems The JIT Compiler Interface Specification. On-line document available at http://java.sun.com/docs/jit interface.html, visited Feb. 14th, 2003.
The Modest Modeling Tool and Its Implementation Henrik Bohnenkamp1 , Holger Hermanns1,2 , Joost-Pieter Katoen1 , and Ric Klaren1 1
Formal Methods and Tools Group, Department of Computer Science University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands 2
Department of Computer Science Saarland University, D-66123 Saarbr¨ ucken, Germany
Abstract. This paper is about the tool-suite Motor that supports the modeling and analysis of Modest specifications. In particular, we discuss its tool architecture, and the implementation details of the tool components that do already exist, in particular, the parser, the SOS implementation, an interactive simulator, and a state-space generator. As the expressiveness of Modest goes beyond existing notations for realtime as well as probabilistic systems, the implementation of these tool components has a non-trivial intrinsic complexity.
1
Introduction
Contrary to traditional software engineering where correctness issues prevail, non-functional aspects such as reliability and performance are of paramount importance for embedded software design [14,23]. Nowadays, specification languages are used for the description of embedded system’s behaviour at the various design stages. Some of these languages provide support for the representation of quantitative aspects, and a few, such as extensions of the Corba IDL [29,10, 15] have proved useful in a very pragmatic system engineering context, where they provide guidelines for run-time adaptation to ensure certain non-functional properties. These languages, however, are lacking a precise semantical meaning, and therefore do not support quantitative analysis at design time. Rigorous specification formalisms on the other hand, such as stochastic process algebras [21, 19,6,12], do have such precise semantics, but their learning curve is typically too steep from a practitioner’s perspective. Recently, we have developed the specification language Modest that covers a wide spectrum of modeling concepts, possesses a rigid, process-algebra style semantics, and yet provides modern and flexible specification constructs [11]. Modest specifications constitute a coherent starting-point to analyse distinct system characteristics with various techniques, e.g., model checking to assess functional correctness and discrete-event simulation to establish the system’s reliability. Analysis results thus refer to the same system specification, rather than to different (and potentially incompatible) specifications of system perspectives P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 116–133, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Modest Modeling Tool and Its Implementation
117
like in the UML. Modest is a modeling language that has a rigid formal basis and incorporates several ingredients from light-weight notations (e.g., SDL and the UML) such as exception handling. Modest has a stochastic process algebra ‘core’ and contains features such as simple data types, structuring mechanisms like composition and abstraction, atomic statements, non-deterministic and probabilistic branching and timing. With Modest, we take a singleformalism, multisolution approach. Our view is to have a single specification that addresses various aspects of the system under consideration. Analysis takes place by extracting simpler models from Modest specifications that are tailored to the specific properties of interest. For instance, for checking reachability properties, a possible strategy is to “distill” an automaton from the Modest specification and feed it into an existing model checker such as provided by Cadp [16]. On the other hand, for carrying out an evaluation of the stochastic process underlying a Modest specification, a discrete-event simulator can be used. Viewed differently, one may consider Modest as an overarching notation for a spectrum of models, ranging from ordinary finite-state automata, to timed automata [7], and discrete event stochastic processes such as stochastic automata [12] and Markov Decision Processes [28]. Our Contribution. In this paper, we focus on the tool Motor (Modest Tool enviRonment), which is aimed to provide the means to analyse and evaluate Modest specifications. The tool is written in the C++ programming language. The reason for this choice was the good trade-off between speed of the implementation, modularity of the code, and easy extendability. We first discuss the architectural issues that arose in the planning phase of Motor. Since Modest is capable of describing many different aspects of a system design, a tool supporting such a language has a non-trivial intrinsic complexity. The second topic and main contribution of this paper is the implementation of the structural operational semantics of Modest. Although tools for process algebraic languages are around for some time (several of them mentioned below), only few are written in object-oriented languages like C++ or Java, and, to our knowledge, their implementation details have not been described in the literature. Our paper is therefore a practical contribution to tool development for process-algebraic specification languages like Modest in an object-oriented programming environment. Related Work. Multiformalism Approaches. In our approach, a single formalism is used to describe the various aspects of the system under study. Alter¨ bius [13] supports natively, the “multiformalism multisolution” approach in Mo several modeling formalisms and various solution methods. Earlier work in this direction has been reported for the tool Sharpe [30]. The APNN-ToolBox [3, 2] also supports the integration of multiple modeling formalisms into a single software tool-environment, as does GreatSPN [9]. Open Tool Architectures. Cadp [8,17] is a widespread toolkit for design and verification of complex systems. The toolbox is designed as an open platform
118
H. Bohnenkamp et al.
for the integration of other specification, verification and analysis techniques. Similar ideas are implemented in the Acf prototype [24], which provides Apis for standard state space exploration functions, and they can also be found in ¨ bius [13]. Mo Stochastic Process Algebra Tools. Various software tools exist that support the modeling and analysis of stochastic process algebras (SPAs). The PEPA workbench [18], TIPPtool [20], TwoTowers [5] (that supports EMPA) are some prominent examples of such tools. The Modest language incorporates most functionality of the SPAs supported by these tools apart from priorities.
2
The Modest Modeling Language
In this section, we highlight the most important features of Modest, as far as they are needed to understand the rest of the paper. For a complete overview, we refer the reader to [11]. 2.1
The Modest Concepts in a Nutshell
Modest is a process algebra, enhanced with some additional features. The most basic activity is an action. Similar to process algebras like CSP, ACP and CCS, Modest actions are combined by means of operators, building processes: alt and do describe a choice between processes (terminating and looping, respectively), par describes parallel composition, hide and relabel the operators for hiding and relabelling actions. Modest allows to describe probabilistic choice by means of the palt construct: after the execution of an action, a successor state can be chosen probabilistically. It has a notion of guard (a boolean condition describing when an action is allowed to execute) and deadline (a boolean condition describing when an action must fire at the latest), described by the when and urgent operators, respectively. To make guards and deadlines work, simple datastructures are incorporated, and a notion of clock is used, a variable-like entity that changes its value linearly and continuously with time. Data and clocks can be evaluated in guards and deadlines, and can be manipulated in palt constructs. 2.2
Stochastic Timed Automata
The formal semantics of Modest has been defined in [11]. We briefly present the semantic objects, called stochastic timed automata (Sta), that are derived from Modest specifications by means of a structural operational semantics. An Sta consists of a set L of (control) locations, and a set of labelled trana,g,d
sitions of the form s −→ P, where s is a location, a an action, g a guard, and d a deadline. P is a discrete probability distribution on pairs of the form s , A, where s is a location and A is a (possibly empty) sequence of assignments. The probability distribution is used to express discrete probabilistic branching. We denote by PP s , A the probability that the pair s , A is chosen.
The Modest Modeling Tool and Its Implementation
119
An Sta is a very abstract representation of a Modest model. It describes the potential moves between locations. A more concrete description is obtained if, besides locations, we consider the valuation of data variables and clocks. A valuation assigns to each variable its value. A location together with a valuation is called a state. For a given valuation, an Sta transition is interpreted as follows. a,g,d The system is allowed to fire transition s −→ P whenever it is at location s and the guard g is true under the current valuation. As soon as deadline d becomes true (say, at time t), the system is obliged to fire the transition at time t without any delay, if it has not been fired before. If the guard is false, when the deadline becomes true, the whole system locks: the semantics of Stas does not allow time to progress anymore. This situation is called a timelock. The system is thus allowed to wait in location s as long as no deadline of either one of its outgoing a,g,d transitions becomes valid. Once the transition s −→ P is executed, the system moves with probability PP s , A to location s , assigning values to variables (i.e., changing the valuation) according to A in an atomic manner.
3
The Motor Tool Architecture
The Modest language is very expressive and has been designed without any focus on a particular analysis technique covering the entire language spectrum. Actually, many analysis problems will be undecidable or at least much too complex to be tackled effectively. Therefore, to analyse and evaluate Modest specifications, it is necessary to have sound and well-designed means to project specifications so that the projected model can be analysed with a particular technique, still providing meaningful information about the original model. From the perspective of a tool developer, it appears worthwhile to design a Modest tool that (i) provides interfacing capabilities for connection to existing tools for specific projected models, but which (ii) also provides means for enhancement by native algorithms for analysis of (classes) of Modest specifications. These requirements induce a natural architectural structure of Motor, as described in the next section. 3.1
Overview
The philosophy behind Modest as described earlier suggests a specific architectural setup. The principal idea is to hide the core functionality, i.e., the language-specific parts that are common to several subsequent analysis and postprocessing functions, behind a set of well-designed software interfaces. These interfaces enable the connection of the Motor core (called the core module) to several other, external tools. To facilitate easy extendibility, the Motor core is equipped with a modular structure, cf. Figure 1. In the following, we describe the core module and the interfaces that serve as docking points for other, so-called satellite modules.
120
H. Bohnenkamp et al. External tools
Satellite Modules
Modest specification (ASCII file)
First-state/Next-state Interface (FNS)
Modest core module AST Interface
Satellite Modules
External tools
Fig. 1. The Motor architecture
3.2
The Core Module
The core module takes a Modest specification as input, which has been composed with an ordinary text editor. The core module consists of two parts: the language parser, that produces a parser abstract syntax tree (PAST) from a Modest specification, and an implementation of the operational semantics (SOS). Two programming interfaces are offered to the user: the first/next-state interface (Fns), and the AST (abstract syntax tree) interface. The First/Next-State API. This interface provides access to the global state space of the Modest specification at hand and is similar to the OPEN/Cæsar interface of the Cadp tool environment [16], or the concepts behind the Analyser Component Framework (Acf) [24]. The user of the interface (which can be another software component) can query the initial and the current state of the Modest specification, can derive transitions that are enabled from a given state, and can derive successor states by firing transitions. The mechanisms behind the Fns only keep information about the state that was reached by firing the last transition. The Fns provides however external representations of the current state, which are created and destroyed dynamically and are used by the user to reset the internal state of the core module to that represented by the external state. One of the components that make use of the Fns are state-space generators, test-case generators such as TorX [4], or discrete-event simulators. Hierarchical Semantics. The Fns is a first step towards realising our aims behind Modest. The described interface provides access to Sta states and transitions. The Sta is the most abstract representation of the behaviour of a Modest specification. To enable the use of different analysis algorithms, it must be possible to express more concrete representations of behaviour. A very simple example of such a concretisation is obtained by taking data into account (cf. Section 2.2): an Sta does not have a notion of memory or store. The guards and assignment on Sta transitions remain completely uninterpreted. A concretisation of the Sta
The Modest Modeling Tool and Its Implementation
121
would then be to enhance the definition of state by a notion of valuation, a function that keeps track of the values assigned to variables. The concretisation of a transition then comprises the interpretation of guards and deadlines according to the valuation of the source state, and the modification of valuations according to assignments. To describe concretisations of Sta, we will use the layered semantics approach [1], which allows to define (more concrete) transition systems in the deductive style of a structural operational semantics, based on abstract transition systems. This approach allows to define hierarchies of semantical concretisations. From an implmentors point of view, the layered semantics approach is best realised by means of class inheritance: the states of a concrete semantical layer are realised as subclasses of the class representing abstract states. Accordingly, more concrete transitions are described by subclasses of the classes describing abstract transitions. AST Interface. The second API is the AST interface that provides access to the abstract syntactic description of a Modest specification. Since some tools that we are aiming to connect to have a high-level input formalism, it is more convenient to translate the Modest specification directly into this language. This can usually be performed by recursive traversals of the abstract syntax tree of the specification. Satellite modules. The satellite modules provide the real functionality to Motor. They implement either adaptor modules that bridge the gaps between the core and other, external tools, or ‘native’ modules, which implement analysis algorithms within the Motor framework. Satellite modules make use of the Fns of the AST interface. In the former case, a satellite module might also implement a concretisation of the Fns, providing itself a more concrete Fns. 3.3
The Current State of Implementation
Core Module. The core module has been implemented and provides currently the AST interface. The Fns API is also available. Its implementation realises data access and manipulation. Therefore, the notion of state and transition provided by the Fns is more concrete than that of an Sta. It is utilised by a state-space generator and an interactive user interface (see below). The Fns, as it is implemented now, does, however, not yet have the flexibility needed to write (layered) satellite modules on top of it. Available Satellite Modules Interactive Simulator. To test the implementation of the Modest SOS, we have implemented a simple interactive simulator which allows to examine the state of Modest specifications, derive possible transitions, check for enabledness of these transitions, and to execute them. This simulator provides basically a simple textual user interface to the implemented Fns API. State-space Generator. We have implemented a simple state-space generator which makes use of the implemented Fns. The generator can produce .dot
122
H. Bohnenkamp et al.
files [22], .aut files, a textual description of labelled transition systems, and .bcg files, a compact file format used by the Cadp tool environment. The latter format enables to bridge to the verification and visualisation components of Cadp which are mostly focused on functional behaviour, as well as the on-the-fly testgeneration tool TorX. ¨ bius. Recently we finished the implementation of a satellite module that Mo ¨ bius [13]. Modest is incortranslates Modest specifications into the AFI of Mo ¨ bius. Naturally, porated as a new atomic model specification formalism into Mo the module makes use of the Ast interface. Concluding Remarks. Note that we have not addressed the issue of how to specify properties to be analysed by target tools, and how to get the results back. This is currently an open issue. The diversity of tools we plan Motor to connect to makes it difficult to develop a clean concept a priori. However, we believe the approach we have chosen is flexible enough to support all kinds of solutions for this problem, either by providing dedicated user interfaces of preprocessors to Motor and the target tool, or by letting Motor vanish completely under the ¨ bius). hood of the target tool (as is the case with Mo
4
Implementation of the Core Module
In this section, we describe the implementation of the Motor core module. The core module currently comprises about 23,000 lines of code. For code generation we used the latest version of the GNU g++ compiler. The core module comprises a language parser and the implementation of the Modest SOS. Motor has a simple command-line user interface and takes a plain ASCII file with the modest specification as input. 4.1
The Parser
The parser module converts a concrete syntax representation of a Modest specification into a PAST (parser abstract syntax tree). During the parse run a first set of semantic and syntactic checks is performed. Emphasis has been put on providing informative feedback to the user about errors in the input, ranging from simple syntactic errors to typing errors. The parser module heavily relies on the parser construction tool ANTLR [25]. ANTLR is also used in other modules to generate tree walkers used for the conversion of a PAST into more specific representations. 4.2
The SOS
The SOS module uses the AST API, and provides the Fns API. As described in Section 3.3, we implemented a notion of state and transition more concrete than Sta. A state consists of a location, which is a syntactic entity, and a
The Modest Modeling Tool and Its Implementation
123
valuation of all variables and clocks that are defined and visible (i.e., relevant) in the current location. The SOS module encapsulates this information. Our main design objectives of the implementation of the SOS module were: (i) it should follow the structure of the SOS-rules [11] as close as possible, such that correctness of the implementation can be ensured to a considerable extent by simple code inspection; (ii) it should support full Modest, in particular allowing arbitrary nesting of operators. This means that processes can be created and terminated dynamically. The design of the necessary data structures is thus highly challenging. Representing Locations. Structural operational semantics (SOS) typically define a transition system describing the behaviour of a syntactic entity [27]. The deductive nature of SOS suggests a recursive algorithm to derive the transitions (and therefore, the locations) of the underlying transition system. An SOS rule has the following form: l
1 P1 P1 →
...
l
n Pn → Pn
l
op(P1 , . . . , Pn ) → P
(side condition),
which means that the outgoing transition(s) of the location op(P1 , . . . , Pn ) (where op is an n-ary language operator) are defined in terms of the outgoing transition(s) of the locations P1 , . . . , Pn . A transition for op(P1 , . . . , Pn ) can be derived provided that the side condition is true, and each of the preconditions li Pi → Pi is satisfied. As the behaviour of each language operator is described by its SOS rules, the behaviour (in terms of outgoing transitions) of a complete specification can be determined by recursively computing the outgoing transitions of its sub-processes. This recursive scheme has been implemented. We do not use syntactic term rewriting to transform locations into successor locations, since this would be too inefficient in time as well as space. We instead decided to use a shared data structure to represent locations. Since all locations that can be derived by means of the SOS are actually (combinations of) sub-expressions of the initial location, we use a single AST that describes the whole Modest specification at hand, and add attributes to the AST to denote the current location to be represented. Since sub-trees of an AST correspond to sub-expressions on the concrete syntactic level, such attributed AST can indeed express all possible sub-expressions (i.e., locations) of a Modest specification. The inner nodes of the AST correspond to the language operators of Modestand have an arity n 1, like do, par, alt, etc.. The leafs correspond to atomic processes, i.e., plain actions, exceptions, break, etc. These can be seen as language operators with arity 0. The attributes we added to the inner nodes do express which of the respective sub-trees does currently contribute to the current location, i.e., is currently active. Example. Figure 2 depicts two ASTs of the Modest process alt { :: a; e :: b; c; d }. An alt node represents a nondeterministic choice between its two sub-trees while a seq node represents the sequential composition of its subtrees (in this case, just leafs representing actions). Nodes are attributed: alt
124
H. Bohnenkamp et al.
alt chosen =↑
alt chosen = 2
seq current = 1
a
seq current = 1
e
b
(a)
c
seq current =↑
d
a
seq current = 2
e
b
c
d
(b)
Fig. 2. Example for attributed AST
nodes have an attribute chosen indicating which of its sub-trees has executed an action first while seq nodes have an attribute current denoting which of its subtrees is currently allowed to execute actions. Nodes are either active (gray) or inactive (white). Figure 2 (a) denotes the initial location of the process: the alt node has not chosen a branch yet (chosen =↑), and both seq nodes activate their respective first subtree (current = 1), the atomic processes a and b, respectively. Figure 2 (b) denotes the location reached after the b action has been executed. Attribute chosen equals 2, since the right subtree has executed the b action. The left subtree of the alt is deactivated, as it cannot exhibit any behaviour anymore. For the right seq node, current = 2, since the first subtree was the b action, which has been executed already. Therefore, the second action is active now. • Representing locations by means of ASTs has an interesting property that we exploit in our implementation: a location is uniquely determined by the set of active leafs: the ancestor nodes of all active leafs are also active,and no other node. Therefore, determining the active leafs suffices for establishing the active inner nodes and their attributes. Deriving Transitions. The primary purpose of the SOS implementation is to derive the outgoing transitions from the current location. This is accomplished by an algorithm that for a given node of the AST returns all the transitions that the process represented by this node can perform. An outline of this algorithm for a node that represents the operator op is: Step 1: Get transitions of all sub-trees that are active according to the attributes of the node (this step is obsolete for atomic processes). Step 2: Combine the thus derived transitions to new transitions according to the SOS rules defined for op. For atomic processes, only one, constant transition is returned. The implementation of the different SOS rules is straightforward and, for lack of space, not further described here. Firing Transitions. The second purpose of our SOS implementation is to administer location changes, i.e., to fire transitions. To explain how this is imple-
The Modest Modeling Tool and Its Implementation
125
mented, we must first explain how we represent transitions. Firing a transition means actually to execute actions. In our implementation, a transition contains a collection of references to the leafs in the AST that represent the actions to be executed. If a transition contains more than one of these references, all the referenced leafs/actions take part in a synchronisation. All of these leafs/actions have to be ‘executed’, and the ‘execution’ of an leaf/action is a two-step process: Step 1: A recursive bottom-up algorithm propagates to all nodes on the path from the current leaf to the root that an action has been executed. This information is important for do and for alt nodes: based on it, they will deactivate all their respective sub-trees that represent the branches of the choice that have “lost” the competition (unless, of course, the choice has been decided earlier already, in which case nothing happens). Step 2: A second recursive algorithm runs bottom-up from the leafs to indicate termination: it informs the respective parent node that the sub-process of the calling node has terminated. The parent node has then to interpret this information relative to its own attributes. For example, a seq node that gets informed that one of its sub-trees has terminated will then activate the next subtree and update its own attributes. In case there is no next subtree, the seq node will inform its own parent node that it has terminated itself. The both steps are actually combined into one, but are described here separately for reasons of clarity. Process Instantiation and Recursion. An important concept in Modest is that of process definitions and process instantiations. In particular, this allows us to specify recursive processes. In our implementation, a process definition is a mapping from a process name (a string) to an AST, as described above. A process instantiation in the concrete syntax is interpreted as an unary operator which has one parameter: the name of the process to be instantiated. Instantiations are therefore represented in the AST by nodes of type inst, which have an attribute denoting the process name. An inst node has the (among others) following tasks to perform: (1) it has to serve as a place holder for a process; (2) once it gets enabled by a parent node, it has to make a copy of the AST of the process that it represents, and make itself the root of this copy. This means that during run time the AST of the overall process is dynamically extended. Once the subtree of an inst node has been generated, the node does behave neutral with respect to transition-generation and -firing. Other tasks of inst will be discussed later. Probabilistic Branching. The palt operator has several tasks in the Modest language. It describes a transition from one location to another and expresses probabilistic branching. Each branch can have assignments. Its behaviour can be viewed as a weighted hyperedge connecting a single source location and possibly several target locations. In our implementation, the palt construct is represented by a palt node which has always an atomic process as a subtree (the palt action), and zero or more sub-trees that denote the possible successor processes. The probabilistic choice itself which is described by a palt process is
126
H. Bohnenkamp et al.
not explicitly implemented. Instead, the user has to decide what to do in case of a probabilistic branching. A state-space generator might just want to traverse all possible branches, without taking probabilities into account. A stochastic simulator, on the other hand, might want to choose the next branch randomly, according to the given probability distribution. A palt node has an attribute denoting the branch that is going to be activated after the palt action has been executed. This attribute has to be set externally by the user. From Locations to States. We will now describe our approach to access and modify variables. Recall that the state of a Modest specification is determined by its control location together with the valuation of all data variables and clocks that are defined and visible in the current location. The most involved part in our implementation is to obtain states from locations. Note that the implementation of the data part in the core module is already a concretisation of the Modest SOS. The current implementation of the data part, being in an early prototype state, does not yet comply with the architectural ideas we have discussed in Section 3, but is rather a monolithic approach. Data is manipulated in the palt construct by means of assignments. Data is accessed in when constructs, which describe enabling conditions for transitions, and the urgent constructs, which describe the conditions when a transition must fire at the latest. Fun with Boolean Functions. Before we present the details of our approach to handle data, we first describe the evaluation of guards and deadlines. For the sake of simplicity, we denote them both as conditions. Conditions are Boolean expressions, and one way to deal with them is just to evaluate them, i.e., to find out whether the expression evaluates to true or to false. In Modest, this approach is not sufficient, since we have to distinguish between two types of conditions: static conditions and dynamic conditions. Static conditions are those which do not refer to clocks. Dynamic conditions are those that do. The important difference is that static conditions will never change their evaluation over time, unless the valuation of the variables they refer to is changed (which can only happen with the firing of a transition). For dynamic conditions, this is different: since they depend on clocks, their evaluation can change over time. Therefore, dynamic conditions define implicitly Boolean functions b : IR+ −→ {true , false } which depend on one continuous parameter: time. In our implementation, we have incorporated an explicit representation of these Boolean functions, since it must be possible to derive the earliest time when a guard or deadline becomes true. In the following, we will denote these functions as BoolFun. Declarations and Data Blocks. Assignments, guards and deadlines refer to data, and thus it is necessary to make the data values accessible. Declarations are the syntactic entities in a Modest specification that describe the structure of the data to be used. There are two levels of scope in Modest: the global scope, in
The Modest Modeling Tool and Its Implementation
127
which access to global variables is possible, and local scopes, which are delimited by the body of a process definition. A variable needs a certain memory space to which its values can be written to and read from. Therefore, in our implementation, we have to reserve memory for declared variables. Variables that are defined in the same scope can be grouped together, i.e., we allocate memory blocks which are large enough to hold the values of all variables that are declared in the same scope. How does the data fit into the concepts we have introduced so far? Observe that every instantiation of a process with variable declarations opens a new scope and forces to allocate memory for these variables on activation of the process. To that purpose we introduced an attribute for inst nodes that points to the memory block storing the values of the variables declared in the process definition represented by the inst node. Allocation and deallocation of this data block is done during the generation of the subtree of the inst node. Accessing and Modifying Data. There are two concepts in Modest which access data. Guards and deadlines access variables without modifying them. Assignments, on the other hand, may also modify them. There are two principally different styles to implement the handling of guards, deadlines and assignments. The first is interpreter-oriented, the second compiler-oriented. In the first approach, a software component must be implemented that interprets the (abstract) syntactic description of conditions and assignments and takes the necessary steps to provide, for example, Boolean values that describe the enabledness of conditions, or to modify the data according to an assignment. In the second, compiler-oriented approach, a software component is generated at execution-time of the tool, which is then compiled and dynamically linked back into the running tool. We have implemented a combination of both approaches. We interpret all composed conditions, i.e., all conditions that are composed by means of the logical and, or, or not operators. We call conditions which are not composed primitive. The tool restricts primitive conditions to be of the form c ∼ expr or expr ∼ expr , where c is a clock, expr is an arbitrary arithmetic expressions not referring to clocks, and ∼ ∈ {<, , >, }. For primitive conditions and assignments we rely on a compiler-oriented approach. For both concepts C functions are generated. This relieves us from the duty to also interpret the arithmetic expressions that occur in assignments or conditions. Although an interpreter-oriented approach is straightforward to implement, a compiler-oriented approach is much more efficient, since expensive traversal of the syntax tree of the expressions, and table look-ups for data access are not necessary. For each primitive condition, two C functions are generated. The first function evaluates the guard and returns true or false, depending on the data that is passed as a parameter to the function. The second C function returns the BoolFun for the atomic condition, i.e., the partially evaluated condition which has the time as its only remaining parameter. These two functions are provided regardless whether the condition is dynamic or static. In the latter case conditions can be represented as BoolFuns which are constant true or false.
128
H. Bohnenkamp et al.
Additionally, each assignment is translated into a C function. The generated C file contains all the generated functions and is then compiled automatically. The resulting object file is dynamically linked to the running program, and the names of the generated functions are resolved to function pointers. These function pointers are then assigned to attributes in the respective nodes of the AST: the pointers to assignment functions are assigned to attributes in the corresponding palt node. The pointers to the condition-related functions are assigned to attributes in the when and urgent nodes. Parameter Passing. Process definitions can have parameters. Parameters are treated as local variables in the process definition. The difference to “normal” local variables is that they have to be initialised, when a process is instantiated. For this reason, also a parameter-passing C function is generated which takes care of the initialisation of the respective variables. External State Representation. In our implementation, the state of a Modest process is defined implicitly by the attributes of the AST and the allocated memory blocks holding the data. In order to do an exhaustive state space generation, it is however necessary to have an explicit state representation, which allows state-comparisons. This process is often called state-matching, and makes it possible to determine whether a state has been visited before or not. Clearly, it is much too inefficient to copy the complete AST and all memory blocks whenever a state change has occurred. Here our observation comes to help that the current location of a Modest process is already defined by the set of active leafs in the AST. The active leafs define a set of paths through the AST to the root of the AST. So our external, explicit state representation exhibits information about the active leafs. Additionally, an external state representation has to know about the valuations of the variables of the currently active processes. As indicated earlier, each process instantiation has its own, private memory block which contains the data. In the external state representation, we store the pointers to these memory blocks. This is combined with a copy-on-write mechanism to implement state changes caused by assignments. While executing a transition, assignments may modify data, and this requires to copy the data space of the state. By using pointers, however, copying the data means just copying these pointers, except for memory blocks which are subject to modification due to assignments. Only these memory blocks are actually copied, and necessary changes are then applied to the copy, as specified by the assignment. Relative to a na¨ıve copying of entire data spaces, this approach has the following advantages. – the copying of states is much faster, since only pointers are copied. – if memory has to be copied, only the memory block that really changes is copied. – since memory is shared, the memory utilisation is higher. – state-matching is faster, since it can utilise pointer comparison: if we have to compare the memory blocks of two process instantiations, a comparison of
The Modest Modeling Tool and Its Implementation
129
the respective pointers in the external state representation might show that they point to the same memory block. If they are identical, then it is assured that both process instantiations in both states have identical data. Only if they differ, it is necessary to compare the memory block itself. Below, we will describe state-matching in greater detail. In the following, we assume that the nodes of an AST are uniquely numbered. These node numbers are used for the external state representation. More particular, our external state representation comprises the following ingredients: 1. the set of node numbers of the active leafs in the AST. This defines implicitly and uniquely all relevant paths to the root of the AST, all active inner nodes, and therefore, the current location of the state. 2. a map from node numbers to memory pointers. The domain of this map is the set of all node numbers of those inst nodes that lie on one of the paths from the actives leafs to the root of the AST. The node number of an inst node is mapped to the pointer of the memory block that is maintained in this inst node. State Matching. State-space generators have to keep track of states that have been reached already in order to detect loops. The standard generic container data structures of the C++ Standard Template Library (set, list, stack, queue, etc.), expect the definition of equality (==) and a strict order (<) on element classes. In the following, we define equality and the order on states, and do therefore refrain from pointer comparison, as described above. Using pointer comparisons is an implementation detail for efficiency improvement, that we can make use of due to the copy-on-write mechanism described above. Including the comparison in the definition would, however, only complicate things unnecessarily. For illustration purposes, we assume two states, s and s , and denote the set of leaf numbers as s.l and s .l, respectively. By s.m we denote the mapping from node numbers to pointers. Then, dom(s.m) is the domain of the map, and, for n ∈ dom(s.m), s.m(n) denotes the pointer of the memory block mapped to. Abusing C/C++ notation, we denote by ∗s.m(n) the memory block that is pointed is defined as follows: s == s , if s.l = to by s.m(n). Then equality s .l ∧ n∈dom(s.m) ∗s.m(n) == ∗s .m(n). In order to define <, we assume that there is a strict order ≺ on sets of integers. We also assume w.l.o.g. that, if dom(s.m) = {n1 , . . . , nk }, then n1 < n2 < · · · < nk . Then we define s < s , iff s.l ≺ s .l ∨ s.l = s.l ∧ [∗s.m(n1 ) < ∗s .m(n1 ) ∨ ∗s.m(n1 ) = ∗s .m(n1 ) ∧ [∗s.m(n2 ) < ∗s .m(n2 ) ∨ ··· .. . ∨ ∗s.m(nk−1 ) = ∗s .m(nk−1 ) ∧ ∗s.m(nk ) < ∗s .m(nk )] · · ·]]. We use brackets to emphasis the nesting of the formula.
130
4.3
H. Bohnenkamp et al.
Classes
In this section, we briefly describe the classes that are used to implement the concepts described in the previous section. BoolFun. The class BoolFun implements the BoolFuns. The BoolFuns make use of a class Flank of so called flanks. Flanks denote the changes of the BoolFun from false to true or vice versa. A Flank object is basically a triple (d, u, c), where d denotes the position of the flank on the time scale, u denotes whether it is a change from false to true (UP) or from true to false (DOWN), and c denotes whether the following interval is left-closed (CLOSED) or not (OPEN). The class BoolFun has the following important methods: Flank getFirst(): returns the first flank of the function from false to true. bool operator()(double t): truth value of the BoolFun at time t. Guard. A Guard object represents a condition. Guard is an abstract class. There are several concrete sub-classes: AndGuard, OrGuard, NotGuard are guards that represent Boolean combinations of guards. The class FuncGuard represents a primitive static condition, the class TimedGuard an primitive dynamic one. The important method of all these classes is: BoolFun getBoolFun(): returns the BoolFun defined by the Guard object. State. The State class implements the external state representation as described before. There are two external operators defined: bool operator==(const State& rhs, const State& lhs): checks equality of states. bool operator<(const State& rhs, const State& lhs): checks the strict order as defined earlier. Transition. The Transition class represents Sta transitions. A Transition object has an action name, a guard, a deadline, and an abstract notion of the probabilistic branching. The important methods of Transition are: bool isNeverEnabled(): returns true, if the guard of the transition is constantly false, independent of time that may pass. Guard* getGuard(), Guard* getDeadline(): returns guard and deadline of the transition. void Fire(coordinate t& c, double offset): executes the transition, i.e., it initiates the state changes in the AST that lead to the internal representation of the destination state of the transition that is represented by the Transition object. The Transition class provides methods that return a description of the possible branches and their probability. A branch is identified by an object of type coordinate t, and the parameter c denotes the branch that the caller has chosen. The parameter double offset is a time offset. It determines the delay before the transition actually fires. Part of the state change is then to advance the valuation of each clock (that is not being reset) by offset.
The Modest Modeling Tool and Its Implementation
131
ModestModel. The ModestModel is the topmost entity in the SOS implementation. It is constructed during a traversal of the PAST. It contains all information about the specification at hand, i.e., all process definitions, and it harbours the AST used to represent states. The ModestModel provides part of the Fns. The provided methods are: State* getInitialState(): returns a pointer to the initial state. State* getCurrentState(): returns a pointer to the current state. transition list& getTransitions(State* state): returns a list of Sta transitions outgoing from state. Process. The nodes of the AST are all sub-classes of the (abstract class) Process. The Process class has the following methods: transition list& getTransitions(): returns all transitions that can be derived from the process node in the current state. void reset(): resets the process and all its sub-processes to their initial state. void disable(): This method disables the process and all its sub-processes. An important, direct subclass of Process is ComposedProcess, the class that represents all inner nodes of the AST. It has the following additional methods: propagateActivity(): implements the propagation mechanism as described in Section 4.2. childDone(): implements the termination mechanism as described in Section 4.2.
5
Future Directions and Concluding Remarks
In this paper we have presented the design and architecture of the Modest tool environment Motor, and discussed the implementation of the Motor core module, the central component of the tool prototype. We have also briefly commented on the current state of the implementation and the existing tools that we already have connected Motor to. Our next effort will be to modify the Fns API implementation such that it is possible to implement hierarchies of semantics on top of the Sta semantics (cf. Section 3). Our second short-term aim is to connect Motor to UPPAAL [26], a model checker for timed systems. Modest is capable of expressing many different known formalisms (for an overview, see [11]), like labelled transition systems, timed automata, CTMCs, GSMPs etc. Currently, Motor does only support simulation of GSMPs with ¨ bius, and allows the derivation of LTS with a simple state-space generator. Mo The expressiveness of Modest comes at a prize. Some of the underlying model classes support concepts that cannot be treated properly in other classes. The most prominent example is nondeterminism, which usually cannot be dealt with in a stochastic analysis (be it numerical, or a simulation). Exceptions are Markov Decision Processes [28], in which at least bounds of the measures of interest can
132
H. Bohnenkamp et al.
be obtained. One of the problems that arises is to check if a given Sta fulfils the restrictions dictated by a certain target formalism, or not. For example, although Modest can express CTMCs, it is not trivial to check that the Markov property is not violated by a given Sta. It is also very difficult to detect nondeterminism in a stochastic model, without doing an exhaustive state-space analysis. All these problems have to be addressed to make Modest/Motor a success, and make the whole project very challenging. Acknowledgements. This work has taken place in the context of the HaaST project (Verification of Hard and Soft Real-Time Systems) that is supported by the Dutch Technical Foundation (STW). We thank Pedro D’Argenio (Universidad Cordoba) for his contributions to the Modest modeling language, Theo C. Ruys for sharing his views on the tool implementation and Michel Rosien for his work on an initial state-space generator. William H. Sanders (U. Illinois at Urbana-Champaign) is thanked for facilitating and supporting the interfacing ¨ bius framework. to the Mo
References 1. Pierre America and Jan J. M. M. Rutten. A layered semantics for a parallel object-oriented language. Formal Aspects of Computing, 4(4):376–408, 1992. 2. F. Bause, P. Buchholz, and P. Kemper. A toolbox for functional and quantitative analysis of DEDS. In R. Puigjaner et al., editor, Computer Performance Evaluation: Modelling Techniques and Tools, volume 1496 of LNCS, pages 356–359. Springer-Verlag, 1998. 3. F. Bause, P. Kemper, and P. Kritzinger. Abstract Petri nets notation. Petri Net Newsletter, 49:9–27, 1995. 4. A. Belinfante, J. Feenstra, R.G. de Vries, J. Tretmans, N. Goga, L. Feijs, S. Mauw, and L. Heerink. Formal test automation: A simple experiment. In G. Csopaki, S. Dibuz, and K. Tarnay, editors, 12th Int. Workshop on Testing of Communicating Systems, pages 179–196. Kluwer, 1999. 5. M. Bernardo, R. Cleaveland, S. Sims, and W. Stewart. TwoTowers: a tool integrating functional and performance analysis of concurrent systems. In Formal Description Techniques, pages 457–467, 1998. 6. M. Bernardo and R. Gorrieri. A tutorial on EMPA: a theory of concurrent processes with nondeterminism, priorities, probabilities and time. Theor. Comp. Sci., 202:1– 54, 1998. 7. S. Bornot and J. Sifakis. An algebraic framework for urgency. Inf. and Comp., 163(1):172–202, 2000. 8. M. Bozga, J.-C. Fernandez, A. Kerbrat, and L. Mounier. Protocol verification with ´baran toolset. Int. J. Softw. Tools for Techn. Transf., 1(1/2):166–184, the alde 1997. 9. G. Chiola, G. Franceschinis, R. Gaeta, and M. Ribaudo. GreatSPN 1.7: Graphical editor and analyzer for timed and stochastic Petri nets. Perf. Ev., 24(1&2):47–68, 1995. 10. M. Cukier, J. Ren, C. Sabnis, W. H. Sanders, D. E. Bakken, M. E. Berman, D. A. Karr, and R. E. Schantz. AQuA: An adaptive architecture that provides dependable distributed objects. In SRDS, pages 245–253, 1998.
The Modest Modeling Tool and Its Implementation
133
11. Pedro R. D’Argenio, Holger Hermanns, Joost Pieter Katoen, and Ric Klaren. MoDeST – a modelling and description language for stochastic timed systems. In L. de Alfaro and S. Gilmore, editors, Process Algebra and Probabilistic Methods, volume 2165 of LNCS, pages 87–104, 2001. 12. P.R. D’Argenio, J.-P. Katoen, and E. Brinksma. An algebraic approach to the specification of stochastic systems (extended abstract). In D. Gries and W.-P. de Roever, editors, Programming Concepts and Methods, pages 126–147. Chapman & Hall, 1998. 13. D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derasavi, J. Doyle, W.H. Sanders, ¨ bius framework and its implementation. IEEE Trans. on and P. Webster. The Mo Softw. Eng., 28(10):956–970, 2002. 14. S. Edwards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli. Design of embedded systems: formal models, validation and synthesis. Proc. of the IEEE, 85(3):366–390, 1997. 15. S. Frolund and J. Koistinen. Quality-of-service specifications in distributed object systems. Distr. Sys. Eng., 5:179–202, 1998. 16. H. Garavel. Open/Cæsar: An open software architecture for verification, simulation, and testing. In Tools and Algorithms for the Construction and Analysis of Systems, volume 1384 of LNCS, pages 68–84, 1998. 17. H. Garavel, F. Lang, and R. Mateescu. An overview of CADP 2001. EASST Newsletter, 4:13–24, 2002. 18. S. Gilmore and J. Hillston. The pepa workbench: a tool to support a process algebra-based approach to performance modelling. In Computer Performance Evaluation, volume 794 of LNCS, pages 353–368, 1994. 19. H. Hermanns. Interactive Markov Chains, volume 2428 of LNCS. Springer-Verlag, 2002. 20. H. Hermanns, U. Herzog, U. Klehmet, V. Mertsiotakis, and M. Siegle. Compositional performance modelling with the tipptool. Perf. Ev., 39(1-4):5–35, 2000. 21. Jane Hillston. A Compositional Approach to Performance Modelling. Cambridge University Press, 1996. 22. http://www.research.att.com/sw/tools/graphviz/. 23. E.A. Lee. Embedded software. In M. Zelkowitz, editor, Advances in Computers, volume 56. Academic Press, 2002. 24. J. Lilius. The analyzer component framework version 0.1 – a tutorial. HTTP: http://aiken.cs.abo.fi/acf/ACF.pdf, Feb 2000. DRAFT: Revison 1.2. 25. Terence Parr and Russell Quong. Antlr: A predicated-ll(k) parser generator. Journal of Software Practice and Experience, 25(7):789–810, July 1995. 26. Paul Pettersson and Kim G. Larsen. Uppaal2k. Bulletin of the European Association for Theoretical Computer Science, 70:40–44, February 2000. 27. G. Plotkin. A structural approach to operational semantics. Report DAIMI FN-19, Computer Science Department, Aarhus University, September 1981. 28. M.L. Puterman. Markov Decision Processes. John Wiley & Sons, 1994. 29. C. Rodrigues, J.P. Loyall, and R.E. Schantz. Quality objects (QuO): Adaptive management and control middleware for end-to-end qos. In OMG’s First Workshop on Real-Time and Embedded Distributed Object Computing, 2000. 30. R.A. Sahner, K.S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems. An Example-Based Approach Using the SHARPE Software Package. Kluwer, 1996.
An M/G/1 Queuing System with Multiple Vacations to Assess the Performance of a Simplified Deficit Round Robin Model L. Lenzini1, B. Meini2, E. Mingozzi1, and G. Stea1 1
Department of Information Engineering University of Pisa Via Diotisalvi 2, 56126 Pisa – Italy ^OOHQ]LQLHPLQJR]]LJVWHD`#LQJXQLSLLW 2 Department of Mathematics University of Pisa Via Buonarroti, 2 - 56127 Pisa, Italy PHLQL#GPXQLSLLW
Abstract. Deficit Round-Robin (DRR) is a packet scheduling algorithm devised for providing fair queuing in the presence of variable length packets. Upper bounds on the buffer occupancy and scheduling delay of a leaky bucket regulated flow have been proved to hold under DRR. However, performance bounds are important for real-time traffic such as video or voice, whereas regarding data traffic average performance indices are meaningful in most of the cases. In this paper we propose and solve a specific worst-case model that enables us to calculate quantiles of the queue length distribution at any time (and hence average delays) as a function of the offered load, when the arrival process is Poissonian. The model proposed is a discrete time discrete state Markov chain of M/G/1-Type, and hence we used the matrix analytic methodology to solve it. The structure of the blocks belonging to the transition probability matrix is fully exploited. As a result of the above exploitation an effective algorithm for computing the matrix * is proposed. The algorithm consists in diagonalizing suitable matrix functions by means of Discrete Fourier Transform and in applying Newton’s method.
1 Introduction Multi-service packet networks are required to carry traffic pertaining to different applications, such as e-mail or file transfer, which do not require per-specified service guarantees, and real-time video or telephony, which require performance guarantees. Therefore, multi-service packet networks need to enable Quality of Service (QoS) provisioning. A key component for QoS enabling networks is the scheduling algorithm, selecting which next packet to transmit, and when, on the basis of some
3.HPSHUDQG:+6DQGHUV(GV 722/6/1&6SS± 6SULQJHU9HUODJ%HUOLQ+HLGHOEHUJ
6þHBRÃrÃvþtT'²³r·
"$
expected performance metrics. During the last decade, this research area has been widely investigated, as proved by the abundance of existing literature (see [3,4], for example, and the works referred therein). Deficit Round Robin (DRR, [6]), has been devised in order to allow flows1 with variable packet lengths to share the link bandwidth fairly. It is currently implemented in IP routers (e.g. Cisco GSR2) and in operating systems (e.g. Windows XP). It has been proved that DRR belongs to the Latency-Rate Servers category [4,5]3. This implies that it is possible to guarantee a minimum transmission rate to a flow, irrespective of the behavior of the other flows. If the flow’s traffic is leaky-bucket shaped, minimum rate guarantees straightforwardly translate into maximum delay guarantees. More recent research related to the DiffServ architecture [8], show that DRR provides packet scale rate guarantees [10], and is thus feasible for implementing the Expedited Forwarding Per-hop Behavior [9, 10]. The above mentioned researches show that DRR can be used to schedule traffic requiring performance guarantees, e.g. real-time traffic. However, the vast majority of today’s network traffic is still originated by data applications, requiring a best-effort service such as that provided by the current Internet architecture. For this type of traffic, average performance metrics are more important than performance bounds [3]. In this paper we propose and solve a specific worst-case model that enables us to calculate quantiles of the queue length distribution at packet departure time (and hence average delays) as a function of the offered load when the arrival processes is Poisson. The model proposed is a discrete time discrete state Markov chain of M/G/1Type, and hence we used a matrix analytic methodology to solve it. Exploitation of the structure of the blocks belonging to the transition probability matrix considerably reduces the computational costs. The remaining part of the paper is organized as follows. Section 2 outlines the Deficit Round Robin scheduling algorithm while Section 3 describes the worst-case model. Section 4 describes the Markov chain which describes the behavior of the worst-case model while Section 5 focuses both on the specification of the 6x and 7x matrices and the stability analysis. Sections 6 and 7 describe the effective method proposed for solving this type of M/G/1-Type Markov chain. Our results are discussed in Section 8, and conclusions are drawn in Section 9.
2 DRR Scheduling Algorithm Deficit Round Robin (DRR, [6]) is a variation of Weighted Round-Robin that allows flows with variable packet lengths to share the link bandwidth fairly. In this section we briefly recall the DRR aspects which are relevant for our analysis, and its implementation as proposed in [6]. Let us assume that I flows contend for the bandwidth of an output link. Each flow v , v = I , is characterized by a quantum u (v ) , which measures the quantity of 1
By flow we mean a distinguishable stream of packets, to which a scheduler applies the same forwarding treatment. Note that such a definition is general enough to match both IntServ flows [7] and DiffServ behavior aggregates [8]. 2 A slightly different version, called Modified DRR (MDRR), is in fact implemented therein 3 A tight bound on the DRR latency has been computed in [2].
"%
GGrþªvþvr³ hy
traffic that flow v should “ideally” transmit during a round, and by a deficit variable 98 (v ) . When a backlogged flow is serviced, it is allowed to transmit a burst of packets of an overall length not exceeding u (v ) + 98 (v ) . When a flow is not able to send a packet in a round because the packet is too large, the fraction of quantum which could not be used for transmission in the current round is saved into the flow’s deficit variable, and is therefore made available to the same flow in the next round. More specifically, the deficit variable is managed as follows: • reset to zero when the flow is not backlogged; • increased by u (v ) when the flow is selected for service during a round; • decreased by the packet length when a packet is transmitted. Let us denote with segment the unit of length of a packet (e.g. one or more bytes), and let us assume that quanta are expressed as an integer number of segments. Let G (v ) be the maximum length of a packet for flow v . An example of a DRR schedule is reported in Fig. 1. sy¸Z ¹ÃrÃr sy¸Z ! ¹ÃrÃr sy¸Z " ¹ÃrÃr
G ()
²³h¼³ ¸s ¼¸Ãþq w
u ()
hp³v½r yv²³ G (! )
G (" )
98 () = 98 (! ) = 98 (" ) = ²purqÃyr ¸s ¼¸Ãþq w
u (! )
u (" )
²³h¼³ ¸s ¼¸Ãþq w 98 () = 98 (! ) = ! 98 (" ) =
Fig. 1. Example of DRR schedule
In the figure, flows are numbered according to their order in the active list. Quanta are u () = u (") = # and u (! ) = " . At the beginning of round w , every flow has a null deficit. During round w , flow 1 is able to fit in its quantum only the head-of-line 3segment packet. It then carries over a deficit 98 () = onto the next round. Then, flow 2 transmits one 1-segment packet and carries over a deficit 98 (! ) = ! , and flow 3 transmits four 1-segment packets and carries over a null deficit. In [6] the following inequality has been proved to hold right after a flow has been serviced during a round: ≤ 98 (v ) ≤ G (v ) − ,
meaning that a flow’s deficit never reaches the maximum packet length for that flow. From an implementation standpoint, DRR exploits a FIFO list, called the active list, to store the references to backlogged flows. When an idle flow becomes backlogged, its reference is added at the tail of the list. Cyclically, if the list is not empty, the flow at the head of the list is dequeued and serviced. If the flow is still backlogged after being serviced, its reference is added at the tail of the list. Authors in [6] show that, if u (v ) ≥ G (v ) , v = I .
(1)
then the number of operation required for enqueueing and dequeuing a packet in a worst case is constant, i.e. independent of the number of scheduled flows.
6þHBRÃrÃvþtT'²³r·
"&
3 DRR Worst-Case Model Description The performance analysis of the DRR scheduling algorithm via an analytical model is a very difficult, if not impossible, task. Simplifying assumptions therefore have to be made in order to obtain analytically tractable solutions. In the following we specify a simplified model of DRR (worst-case4 model) which can be analytically solved and yet still provides useful information to study the quality of service a given (although arbitrary) flow can rely upon. Specifically, the worst-case model focuses on a specific flow (tagged flow), and assumes that all the others (vacation flows) operate in asymptotic conditions (i.e. all the flows, except the tagged one, always have packets ready for transmission). According to this approach, the tagged flow is modeled as a single server queuing system with server vacations (hereafter “system” for short). The server vacation time, which in general varies from round to round, represents the time interval between the server’s departure from the tagged flow and its subsequent arrival at the same flow. Thus, the vacation time is generated by the aggregate of the vacation flows operating in asymptotic condition. Let’s denote by h the quantum associated to the tagged flow5 and uÖ (in general different from h) the quantum associated to each of the vacation flows. The number of packets that can be held in the tagged queue (associated to the tagged flow) and vacation queues (associated to vacation flows) is assumed to be infinite. At the n-th round, when the server visits the tagged queue it can provide contiguous service for at most 98 (ÿ ) segments; i.e. the server provides service until either: • the system is emptied; • a burst of packets for a total of exactly 98 (ÿ ) segments have been serviced, i.e. 98 (ÿ ) = ; • a burst of packets for a total of less than 98 (ÿ ) segments have been serviced and the next packet length exceeds the number of 98 (ÿ ) segments the server is still entitled to transmit, whichever of the above events occurs first. The server then goes on vacation before returning to service the tagged queue again. If the server returns from a vacation to find no packets waiting, it begins another vacation immediately, and continues in this manner (multiple vacations) until one packet joins the empty tagged queue. This event occurs while the server is on vacation and therefore it is servicing one vacation queue. To model the behavior of the active list facility, the server behaves as follows: • it completes the service at the vacation queue being visited at the time of the packet arrival; • it services all the other vacation queues, for a time W − uÖ ; • it services the tagged queue. As we said before vacations generally vary from round to round. However, in order to simplify the model we assume that each vacation has a constant length. Specifically, 4
Although we call it worst-case model, there is no theorem which proves it to be the worstcase model. Therefore, we use this expression to mean that it belongs to the class of worstcase models. 5 In the rest of the paper we omit the flow index when referring to the tagged flow.
"©
GGrþªvþvr³ hy
we assume that there are H ≥ vacation queues, and therefore W = HuÖ . This is clearly an approximation which will be validated by simulation in Section 8. Each packet is modeled as consisting of a random number of segments. Let the n-th packet size be denoted by a ÿ (measured in segments). Then {a ÿ ÿ ≥ } forms a sequence of independently identically distributed (i.i.d.) random variables. The probability mass function and the average of the packet length are denoted by ¦x = Q {a ÿ = x } , x = ! G < ∞ , and @b a d respectively. Packets arrive at the tagged queue according to a homogeneous Poisson process of rate λ packets/segment; i.e. x hx (y ) = r − λy (λ y ) x is the probability of x ≥ packet arrivals over y ≥ segments. Due to the quantum associated to the tagged queue, our DRR worst-case model can be regarded as belonging to the class of M/G/1 queuing systems with exhaustive limited (E-limited) service and multiple (constant) server vacations [15].
4 Markov Chain Description This section focuses on the specification of the Markov chain which describes the behavior of the worst-case DRR model. The M/G/1 systems with E-limited service and multiple vacations (M/G/1 system for short) is observed at the packet departure epochs (embedding points); i.e. at the epochs of the successive packet service completions. Specifically, immediately after the n-th packet departure the M/G/1 system state can be specified by the following couple {R 98}(ÿ ) of random variables, where Rÿ ≥ is the number of packets in the queue while 98ÿ , with ≤ 98ÿ ≤ u − , is the forward deficit counter6 which represents the remaining number of segments the server is entitled to use for successive packet transmissions. We want to highlight that during the evolution of the M/G/1 system the deficit counter can assume values up to u + G − . However, at the embedding points the following inequalities hold: ≤ 98ÿ ≤ u − . We refer to the time interval between two embedding points as an embedding interval. Note that the embedding interval only takes on integer values since the service times and vacation times are integervalued. With the assumptions made for the DRR worst-case model7, it can be shown that the {R 98}(ÿ ) is a spatially homogeneous Markov chain on the state space @ = {(x Qu ) ) x ≥ ≤ Qu ≤ u − }, where, according to the Markov Chain terminology, the first component ( x ) is called level and the second (Ph) server phase. Since at each embedding point the level decreases by at most of one unit, it follows that {R 98}(ÿ ) is a Markov chain of M/G/1 - Type characterized by the following transition probability block matrix Q
6 For ease of representation, hereafter the sub-index in the DC refers to the embedding point 7
only. Poissonian packet arrivals, constant duration
{a
Q
ÿ ≥ } sequence of i.i.d. random variables, vacations of
6þHBRÃrÃvþtT'²³r·
7 7 7! 7" 6 6 6! 6" Q = 6 6 6! 6 6 " " " "
"(
" " " , " "
6x ∈ \ u×u ; 6 = 6 ⋅ r ∈ \ u× where ∀x ≥ : 7x ∈ \× u ; 7 = 7 ⋅ r ∈ \ ; U ( r = [] is a vector with h components all equal to one) are matrices of mass functions defined by
• 6x = s vw (x ) : Probability that given a packet departure, which left at least one packet in the system and the server in phase v , the next packet departure leaves the server in phase w , and during that inter packet departure times there were k packet arrivals; • 7x = i w (x ) : Probability that given a packet departure, which left the system empty - and therefore the server in phase 0 - the next packet departure after a vacation leaves the server in phase j and k packets in the system. The entries in the v -th row of matrix P represent transitions from states with v − packets to some other state. For example, entries in the first row correspond to transitions from states with zero packets, and therefore phase zero, to other states. In particular, the entry 7 ∈ \ represents the only transition from a state with zero packets and phase 0 to another state with the same number of packets and phase. Similarly, the entry 7! = i w (! ) corresponds to transitions from the state with zero packets and phase zero to the state with 2 packets and phase w .
5 Structure of the 6x and 7x Matrices The structure of the 6x and 7x matrices depends upon the relation between h and the maximum packet length L. Two cases must be distinguished: i.e. u ≥ G and u ≤ G , which will be developed in subsections 5.1 and 5.2 respectively reported below. 5.1 Case u ≥ G The transition probabilities related to 6x = s vw (x ) , and 7x = i w (x ) x ≥ can be readily derived ¦ u + v − w ⋅ h x (u + v − w + W ) s vw (x ) = ¦v − w ⋅ hx (v − w )
≤ v ≤ G − u − ( G − v ) ≤ w ≤ u − ≤ v ≤ G − ≤ w ≤ v − G ≤ v ≤ u − v − G ≤ w ≤ v − ¸³ur¼Zv²r
(2)
#
G Grþªvþvr³hy
(
)
x + γ ⋅ h W − uÖ − (u − w ) ⋅ ¦u − w u − G ≤ w ≤ u − ∑ i w (x ) = · = · x − · + ¸³ur¼Zv²r
(3)
( )
where γ · = ν · ( − ν ) , with ν · = r λ u · , is the probability of · ≥ packet arrivals at the tagged queue while the server was visiting a vacation queue (characterized by the quantum uÖ ), given that at least one packet arrives during uÖ . By means of (2) and (3) it can be verified that P is stochastic; i.e.
∑
∞ x =
7x ⋅ r = ∑ x = 6x ⋅ r = r . ∞
To get the stochastic interpretation of the 6x = s vw (x ) and 7x = i w (x ) matrices it is necessary to distinguish between the following classes of events: 1. {Rÿ > 98ÿ = v ≥ y = v − w} where y (with probability ¦y ) is the length of the next packet to be transmitted. In this case the (ÿ + ) -th packet is served immediately after the previous packet’s departure. Assuming that k packets arrived during the packet service, the probability that (ÿ + ) -th {Rÿ + 98ÿ +} = {Rÿ + x − 98ÿ − y = v − w} is ¦v − w ⋅ hx (v − w ) . 2. {Rÿ > 98ÿ = v < y = u − w + v} where y (with probability ¦y ) is the length of the next packet to be transmitted. In this case the server has to take a vacation before serving the (n+1)-th packet. Specifically, at the end of the vacation the station gets the right to transmit u + v segments which thus enables the station to transmit the packet and therefore to move to phase w . Assuming that k packets arrived during the vacation (W ) followed by the (ÿ + ) -th packet service (y = u + v − w ) , the probability that {Rÿ + 98ÿ +} = {Rÿ + x − 98ÿ + u − y = w} is given by the following expression ¦u + v − w ⋅ hx (u + v − w + W ) . 3. {Rÿ = } and no packet arrives during the (ÿ + ) -th packet service. In this case the queue size becomes zero at the (ÿ + ) -th packet departure epoch. Furthermore, at this departing epoch the phase is set to zero (which enable the station to transmit h segments immediately after a vacation during which packet arrivals occur), independently of the phase reached after the transmission of the packet which left the system empty. Thus, the probability of such an occurrence is 6 . 4. {Rÿ = } . In this case, the server takes vacations until the (ÿ + ) -th packet has arrived. If this packet leaves behind x > packets in the system upon departure, there must be ≤ · ≤ x + packet arrivals during the service of the vacation queue at the time the packet joined the empty tagged queue, · − x + packet arrivals during W − uÖ segments plus the (ÿ + ) -th packet service time. Thus, if the packet leaves the system in phase j, we have · − x + packet arrivals in a time interval equal to W − u + (u − w ) segments (see Fig. 2). It is easy to verify that the probability of having {Rÿ + 98ÿ +} = {x w} is the probability vector 7x = i w (x ) , for x > . Thus, i w (x ) is computed from the convolution between the probability γ · that · ≥ packets arrive during uÖ , and the probability hx − · + W − u + (u − w ) that x − · + packets arrive during W − uÖ and the service time of the first packet of length u − w which arrived during uÖ . In case the (ÿ + ) -th packet leaves behind an empty system, the phase is set to zero and thus
(
((
)
)
)
6þHBRÃrÃvþtT'²³r·
#
the probability for this event is 7 . As outlined Fig. 2, when a packet arrives at an empty tagged queue, the server experiences a vacation which includes the residual service time of the vacation queue being visited plus the W − uÖ service time of the other vacation queues. This is due to the active list facility described in sections 2 and 3. h¼¼v½hy² h³ ³ur ³httrq¹ÃrÃr
· yh²³qr¦h¼³Ã¼rs¼¸· ³ur ³httrq¹ÃrÃr ÿ¸Z r·¦³'þ
x − · + ¦hpxr³Q yrh½r²³ur²'²³r· Zv³ux ¦hpxr³² hÿq vÿ ¦uh²rw
uÖ
W − uÖ
u− w WLPH
ÿ½hph³v¸ÿ² ÿ¸h¼¼v½hy²
½hph³v¸ÿ ¦hpxr³Q h¼¼v½r² Zuvyrh ½hph³v¸ÿ ¹ÃrÃr v²irvÿt½v²v³rq
³ur ³httrq ¹ÃrÃr v²½v²v³rqi'³ur²r¼½r¼
Fig. 2. Events starting from an empty tagged queue
From (2) it follows that the transition probabilities s vw (x ) only depend upon the differences v − w and therefore the matrices 6x , x ≥ are Toeplitz. Thus, if we define
φ· (x ) = ¦· hx (· ) , φ·ÿ (x ) = ¦· hx (· + W ) , · = ! G , then for v w = ! u − , vs ≤ v − w ≤ G φv − w (x ) s v w (x ) = φvÿ− w + u (x ) vs − u + ≤ v − w ≤ − u + G , ry²rZur¼r
and therefore matrices 6x , x ≥ , can be written as follows φ Gÿ (x ) " φ!ÿ (x ) φÿ (x ) φGÿ (x ) % φ!ÿ (x ) φ (x ) φ! (x ) φ (x ) % % # ÿ 6x = # % % % φ G (x ) . φ (x ) % % φ (x ) G % % % φ (x ) " x x x φ φ φ ( ) ( ) ( ) ! G
Depending on the packet length distribution and on the value of h, matrix 6 = ∑ x = 6x can be either irreducible or reducible. In the following we choose the above distribution and h in such a way that A is irreducible.
#!
G Grþªvþvr³hy
5.1.1 Stability Analysis It can be readily verified that when the matrix 6 = ∑ x = 6x is irreducible there exists a unique (positive) vector (invariant probability vector), which satisfies 6 = ; ⋅ r = . Since A is stochastic circulant and therefore doubly stochastic, the h components of vector coincide and are given by π w = u . Define the vector = ∑ x = x6x r . When the matrix A is irreducible, the minimal nonnegative solution B of the matrix equation Y = ∑ x = 6x Y x is stochastic if and only if the following inner product ρ = ⋅ ≤ . Moreover, if the matrix A is irreducible, the imbedded Markov chain is positive recurrent if and only if ρ = ⋅ < , and ∑ x = x7x r is finite. We present below a stochastic interpretation to the conditions for the positive recurrence of the embedded Markov chain. After some matrix algebraic manipulation, inequality ρ = ⋅ < leads to λ (u + W ) @b a d < u . Since λ (u + W ) is the average number of packet arrivals in a round of maximum length (segments) while @b a d is the average packet length (segments), inequality ρ = ⋅ < means that positive recurrence of the embedded Markov chain requires that, when the system is not empty, the average number of segments arriving in a round of maximum length must be less than the server quantum h. Since
∑
∞ x =
uÖ + W − uÖ + @b a d − , x7x r = λ − λ uÖ − r
(
)
(4)
the positive recurrence of the embedded Markov chain requires also that, starting from an empty system, the expected number of packet arrivals during: • the vacation experienced by the server before visiting the tagged queue, from the time the tagged queue begins a busy period; • the transmission time of the above packet, should be finite. Minus one in (4) represents the packet which arrived during the vacation which began the busy period and left the system the first embedding point after the vacation itself. 5.2 Case u ≤ G The difference with respect to the previous case (i.e. u ≥ G ) is that the server may take several vacations (see ¼·h` (· ) defined below) before getting the right to transmit a packet. According to the DRR scheduling algorithm, at the end of each vacation, if the system is not empty, the DC variable is incremented by the quantum u . Similarly to the case u ≥ G , it can be shown that the matrices 6x , x ≥ are Toeplitz. Thus, if we define
θ · (x ) =
θ
ÿ ·
¼PD[ (· )
∑ y =
{λ y (u + W ) + · + W }
x
¦yu + · r
¼PD[ (· )
(x ) = ∑ y =
− λ y (u +W )+ · +W
x
{λ y (u + W ) + · }
x
¦yu + · r
− λ y (u +W )+ ·
x
,
6þHBRÃrÃvþtT'²³r·
#"
G −· ¼·h` (· ) = u
then
θ vÿ− w (x ) vs ≤ v − w ≤ u − 6v w (x ) = θ v − w + u (x ) vs − u + ≤ v − w ≤ , ry²rZur¼r
and therefore matrices 6x , x ≥ , can be written as follows " θ u (x ) θ u − (x ) θ (x ) ÿ % # θ (x ) θ u (x ) 6x = # % % θ u − (x ) ÿ ÿ " θ (x ) θ u (x ) θ u − (x )
As in the previous case ( u ≥ G ), depending on the packet length distribution and on the value of h, matrix 6 = ∑ x = 6x can be either irreducible or reducible. In the following we choose the above distribution and the value of h in such a way that A is irreducible. Furthermore, i w (x ) = ∑ · = γ · ⋅ x +
( G + w ) u
∑ y =
{λ (W − uÖ ) + (y − )W + (yu − w )}
(x − · +)
¦yu − w r
(
)
− λ W − uÖ + (y −)W + (yu − w )
(x − · + )
,
≤ w ≤ u −, It can be readily verified that when u = G matrices 6x , and 7x , x ≥ , developed for u ≥ G and u ≤ G coincide.
5.2.1 Stability Analysis It can be readily verified that when the matrix 6 = ∑ x = 6x is irreducible there exists (invariant probability vector), which satisfies a unique (positive) vector 6 = ; ⋅ r = . Since A is stochastic circulant and therefore doubly stochastic, the h components of vector coincide and are given by π w = u . We present below a stochastic interpretation to the conditions for the positive recurrence of the embedded Markov chain. After some thorough matrix algebraic manipulation, inequality ρ = ⋅ < leads to λ ( @b a d + W ⋅ @b IW d) < , where IW is a random variable representing the number of vacations needed by the server to get the right to transmit a packet. Since λ ( @b a d + W ⋅ @b IW d) is the average number of packet arrivals in IW consecutive rounds, each one of length W , while @b a d is the average packet length, inequality ρ = ⋅ < means that positive recurrence of the embedded Markov chain requires that, when the system is not empty, the average number of packets arriving in a period including the IW vacations and the following packet transmission must be less than one. Since
##
GGrþªvþvr³ hy
∑
∞ x =
uÖ Ö + W ⋅ @ [I ] + @b a d − + − x7x r = λ W u W Ö −λu − r
(
)
(5)
the positive recurrence of the embedded Markov chain requires also that, starting from an empty system, the expected number of packet arrivals during: • the vacation experienced by the server before visiting the tagged queue, from the time the tagged queue begins a busy period; • IW vacations required by the server to get the right to transmit the above packet; • the transmission time of the above packet, should be finite. Minus one in (5) represents the packet which arrived during the vacation which began the busy period and left the system at the embedding point.
6 Computation of the G Matrix Using the matrix analytic methodology [11], the steady-state vector can be computed by a stable recursive formula (Ramaswami’s formula [12]) once the minimal nonnegative solution of B of the matrix equation Y = ∑ x = 6x Y x is computed. Several general methods are available in the literature. However, due to the strong properties of the matrices 6x , we propose a specific algorithm, which fully exploits the structure of B . More specifically, we show that B is similar to a diagonal matrix and its computation is reduced to approximating zeros of h-1 scalar functions. The structure of a matrix B relies on the property that the matrix function ∞ 6 ( ª ) = ∑ x = 6x ª x is a β ( ª ) - circulant matrix, that is it has the structure:
β ( ª )α ( ª ) α ( ª ) β ( ª )α u − ( ª ) " α ( ª ) α (ª ) % # ∞ x . 6 ( ª ) = ∑ x = 6x ª = # β ( ª )α u − ( ª ) % % α ( ª ) α ( ª ) " α u − ( ª ) In the following we will show the structure of 6 ( ª ) = ∑ x = 6x ª x in the two cases u ≥ G and u ≤ G . 6.1 Case u ≥ G Observe that: ∞
∑ φ (x ) ª x =
Thus
·
x
= ¦· r − λ · (− ª ) ,
∞
∑ φ (x ) ª x =
ÿ ·
x
= ¦· r − λW (− ª )r − λ · (− ª ) .
α · ( ª ) = if · = and · = G + ! u − ,
6þHBRÃrÃvþtT'²³r·
#$
∞
α · ( ª ) = ∑ φ· (x ) ª x = ¦· r − λ · (− ª ) if · = ! G , x =
β ( ª ) = r − λW (− ª ) .
6.2 Case u ≤ G Observe that: ∞
¼PD[ (· )
x =
y =
∑θ ·ÿ (x )ª x = r −λ·(− ª )
∑
¦yu + · r − λy (u +W )(− ª ) ,
∞
¼PD[ (· )
x =
y =
∑θ · (x )ª x = β ( ª )r −λ·(− ª ) Hence,
α ( ª ) = r − λ (u +W )(− ª ) α · ( ª ) = r − λ ·(− ª )
¼PD[ (· )
∑ y =
¼PD[ ( )
∑ y =
∑
¦yu + · r − λy (u +W )(− ª )
¦yu + u r − λy (u +W )(− ª ) ,
¦yu + · r − λy (u +W )(− ª ) if · = ! u − ,
β ( ª ) = r − λW (− ª ) . Let us denote by 9 ( ª ) the diagonal matrix whose diagonal entries are 1, u !u (u ) u β ( ª ) β ( ª ) ! β ( ª ) , and let Ω be the u × u Fourier matrix, such that w u , v w = ! u − and ω = p¸² !uπ + v ²vþ !uπ . Then, by denoting with (Ω )v w = ω 6C the conjugate transposed matrix of 6 , from the properties of circulant matrices 1 one has that: ¹ ( ª ) ¹ ( ª ) − Ω9 ( ª ) 6 ( ª ) 9 ( ª ) Ω C = % ¹u − ( ª )
where u −
¹ w ( ª ) = ∑ α · ( ª )β · ( ª ) · =
Thus, • when u > G , for w = ! u − ,
· u
ω ·w , w = ! u − .
(6)
#%
GGrþªvþvr³ hy
G
¹ w ( ª ) = ∑ ¦· r − λ · (− ª )(+W u )ϖ ·w ; · =
• when u ≤ G , for w = ! u − , ¹ w (ª ) = r
− λ (u +W )(− ª )
¼PD[ ( )
∑ y =
¦yu + u r
− λ y (u +W )(− ª )
u −
+ ∑r
− λ · (− ª )(+W u )
ϖ ·w +
· =
¼PD[ (· )
∑ y =
¦yu + · r
− λ y (u +W )(− ª )
Let us denote s w ( ª ) = ¹ w ( ª ) − ª , for w = ! u − . Concerning the zeroes of the functions s w ( ª ) for w = ! u − , one has the following result: Theorem 1 The function s w ( ª ) = ¹ w ( ª ) − ª , for w = ! u − , has exactly one zero in the closed unit disk. Proof: We show that ¹ w ( ª ) < ª for w = ! u − and for any ª in the complex plane such that ª = . Then the thesis follows from Rouchè’s theorem. Let us assume u > G . Let ª = p¸² σ + v ²vþ σ , for a fixed real number σ . Then, from the definition of ¹ w ( ª ) , it follows that: G
¹ w ( ª ) = ∑ ¼· rv ρP , · =
where ¼· = ¦· r − λ · (− p¸² σ )(+W
u)
, ρ · = λ · ( + W u )²vþ σ + !π · w u .
If p¸² σ ≠ , i.e. if ª ≠ , and if ¦· ≠ , then ≤ ¼· < ¦· for · = ! G , and thus ¹ w ( ª ) ≤ ∑ · = ¼· < . G
If ª = , then ¹ w () = ∑ · = ¦·ω ·w G
Since
∑
G · =
¦· p¸² (!π ·w u ) < ∑ · = ¦· = and G
∑
G · =
¦· ²vþ (!π ·w u ) < ∑ · = ¦· = G
it follows that ¹ w () < . In the case u ≤ G we proceed analogously. ♦ Since under the assumption that the Markov chain is positive recurrent, qr³ ( 6 ( ª ) − ªD ) has exactly u zeros in the closed unit disk, it follows that also s ( ª ) has exactly one zero in the closed unit disk, and such zero is equal to 1. Let us denote by ξ w the unique zero of s w ( ª ) = ¹ w ( ª ) − ª , w = u − , in the closed unit disk.
6þHBRÃrÃvþtT'²³r·
#&
Moreover, let us denote by r w , w = u the w -th column of the u × u identity matrix D . Then, from (6), it follows that
(Ω9 (ξ ) 6 (ξ ) 9 (ξ )
−
w
w
w
)
Ω C − ξ w D r w + = , w = u −
and thus
( 6 (ξ ) − ξ D )½ w
w
w
= , w = u − ,
where ½ w = 9 (ξ w ) Ω C r w + , w = u − . −
Let Φ be the matrix whose columns are the vectors ½ ½u − , and let Σ be the diagonal matrix made up by ξ ξ u − . Since Φ is nonsingular, the matrix B is given by B = ΦΣΦ − . Thus the computation of B is reduced to the following steps. 1. Set ξ = and compute ξ ξ u − , say, by applying Newton’s method to the function s w ( ª ) , w = u − ; 2. compute ½ w = 9 (ξ w ) Ω C r w + , w = u − (this computation simply consists in scaling the ( w + ) –th column of Ω C ); 3. compute B = ΦΣΦ − .
7 Computation of the Buffer Occupancy Steady State Probabilities Once the B matrix is known we can apply the Ramaswami’s formula to calculate the invariant probability vector x of the queue length following packet service completions of the Markov chain {R 98}(ÿ ) described in Section 3. The probability vector x can be partitioned as ` = [` ` ` ! `" ] , where the vectors ` x = `x `x `x ! `x u − , x ≥ , are of dimension h, and ` , which corresponds to the state in the boundary level 0, is a scalar. Since in our model the arrival process is Poisson with a constant arrival rate λ (i.e. independent of the number of packets in the system), the stationary probability density {¦ x } of the queue length at an arbitrary time is the same as the stationary probability density {` x } of the queue length following packet service completions [13]. In our problem the blocks 7x , x ≥ , are row vectors and 7 is a scalar, thus Ramaswami’s formula [14] reduces to − ` = + ∑ 7v ⋅ D − ∑ 6v ⋅ r v ≥ v ≥ x − U U ` = ` ⋅ 7 + ∑ ` ⋅ 6x − w + D − 6 x w = w x
(
)(
with ` x ∈ \× u , x ≥ ; ` ∈ \ , and 6x = ∑ w ≥ x 6w ⋅ B w
,
)
−
x ≥
, 7x = ∑ w ≥ x 7 w ⋅ B w
.
#©
GGrþªvþvr³ hy
8 Numerical Results In this section we report and comment on the following analytic results: • probability mass function of the buffer occupancy; • average delay of a packet, defined as the average time interval between packet arrival and departure, obtained by solving the DRR worst-case model with two different packet length distributions and for several values of the offered load. In order to validate the assumption of constant vacation made in Section 3, we compare the analytic results with simulative results obtained with the ns-2 simulator8 [17]. The algorithm for solving the analytic model is implemented in Fortran (computation of the G matrix) and C++ (Ramaswami’s formula). The model is solved by using a segment length of 4 bytes long, a maximum packet size (L) of 375 segments, and a vacation length of 3375 segments, corresponding to 9 vacation flows with uÖ = "&$ segments. Two different packet length distributions are used. Specifically: • uniform distribution between 125 and 375 segments; • empirical distribution, reported in Table 1, which closely mirrors that measured from the Internet [16]. Note that, with this distribution, it is @ [a ] ≅ !$©# . Table 1. Empirical packet length distribution Q Grþt³u ²rt·ÿ Q¼¸i
&
!
"
"©
##
$&
"$$
"$&
¼hþtr(# !
¼hþtr !#"
©
"$
!!
!#
#$
%#
$
$
"&
%
(%
¼hþtr #$ " &&
We solve the model with the above settings for u = $ and u = $ , selecting the arrival rate on the tagged flow so that ρ ranges from 0.01 to 0.99. For sample values of ρ , we simulate an IP router in which 10 flows contend for the bandwidth of a 10 Mbps link. In the simulation, nine “vacation” flows transmit packets whose length is uniformly distributed between 125 and 375 segments; they are kept in asymptotic conditions (i.e. their queues are always backlogged) and they are allocated a quantum of uÖ = "&$ segments each, so as to generate a vacation whose average value is 3375 segments. For the tagged flow we select the quantum and the packet length distribution as specified above. We plot the average delay obtained analytically and by simulation as a function of ρ , along with the 99% confidence intervals for the simulation results. Fig. 3 reports the graphs for u = $ , whereas Fig. 4 reports the graphs for u = $ in the case of empirical packet length distribution. Graphs related to the uniform packet length distribution show the same behavior and are therefore omitted due to space limitations. 8
The DRR code that we used for the experiment was obtained by slightly modifying the one that comes with the distribution, which does not allow for quantum differentiation among flows. 9 Packet lengths are assumed to be uniformly distributed within this and the following intervals.
6þHBRÃrÃvþtT'²³r·
ü ýü ÿ
ù ýü ÿ
ý WV Q H P J H Vþ ÿ \ D O H ÿG H J D U H Y D
#(
C2$ r· ¦v¼vphy ¦hpxr³ yrþt³u qv²³¼vióv¸þ
þ
ÿ
PRGHO ú ýü ÿ
û ýü ÿ
þ ýü ÿ
VLP XODWLRQ
ÿ
ÿ
ÿ
ÿ ÿ
ÿ øþ
ÿ øû
ρ
ÿ øú
ÿ øù
ü
Fig. 3. Average delay for the case u = $ ùüýÿ
ÿ
úûþüýÿ
ÿ
C 2$ r· ¦v¼vphy ¦hpxr³ yrþt³u qv²³¼vióv¸þ
ÿ
ý úüýÿ WV Q H P J H þVÿ ýûþüýÿ ÿ \ OD H ÿG H J D U H ýüýÿ ÿ Y D
P RGHO VLP XODWLRQ
þÿÿÿ
ÿ
ÿ
ÿûú
ÿûø
ÿû÷
ÿûö
ý
ρ
Fig. 4. Average delay for the case u = $
As the figures clearly show, the analytic results closely match those obtained in the simulator. Also note that, for a given value of the tagged flow quantum and a given ρ , he average delay is almost insensitive (at least in the framework of our experiments) to the packet length distribution chosen. Furthermore, we report the probability mass function of the number of packets in the system for ρ equal to 0.8, both for u = $ and for u = $ , in Fig. 5. We also show the comparison between the analytic and simulative data (along with 95% confidence intervals) for u = $ and uniform packet length distribution in Fig. 6. As the figure shows, analytic data are close to those obtained by simulations. Similar comments can be made for the other three cases (i.e. u = $ and/or empirical packet length distribution), which are not reported due to space limitations.
$
G Grþªvþvr³ hy ¼ 2 ©
ÿ
+ÿþýüûúXQLIRUPúGLVWù +ÿþýüûúHPSLULFDOúGLVWù +ÿýüüûúXQLIRUPúGLVWù +ÿýüüûúHPSLULFDOúGLVWù
þýÿ
þýþÿ
Q LR W F Q X þýþþÿ ÿIV V D ÿP W\ L þýþþþÿ LO E D E R U S ÿü ÿþ
ÿþ
ÿý
ÿþ
ÿþ
þ
ü
ÿþ
ÿü
ûþ
ûü
úþ
úü
ùþ
1RùúRIúSDFNHWVúLQúWKHúV\VWHP
Fig. 5. Probability mass function of the number of packets in the system ý
¼ 2 ©
C 2$Ãþvs¸¼· ¦hpxr³ yrþt³u
ÿþý
Q R WL F Q X Iÿ VV D ÿþÿý P ÿ \ WL LO E D E UR S ÿþÿÿý
PRGHO VLPXODWLRQ ÿþÿÿÿý ÿ
ü
ýÿ
ýü
ûÿ
1RÿþRIþSDFNHWV
Fig. 6. Comparison between analytic and simulative results
9 Conclusions In this paper we propose and solve a specific worst-case model of DRR that enables us to calculate quantiles of the queue length distribution at any time (and hence average delays) as a function of the offered load, when the arrival process is Poissonian. The solution method makes use of the matrix analytic methodology and of an effective algorithm for computing the matrix B . Since the proposed model approximates the behaviour of DRR, simulation was performed to validate the model. At least within the limits of the experiments, the obtained results shows that the model approximates the behavior of the real DRR closely.
6þHBRÃrÃvþtT'²³r·
$
Extensions of the present paper can proceeds in at least the following two directions. First, we can investigate under what conditions the A matrix is reducible and then we can try to exploit such irreducibility in the computation of the invariant probability vector x. Second, we can analyze the DRR simplified model with more general input processes such as the general Markovian Arrival Process (MAP). The rational behind it is that the MAP is a particularly tractable point process, which is, in general, nonrenewal and which includes the Markovian Modulated Poisson Process (MMPP), the PH-renewal process and superpositions of such processes as particular cases. Because of its versatility, it lends itself very well to modeling the bursty arrival processes commonly arising in computer and communications applications.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
P. J. Davis, Circulant matrices, John Wiley and Sons, New York, 1979. L. Lenzini, E. Mingozzi, and G. Stea, “Aliquem: a Novel DRR Implementation to Achieve Better Latency and Fairness at O(1) Complexity”, in Proc. of the 10th International Workshop on Quality of Service (IWQOS), Miami Beach,USA, May 2002. H. Zhang “Service Disciplines for Guaranteed Performance Service in Packet-Switching Networks”, Proceedings of the IEEE, vol. 83, No. 10, pp. 1374–1396, Oct. 1995. D. Stiliadis, A. Varma, “Latency-Rate Servers: A General Model for Analysis of Traffic Scheduling Algorithms,” IEEE/ACM Trans. on Networking, vol. 6, pp. 675–689, Oct. 1998. D. Stiliadis, A. Varma, “Latency-Rate Servers: A General Model for Analysis of Traffic Scheduling Algorithms,” T.R. CRL-95-38, Univ. of California at S. Cruz, USA, July 1995. M. Shreedhar and G. Varghese, “Efficient Fair Queueing Using Deficit Round Robin,” IEEE/ACM Trans. on Networking, vol. 4, pp. 375–385, June 1996. R. Braden, D. Clark and S. Shenker, “Integrated Services in the Internet Architecture: an Overview”, RFC 1633, The Internet Society, June 1994 . S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An Architecture for Differentiated Services”, RFC 2475, The Internet Society, Dec. 1998. B. Davie et al. “An Expedited Forwarding PHB (Per-Hop Behavior)”, RFC 3246, The Internet Society, March 2002. A. Charny et al. “Supplemental Information for the New Definition of the EF PHB (Expedited Forwarding Per-Hop Behavior)”, RFC 3247, The Internet Society, March 2002. M. F. Neuts, "Structured Stochastic Matrices of M/G/1 Type and their Applications", Marcel Dekker, Inc., 1989. V. Ramaswami "A stable recursion for the steady state vector in Markov chains of M/G/l type", Stochastic Models, 4, pp. 183–188, 1988. Ronald W. Wolf, "Stochastic Modeling and the Theory of Queues", Prentice-Hall International Editions, 1989. H. Schellhaas, “On Ramaswami's algorithm for the computation of the steady state vector in Markov chains of M/G/1-type “, Comm. Statist. Stochastic Models 6 (1990), No. 3, pp. 541–550. H. Takagi, “Queuing Systems”, North Holland, Volume 1: Vacations and Priority Systems, Part 1, 1991. Agilent Technology Website, .
17. The ns-2 simulator, available at .
Queueing Models with Maxima of Service Times Peter Harrison1 and Soraya Zertal2 1 2
Imperial College London, South Kensington Campus, London SW7 2AZ, UK [email protected] PRiSM, Universit´e de Versailles, 45, Av. des Etats-Unis, 78000 Versailles, France [email protected]
Abstract. We develop a queueing model that approximates the effect of synchronisations at parallel service completion instants. We first obtain exact results for the maxima of independent exponential random variables with arbitrary parameters and follow this with an approximation for general random variables, which reduces to the exact result in the exponential case. We use this in a queueing model of RAID (Redundant Array of Independent Disks) systems, in which accesses to multiple disks occur concurrently and complete only when every disk involved has completed. The random variables to be maximised are therefore disk response times which are modelled by the waiting times in an M/G/1 queue. To compute the mean value of their maximum requires the second moment of queueing time and we obtain this in terms of the third moment of disk service time, itself a function of seek time, rotational latency and block transfer time. These quantities are analysed individually in detail. Validation by simulation, with realistic hardware parameters and block sizes, shows generally good agreement at all traffic intensity levels, including the threshold above which performance deteriorates sharply.
1
Introduction
Traditional, e.g. product-form, queueing networks cannot model synchronisations at parallel service completion instants. We approximate this effect by considering the explicit flow of control in a physical system and modelling the contention in each phase using an approach based on the M/G/1 queue. The synchronisation time is then the maximum of a collection of M/G/1 queue sojourn times (also called waiting times or response times), which we assume to be independent and, initially, exponential random variables, as in an M/M/1 queue. We obtain, in section 2, an exact recurrence formula for the Laplace transform of the probability density function of this maximum, from which the mean and higher moments follow. In the special case that all the constituent exponential distributions are identical, the well-known result for the mean value of the maximum in terms of harmonic numbers follows immediately. The recurrence is then generalised to approximate the mean of the maximum of independent generally distributed random variables. This simplifies to the previous exact result when the constituent distributions are exponential but in general requires their P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 152–168, 2003. c Springer-Verlag Berlin Heidelberg 2003
Queueing Models with Maxima of Service Times
153
second moments. The accuracy of the approximation is assessed by comparison with simulation results obtained for Erlang and Pareto constituent distributions. In section 3, these results are used in RAID (Redundant Array of Independent Disks) performance models with Poisson external requests but general disk seek, latency and transfer times. We determine the higher moments of the queueing time in the M/G/1 queue by differentiating its Laplace-Stieltjes transform at the origin. The second moment is then given in terms of the third moment of the service time, which is obtained in turn from the assumed distributions of seek time, rotational latency and block transfer time. The assumption of Poisson arrivals has often been found to be robust, especially for modelling external, user-generated, logical requests because such external traffic is usually composed of a number of low intensity streams that behave independently. The superposition of such sparse streams can be shown to approximate a Poisson process under quite mild assumptions. The accuracy of the model is assessed in section 4 by comparing the analytical predictions with a simulation of the actual system at the operational level. The quantitative results are presented as graphs of mean system response time against trafic intensity, showing generally good agreement and hence providing justification for our approach. The paper concludes in section 5.
2
Maximum of Random Variables
Suppose a task forks into a number of subtasks that are processed in parallel independently. The task’s completion instant is that of the last subtask to complete processing, whereupon the subtasks combine (join) to re-form the original task. The fork-join time of the task, i.e. the time elapsed between the fork instant and the join instant, is therefore the maximum of the subtasks’ processing times. In a Markovian environment, we derive the following : Proposition 1. The maximum of n independent, negative exponential random variables, with parameters α = (α1 , . . . , αn ), has probability density function fn (α, t) with Laplace transform Ln (α, s) given by the recurrence (for s ≥ 0): m m s + αj Lm (α, s) = αj Lm−1 (α\j , s) (1) j=1
j=1
for 1 ≤ m ≤ n, where α\j = (α1 , . . . , αj−1 , αj+1 , . . . , αm ) and L0 (, s) = 1, where is the null vector of zero components. Proof
∞
Lm (α, s) =
0
=− 0
e−st fm (α, t)dt ∞
−st
e
F m (α, t)dt
=1−s 0
∞
e−st F m (α, t)dt
(2)
154
P. Harrison and S. Zertal
where F m (α, t) = 1 − Fm (α, t) is the complementary distribution function of the maximum and the prime denotes differentiation with respect to t, so that (α, t). Now, fm (α, t) ≡ Fm Fm (α, t) =
m
(1 − e−αi t )
i=1
and so (α, t) = − Fm
m
m αj (1 − e−αi t ) − (1 − e−αi t )
j=1
=
m
i=j
i=1
αj Fm−1 (α\j , t) − Fm (α, t)
j=1
=
m
αj Fm (α, t) − Fm−1 (α\j , t)
j=1
Thus, by equation (2) sLm (α, s) =
m
αj Lm−1 (α\j , s) − Lm (α, s)
j=1
and the result follows. ♠ To start the recurrence, note that the maximum of zero non-negative random variables (m = 0) is zero with probability 1 and so its density function has Laplace transform which is the constant 1. Notice that the following simple probabilistic argument proves this proposition. The maximum of the random variables is the sum of the minimum – i.e. the time up to the instant that the first exponential duration ends – and the maximum of the remaining m − 1 random variables. These two times are independent by the memoryless property of the exponential distribution and, moreover, the remaining times have the same exponential distribution as the corresponding full times, for the same reason. Finally, the ith random variable is the least with probability αi /(α1 + . . . + αm ). 2.1
Moments
From proposition 1, we can immediately obtain the moments of the maximum of a set of exponential random variables. Corollary 1. The kth moment Mn (α, k) of the maximum of n ≥ 1 independent, negative exponential random variables with parameters α = (α1 , . . . , αn ) is defined by the recurrence n k j=1 αj Mn−1 (α\j , k) n Mn (α, k) = n Mn (α, k − 1) + j=1 αj j=1 αj for n ≥ 1 and M0 (, k) = 0, for all k ≥ 1, with Mn (α, 0) = 1 for all n ≥ 0.
Queueing Models with Maxima of Service Times
155
Proof. Differentiating equation (1) k times, using Leibnitz’s rule for differentiating products, and setting s = 0, we obtain n n αj Mn (α, k) − kMn (α, k − 1) = αj Mn−1 (α\j , k) j=1
j=1
and the result follows. Corollary 2. In the special case that all the parameters of the exponential distributions are equal, αj = α for 1 ≤ j ≤ n we have: n!αn m=1 (s + mα) k Mn (α, k) = Mn−1 (α, k) + Mn (α, k − 1) nα n k Mm (α, k − 1) = α m=1 m Ln (α, s) = n
(3)
(4)
Note, in particular, from equation (4), we get for the mean of the maximum Mn (α, 1) =
n 1 1 α m=1 m
This special case is already well known, relating to the nth harmonic number; see [10] for a recent application in performance evaluation. Similarly, Mn (α, 2) = and
n n m−1 n m 2 1 2 1 2 1 = + α2 m=1 i=1 mi α2 m=1 i=1 mi α2 m=1 m2
n n n m−1 1 1 1 1 1 1 2 = 2 Mn (α, 1) = 2 + +2 α m=1 m2 mi α m=1 m2 mi m=1 i=1 i=m
Hence, the variance of the maximum random variable out of the n is n 1 1 Vn (α) = 2 α m=1 m2
Again, these results follow immediately from the above probabilistic argument. The maximum of n exponential random variables consists of a sum of one exponential random variable with parameter nα and the maximum of n − 1 random variables, which are independent. Consequently, the maximum is a sum of n independent, exponential random variables with parameters nα, (n−1)α, . . . , 1α.
156
P. Harrison and S. Zertal
Notice that, as n → ∞, Vn (α) → 0, i.e. the maximum of n independent exponential random variables becomes more consistent as n → ∞. This is not surprising since the maximum value will not change unless a newly included random variable happens to be the biggest. This will occur with probability 1/n which approaches zero. Of course, the same argument applies to any n random variables provided they are independent – the specific distributions do not matter. 2.2
Clusters of Identical Exponential Random Variables
For a large RAID system (for example) composed of disks with different arrival and/or service rates, the computational cost of the above expressions can be very high. However, we may be able to regroup the disks into a small number of clusters with a similar arrival rate within each cluster. Suppose a set of exponential distributions forms c clusters with ni exponential random variables having parameter αi in cluster i, 1 ≤ i ≤ c. Then we immediately have : c Proposition 2. The maximum of a=1 na independent, negative exponential random variables in c clusters, with rate αa in cluster a (1 ≤ a ≤ c), has probability density function L(n, s) given by (for s ≥ 0): c s+ na αa L(n, s) = na αa L(na− , s) (5) a=1
a:na >0
for n = 0 and L(0, s) = 1, where n = (n1 , . . . , nc ), na− = (n1 , . . . , na−1 , na − 1, na+1 , . . . , nc ), 0 = (0, . . . , 0). The k th moment M (n, k) is given by c na αa M (n, k) = kM (n, k − 1) + na αa M (na− , k) (6) a=1
a:na >0
for n = 0 and M (0, k) = 0, for all k ≥ 1, with M (n, 0) = 1. 2.3
Mean of the Maximum of General Random Variables
We now derive an approximation for the mean value of the maximum of a set of independent, non-exponential random variables. First consider T = max(T1 , T2 ) for non-negative random variables T1 , T2 with distribution functions F1 (t), F2 (t) having LSTs F1∗ (θ), F2∗ (θ) respectively. Then, E[T ] = E[T1 ] + P (T2 > T1 )E[T2 − T1 |T2 > T1 ] ∞ F1 (t)dF2 (t).E[T2 − T1 |T2 > T1 ] = E[T1 ] + 0
In the special case that T2 is exponential, with parameter α2 say, we get E[T ] = m1 + F1∗ (α2 ).E[T2 − T1 |T2 > T1 ]
Queueing Models with Maxima of Service Times
157
where m1 is the mean of T1 . We make the approximating assumption that, at time instant T1 < T2 , the random observer property holds with respect to T2 . This assumption is, of course, valid in the special case that T1 is exponential. Then we have M2 E[T2 − T1 |T2 > T1 ] = 2m2 where m2 , M2 are the mean and second moment of T2 respectively. This is α2−1 when T2 is exponential. We end up with the approximation E[T ] = m1 +
M2 F1∗ (m−1 2 ) 2m2
(7)
By construction, this result is exact if both T1 and T2 are exponential. Otherwise, in general, we need to approximate the Laplace transform of the density of the maximum of k − 1 random variables when considering the maximum of k. To do this we use equation (1), which is the correct Laplace transform if the maximised random variables are all exponential. We immediately obtain the following approximation: Let I(n, α, M) be the expected value of the maximum of n non-negative ran−1 dom variables with means m = (m1 , . . . , mn ), α= (m−1 1 , . . . , mn ) and second moments M = (M1 , . . . , Mn ). Then I(n, α, M) is defined by the recurrence, for k = 2, . . . , n, I(k, α, M) =
k 1 I(k − 1, α\i , M\i ) + αi Mi Lk−1 (α\i , αi )/2 k i=1
(8)
with I(1, α1 , M1 ) = 1/α1 Again, by construction (an easy inductive proof), the result is exact if all the random variables are exponential. Notice that, when exact, all the summands give the same result. When approximate, the result is the average of picking each of the k random variables in turn as the last in the sequence, and maximizing this and the maximum of the rest. Finally, we look at the special case where all the parameters are equal for all i, say αi = α and Mi = M for 1 ≤ i ≤ n. We then have, fom Corollary 2, Lk−1 (α, α) = 1/k so that I(k, α, M) = I(k − 1, α, M) + and hence I(k, α, M) = 1/α + (M α/2)
Mα 2k
k i=2
1/i
158
2.4
P. Harrison and S. Zertal
Accuracy of the Approximation
A pilot assessment of the accuracy of the approximation described in the previous section compared it against simulations of the maxima of a number N identical random variables of two types: Erlang and Pareto. The simulations were run 100,000 times, giving 98% confidence bands of the order 0.01. Each test distribution was standardised to have unit mean value so that the approximate mean-maximum is determined solely by the second moment. Notice that, even when the variance is zero, the second moment is the square of the mean, viz. 1. Consequently, the approximation’s estimate will always diverge as the number of parallel random variables maximized increases. Thus, for N deterministic random variables, here each equal to 1 with probability 1, the exact mean-maximum is 1 whereas the approximation diverges to infinity with N . Thus, the approximation is not appropriate for small variances. This is illustrated in table 1 where the approximation is tested for Erlang-2, Erlang-3 and Erlang-4 distributions. The mean of a k-phase, Erlang-k distribution with parameter λ is k/λ and so we choose λ = k. The variance is therefore k/λ2 = 1/k which tends to zero as k → ∞. Thus the approximation deteriorates at larger k, as we see from the table 1. The second moment of the k-phase Erlang is 1+1/k and we see a 36% error for 16 parallel Erlang-4 random variables. Each of these has variance 0.25 and so we see poor agreement at moderately small variances for more than 8 parallel random variables – all overestimates as expected. However, for up to 4 in parallel, the accuracy is quite acceptable; this happens in reads from mirrored disks and RAID accesses with small numbers of blocks. Also included in each row of the table is the mean of the maximum of N parallel exponential random variables, each with unit parameter. By corollary 1, this is just the N th harmonic number and it can be seen that it overestimates seriously; more than double the error in its best case of 16 parallel Erlang-4 distributions. Table 1. Comparison with Erlang (low-variance) N Exp-1 1 2 4 8 16
1.000 1.500 2.083 2.718 3.381
Erlang-2 Mod Sim % err 1.000 1.003 -0.334 1.375 1.373 0.135 1.813 1.772 2.265 2.288 2.182 4.881 2.786 2.588 7.648
Erlang-3 Mod Sim % err 1.000 0.999 0.062 1.313 1.271 3.281 1.677 1.546 8.448 2.074 1.806 14.84 2.488 2.061 20.74
Erlang-4 Mod Sim % err 1.000 0.999 0.060 1.281 1.195 7.207 1.609 1.380 16.64 1.966 1.555 26.43 2.339 1.716 36.30
However, in practice, waiting times in queues tend not to have very low variance – it would be perhaps easier to predict if they did. Consequently, we tested the accuracy of the approximation near the opposite extreme, against high variance, heavy-tailed Pareto distributions. Again these were chosen to have unit mean and zero distribution function at the origin. The form of the distributions chosen is FP (x) = 1 − α(x + γ)−β , where β > 2 for the first two moments to be finite. In order to pass though the origin and have unit mean, we require
Queueing Models with Maxima of Service Times
159
α = γ β and γ = β − 1. This gives a second moment M2 = 2 + 2/(β − 2), which we use to parameterise the approximation. We call a Pareto distribution with these properties Pareto-β and compare our approximation with simulation for the mean-maximum of Pareto-4 and Pareto-5 random variables; see table 2. Table 2. Comparison with Pareto (high-variance) N Exp-1 1 2 4 8 16
1.000 1.500 2.083 2.718 3.381
Pareto-4 Mod Sim % err 1.000 1.004 -0.381 1.750 1.579 10.82 2.625 2.327 12.81 3.577 3.261 9.698 4.571 4.394 4.027
Pareto-5 Mod Sim % err 1.000 0.994 0.614 1.667 1.567 6.350 2.444 2.269 7.744 3.290 3.129 5.173 4.174 4.153 0.512
It can be seen that the agreement is much better here than for the low variance cases. In fact the approximation is at its worst for moderately small numbers in parallel (N ), improving as N reaches 16. As expected, the approximation improves as the parameter β increases, giving a lower variance closer to that of the exponential, 1. The exponential mean-maximum values are repeated in this table and show underestimates, again as expected since the Pareto second moments are greater than that of an exponential random variable with mean 1, viz. 2. This indicates a degree of flexibility in the new approximation. This preliminary validation of the approximation suggests at the very least that many mean-maxima of waiting times will be well approximated by the recurrence of the previous section. Recall too that, when these waiting times are exponential, the recurrence is exact. Indeed, if the waiting times are phasetype, the mean of their maximum can also be computed exactly, the maximum also being phase-type. This calculation has exponential complexity in N but an efficient polynomial approximation was obtained in [2]. This could be used in cases where ours is too inaccurate, for example low variance Erlang distributions, a special case of phase-type.
3 3.1
RAID Storage System General Description
A RAID storage system consists of a disk system manager and a collection (array) of independent disks. The disk system manager is a software component of the RAID controller. It receives requests from the multiple system users, considered logical requests. They may arrive from different users at various rates. The disk system manager subdivides the data into blocks called stripe units and distributes them across the collection of disks. Consequently, for each logical request, it generates a number of physical requests and sends them to the associated disks. Each disk of the array receives requests at a different rate λi . Finally, the disk system manager waits for the (physical) responses from each
160
P. Harrison and S. Zertal
requested disk to construct the (logical) response to each logical request, which it then sends to the corresponding user. The request subdivision-distribution process is performed according to the data/redundancy pattern over the disks. There are various data placement schemes [5,6], which added to requests’ independent executions on such asynchronous disks, introduce fork-join problems of the type we have described and analysed. We just apply our model here to RAID0-1 as an application of our mean-maximum approach to synchronisation. RAID3 and RAID5 can be modelled similarly. The model we develop uses the notation in Table 3. Table 3. Notation for the parameters of the RAID models (j = 0-1 here)
Parameter
N C B Qi Si Ri RM AX t T λ pi λi λRj λiRj Praidj Zr (i) Zw (i) Zr Zw Z pw pr ps
3.2
Description
The The The The The The The The The The The The The The The The The The The The The The The
number of disks in the storage system. number of cylinders on a disk. logical request size in terms of transfer blocks. waiting or queuing time at disk i. seek time on a disk i. rotational latency to move from a random point to the target block. full disk rotation time. transfer time of one disk data-block from or to a cylinder. bus transfer time of one disk data-block. logical request arrival rate to the storage system. probability that disk i is used by a given physical request. physical request arrival rate to disk i. physical request arrival rate to the RAIDj area. physical request arrival rate to a RAIDj area on disk i. proportion of the RAIDj area in the whole storage system space. response time for a read request on disk i. response time for a write request on disk i. mean response time for read requests in the storage system. mean response time for write requests in the storage system. mean response time for any request in the storage system. probability that a request is a write. probability that a request is a read. probability that a request’s access is sequential.
Mean Response Times
Each disk is modelled by an M/G/1 queue of physical requests. It serves tasks comprising both read/write requests and parity pre-read/update requests. Each physical request relates to one block of data and leads to a disk access in read
Queueing Models with Maxima of Service Times
161
or write mode. The response time of each physical request is composed of four components: the waiting or queueing time in the disk queue (Q), the seek time (S), the rotational latency (R) and the transfer time which we separate into two components (T and t). – Queueing time, Qi Since each disk is modelled by an M/G/1 queue, the mean queueing time is calculated using the Pollaczek-khinchin formulae [8], extended to handle multiple classes: E[Qi ] =
j=1,5
λiRj XiRj
2(1 − ρi )
(9)
where, referring to Table 3, • Xi = Si + Ri is the head displacement time random variable for a RAID request on disk i, with mean Xi and second moment Xi ; • ρi = λi Xi is the traffic intensity on disk i ; These quantities depend on the moments of Xi and hence on those of Si and Ri . They are determined in the next subsection. Notice that a write generates extra traffic because of the additional I-O transfers (mirror writes) which is reflected in the value of λi . – Seek time, Si The seek time depends on the distance D between the current position of the device’s read/write head and the target position. It is commonly calculated – for the considered hardwarei – according to [4] as follows: S=
0 √ a+b D
if D=0 otherwise
(10)
where a, b are hardware-related constants. We assume that the incoming logical requests’ addresses are independent random variables, uniformly distributed over the disk-address space. The distance D can then be well approximated by a continuous random variable with density function fD (x) = ps δ(x) + (1 − ps )
2(C − x) (C − 1)2
(0 ≤ x ≤ C − 1)
(11)
The term 2(C−x) (C−1)2 is the probability density function of the difference between two uniform random variables on [0, C − 1], where C is the number of cylinders on disk i and δ(x) is the Dirac delta-function (unit impulse). For simplicity, we have assumed that all disks have the same hardware parameters, including a, b, C. However, it would be easy to extend our model to heterogeneous devices. The quantity ps is the probability that a given physical request addresses the same track as the previous one, i.e. requires no seek. It is a workload parameter that we estimate by 1/C, consistent with our assumption of uniformity. Its effect is therefore negligible here.
162
P. Harrison and S. Zertal
– Rotational latency, Ri The rotational latency is assumed to be a random variable with Uniform distribution on the interval [0, RM AX ], with density function: fR (x) =
1 RM AX
0 ≤ x ≤ RM AX
(12)
for all disks i. – Single block bus transfer time, T Assuming negligible contention on buses, the block transfer time is a constant T , denoting the sum of the transfer time of the device (from the disk buffer to the bus) and the bus transfer time (on the bus connecting the disk to the disk system manager). – Single block disk transfer time, t This is the time it takes to transfer one block to or from a cylinder. It is very small, of the order of one tenth of a millisecond1 Mean values. From the above probability density functions, we calculate the following by direct integration: X i = S i + R i + Ki t √ 8b C − 1 Si = (1 − ps ) a + 15 RM AX Ri = 2 where Ki is the random variable denoting the number of blocks in a physical request, a workload parameter. Since the quantity t is so small, the contribution of Ki is negligible and we take it to be 1 with probability 1. A more sophisticated approximation could be obtained by workload profiling and be relevant for very large block sizes. Higher moments. We denote the nth moment of a random variable by n overbars, cf. the mean values used above. To approximate the mean of the maximum of non-exponential random variables by the method of section 2.3 requires the second moment of the queueing time, Qi = E[Q2i ]. As we shall see, this requires the third moments of Si and Ri . In an M/G/1 queue with arrival rate Λ, service time random variable X with distribution function X(t) and Laplace-Stieltjes transform (LST) X ∗ (θ), the queueing time Q has distribution function with LST given by (see [8], for example): (θ − Λ(1 − X ∗ (θ))Q∗ (θ) = (1 − ρ)θ 1
This is the case of 5400 rpm disks; it is smaller still for modern disks of around 15000 rpm.
Queueing Models with Maxima of Service Times
163
Differentiating twice with respect to θ and setting θ = 0 gives Q = ΛX/(2(1−ρ)) leading to equation (9). Differentiating thrice at θ = 0 gives the required second moment: 2 Λ2 X ΛX Q= + 2(1 − ρ)2 3(1 − ρ) We already have the mean value Xi and now need to calculate the corresponding results for the second and third moments. First we consider the random variable Xi = Si + Ri and calculate: Xi = Si + Ri + 2Si Ri Xi = Si + Ri + 3Si Ri + 3Si Ri It remains to calculate the second and third moments of Ri and Si , the first moments being as given in the previous subsection. For the uniform random variable Ri ∈ [0, RM AX ], we have 2 3 R i = RM AX /3, Ri = RM AX /4
C−1 n The nth moment of Si is 2(1 − ps )/(C − 1)2 0 (C − x)(a + bx1/2 ) dx, giving after some manipulation: 16 √ 1 Si = (1 − ps ) a2 + ab C − 1 + b2 (C − 1) 15 3 √ 8 2 √ 8 3 3 2 Si = (1 − ps ) a + a b C − 1 + ab (C − 1) + b (C − 1) C − 1 5 35 3.3
Mean Response Time on RAID0-1
In the RAID0-1 organisation, both shadowing (full redundancy) and striping are used. The disk collection is divided into two groups: native disks and mirror disks, which are both subdivided into stripe units. All data is duplicated and distributed on both the native disks and the mirror disks. A read physical request is sent to the native or to the mirror disk while a write physical request is sent to both of them in order to maintain the native and mirror data coherency. One-block read requests. The choice of the target disk can be according to one of many policies, such as random, the shortest queue or the smallest seek. In our study, we consider random disk selection. The response time and its average for a read request on disk i among the N comprising the RAID are formulated as: Zr (i) = Qi + Si + Ri + T (13) 8b √ RM AX j=1,5 λiRj XiRj + (1 − ps )[a + +T (14) Zr (i) = C − 1] + 2(1 − ρi ) 15 2
164
P. Harrison and S. Zertal
Multiple-blocks read requests. The response time of a multiple blocks logical read is the maximum of the physical requests’ response times on each disk in the set requested. A logical read is achieved when its last associated physical request is finished. The response time of such a B-blocks read request is therefore estimated by the expression: Zr = maxki=1 (Qi + Si + Ri ) + nb bloc × T B if B < N B where nb bloc = k and k = N otherwise This assumes that each disk transfers the same number of blocks. The error is less than a one-block transfer time and the relative error approaches zero at large B. The mean response time may now be approximated using the method of section 2.3 by: Zr = I(k, α, M) + nb bloc × T where, for 1 ≤ i ≤ k, αi =
1 ; Mi = Qi + Si + Ri + 2Qi Si + 2Si Ri + 2Ri Qi Qi + Si + Ri
This approximation is exact for exponential response times in the individual queues, as in the very special case of M/M/1 queues. The general shape of the distribution could be expected to approximate an exponential in some systems, there being ‘more small than large blocks’. However, the exponential result (corollary 1) was found to give serious overestimations and the use of the above approximation caused a dramatic improvement. This is because the coefficients of variation of disk seek time, rotational latency and, hence, queueing time are significantly less than unity. As we saw in section 2.4, table 1, the overestimate of the exponential assumption is improved by more than half by the approximation that takes into account the second moment. This approximation is therefore what we used in our empirical studies of section 4. One-block write requests. The disk system manager sends the one-block write request to both the native data disk and the mirror disk, as shown in figure 1. The response time of the one-block logical write is the maximum of the physical requests’ response times on each disk of the pair, i.e. 2
Zw = max(Qi + Si + Ri ) + T i=1
The average response time is then estimated by: Zw = I(2, α, M) + T Multiple blocks write requests. According to the execution schedule shown on the task graph (figure 1), the response time of a B-block write request is: k
Zw = max(Qi + Si + Ri ) + nb bloc × T i=1
with mean value
Queueing Models with Maxima of Service Times
165
Zw = I(k, α, M) + nb bloc × T 2B if B ≤ N/2 and k = The mean response time where nb bloc = 2B N N otherwise
B ≤ N/2
B ≥ N/2
logical request
logical request
AND
AND
...
...
AND
W1
W1M
AND
...
WB
WBM
AND
W1
W1M
...
AND
...
WN/2
W(N/2)M
...
logical response
logical response
Fig. 1. RAID0-1 write request tasks graph
for both reads and writes depends on the probability pi and λiR1 (referring to table 3): pi = 1/N and λiR1 = BλR1 pi (pr + 2pw ) where λR1 = λpraid1 . We notice that only the first physical request out of the nb bloc on any disk i leads to a seek and a rotation latency. Consequently, we redefine as follows: Sraid1 ,i :=
Si Ri and Rraid1 ,i := nb bloc nb bloc
(15)
Hence we obtain the mean overall response time: Z = pr Zr + pw Zw
4
Results and Discussion
In order to validate our model and assess its accuracy, we have developed a detailed event-driven simulator. This simulator is written in C and composed of three main parts. The first part is a logical request generator, which uses standard random number generation functions to produce inter-request arrival times with arbitrary probability distributions. The second part is a logical to physical mapping, which contains all the physical request generation funtions, and the third part is the simulation engine, which schedules the execution of physical
166
P. Harrison and S. Zertal
requests on (operational abstractions of) the disks, as specified in section 3, and manages synchronisation. We obtained the hardware parameters from a library, which we separated from the execution routines in order to enhance the flexibility and the scalability of the simulator. We generated workloads with different mean logical request sizes (measured in blocks of 4KB each), using sizes of 1 and 4 blocks to represent small and medium requests. It would also be interesting to use bigger sizes (going up to 250 blocks) to represent medium and large requests. In fact, the upper bound is 1MB for the large requests observed in image applications. Concerning the balance between reads and writes in the workload, we generated model inputs with three ratios : 25% of reads for write oriented workloads, 75% of reads for read oriented workloads and 100% of reads for exclusively read workloads . Last, for the results presented in this paper, we used an array of 16 disks. The characteristics of the disks we used are : number of cylinders C=1200; full rotation time RM AX = 16.7 ms; number of blocks per track (bpt) = 12; acceleration time a = 3ms; seek factor b = 0.5 and one block transfer time T = 1.34ms. We chose this parameterisation in order to compare our results with those in [4]. Any modifications needed for testing more modern disks are straightforward. Notice that we simulate the operation of a real RAID system, not the analytical model as detailed in the previous section. In particular, all service times are taken from the operational characteristics of the system, but we do assume external Poisson arrivals of the logical requests.
N=16 RAID0-1 B=1 250 ANApr1 SIMpr1 ANApr0.75 SIMpr0.75 ANApr0.25 SIMpr0.25
Response Time (ms)
200
150
100
50
0 0
100
200
300
400
500
600
700
lambda (req/s)
Fig. 2. Analytical vs simulation response time (small requests, RAID0-1)
800
Queueing Models with Maxima of Service Times
167
Fig. 3. Analytical vs simulation response time (medium requests, RAID0-1)
Simulations were run for a warm-up period of 300000 logical requests to allow the system to reach a stable state. They were then run for a further 700000 logical requests during which the measurememts concerning response time were gathered. The confidence bands are not shown here, but the regions where there is good agreement and bad between the simulation and the analytical model will be apparent. Figures 2 and 3 compare the mean response time predicted by the analytical model with that obtained by simulation for small and medium block sizes respectively. These figures are choosen to highlight the impact of a specific parameter (read/write ratio, request size) on the response time. Figure 2 illustrates the effect of the workload’s read/write ratio in a smallrequest environment; the mean logical request size is one block (B = 1). The model and the simulation response times show good agreement. Comparing with figure 3, we can see how the system behaves in small vs. medium request size RAID0-1 environments. We deduce that the workload thresholds decrease considerably with the increase in the mean request size. In fact, the same effect is observed for RAID5.
5
Conclusion
We have developed quite intricate analytical models, based on queueing theory, that take into account the detailed principles of operation of synchronised fork-join operations. We applied our methodology to the RAID0-1 disk storage management system, one of the main ones currently in popular use. Analytical
168
P. Harrison and S. Zertal
results were compared with simulation at a very fine level of abstraction and showed good agreement at low-medium loads. In addition, the model mostly predicted the onset of saturation well, i.e. the loading above which response time grows rapidly to unacceptable levels where poor quality of service ensues. In the calculation of the mean of the maximum of an independent set of random variables, in general, the parameters αi , Mi are distinct. In this study we assumed equal parameters, giving a simple non-recursive result, but it requires a controlled experiment to ensure that all the workload parameters are the same at every disk. In fact, pi is particularly sensitive to workload variations, influencing the arrival rate at disk i. An optimisation of the general calculation is the subject of work in progress; equation (6) provides a starting point when it is possible to group a large collection of disks into subsets which are almost homogeneous in terms of both hardware specification and loading. It is also important to further consider the extent of the error in the mean-maximum calculation. This could be done by further simulation for various constituent distributions, but, especialy for low variances, a comparison against exact results in the phase-type case could be carried out, using the method of [2], cf. section 2.4. Finally, we are extending the study to a complete dynamic and heterogenous storage system, dealing with the layout schemes and reconfiguration for a RAID scheme that adapts to its incoming workload. We will then be able to evaluate the overheads of the related data migration and communications.
References 1. The RAID Advisory board. The RAIDBOOK : A source Book for RAID Technology. Lino Lakes MN Publisher, June 1993. 2. H. Bohnenkamp and B. Haverkort. The mean value of the maximum. In proc. PAPM/Probmiv 2002, Lecture Notes in Computer Science 2399 Springer-Verlag, pp. 37–56, 2002. 3. S. Chen. Design, Modeling and evaluation of high performance, Ph.D. thesis. Ph.D. thesis, University of Massachusetts, (USA), September 1992. 4. S. Chen and D. Towsley. A performance evaluation of RAID architecture. In IEEE Transactions on computers, January 1997. 5. G. Gibson D. A. Patterson and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of SIGMOD Conference, June 1988. 6. G. Gibson D. A. Patterson, P. M. Chen and R. H. Katz. Introduction to redundant arrays of inexpensive disks (RAID). In IEEE COMPCON, 1989. 7. A. Gravey. A simple construction of an upper bound for the mean of the maximum of n identically distributed random variables. Journal on Applied Probability, (22):844–65, 1985. 8. Ng Chee Hock. Queueing Modelling Fundamentals. John Wiley Publisher, 1996. 9. E.K. Lee and R. H. Katz. An analytic perfomance model of disk arrays and its application. In Tchnical Report UCB/CSD 92/660, November 1991. 10. I. Mitrani M. Hamilton and P. McKee. Distributed systems with different degrees of multicasting. In Proceedings of WOSP 2002, 3rd International Workshop on Software and Performance, July 2002. 11. S. Zertal. Dynamic redundancy mechanisms for storage customisation on multi disks storage systems. Ph.D. thesis, University of Versailles, France, January 2000.
Heuristic Optimization of Scheduling and Allocation for Distributed Systems with Soft Deadlines Tao Zheng and Murray Woodside Dept. of Systems and Computer Engineering, Carleton University, 1125 Colonel By Drive Ottawa, Ontario, Canada K1S 5B6 ^]KHQJWDRFPZ`#VFHFDUOHWRQFD Abstract. This paper studies optimal deployment and priorities for a class of distributed real-time systems which have complex server tasks, many concurrent scenarios, operations with deterministic or stochastic execution demands, arbitrary precedence between operations, and hard or soft deadline requirements. The soft deadlines take the form of a required percentage of responses falling within the deadline. This work improves on an earlier optimization approach which was only applied to hard deadlines. As before, heuristic measures derived from solutions of layered queueing models are used to guide step-by-step improvement of the priorities and allocation, searching for a feasible solution which meets the soft deadline requirements. Effectiveness is demonstrated on a range of examples including thousands of individual cases.
1 Introduction Performance specifications in many systems take the form of a soft deadline for each class s of responses, which execute a scenario also labeled s. A soft deadline is defined here as a requirement that the response time Rs for scenario s should satisfy a deadline Ds with some probability αs. This may be written as: 3V 3URE5V!'V V V
(1)
$]HURYDOXHIRU VGHILQHVDKDUGGHDGOLQHJUHDWHUWKDQ]HURDVRIWGHDGOLQH This work describes a method for adjusting the priorities and allocations of the tasks in a distributed system, to satisfy such requirements. Unlike most work that simultaneously addresses priorities, allocation and deadlines, this work considers systems with some degree of randomness in their CPU demands, and with a complex task structure as described in [4]. The structure can include layers of servers to satisfy parts of the responses, with contention for service. Such systems and requirements are common in telecommunications systems, and in business systems. Examples include directory servers and proxy servers with layered structure, and e-commerce servers. A great deal of work has been done on systems with hard GHDGOLQHV V DQG a flat task structure. Priority assignment and task (process) allocation were found as two important issues to meet deadline requirements for hard real-time systems deployed on multiprocessors. Unfortunately, it has been shown that the problem of assigning P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 169–181, 2003. © Springer-Verlag Berlin Heidelberg 2003
170
T. Zheng and M. Woodside
priority to end-to-end tasks [1], which have a chain of subtasks in a distributed system, and the problem of allocating tasks to processors [11] are both NP-hard problems. Efficient optimal solutions are not likely available, and heuristic algorithms must be created to find out feasible solutions. In previous related work, Tindel, Burns and Wellings used simulated annealing to find priority assignments and task allocations at the same time [19]. Garcia and Gonzalez Harbour proposed a Heuristic Optimized Priority Assignment (HOPA) algorithm which schedules transactions consisting of a chain of actions (subtasks) in a distributed system, to meet end-to-end hard deadlines [8]. Peng, Shin et al. proposed two branch-and-bound algorithms to allocate periodic tasks with precedence constraints to minimize the maximum response time [13]. Hou and Shin proposed an algorithm to find an assignment which maximizes the probability of meeting deadlines, and also to schedule the tasks, called the Module Allocation Algorithm [10]. 7R HYDOXDWH 3V IRU VRIW GHDGOLQHV V ! DQG VWRFKDVWLF H[HFXWLRQ SDWWHUQV WKH GLVWULEXWLRQRIWKHUHVSRQVHWLPH5VPXVWEHIRXQG. Dingle, Harrison and Knottenbelt presented a technique for the numerical determination of response time densities in Generalized Stochastic Petri Net (GSPN) models [2]. Simulation methods can also be used, although they are more expensive in run-time. El-Sayed et al in [4] considered the same problem, but for hard deadlines only. That paper described a heuristic optimization technique called the Planner, which used measures computed from a layered performance model to identify promising moves in the priority values and the allocations. The Planner met or surpassed other algorithms on test cases from the literature, and solved a large number of randomly generated problems of moderate size (with 16 tasks in four layers). The Planner, like this work, used simulation to determine Ps. Its application to soft deadlines was proposed in [4] but was not attempted. The present work improves on the Planner in several ways, to be more effective on hard deadlines. The new version, called Planner2, is then evaluated on many small and large systems with hard and soft deadlines. On a large set of randomly generated stochastic systems, it was used to discover a property or characterization which indicates cases which are feasible for soft-deadline schedulability. The characterization takes account of the average processor utilizations, a latency factor, and the coefficient of variation of the execution demands of the tasks. The latency requirement increases with the variance of the demands. This paper is organized as follows. Section 2 briefly introduces the Layered Queueing Networks (LQN) model which is used as the performance model in this paper. Section 3 provides the optimization algorithm of the Planner2 based on the LQN simulation results and shows how it works. Section 4 applies the optimization approach to soft real-time systems with stochastic execution demands. Section 5 gives the conclusions.
2 The Layered Queueing Networks (LQN) Model The layered queueing networks (LQN) model, presented by Woodside et al and others, is a performance model for systems with distributed software servers
Heuristic Optimization of Scheduling and Allocation for Distributed Systems
171
[5][6][17][20][22]. It extends queueing networks to model software servers and logical resources in a canonical way, including hardware devices, software processes, nested services, precedence constraints and multithreaded tasks.
(QY
(QY
(QY3
(QY3 HQBD D
HQBF D
F
7DVN$
F 7DVN&
HQBG G
HQBE
G 7DVN'
7DVN% &38 &38 HQBH
HQBH 7DVN(
7DVN
HQW U \ D &38
7DVN RU 3U RFHVV SU RJU DP L Q H[HFXW L RQ
(QW U \ VHU YL FH SRU W $FW L YL W \ XQL W RI H[HFXW L RQ &38 U HVRXU FH 6\QFKU RQRXV &DO O $V\QFKU RQRXV &DO O )RU ZDU GL QJ &DO O
([HFXW L RQ 'HPDQGV D D HQBE F F G G HQBH HQBH
3HU L RGV DQG 'HDGO L QHV (QY (QY , QL W L DO 3U L RU L W L HV 7DVN$ 7DVN% 7DVN& 7DVN' 7DVN(
1HW ZRU N 'HO D\ 7KH GHDGO L QHV FDQ W EH PHW M XVW XQGHU SU L RU L W \ DGM XVW PHQW 7KH GHDGO L QHV ZL O O EH PHW ZKHQ 7DVN( L V U HDO O RFDW HG W R &38
Fig. 1. An Example of an LQN model
In an LQN model, a task represents a software or hardware object which may execute concurrently, such as the tasks in a real-time system. A task has one or more entries which define its different services and are equivalent to classes in a queueing network. When a single-threaded task is busy serving a request to one entry, it cannot serve any other requests. An entry consists of some activities or phases which are the
172
T. Zheng and M. Woodside
smallest execution units. Activities may have arbitrary precedence relationships (e.g. AND fork or join, OR fork or join) [7]. LQN tasks, entries, activities and interactions provide a description which is quite close to software architecture models. Figure 1 shows an example LQN model, with parallelograms representing tasks, rectangles on their interfaces for entries, internal rectangles for activities and their precedence, and arrows to denote interactions. Three types of interactions are described in LQNs: a synchronous call (shown in diagrams by a solid line with a filled arrowhead), an asynchronous call (a solid line with an open arrowhead), and a forwarding call (a dashed line with a filled arrowhead). Both analytic and simulation solvers may be useful. In this paper, all the results are produced by an LQN simulation tool. The confidence interval for every result is no more than ±10%, meaning that all the results should be accurate within ±10% with 95% confidence. In most cases the actual results are much more accurate, of the order of 1%. There are several ways to create LQN models. Petriu and Shen proposed a method to derive an LQN model from a Unified Modelling Language (UML) design model of the software, using the UML Profile for Schedulabilty, Performance and Time [16]. Petriu and Woodside described an algorithm to transform Use Case Maps (UCM) scenario models into LQN performance models [15]. El-Sayed generated LQN models and also guided the optimization from a model in a proprietary scenario language which is no longer available [3] (the lack of the inputs needed to use the Planner was one of the motivations behind this work). Figure 1 is an example LQN model with two scenarios, both of which require 95% of the responses to meet the deadline. There are five tasks running on two CPUs, joined by a network (which is not shown). The activities and the network delays are stochastic, with exponential delays. TaskE is initially allocated on CPU2. The initial allocation and priorities are infeasible. Because there is higher communication cost between TaskA (on CPU1) and TaskE (3 calls per request) than between TaskD (on CPU2) and TaskE (1 call per request), the Planner2 reallocates TaskE to CPU1, giving a feasible solution.
3 The Optimization Algorithm The goal of Planner2 can be stated as the minimization of a penalty function V(A,P) over a set A of task allocations and a set P of priorities for the tasks: V(A,P) =
∑
Criticalitys
V
Criticalitys = &ULWLFDOLW\V
0 H 3V V
3V 3V!
V V
V(A,P) is defined to be zero if the performance goal is met for all scenarios, or positive otherwise. Briefly, A and P are initialized using heuristic algorithms, and the model of the system is solved. A set of scoring functions are computed to rank the scenarios, tasks and processors according to how they will contribute to improvement. These scores
Heuristic Optimization of Scheduling and Allocation for Distributed Systems
173
are used to select a change in priority or allocation. The search succeeds if V(A,P) = 0 or fails if V(A,P) > 0 and no step can be found that gives an improvement. The original Planner is described in [3][4], and Planner2 follows the same outline with changes in important details. For reasons of space, only the modified Planner2 algorithm will be described, and the changes.
Algorithm 1. Initialize the configuration: a) Use the Multifit-Com algorithm [21] for the initial allocation, b) Give higher initial priority to tasks with fewer threads, on the same processor. Break ties by the Proportional-Deadline-Monotonic algorithm [18]. 2. Check termination; use the V(A,P) metric defined by Eq. (2) to estimate the solution quality. If it is zero, go to 5. Otherwise go to 3. 3. Use the TaskWaitingMetric metric defined by Eq. (3) below to select a task with the largest value for priority increase. Estimate the solution quality. If V(A,P) is zero, go to 5. Otherwise go to 4. 4. Restore the configuration which has the smallest V(A,P) metric so far. Use the ComGainMetric metric defined in Eq. (4) below to select a task with the largest value, and a new CPU for it to move to. Estimate the solution quality. If V(A,P) is zero, go to 5. If failure conditions (on maximum total steps or on running out of alternatives to try) are satisfied, go to 6. Otherwise, go to 3. 5. Stop. A feasible solution is found. 6. Failed The improvements over the Planner in [4] include: • A different method for initializing the priorities based on the number of threads (higher priority to a task with fewer threads), and breaks ties by applying the Proportional-Deadline-Monotonic (PDM) algorithm. The original uses PDM entirely. The change was found to improve the success rate of the optimization. • 7KH9$3 PHWULFLQ3ODQQHUXVHGH 3VLI3V! VJLYLQJDVWHSDW3V VZKRVH KHLJKWGHSHQGVRQ VKHUHLWLVDOZD\VDXQLWVWHS7KLVQRUPDOL]HVWKHFRPSRQHQWV RI9$3 DQGEDODQFHVWKHLPSRUWDQFHRIYLRODWLRQVLQDOOVFHQDULRV • Planner used a single scoring function TaskMetric to identify the best task for changes, and this was the sum of two functions which are used separately in Planner2: TaskWaitingMetric(t) = 1/U(t)
∑
Criticalitys W(s,t)
(3)
V
where U(t) is the utilization by task t of its processor, and W(s,t) is the total waiting time of task t in scenario s (these values are obtained from the LQN simulation report), and CommOV(m) CommOV(m) (4) ComGainMetric(t, c) =
∑
∀m∈ nonLocalMsgs(c)
∑
∀m ∈LocalMsgs(t)
174
T. Zheng and M. Woodside
where nonLocalMsgs(c) is the set of non-local physical messages of task t between CPU c and the local CPU of task t, LocalMsgs(t) is the set of local physical messages of task t and CommOV(m) is the overhead caused by the message m, plus the change in network delay. In Planner2 the first function is used for priority adjustment, and the second for task re-allocation. • The priority levels of tasks are all forced to be distinct in Planner2, while equal priorities were allowed in [4]. Distinct priority levels decrease the sources of uncertainty in the response time. • The priority adjustment strategy is simple in the Planner. The priority of the task with the worst TaskMetric metric will be raised one level higher. If the failure condition is met (i.e. the maximum step is reached), the priority adjustment stops. This strategy can hang up in a loop and fail to reach a better solution when one is available. To avoid this problem, the new priority adjustment strategy raises the priority level of the task with the largest TaskWaitingMetric by one level. If this priority combination occurred before (i.e., if a loop is found), then the priority level of this task will be raised to the highest priority level among all the tasks. If the new priority combination occurred twice before, then the priority adjustment stops and the Planner2 considers task reallocation. This strategy is heuristic, but it worked well in the evaluations of section 4.1. The table shown in Figure 2 follows the steps in the algorithm for the model shown in Figure 1. Rs, Ps and V(A,P) are given for the two scenarios.
4 Evaluation and Demonstration Three sets of results will be described to demonstrate the success of the new version. 4.1 Evaluation of the New Priority Adjustment Strategies for Hard Deadlines The new priority adjustment strategies were evolved partly to improve the optimization for hard deadlines. The improvement compared to [4] was evaluated by a battery of randomly generated cases with between two and eight independent periodic tasks on one processor. These test cases satisfy the requirements for rate monotonic scheduling [12], so the results could be compared with an exactly optimal solution. The fraction of the results found by Planner or Planner2 over which are exactly optimal is called its “success ratio” in the results shown in Figure 3 below. Fourteen sets of cases were constructed with different combinations of the number of tasks and the total utilization (the utilization is the sum, over the tasks, of CPU time divided by the period). 50 cases were generated for each combination, giving 700 cases in total. To increase the difficulty of these cases, the initial priorities of the tasks were set to the reverse order to that assigned by the rate monotonic algorithm (thus, a task with a shorter period was given a lower priority). The results from the original Planner algorithm [4] and the new priority adjustment strategies are compared in Figure 3.
Heuristic Optimization of Scheduling and Allocation for Distributed Systems Step
Candidate Task
0 1
TaskE
2
TaskD
3
TaskE
4
TaskE
5
TaskD
TaskE
Actions And States
Rs
Ps
V(A,P)
CPU1: TaskA > TaskB CPU2: TaskC > TaskD > TaskE Raise priority of TaskE CPU1: TaskA > TaskB CPU2: TaskC > TaskE > TaskD The priority combination occurred before. Raise priority of TaskD to the highest level CPU1: TaskA > TaskB CPU2: TaskD > TaskC > TaskE Raise priority of TaskE CPU1: TaskA > TaskB CPU2: TaskD> TaskE > TaskC Raise priority of TaskE CPU1: TaskA > TaskB CPU2: TaskE > TaskD > TaskC The priority combination occurred before. TaskD has the highest priority, its priority can’t be raised any more. The best configuration is restored (step1). The TaskE has best benefit to be reallocated to CPU1 with the highest priority. CPU1: TaskE > TaskA > TaskB CPU2: TaskC > TaskD
33.781 23.405 27.953 24.165
0.092917 0.0444 0.0294 0.05095
1.903615
33.801 23.436
0.093402 0.04505
1.917514
29.015 24.168
0.044317 0.051117
1.016896
25.031 24.728
0.014767 0.054333
1.067153
18.138 25.225
0.0066833 0.049467
0
1.014352
Fig. 2. Optimization steps of LQN model in Figure 1 Number Of Tasks 2 2 3 3 4 4 5 5 6 6 7 7 8 8
175
Utilization 0.82 0.90 0.77 0.90 0.75 0.90 0.74 0.90 0.73 0.90 0.72 0.90 0.72 0.90
Success Ratio % (Original [4]) 100 100 100 100 98 91.67 94 80 98 40 96 42.86 94 33.33
Fig. 3. Test cases and results
Success Ratio % (New) 100 100 100 100 100 100 100 100 100 100 100 100 100 100
176
T. Zheng and M. Woodside
The new strategies are a definite improvement. They have found a feasible solution in every case which is feasible under rate monotonic scheduling, whereas the original algorithm had a significant number of failures, especially with larger task numbers and utilizations. 4.2 Evaluation of Planner2 with Soft Deadlines The Planner2 was equally successful with random CPU demands and soft deadlines, and on more complex systems with layered servers. Two evaluations are described here, first for a large set of randomly generated layered systems, and second for an application case study with a realistic architecture. 4.2.1 Layered Randomly-Generated Cases with Stochastic Demands Figure 4 shows an LQN introduced in [4] to demonstrate the robustness of optimization on hard real-time applications with deterministic CPU demands that were selected randomly. The evaluation is extended here to soft deadlines and stochastic execution demands. The main characteristics of the randomly generated parameters are: • Every scenario has a fixed period and deadline, (6) Deadlines = Demands * L where L is the laxity factor, taking values between 1.9 and 6 • The deadline requirement is that the deadline miss rate is no more than 5% • The average utilization of all the processors is adjusted to take a selected fixed value for a given case, chosen between 0.4 and 0.8 • The coefficient of variation CV of the execution demand was fixed, taking values 0.0 (deterministic), 0.1, 0.5 or 1.0 (exponential) for all the tasks. There were altogether 240 different combinations for the coefficient of variation, laxity factor and utilization. 50 cases were generated for each combination, and all 12000 cases were optimized by the Planner2. Figure 5 shows some of the results. For each combination of utilization and CV, Figure 5 shows the minimum laxity factor value (Lmin) for which a feasible allocation and priority solution was found for all 50 cases. We observe that: • For a given utilization, the required minimum laxity factor increases as the coefficient of variation increases, and as the utilization increases. • The minimum laxity factor with large coefficient of variation (e.g. 1.0) increases much faster than that with small coefficient of variation (e.g. 0.0) as the utilization increases. This indicates the extreme difficulties to meet deadline requirements with large coefficient of variation and high utilization. For larger laxity values, the fraction of cases that is found to be infeasible is positive and increases with laxity value. The Figure can be interpreted as a heuristic guideline for feasibility of a set of soft deadlines based on a system’s utilization and coefficient of variation values. For example, with an average utilization of 0.6 and CV = 1.0, the laxity factor for all tasks should be at least 3.5. However this guideline is only for a 95% success rate and provides no guarantees.
Heuristic Optimization of Scheduling and Allocation for Distributed Systems
'U L YHU
'U L YHU
'U L YHU
'U L YHU
3HU L RG
3HU L RG
3HU L RG
3HU L RG
'HDGO L QH
'HDGO L QH
'HDGO L QH
'HDGO L QH
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
7DVN
177
7DVN
7DVN
7DVN
Fig. 4. Random statistical models
/PLQ
&9 &9 &9 &9
8WLOL]DWLRQ
Fig. 5. Minimum laxity factor value providing feasibility for different combinations of the coefficient of variation and the processor utilization
178
T. Zheng and M. Woodside
4.2.2 RADS Bookstore Model This example is a simplified e-commerce site described in [14] by Petriu and Woodside, called the RADS Bookstore model. The model describes a 3-tier clientserver system (client, application and database tiers) with stochastic behaviour. The customer has 7 scenarios: browsing the stock, viewing a detailed item description, adding or removing items to or from shopping cart, checking out the items in shopping cart, registering and logging into the RADS bookstore. The administrator can update the inventory and fill the outstanding back orders. Figure 6 is the simplified LQN model of RADS bookstore, originated from a diagram in [14]. The model has been adapted as follows: • The set of Customers represented by the Customer task have a random think time between requests to the system, and a probability for making each type of request. There is one Administrator task. • Each scenario for the Customer and Administrator is governed by a pseudo task in the second layer, running on a pseudo processor ScenarioProc. The pseudo tasks are used to collect response times and to set deadlines. • The scenario deadlines for Customer are all set to 500ms, and the scenario deadlines for Administrator are set to 6000 ms. • The deadline miss rates for scenarios are required to be no more than 10%. The model was analyzed with 50, 100, 150, 200, 250, 300 and 350 customers. Because there is only one processor in the application tier, there is no task reallocation, and priority adjustment is the only optimization option. In the baseline model all the tasks on processor BookstoreProc and processor DatabaseProc are scheduled by the FIFO discipline (i.e. all the tasks are assigned the same priority). The optimized model will be compared to the baseline model. It turns out that the Customer scenarios easily meet their deadlines, so the experimental results in Figure 7 only show the miss rates for the two Administrator scenarios. These are greatly improved by the optimization. In the baseline model, the miss rates for the two administrator’s scenarios increase rapidly when the number of customers increases, and the deadline requirements couldn’t be met when the number of customers is 200 or more. In the optimized model, the miss rates of the two Administrator scenarios are held roughly constant, and the deadline requirements are met for all cases. This case study shows how the optimization approach can be usefully applied to find a runtime configuration for a complex hierarchical client-server system with soft deadlines and stochastic behaviour. The performance of the result is comparable to the redesign proposed in [15], which was determined with considerable analysis and required restructuring the database subsystem. This is an outstanding success for an automated procedure.
5 Conclusions An improved method for optimizing the configuration of a layered real-time system has been described. It is intended to be useful to software designers who wish to evaluate a software design “at its best”, without the effort of manually tuning the deployment. It adjusts the priorities of tasks competing for a processor, and the
Heuristic Optimization of Scheduling and Allocation for Distributed Systems
179
allocation of tasks to processors, searching for a feasible configuration (meaning, one that meets soft deadlines on percentiles of responses). It can equally be used to configure systems with hard deadlines, to be met by 100% of responses. The percentiles can be different for different scenarios. $GPLQ3URF
&XVWRPHU3URF &XVWRPHU
$GPLQLVWUDWRU
6FHQDULR3URF &XVW %URZVH
&XVW9LHZ
&XVW$GG
&XVW 5HPRYH
&XVW /RJLQ
&XVW &KHFNRXW
&XVW 5HJLVWHU
$GP 8SGDWH
$GP%DFN RUGHU
%RRNVWRUH3URF
5$'6ERRNVWRUH
6HUYHU
6KRSSLQJ&DUW
%DFNRUGHU0JU
,QYHQWRU\0JU
&XVWRPHU$FFRXQW
'DWDEDVH3URF
'DWDEDVH
&XVWRPHU'%
Fig. 6. Simplified LQN model of RADS Bookstore
The Planner2 is significantly better than its predecessor at finding feasible configurations for hard deadlines. It successfully configured thousands of cases with soft deadlines as well, with complex task structures. It successfully configured a realistic task system for e-commerce, without intervention.
180
T. Zheng and M. Woodside
0LVV5DWH)RU $GP%DFNRUGHU EDVHOLQH
0LVV5DWH
0LVV5DWH)RU $GP%DFNRUGHU RSWLPL]HG
0LVV5DWH)RU $GP8SGDWH EDVHOLQH
8VHU1XPEHU
0LVV5DWH)RU $GP8SGDWH RSWLPL]HG
Fig. 7. Miss rates of baseline and optimized model for scenario AdmBackorder and AdmUpdate
Acknowledgements. This work was supported by the Natural Sciences and Engineering Council of Canada through its program of Discovery Grants, and by the Ontario government through its OGSST program of scholarships.
References [1] R. Bettati, “End-to-end scheduling to meet deadlines in distributed systems”, Ph.D. thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA, March 1994. [2] N. J. Dingle, P. G. Harrison, W. J. Knottenbelt, “Response time densities in Generalized Stochastic Petri net models”, In Proceeding of the Third Workshop on Software and Performance, Rome, July, 2002 [3] H.M. El-Sayed, D. Cameron, and C.M. Woodside, "Automated performance modeling from scenarios and SDL designs of distributed systems", In Proc. of the Int. Symposium on Software Engineering for Parallel and Distributed Systems (PDSE'98), Kyoto, April 1998 [4] H.M. El-Sayed, D. Cameron, C. M. Woodside, “Automation Support for Software Performance Engineering”, Proc. Joint Int. Conf. on Measurement and Modeling of Computer Systems (Sigmetrics 2001/Performance 2001), Cambridge, MA, 2001, ACM order no. 488010, pp 301–311. [5] G. Franks, A. Hubbard, S. Majumdar, D. Petriu, J. Rolia, and C.M. Woodside, “A toolset for performance engineering and software design of client-server systems”, Performance Evaluation, 24 (1–2):117–135, November 1995. [6] G. Franks, S. Majumdar, J. Neilson, D. Petriu, J. Rolia, and M. Woodside, “Performance analysis of distributed server systems”, Proceedings of The Sixth International Conference on Software Quality, Ottawa, Canada, October 28–30, 1996, pp. 15–26. [7] G. Franks, “Performance analysis of distributed server systems”, Ph.D. thesis, Dept. of Systems and Comp. Eng., Carleton University, Dec. 1999.
Heuristic Optimization of Scheduling and Allocation for Distributed Systems
181
[8] J.J.G. Garcia and M. Gonzalez Harbour, “Optimized priority assignment for task and messages in distributed hard real-time systems”, Proc. IEEE Workshop on Parallel and Distributed Real-Time Systems, California, pp. 124–132, April 1995. [9] Mark K. Gardner, “Probabilistic Analysis and Scheduling of Critical Soft Real-Time Systems”, Thesis of Doctor of Philosophy, in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1999 [10] C.J. Hou and K.G. Shin, “Allocation of periodic task modules with precedence and deadline constraints in distributed real-time systems”, in Proc. of the Real Time system symposium, pp. 146–155, 1992 [11] J.Y.T. Leung, J. Whitehead, “On the complexity of fixed-priority scheduling of periodic real-time tasks”, Performance Evaluation, 2, (4), pp. 237–250, Dec. 1982. [12] C.L. Liu, J.W. Layland, “Scheduling algorithms for multiprogramming in a hard realtime environment”, J. Assoc. Computing. Mach., v 20, pp 46–61, 1973. [13] D.T. Peng and K.G. Shin, “Static allocation of periodic tasks with precedence constraints in distributed real-time systems”, In Proc. of the 9th Intl. Conf. On Distributed computing systems, pp. 190–198, 1989. [14] Dorin Petriu, Murray Woodside, "Analysing Software Requirements Specifications for Performance", . Third Int. Workshop on Software and Performance, Rome, July 2002 [15] Dorin C. Petriu, Murray Woodside, "Software Performance Model from System Scenarios in Use Case Maps” , International Conferences on Modelling Techniques and Tools for Computer Performance Evaluation, p141–p158, 2002 [16] Dorina C. Petriu, Hui Shen "Applying the UML Performance Profile: Graph GrammarBased Derivation of LQN Models from UML specifications, International Conferences on Modelling Techniques and Tools for Computer Performance Evaluation, p159–p177, 2002 [17] J. R. Rolia and Kenneth Sevcik, “The method of layers”, IEEE Transactions on Software Engineering, Vol. 21, No. 8, pp. 689–700, 1995. [18] Jun Sun, “Fixed Periodic Scheduling of Periodic Tasks with End-To-End Deadlines”, Ph.D. thesis, Department of Computer Science, University of Illinois at UrbanaChampaign, Urbana, Illinois, USA, March 1996. [19] K.W. Tindel, A. Burns, and A.J. Wellings, “Allocating hard real-time tasks: an NP hard problem made easy”, Real-Time Systems, 4(2):145–165, June 1992. [20] C.M. Woodside, “Throughput calculation for basic stochastic rendezvous networks”, Performance Evaluation, Vol. 9, No. 2, pp. 143–160, 1989. [21] C.M. Woodside and G.M. Monforton, “Fast allocation of processes in distributed and parallel systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 4, no. 2, Feb. 1993. [22] C. M. Woodside, J. E. Neilson, D. C. Petriu, and S. Majumdar, “The stochastic rendezvous network model for performance of synchronous client server-like distributed software”, IEEE Transactions on Computers, 44(1):20–34, January 1995.
Necessary and Sufficient Conditions for Representing General Distributions by Coxians Takayuki Osogami and Mor Harchol-Balter Department of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA {osogami, harchol}@cs.cmu.edu
Abstract. A common analytical technique involves using a Coxian distribution to model a general distribution G, where the Coxian distribution agrees with G on the first three moments. This technique is motivated by the analytical tractability of the Coxian distribution. Algorithms for mapping an input distribution G to a Coxian distribution largely hinge on knowing a priori the necessary and sufficient number of phases in the representative Coxian distribution. In this paper, we formally characterize the set of distributions G which are well-represented by an n-phase Coxian distribution, in the sense that the Coxian distribution matches the first three moments of G. We also discuss a few common, practical examples.
1
Introduction
Background. Approximating general distributions by phase-type (PH) distributions has significant application in the analysis of stochastic processes. Many fundamental problems in queueing theory are hard to solve when general distributions are allowed as inputs. For example, the waiting time for an M/G/c queue has no nice closed formula when c > 1, while the waiting time for an M/M/c queue is trivially solved. Tractability of M/M/c queues is attributed to the memoryless property of the exponential distribution. A popular approach to analyzing queueing systems involving a general distribution G is to approximate G by a PH distribution. A PH distribution is a very general mixture of exponential distributions, as shown in Figure 1 [21]. The Markovian nature of the PH distribution frequently allows a Markov chain representation of the queueing system. Once the system is represented by a Markov chain, this chain can often be solved by matrix-analytic methods [18,21], or other means. When fitting a general distribution G to a PH distribution, it is common to look for a PH distribution which matches the first three moments of G. In this paper, we say that: Definition 1. A distribution G is well-represented by a distribution F if F and G agree on their first three moments. We choose to limit our discussion in this paper to three-moment matching, because matching the first three moments of an input distribution has been shown P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 182–199, 2003. c Springer-Verlag Berlin Heidelberg 2003
Necessary and Sufficient Conditions for Representing General Distributions
183
p05 p04 p14
p03 p02 Exp
p01
p24
p13 λ1
p12 p21
Exp
λ2
p23 p32
p31
Exp
λ3
p34
Exp
p43
λ4 p 45 p35
p42 p41
p25 p15
Fig. 1. A PH distribution is the distribution of the absorption time in a finite state continuous time Markov chain. The figure shows a 4-phase PH distribution. There are n = 4 states, where the ith state has exponentially-distributed sojourn time with rate λi . With probability p0i we start in the ith state, and the next state is state j with probability pij . Each state i has probability pi5 that the next state will be the absorbing state. The absorption time is the sum of the times spent in each of the states. p1 Exp p2 λ1 1−p1 1−p2
Exp p3 λ2 1−p3
pn Exp λn
Fig. 2. An n-phase Coxian distribution is a particular n-phase PH distribution whose underlying Markov chain is of the form in the figure, where 0 ≤ pi ≤ 1 and λi > 0 for all 0 ≤ i ≤ n.
to be effective in predicting mean performance for variety of many computer system models [7,10,23,29,33]. Clearly, however, three moments might not always suffice for every problem, and we leave the problem of matching more moments to future work. Most existing algorithms for fitting a general distribution G to a PH distribution, restrict their attention to a subset of PH distributions, since general PH distributions have so many parameters that it is difficult to find time-efficient algorithms for fitting to the general PH distributions [14,15,20,27,32]. The most commonly chosen subset is the class of Coxian distributions, shown in Figure 2. Coxian distributions have the advantage of being much simpler than general PH distributions, while including a large subset of PH distributions without needing additional phases. For example, for any acyclic PH distribution Pn , there exists a Coxian distribution Cn with the same number of phases such that Pn and Cn have the same distribution function [5]. In this paper we will restrict our attention to Coxian distributions. Motivation and Goal. When finding a Coxian distribution C which wellrepresents a given distribution G, it is desirable that C be minimal, i.e., the number of phases in C be as small as possible. This is important because it minimizes the additional states necessary in the resulting Markov chain for the
184
T. Osogami and M. Harchol-Balter
queueing system. Unfortunately, it is not known what is the minimal number of phases necessary to well-represent a given distribution G by a Coxian distribution. This makes it difficult to evaluate the effectiveness of different algorithms and also makes the design of fitting algorithms open-ended. The primary goal of this paper is to characterize the set of distributions which are well-represented by an n-phase Coxian distribution, for each n = 1, 2, 3, . . .. Definition 2. Let S (n) denote the set of distributions that are well-represented by an n-phase Coxian distribution for positive integer n. Our characterization of {S (n) , n ≥ 1} will allow one to determine, for any distribution G, the minimal number of phases that are needed to well-represent G by a Coxian distribution.1 Such a characterization will be a useful guideline for designing algorithms which fit general distributions to Coxian distributions. Another application of this characterization is that some existing fitting algorithms, such as Johnson and Taaffe’s nonlinear programming approach [15], require knowing the number of phases n in the minimal Coxian distribution. The current approach involves simply iterating over all choices for n [15], whereas our characterization would immediately specify n. Providing sufficient and necessary conditions for a distribution to be in S (n) does not always immediately give one a sense of which distributions satisfy those conditions, or of the magnitude of the set of distributions which satisfy the condition. A secondary goal of this paper is to provide examples of common distributions which are included in S (n) for particular integers n. In finding simple characterizations of S (n) , it will be very helpful to start by defining an alternative to the standard moments, which we refer to as normalized moments. Definition 3. Let µF k be the k-th moment of a distribution F for k = 1, 2, 3. The normalized k-th moment mF k of F for k = 2, 3 is defined to be mF 2 =
µF 2 2 (µF 1)
and
mF 3 =
µF 3 . F µF 1 µ2
Notice the correspondence to the coefficient of variability CF and skewness γF µF 2 F 3 of F : mF mF 2 = CF + 1 and m3 = νF 2 , where νF = (µF )3/2 . (Notice the 2
correspondence between νF and the skewness of F , γF , where γF = µ ¯F k 1
is the centralized k-th moment of F for k = 2, 3.)
µ ¯F 3 3/2 (¯ µF 2 )
and
One might initially argue that S (2) , the set of distributions well-represented by a twophase Coxian distribution, should include all distributions, since a 2-phase Coxian distribution has four parameters (p1 , p2 , λ1 , λ2 ), whereas we only need to match three moments of G. A simple counter example shows this argument to be false. Let G be a distribution whose first three moments are 1, 2, and 12. The system of equations for matching G to a 2-phase Coxian distribution with three parameters (λ1 , λ2 , p) results in either λ1 or λ2 being negative.
Necessary and Sufficient Conditions for Representing General Distributions
185
Relevant Previous Work. All prior work on characterizing S (n) has focused ∗ ∗ on characterizing S (2) , where S (2) is the set of distributions which are wellrepresented by a 2-phase Coxian+ distribution, where a Coxian+ distribution is simply a Coxian distribution with no mass probability at zero, i.e. p1 = 1. ∗ Observe S (2)∗ ⊂ S (2) . Altiok [2] showed a sufficient condition for a distribution G to be in S (2) . More recently, Telek and Heindl [31] expanded Altiok’s condition and ∗proved the necessary and sufficient condition for a distribution G to be in S (2) . While neither Altiok nor Telek and Heindl expressed these conditions in terms of normalized moments, the results can be expressed more simply with our normalized moments, as shown in Theorem 1. In this paper, we will characterize S (2) , as well as characterizing S (n) , for all integers n ≥ 2. Our Results. While the goal of the paper is to characterize the set S (n) , this characterization turns out to be ugly. One of the key ideas in the paper is that there is a set SV (n) ⊂ S (n) which is very close to S (n) in size, such that SV (n) has a very simple specification via normalized moments. Thus, much of the proofs in this paper revolve around SV (n) . Definition 4. For integers n ≥ 2, let SV (n) denote the set of distributions, F , with the following property on their normalized moments: mF 2 >
n n−1
and
mF 3 ≥
n+2 F m . n+1 2
(1)
The main contribution of this paper is a derivation of the nested relationship between SV (n) and S (n) for all n ≥ 2. This relationship is illustrated in Figure 3 and proven in Section 3. There are three points to observe: (i) S (n) is a proper subset of S (n+1) for all integers n ≥ 2, and likewise SV (n) is a proper subset of SV (n+1) ; (ii) SV (n) is contained in S (n) and close to S (n) in size; providing a simple characterization for S (n) ; (iii) S (n) is almost contained in SV (n+1) for all integers n ≥ 2 (more precisely, we will show S (n) ⊂ SV (n+1) ∪ E (n) , where E (n) is the set of distributions well-represented by an Erlang-n distribution). This result yields a necessary number and a sufficient number of phases for a given distribution to be well-represented by a Coxian distribution. Additional contributions of the paper are described below. With respect to the set S (2) , we derive the exact necessary and sufficient condition for a distribution G to be in S (2) as a function of the normalized moments of G. This complements the results of Telek and Heindl, who analyzed ∗ S (2) , which is a subset of S (2) . (See Section 2). Lastly, we provide a few examples of common, practical distributions included in the set SV (n) ⊂ S (n) . All distributions we consider have finite third moment. The Pareto distribution and the Bounded Pareto distribution (as defined in [8]) have been shown to fit many recent measurements of job service requirements in computing systems, including the file size requested by HTTP requests [3,4], the CPU requirements of UNIX jobs [9,19], and the duration of FTP transfers [24]. We show that a large subset of Bounded Pareto distributions is in SV (2) .
186
T. Osogami and M. Harchol-Balter
(2)
Sv (2)
S (3)
S (4)
S
(3)
Sv
(4)
Sv
Fig. 3. The main contribution of this paper: a simple characterization of S (n) by SV (n) . Solid lines delineate S (n) (which is irregular) and dashed lines delineate SV (n) (which is regular – has a simple specification). Observe the nested structure of S (n) and SV (n) . SV (n) is close to S (n) in size and is contained in S (n) . S (n) is almost contained in SV (n+1) .
We also provide conditions under which the Pareto and uniform distributions are in SV (n) for each n ≥ 2. (See Section 4).2
2
Full Characterization of S (2)
The Telek and Heindl [31] result may be expressed in terms of normalized moments as follows: ∗
Theorem 1 (Telek, Heindl). G ∈ S (2) iff G is in the following union of sets: F √ 3 6(mF 3 9m2 − 12 + 3 2(2 − mF2 ) 2 F F 2 − 1) F ≤ m3 ≤ and ≤ m2 < 2 2 mF mF 2 2 3 F F F F m F F mF . 3 = 3 and m2 = 2 2 < m3 and 2 < m2 2
We now show a simple characterization for S (2) : Theorem 2. G ∈ S (2) iff G is in the following union of sets: 6(mF 3 4 F F F 2 − 1) F m2 ≤ m3 ≤ and ≤ m2 ≤ 2 SV (2) , F 3
where recall SV (2) is the set: 2
m2
2
(2)
F F F 43 mF ≤ m and 2 < m 2 3 2 .
Our results show that the first three moments of the Bounded Pareto distribution and the Pareto distribution are matched by a Coxian distribution with a small number of phases. Note however that this does not necessarily imply that the shape of these distributions is well-matched by a Coxian distribution with few phases, since the tail of these distributions is not exponential. Fitting the shape of heavytailed distributions by phase-type distributions such as PH distributions is studied in several recent papers [6,11,12,17,26,30].
Necessary and Sufficient Conditions for Representing General Distributions
187
A summary of Theorems 1 and 2 is shown in Figure 4. Figure 4(a) illustrates are in size. Figure 4(b) shows the distributions which how close S (2) and SV (2) ∗ are in S (2) but not S (2) . m3
m3 (2)
SV 3
3
(2)
S
S(2) S(2)*
(2)
SV
2 3/2
2
m2
2
3/2
m2
2
(a)
(b)
Fig. 4. (a) The thick solid lines delineate S (2) . The striped region shows SV (2) ⊂ S (2) . ∗ (b) Again, the thick solid lines delineate S (2) . The shaded region shows S (2) \ S (2) .
∗
Proof (Theorem 2). The theorem will be proved by reducing S (2) to S (2) and employing Theorem 1. The proof hinges on the following observation: an arbitrary distribution G ∈ S (2) iff G is ∗well-represented by some distribution3 Z(·) = X(·)p + 1 − p for some X ∈ S (2) . It therefore suffices to show that Z is in the set defined in (2). ∗ By Theorem 1, since X ∈ S (2) , X is in the following union of sets: F √ 3 6(mF 3 9m2 − 12 + 3 2(2 − mF2 ) 2 F F 2 − 1) F ≤ m3 ≤ and ≤ m2 < 2 2 mF mF 2 2 3 F F F F m F F mF . 3 = 3 and m2 = 2 2 < m3 and 2 < m2 2
mX
k Observe that mZ k = p for k = 2, 3. Thus, Z is in the following union of sets: √ F 3 9pmF 6(pmF 3 2 F F 2 − 12 + 3 2(2 − pm2 ) 2 2 − 1) ≤ m2 < F ∃p, ≤ m3 ≤ and 2p p p2 mF p2 mF 2 2 3 2 3 2 F and mF < mF F ∃p, mF F ∃p, mF (3) 3 = 2 = 2 < m3 and 2
p
3
p
2
p
To shed light on this expression, consider random variables VX whose distribution is X. Then random variable
VZ =
VX with probability p 0 with probability 1 − p,
has distribution Z, since Pr(VZ < t) = p Pr(VX < t) + (1 − p).
188
T. Osogami and M. Harchol-Balter
We want to show that Z is in the set defined in (2). To do this, we rewrite the set defined in (2) as: 6(mF 3 4 F F F 2 − 1) F m2 ≤ m 3 ≤ and ≤ m2 ≤ 2 3 2 mF 2 4 F 3 F 3 F F F m F m2 ≤ mF ≤ and 2 < m F mF . (4) 3 2 2 2 < m3 and 2 < m2 3
2
2
Observe that (3) and (4) are now in similar forms. We now prove that the set defined in (3) is a subset of the set defined in (4), and the set defined in (4) is a subset of the set defined in (3). The technical details are postponed to Appendix A, Lemma 3.
3
A Characterization of S (n)
In this section, we prove that SV (n) is contained in S (n) , where SV (n) is the set of distributions whose normalized moments satisfy (1), and that S (n) is almost contained in SV (n+1) . Figure 5 provides a graphical view of the SV (n) sets with respect to the normalized moments. Figure 5 illuminates several points. First, there is a nested relationship between SV (n) and SV (n−1) . This makes intuitive sense, since an n-phase Coxian can represent at least as many distributions as G an (n − 1)-phase Coxian. Next, observe that as either mG 2 or m3 decreases, more phases are needed to well-represent G. The intuition behind this is that the lower normalized moments, m2 and m3 , imply moving towards a deterministic distribution (which has the minimum possible values of m2 and m3 ), and a deterministic distribution is well-known to require an infinite number of phases. On the flip side, for distributions with sufficiently high m2 and m3 , two phases are all that is needed, since high m2 and m3 can be achieved by mixing two exponentials with very different rates. We prove the following theorem: Theorem 3. SV (n) ⊂ S (n) ⊂ SV (n+1) ∪ E (n) , where E (n) is the set of distributions that are well-represented by an Erlang-n distribution for integers n ≥ 2. An Erlang-n distribution refers to the distribution of a random variable, which is equal to the sum of n i.i.d. exponential random variables. Notice that the nor(n) (n) malized moments of distributions in E (n) , mE2 and mE3 , satisfy the following conditions: (n) (n) n+1 n+2 mE2 = and mE3 = . (5) n n Theorem 3 tells us that S (n) is “sandwiched between” SV (n) and SV (n+1) . From Figure 5, we see that SV (n) and SV (n+1) are quite close for high n. Thus we have a very accurate representation of S (n) . Theorem 3 follows from the next two lemmas: Lemma 1. S (n) ⊂ SV (n+1) ∪ E (n) .
Necessary and Sufficient Conditions for Representing General Distributions
189
m3 3
(2)
Sv
(3)
E2
2
Sv
(4)
E3
Sv
3/2 (32)
Sv
E31 1 1
4/3
3/2
2
m2
Fig. 5. Depiction of SV (n) sets for n = 2, 3, 4, 32 as a function of the normalized moments. Observe that all possible nonnegative distributions lie within the region delineated by the two dotted lines: m2 ≥ 1 and m3 ≥ m2 [16]. SV (n) for n = 2, 3, 4, 32 are delineated by solid lines, which includes the border, and dashed lines, which does not include the border.
Lemma 2. SV (n) ⊂ S (n) . Proof (Lemma 1). The proof proceeds by induction. When n = 2, the lemma follows from (1), (5), and Theorem 2. Next, assume that S (n) ⊂ SV (n+1) ∪ E (n) for n ≤ k − 1. Consider an arbitrary distribution G ∈ S (k) . Let Z(·) = (X(·) ⊗ Y (·))p + 1 − p, where X is an exponential distribution and Y is a (k − 1)-phase Coxian distribution.4 Observe that for any arbitrary distribution G ∈ S (k) , there exists some such Z which well-represents G. By the assumption of induction, Y ∈ SV (k) ∪ E (k−1) . We prove that (i) if Y ∈ SV (k) , then Z ∈ SV (k+1) and (ii) if Y ∈ E (k−1) , then Z ∈ SV (k+1) ∪ E (k) . Without loss of generality, we can set the first moment of X to 1. To see why this is possible, observe that Z is comprised of k exponential phases, and the normalized second and third moments of Z, mZ 2 and mZ 3 are both invariant to multiplying all the rates of exponential phases in Z by the same constant. Thus, if the first moment of X equals µX 1 = 1, then the rates of all the phases in Z may be multiplied by µX 1 to bring the first moment of X down to 1. 4
To shed light on this expression, consider random variables VX and VY whose distributions are X and Y , respectively. Then random variable
VZ =
VX + VY with probability p 0 with probability 1 − p,
has distribution Z, since Pr(VZ < t) = p Pr(VX + VY < t) + (1 − p).
190
T. Osogami and M. Harchol-Balter
(i) Suppose Y ∈ SV (k) : We first prove that mZ 2 >
k+1 k .
Observe that
k µY2 2 + 2µY1 + k−1 2 + 2µY1 + µY2 > , p(1 + µY1 )2 p(1 + µY1 )2
mZ 2 =
where the inequality follows from Y ∈ SV (k) . The right hand side is minimized mZ k+1 k+1 k+3 3 when µY1 = k − 1. Thus, mZ 2 > pk ≥ k . Next, we prove that mZ ≥ k+2 for all mZ 2 >
k+1 k .
Notice that
mZ 3 mZ 2
2
is independent of p:
(6 + 6µY1 + 3µY2 + µY3 )(1 + µY1 ) mZ 3 = . Z m2 (2 + 2µY1 + µY2 )2
Since
mZ 3 mZ 2
is an increasing function of µY3 , it is minimized at µY3 =
since Y ∈ SV (k) . Thus,
Y 2 k+2 (µ2 ) , k+1 µY 1
(1 + µY1 )(6(k + 1)µY1 + 6(k + 1)(µY1 )2 + 3(k + 1)µY1 µY2 + (k + 2)(µY2 )2 ) mZ 3 ≥ . Z m2 (k + 1)µY1 (2 + 2µY1 + µY2 )2
The infimum of the right hand side occurs at: k 6(k + 1)µY1 (1 + µY1 ) Y Y 2 , µ2 = max (µ ) . 4 + 4µY1 + (k + 1)(4 + µY1 ) k − 1 1 By evaluating
mZ 3 mZ 2
at µY2 =
k Y 2 k−1 (µ1 ) ,
we have
(1 + µY1 ) 6(k + 1)(k − 1)2 (1 + µY1 ) + 3k(k2 − 1)(µY1 )2 + k2 (k + 2)(µY1 )3 mZ 3 ≥ · 2 Z m2 (k + 1) [2(k − 1) + 2(k − 1)µY1 + k(µY1 )2 ]
By Lemma 4 in Appendix A, µY2 =
we have
mZ 3 mZ 2
≥
k+3 k+2 .
By evaluating
mZ 3 mZ 2
at
6(k + 1)µY1 (1 + µY1 ) , 4 + 4µY1 + (k + 1)(4 + µY1 )
3 8(1 + µY1 ) + (k + 1)(8 + 5µY1 ) mZ k+3 3 , ≥ ≥ k+2 mZ 16(2 + k)(1 + µY1 ) 2
where the last inequality holds iff µY1 ≤
8k k+9 .
However, µY1 ≤
8k k+9
holds if
k 6(k + 1)µY1 (1 + µY1 ) > (µY )2 . k−1 1 4 + 4µY1 + (k + 1)(4 + µY1 ) (ii) Suppose Y ∈ E (k−1) : We will prove that (a) if µY1 = (k − 1) and p = 1, then Z ∈ E (k) , and (b) if µY1 = (k − 1) or p < 1, then Z ∈ SV (k+1) . For part (a), observe that if Y ∈ E (k−1) , µY1 = (k − 1), and p = 1, then we have already k+1 k+2 seen that mZ in part (i). It is also easy to see that mZ 2 = 3 = k k , and (k) Y hence Z ∈ E . For part (b), if µ1 = (k − 1) or p < 1, then first notice that k+1 Z Y mZ 2 > k , since m2 is minimized when µ1 = (k − 1) and p = 1. Also, since mY3 =
k+1 k−1
>
Z k+2 m3 k−1 , mZ 2
≥
k+3 k+2
by part (i), and hence Z ∈ SV (k+1) .
Necessary and Sufficient Conditions for Representing General Distributions
191
Proof (Lemma 2). When n = 2, the lemma follows from Theorem 2. The remainder of the proof assumes n ≥ 3. We prove that for an arbitrary distribution G ∈ SV (n) , there exists an n-phase Coxian Z such that the normalized moments of G and Z agree. Notice that the first moment of Z is easily matched to G by normalization without changing the normalizing moments of Z. The proof consists of two parts: (i) the case when the normalized moments of G satG isfy mG 3 > 2m2 − 1; (ii) the case when the normalized moments of G satisfy G G m3 ≤ 2m2 − 1. G (i) Suppose G ∈ SV (n) and mG 3 > 2m2 − 1: We need to show that G is wellrepresented by some n-phase Coxian distribution. We will prove something stronger that G is well-represented by a distribution Z where Z = X + Y , and X is a particular two-phase Coxian distribution with no mass probability at zero and Y is a particular Erlang-(n − 2) distribution. (For the intuition behind this particular way of representing G, please refer to [22]). The normalized moments of X are chosen as follows: mX 2 =
mG 2 (n − 3) − (n − 2) ; mG 2 (n − 2) − (n − 1)
X mX 3 = (n − 1)m2 − (n − 2)
−
2 mG 3
(n − 2)mX 2 − (n − 3)
mX 2
X 2 X (n − 2)(mX 2 − 1) n(n − 1)(m2 ) − n(2n − 5)m2 + (n − 1)(n − 3)
mX 2
.
X The first moment of Y is chosen as follows: µY1 = (n − 2)(mX 2 − 1)µ1 . It is easy to see that the normalized moments of G and Z agree:
mZ 2 =
Y 2 mX 2 + 2y + m2 y = mG 2 ; 2 (1 + y)
mZ 3 =
X X Y 2 Y Y 3 mX 2 m3 + 3m2 y + 3m2 y + m2 m3 y = mG 3 ; X Y 2 (m2 + 2y + m2 y )(1 + y)
µY
n Y 1 where mY2 = n−1 . n−2 and m3 = n−2 are the normalized moments of Y , and y = µX 1 Finally, we will show that there exists a two-phase Coxian distribution with no X mass probability at zero, with normalized moments mX 2 and m3 . By Theorem 3 X X X 1, it suffices to show that m2 > 2 and m3 > 2 m2 . The first condition, mX 2 > 2, (n) n G can be shown using n−1 < m2 , which follows from G ∈ SV . It can also be 3 X n X G G G shown that mX 3 > 2m2 − 1 ≥ 2 m2 using n−1 < m2 and m3 > 2m2 − 1, which is the assumption that we made at the beginning of (i). G (ii) Suppose G ∈ SV (n) and mG 3 ≤ 2m2 − 1: We again must show that G is well-represented by an n-phase Coxian distribution. We will show that G is well-represented by a distribution Z(·) = U (·)p + 1 − p (See Section 2 for an explanation of Z), where p = 2mG1−mG and the normalized moments of U satisfy 2
3
G U G mU 2 = pm2 and m3 = pm3 . It is easy to see that the normalized moments of G and Z agree. Therefore, it suffices to show that U is well-represented by an n-phase Coxian distribution W , since then G is well represented by an n-phase Coxian distribution Z(·) = W (·)p + 1 − p (See Section 2 for an explanation of
192
T. Osogami and M. Harchol-Balter
Z). We will prove that U is well-represented by an n-phase Coxian distribution W , where W = X + Y and X is a two-phase Coxian distribution with no mass probability at zero and Y is an Erlang-(n − 2) distribution. The normalized moments of X are chosen as follows: mX 2 =
mU 2 (n − 3) − (n − 2) mU 2 (n − 2) − (n − 1)
X and mX 3 = 2m2 − 1;
X the first moment of Y is chosen as follows: µY1 = (n − 2)(mX 2 − 1)µ1 . It is easy to see that the normalized moments of U and W agree:
mW 2 =
Y 2 mX 2 + 2y + m2 y = mU 2 ; 2 (1 + y)
mW 3 =
X X Y 2 Y Y 3 mX U 2 m3 + 3m2 y + 3m2 y + m2 m3 y = 2mU 2 − 1 = m3 , X Y 2 (m2 + 2y + m2 y )(1 + y)
where mY2 = y=
µY 1 µX 1
n−1 n−2
and mY3 =
n n−2
are the normalized moments of Y , and
. Finally, we will show that there exists a two-phase Coxian distribution
X with normalized moments mX 2 and m3 . By Theorem 2, it suffices to show that 3 X 2 ≤ m2 , since
6(mX 4 X X 2 − 1) m2 ≤ m X , 3 = 2m2 − 1 ≤ 3 mX 2
3 2 and the second inequality holds mG (n) 3 n+2 G X G 2 when 2 ≤ m2 ≤ 2. Since G ∈ SV , m3 ≥ n+1 m2 . Thus, mU 2 ≥ 2mG − n+2 mG = n+1 2 2 n+1 3 n+1 X U
n . Finally, m2 ≥ 2 follows from m2 ≥ n .
where the first inequality holds when mX 2 ≥
4
Examples of Some Common Distributions in S (n)
In this section, we give examples of distributions that are well-represented by an n-phase Coxian distribution. In particular, we discuss Bounded Pareto distributions, uniform distributions, symmetric triangular distributions, and Pareto distributions, and derive the necessary and sufficient condition for these distributions to be in SV (n) ⊂ S (n) . A summary is shown in Figure 6. We first discuss the set of Bounded Pareto distributions. A Bounded Pareto distribution has a density function f (x) = αx−α−1
1−
lα
l α u
for l ≤ x ≤ u and 0 elsewhere, where 0 < α < 2 [8]. Bounded Pareto distributions have been empirically shown to fit many recent measurements of computing workloads. These include Unix process CPU requirements measured at Bellcore: 1 ≤ α ≤ 1.25 [19], Unix process CPU requirements measured at UC Berkeley: α ≈ 1 [9], sizes of files transferred through the Web: 1.1 ≤ α ≤ 1.3 [3,4], sizes of files stored in Unix filesystems [13], I/O times [25], sizes of FTP transfers in the
Necessary and Sufficient Conditions for Representing General Distributions
193
m3
PARETO
BP
3
(2)
Sv
2 (4)
Sv (9)
3/2
Sv
UNIFORM/TRIANGULAR
1 1
4/3
2
3/2
m2
Fig. 6. A summary of the results in Section 4. A few particular classes of distributions are shown in relation to SV (n) . BP ∗ refers to the subset of Bounded Pareto distributions contained in SV (2) . UN IFORM refers to the class of all uniform distributions described in Definition 5. We find that the larger the support of the uniform distribution, the fewer the number of phases that suffices. T RIAN GULAR refers to the set of symmetric triangular distributions, described in Definition 5. These interestingly have the same behavior as the uniform distribution. Finally, PARET O refers to the class of Pareto distributions with finite third moment, described in Definition 5. For this class, we find that the lower the value of the α-parameter, the fewer the number of phases that are needed.
Internet: .9 ≤ α ≤ 1.1 [24], and Pittsburgh Supercomputing Center workloads for distributed servers consisting of Cray C90 and Cray J90 machines [28]. The normalized moments of a Bounded Pareto distribution, F , are (r − 1)2 ; r(log r)2
mF 3 =
(r − 1)(r + 1) 2r log r
(1 − α)2 (rα − 1)(r2 − rα ) ; α(2 − α) (r − rα )2
mF 3 =
(1 − α)(2 − α) (rα − 1)(r3 − rα ) , α(3 − α) (r − rα )(r2 − rα )
mF 2 = when α = 1, and mF 2 =
when 0 < α < 1 or 1 < α < 2, where r = ul . Not all Bounded Pareto distribution are in SV (2) . However, a large subset of the Bounded Pareto distributions reside in SV (2) . Figure 7 shows the necessary and sufficient condition on r as a function of α for a Bounded Pareto distribution to be in SV (2) . Specifically, a Bounded Pareto distribution is in SV (2) if and only if r = ul is above the two lines shown in Figure 7. We use BP ∗ to denote the subset of Bounded Pareto distributions which are contained in SV (2) . Next, we discuss uniform distributions, symmetric triangular distributions, and Pareto distributions, and derive the necessary and sufficient condition for these distributions to be in SV (n) . We use the following definitions: Definition 5. UN IFORM refers to the set of distributions having density 1 for l ≤ x ≤ u and 0 elsewhere, for some 0 ≤ l < u. function f (x) = u−l
194
T. Osogami and M. Harchol-Balter 200
m >2 2 m3/m2 ≥ 4/3
minimum r
150
100
50
0 0
0.5
1 α
1.5
2
Fig. 7. The maximum of the two lines illustrates the lower bound needed on r ≡ ul in the definition of the BP ∗ distribution. These lines are derived from the conditions F F 4 mF 2 > 2 and m3 ≥ 3 m2 .
T RIAN GULAR is the set of distributions with density function 2 2 u+l u−l (x − l) if l ≤ x ≤ 2 2 f (x) = 2 − u−l (x − u) if u+l 2 ≤x≤u 0 otherwise, for some 0 ≤ l < u. PARET O is the set of distributions with density function f (x) = αk α x−α−1 for x ≥ k and 0 elsewhere, for some α > 3 and k > 0. Let FU ∈ UN IFORM and FT ∈ T RIAN GULAR with parameters l and u, and let FP ∈ PARET O with parameters α and k. The normalized moments of FU , FT , and FP are: U mF 2 =
U mF 3 =
4 1 + r + r2 ; 3 (1 + r)2
3 1 + r2 ; 2 1 + r + r2
T mF 2 =
T mF 3 =
7 + 10r + 7r2 ; 6(1 + r)2
3(3 + 2r + 3r2 ) ; 7 + 10r + 7r2
P mF 2 =
P mF 3 =
(α − 1)2 ; α(α − 2)
(α − 1)(α − 2) , α(α − 3)
P P and mF are independent of k. where r = ul . Note that mF 2 3 Therefore, the three distribution classes are formally characterized as follows:
4 2 F Theorem 4. For all F ∈ UN IFORM, 1 ≤ mF 2 ≤ 3 and m3 = 3 − mF for all 2 0 ≤ l < u. 7 2 F For all F ∈ T RIAN GULAR, 1 ≤ mF 2 ≤ 6 and m3 = 3 − mF for all 2 0 ≤ l < u. For all F ∈ PARET O, 2 F F F F 4 −2(mF 2 ) + 3m2 + 2(m2 − 1) m2 (m2 − 1) F F 1 < m2 < and m3 = 3 4 − 3mF 2
for all α > 3.
Necessary and Sufficient Conditions for Representing General Distributions
195
Simple consequences of the theorem are: Corollary 1. Let F ∈ UN IFORM with parameters l and u. Then, F ∈ SV (n) 2 +14r 3 +7r 4 if and only if n ≥ 7+14r+30r , where r = ul . In particular, for all values (1−r)2 (1+4r+r 2 ) of u, n = 7 if l = 0, and n > 7 whenever l > 0. Let F ∈ T RIAN GULAR with parameters l and u. Then, F ∈ SV (n) if and 2 +34r 3 +11r 4 ) , where r = ul . In particular, for all values only if n ≥ 4(11+34r+54r (1−r)2 (5+14r+5r 2 ) of l and u, n ≥ 9. Let F ∈ PARET O with parameters α and k. Then, F ∈ SV (n) if and only if n > (α − 1)2 for all values of k. In particular, n > 4 for all α > 3 and k.
5
Conclusion
The contribution of this paper is a characterization of the set S (n) of distributions G which are well-represented by an n-phase Coxian distribution. We introduce several ideas which help in creating a simple formulation of S (n) . The first is the concept of normalized moments. The second is the notion of SV (n) , a nearly complete subset of S (n) with an extremely simple representation. The arguments required in proving the above results have an elegant structure which repeatedly makes use of the recursive nature of the Coxian distributions. Our characterization of S (n) provides a necessary number of phases and a sufficient number of phases for a given distribution to be well-represented by a Coxian distribution, and these bounds are nearly tight. This result has several practical uses. First, in designing algorithms which fit general distributions to Coxian distributions (fitting algorithms), it is desirable to find a minimal (fewest number of phases) Coxian distribution. Our characterization allows algorithm designers to determine how close their Coxian distribution is to the minimal Coxian distribution, and provides intuition for coming up with improved algorithms. We have ourselves benefitted from exactly this point. In a companion paper [22], we develop an algorithm for finding a minimal Coxian distribution that well-represents a given distribution. We find that the simple characterization of S (n) provided herein is very useful in this task. Our results are also useful as an input to some existing fitting algorithms, such as Johnson and Taaffe’s nonlinear programming approach [15], which require knowing a priori the number of phases n in the minimal Coxian distribution. Furthermore we classify a few examples of common and practical distributions as being subsets of S (n) for some n. Future work includes a simple characterization of the set of distributions that are well-represented by general n-phase PH distributions. If we were to follow the approach in this paper, we would start by specifying the lower bounds for the second and third normalized moments of general n-phase PH distributions. However, this seems to be nontrivial: although the lower bound on the normalized second moment is known [1], the lower bound on the normalized third moment of n-phase PH distributions is not known. Acknowledgement. We would like to thank Miklos Telek for his help in improving the presentation and quality of this paper.
196
T. Osogami and M. Harchol-Balter
References 1. D. Aldous and L. Shepp. The least variable phase type distribution is Erlang. Communications in Statistics – Stochastic Models, 3:467–473, 1987. 2. T. Altiok. On the phase-type approximations of general distributions. IIE Transactions, 17:110–116, 1985. 3. M. E. Crovella and A. Bestavros. Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6):835–846, December 1997. 4. M. E. Crovella, M. S. Taqqu, and A. Bestavros. Heavy-tailed probability distributions in the world wide web. In A Practical Guide To Heavy Tails, chapter 1, pages 1–23. Chapman & Hall, New York, 1998. 5. A. Cumani. On the canonical representation of homogeneous Markov processes modeling failure-time distributions. Microelectronics and Reliability, 22:583–602, 1982. 6. A. Feldmann and W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 32:245– 279, 1998. 7. H. Franke, J. Jann, J. Moreira, P. Pattnaik, and M. Jette. An evaluation of parallel job scheduling for ASCI blue-pacific. In Proceedings of Supercomputing ’99, pages 679–691, November 1999. 8. M. Harchol-Balter. Task assignment with unknown duration. Journal of the ACM, 49(2), 2002. 9. M. Harchol-Balter and A. Downey. Exploiting process lifetime distributions for dynamic load balancing. In Proceedings of SIGMETRICS ’96, pages 13–24, 1996. 10. M. Harchol-Balter, C. Li, T. Osogami, A. Scheller-Wolf, and M. Squillante. Task assignment with cycle stealing under central queue. In Proceedings of ICDCS ’03, pages 628–637, May 2003. 11. A. Horv´ ath and M. Telek. Approximating heavy tailed behavior with phase type distributions. In Advances in Matrix-Analytic Methods for Stochastic Models, pages 191–214. Notable Publications, July 2000. 12. A. Horv´ ath and M. Telek. Phfit: A general phase-type fitting tool. In Proceedings of Performance TOOLS 2002, pages 82–91, April 2002. 13. G. Irlam. Unix file size survey – 1993. Available at http://www.base.com/gordoni/ufs93.html, September 1994. 14. M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Density function shapes. Communications in Statistics – Stochastic Models, 6:283–306, 1990. 15. M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Nonlinear programming approaches. Communications in Statistics – Stochastic Models, 6:259–281, 1990. 16. S. Karlin and W. Studden. Tchebycheff Systems: With Applications in Analysis and Statistics. John Wiley and Sons, 1966. 17. R. E. A. Khayari, R. Sadre, and B. Haverkort. Fitting world-wide web request traces with the EM-algorithm. Performance Evalutation, 52:175–191, 2003. 18. G. Latouche and V. Ramaswami. Introduction to Matrix Analytic Methods in Stochastic Modeling. ASA-SIAM, Philadelphia, 1999. 19. W. E. Leland and T. J. Ott. Load-balancing heuristics and process behavior. In Proceedings of Performance and ACM Sigmetrics, pages 54–69, 1986.
Necessary and Sufficient Conditions for Representing General Distributions
197
20. R. Marie. Calculating equilibrium probabilities for λ(n)/ck /1/n queues. In Proceedings of Performance 1980, pages 117–125, 1980. 21. M. F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. The Johns Hopkins University Press, 1981. 22. T. Osogami and M. Harchol-Balter. A closed-form solution for mapping general distributions to minimal PH distributions. In Proceedings of TOOLS 2003, September 2003. 23. T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf. Analysis of cycle stealing with switching cost. In Proceedings of SIGMETRICS ’03, pages 184–195, June 2003. 24. V. Paxson and S. Floyd. Wide-are traffic: The failure of Poisson modeling. IEEE/ACM Transactions on Networking, pages 226–244, June 1995. 25. D. L. Peterson and D. B. Adams. Fractal patterns in DASD I/O traffic. In CMG Proceedings, December 1995. 26. A. Riska, V. Diev, and E. Smirni. Efficient fitting of long-tailed data sets into PH distributions. Performance Evaluation, 2003 (to appear). 27. C. Sauer and K. Chandy. Approximate analysis of central server models. IBM Journal of Research and Development, 19:301–313, 1975. 28. B. Schroeder and M. Harchol-Balter. Evaluation of task assignment policies for supercomputing servers: The case for load unbalancing and fairness. In Proceedings of HPDC 2000, pages 211–219, 2000. 29. M. Squillante. Matrix-analytic methods in stochastic parallel-server scheduling models. In Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications, July 1998. 30. D. Starobinski and M. Sidi. Modeling and analysis of power-tail distributions via classical teletraffic methods. Queueing Systems, 36:243–267, 2000. 31. M. Telek and A. Heindl. Matching moments for acyclic discrete and continuous phase-type distributions of second order. International Journal of Simulation, 3:47–57, 2003. 32. W. Whitt. Approximating a point process by a renewal process: Two basic methods. Operations Research, 30:125–147, 1982. 33. Y. Zhang, H. Franke, J. Moreira, and A. Sivasubramaniam. An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration. IEEE Transactions on Parallel and Distributed Systems, 14:236–247, 2003.
A
Technical Lemmas
Lemma 3. The set defined in (3) and the set defined in (4) are equivalent sets. Proof. Recall that the set defined in (3) is the union of the following three sets: √ F 3 2 9pmF 6(pmF 3 2 F F 2 − 12 + 3 2(2 − pm2 ) 2 − 1) ≤ m2 < A1 = F ∃p, ≤ m3 ≤ and , 2p p p2 mF p2 mF 2 2 3 2 2 3 F F F F F , A3 = F ∃p, m2 < m3 and A2 = F ∃p, m3 = and m2 = < m2 ; p p 2 p the set defined in (4) is the union of the following three sets:
198
T. Osogami and M. Harchol-Balter
6(mF 3 4 F F F 2 − 1) and ≤ m2 ≤ 2 , B1 = F m2 ≤ m3 ≤ 3 2 mF 2 4 3 3 F F F B 2 = F mF mF , B3 = F mF . 2 ≤ m3 ≤ 2 and 2 < m2 2 < m3 and 2 < m2 3
2
2
It suffices to prove that (i) A1 = B1 ∪ B2 , (ii) A2 ⊂ B1 ∪ B2 , and (iii) A3 = B3 . (ii) and (iii) are immediate from the definition. To prove (i), we prove that A1 ⊂ B1 ∪ B2 and B1 ∪ B2 ⊂ A1 . Consider a distribution F ∈ A1 . We first show that F ∈ B1 ∪ B2 . Let u(p) and l(p) be the upper and lower bound of mF 3 , respectively: √ 3 l(p) =
3 3pmF 2 −4+
2(2 − pmF 2 )2
;
p2 m F 2
u(p) =
6(pmF 2 − 1) . p2 m F 2
Then, u(p) and l(p) are both continuous and increasing functions of p for p≤
2 mF 2
. When
mF 2
≤ 2, the range of p is
4 F m2 = l 3
3 2mF 2
3 2mF 2
≤ p ≤ 1. Thus,
≤ mF 3 ≤ u(1) =
3 2mF 2
≤ mF 3 ≤ u
2 mF 2
≤
6(mF 2 − 1) , mF 2
and hence F ∈ B1 . When 2 < mF 2 , the range of p is 4 F m2 = l 3
3 2mF 2
3 2mF 2
=
≤p≤
2 mF 2
. Thus,
3 F m2 , 2
and hence F ∈ B2 . Therefore, A1 ⊂ B1 ∪ B2 . However, since u(p) and l(p) are continuous functions of p, mF 3 can take any value between the lower and upper bounds. Therefore, B1 ∪ B2 ⊂ A1 .
Lemma 4. Let y ≥ 0 and k ≥ 1. Then, 2
(1 + y) 6(k + 1)(k − 1) (1 + y) + 3k(k2 − 1)y 2 + k2 (k + 2)y 3 (k + 1) [2(k − 1) + 2(k − 1)y +
Proof. Let
ky 2 ]2
≥
k+3 . k+2
g(y, k) = (1 + y) 6(k + 1)(k − 1)2 (1 + y) + 3k(k2 − 1)y 2 + k2 (k + 2)y 3 (k + 2)
−(k + 1) 2(k − 1) + 2(k − 1)y + ky 2 2
4
2
2
(k + 3)
3
= (2 + 4y + y )k − 2(1 + 2y + 4y + y )k3 − (2 + 4y + y 2 − 5y 3 − y 4 )k2 +2(1 + y)(1 + y + 3y 2 )k. g(y,k) . It suffices to k √ 2+4y+8y +2y 3 ± d(y) , where 3(2+4y+y 2 )
We prove that g(y, k) ≥ 0. Let h(y, k) = Observe that
∂h(y,k) ∂k
= 0 iff k =
prove h(y, k) ≥ 0.
2
d(y) = 16 + 64y + 108y 2 + 66y 3 + 17y 4 + 5y 5 + y 6 .
Necessary and Sufficient Conditions for Representing General Distributions
199
Notice that d(y) ≥ (4 + 8y + y 2 + y 3 )2 . Thus, 2 3 2 + 4y + 8y + 2y + 3(2 + 4y + y 2 )
d(y)
≥
2 + 4y + 8y 2 + 2y 3 + (4 + 8y + y 2 + y 3 ) ≥1 3(2 + 4y + y 2 )
for y ≥ 0. Therefore, h(y, k) is minimized when k =
2 + 4y + 8y 2 + 2y 3 + s(y) = h y, 3(2 + 4y + y 2 )
d(y)
√
2+4y+8y 2 +2y 3 + d(y) . 3(2+4y+y 2 )
Let
3
=
2((28 + 83y + 16y 2 + y 3 )d(y) − d(y) 2 ) 27(2 + 4y + y 2 )2 12(64 + 456y + 1260y 2 + 1655y 3 + 889y 4 + 147y 5 ) − . 27(2 + 4y + y 2 )2
It suffices to prove s(y) ≥ 0. Let t(y) = 27(2 + 4y + y 2 )2 s(y). It suffices to prove t(y) ≥ 0. Notice that t(0) = 0. Thus, it suffices to prove t (y) ≥ 0 for y ≥ 0. However, t (y) = √ 3 v(y), where d(y)
v(y) = 2(128 + 688y + 1922y 2 + 3216y 3 + 3055y 4 + 1562y 5 + 420y 6 + 56y 7 + 3y 8 )
d(y)
−(64 + 216y + 198y 2 + 68y 3 + 25y 4 + 6y 5 )d(y) ≥ 2(128 + 688y + 1922y 2 + 3216y 3 + 3055y 4 + 1562y 5 + 420y 6 + 56y 7 + 3y 8 ) · (4 + 8y + y 2 + y 3 ) − (64 + 216y + 198y 2 + 68y 3 + 25y 4 + 6y 5 )d(y) = 3y 2 (912 + 5600 + 13212y 2 + 15184y 3 + 9604y 4 + 3914y 5 + 1175y 6 + 235y 7 + 21y 8 ) ≥ 0.
A Closed-Form Solution for Mapping General Distributions to Minimal PH Distributions Takayuki Osogami and Mor Harchol-Balter Department of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA {osogami, harchol}@cs.cmu.edu
Abstract. Approximating general distributions by phase-type (PH) distributions is a popular technique in queueing analysis, since the Markovian property of PH distributions often allows analytical tractability. This paper proposes an algorithm for mapping a general distribution G to a PH distribution where the goal is to find a PH distribution which matches the first three moments of G. Since efficiency of the algorithm is of primary importance, we first define a particular subset of the PH distributions, which we refer to as EC distributions. The class of EC distributions has very few free parameters, which narrows down the search space, making the algorithm efficient – In fact we provide a closed-form solution for the parameters of the EC distribution. Our solution is general in that it applies to any distribution whose first three moments can be matched by a PH distribution. Also, our resulting EC distribution requires a nearly minimal number of phases, always within one of the minimal number of phases required by any acyclic PH distribution. Lastly, we discuss numerical stability of our solution.
1
Introduction
Motivation. There is a very large body of literature on the topic of approximating general distributions by phase-type (PH) distributions, whose Markovian properties make them far more analytically tractable. Much of this research has focused on the specific problem of finding an algorithm which maps any general distribution, G, to a PH distribution, P , where P and G agree on the first three moments. Throughout this paper we say that G is well-represented by P if P and G agree on their first three moments. We choose to limit our discussion in this paper to three-moment matching, because matching the first three moments of an input distribution has been shown to be effective in predicting mean performance for variety of many computer system models [4,5,19,23,27]. Clearly, however, three moments might not always suffice for every problem, and we leave the problem of matching more moments to future work. Moment-matching algorithms are evaluated along four different measures: The number of moments matched – In general matching more moments is more desirable. P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 200–217, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Closed-Form Solution for Mapping General Distributions
201
The computational efficiency of the algorithm – It is desirable that the algorithm have short running time. Ideally, one would like a closed-form solution for the parameters of the matching PH distribution. The generality of the solution – Ideally the algorithm should work for as broad a class of distributions as possible. The minimality of the number of phases – It is desirable that the matching PH distribution, P , have very few phases. Recall that the goal is to find P which can replace the input distribution G in some queueing model, allowing a Markov chain representation of the problem. Since it is desirable that the state space of this resulting Markov chain be kept small, we want to keep the number of phases in P low. This paper proposes a moment-matching algorithm which performs very well along all four of these measures. Our solution matches three moments, provides a closed form representation of the parameters of the matching PH distribution, applies to all distributions which can be well-represented by a PH distribution, and is nearly minimal in the number of phases required. The general approach in designing moment-matching algorithms in the literature is to start by defining a subset S of the PH distributions, and then match each input distribution G to a distribution in S. The reason for limiting the solution to a distribution in S is that this narrows the search space and thus improves the computational efficiency of the algorithm. Observe that n-phase PH distributions have Θ(n2 ) free parameters [16] (see Figure 1), while S can be defined to have far fewer free parameters. For all computationally efficient algorithms in the literature, S was chosen to be some subset of the acyclic PH distributions, where an acyclic PH distribution is a PH distribution whose underlying continuous time Markov chain has no transition from state i to state j for all i > j. One has to be careful in how one defines the subset S, however. If S is too small it may limit the space of distributions which can be well-represented.1 Also, if S is too small it may exclude solutions with minimal number of phases. In this paper we define a subset of the PH distributions, which we call EC distributions. EC distributions have only six free parameters which allows us to derive a closed-form solution for these parameters in terms of the input distribution G. The set of EC distributions is general enough, however, that for all distributions G that can be well-represented by a PH distribution, there exists an EC distribution, E, such that G is well-represented by E. Furthermore, the class of EC distributions is broad enough such that for any distribution G, that is well-represented by an n-phase acyclic PH distribution, there exists an EC distribution E with at most n + 1 phases, such that G is well-represented by E.2 1
2
For example, let G be a distribution whose first three moments are 1, 2, and 12. The system of equations for matching G to a 2-phase Coxian+ distribution (see Figure 2) with three parameters (λ1 , λ2 , p) results in either λ1 or λ2 being negative. As another example, it can be shown that the generalized Erlang distribution is not general enough to well-represent all the distributions with low variability (see [17]). Ideally, one would like to evaluate the number of phases with respect to the minimal (possibly-cyclic) PH distribution, i.e., the PH distribution is not restricted to be
202
T. Osogami and M. Harchol-Balter
p05 p04 p14
p03 p02 Exp
p01
p24
p13 λ1
p12 p21
Exp
λ2
p23 p32
Exp
λ3
p31
p34
Exp
p43
λ4 p 45 p35
p42 p41
p25 p15
Fig. 1. A PH distribution is the distribution of the absorption time in finite state continuous time Markov chain. The figure shows a 4-phase PH distribution. There are n = 4 states, where the ith state has exponentially-distributed sojourn time with rate λi . With probability p0i we start in the ith state, and the next state is state j with probability pij . Each state i has probability pi5 that the next state will be the absorbing state. The absorption time is the sum of the times spent in each of the states.
Preliminary Definitions. Formally, we will use the following definitions: Definition 1. A distribution G is well-represented by a distribution F if F and G agree on their first three moments. The normalized moments, introduced in [18], help provide a simple representation and analysis of our closed-form solution. These are defined as follows: Definition 2. Let µF k be the k-th moment of a distribution F for k = 1, 2, 3. The F normalized k-th moment mF k of F for k = 2, 3 is defined to be m2 =
and mF 3 =
µF 3 F µF 1 µ2
µF 2 2 (µF 1 )
.
Notice the correspondence to the coefficient of variability CF and skewness γF µF 2 F 3 of F : mF mF 2 = CF + 1 and m3 = νF 2 , where νF = (µF )3/2 . (νF and γF and closely related, since γF = F for k = 2, 3.)
µ ¯F 3 3/2 , (¯ µF 2 )
2
where
µ ¯F k
is the centralized k-th moment of
Definition 3. PH3 refers to the set of distributions that are well-represented by a PH distribution. It is known that a distribution G is in PH3 iff its normalized moments satisfy G G G mG 3 > m2 > 1 [10]. Since any nonnegative distribution G satisfies m3 ≥ m2 ≥ 1 [13], almost all the nonnegative distributions are in PH3 . acyclic. However, the necessary and sufficient number of phases required to wellrepresent a given distribution by a (possibly-cyclic) PH distribution is unknown.
A Closed-Form Solution for Mapping General Distributions p1 Exp p2 λ1 1−p1 1−p2
Exp p3 λ2 1−p3
203
pn Exp λn
Fig. 2. An n-phase Coxian distribution is a particular n-phase PH distribution whose underlying Markov chain is of the form in the figure, where 0 ≤ pi ≤ 1 and λi > 0 for all 0 ≤ i ≤ n. An n-phase Coxian+ distribution is a particular n-phase Coxian distribution with p1 = 1.
Definition 4. OP T (G) is defined to be the minimum number of necessary phases for a distribution G to be well-represented by an acyclic PH distribution.3 Previous Work. Prior work has contributed a very large number of moment matching algorithms. While all of these algorithms excel with respect to some of the four measures mentioned earlier (number of moments matched; generality of the solution; computational efficiency of the algorithm; and minimality of the number of phases), they all are deficient in at least one of these measures as explained below. In cases where matching only two moments suffices, it is possible to achieve solutions which perform very well along all the other three measures. Sauer and Chandy [21] provide a closed-form solution for matching two moments of a general distribution in PH3 . They use a two-branch hyper-exponential distribution for matching distributions with squared coefficient of variability C 2 > 1 and a generalized Erlang distribution for matching distributions with C 2 < 1. Marie [15] provides a closed-form solution for matching two moments of a general distribution in PH3 . He uses a two-phase Coxian+ distribution4 for distributions with C 2 > 1 and a generalized Erlang distribution for distributions with C 2 < 1. If one is willing to match only a subset of distributions, then again it is possible to achieve solutions which perform very well along the remaining three measures. Whitt [26] and Altiok [2] focus on the set of distributions with C 2 > 1 and sufficiently high third moment. They obtain a closed-form solution for matching three moments of any distribution in this set. Whitt matches to a two-branch hyper-exponential distribution and Altiok matches to a two-phase Coxian+ distribution. Telek and Heindl [25] focus on the set of distributions with C 2 ≥ 12 and various constraints on the third moment. They obtain a closed-form solution for matching three moments of any distribution in this set, by using a two-phase Coxian+ distribution. Johnson and Taaffe [10,9] come closest to achieving all four measures. They provide a closed-form solution for matching the first three moments of any distribution G ∈ PH3 . They use a mixed Erlang distribution with common order. 3
4
The number of necessary phases in general PH distributions is not known. As shown in the next section, all the previous work on computationally efficient algorithms for mapping general distributions concentrates on a subset of acyclic PH distributions. Coxian+ and Coxian distributions are particular PH distributions shown in Figure 2.
204
T. Osogami and M. Harchol-Balter
Unfortunately, this mixed Erlang distribution does not produce a minimal solution. Their solution requires 2OP T (G) + 2 phases in the worst case. In complementary work, Johnson and Taaffe [12,11] again look at the problem of matching the first three moments of any distribution G ∈ PH3 , this time using three types of PH distributions: a mixture of two Erlang distributions, a Coxian+ distribution, and a general PH distribution. Their solution is nearly minimal in that it requires at most OP T (G) + 2 phases. Unfortunately, their algorithm requires solving a nonlinear programing problem and hence is very computationally inefficient. Above we have described the prior work focusing on moment-matching algorithms (three moments), which is the focus of this paper. There is also a large body of work focusing on fitting the shape of an input distribution using a PH distribution. Of particular recent interest has been work on fitting heavy-tailed distributions to PH distributions, see for example the work of [3,6,7,14,20,24]. There is also work which combines the goals of moment matching with the goal of fitting the shape of the distribution, see for example the work of [8,22]. The work above is clearly broader in its goals than simply matching three moments. Unfortunately there’s a tradeoff: obtaining a more precise fit requires many more phases. Additionally it can sometimes be very computationally inefficient [8,22]. The Idea behind the EC Distribution. In all the prior work on computationally efficient moment-matching algorithms, the approach was to match a general input distribution G to some subset S of the PH distributions. In this paper, we show that by using the set of EC distributions as our subset S, we achieve a solution which excels in all four desirable measures mentioned earlier. We define the EC distributions as follows: Definition 5. An n-phase EC (Erlang-Coxian) distribution is a particular PH distribution whose underlying Markov chain is of the form in Figure 3. EN p 1−p
Exp
λY
Exp
λY
COX 2
Exp
λY
Exp pX Exp
λX1
λX2
1−pX
Fig. 3. The Markov chain underlying an EC distribution, where the first box above depicts the underlying continuous time Markov chain in an N -phase Erlang distribution, where N = n − 2, and the second box depicts the underlying continuous time Markov chain in a two-phase Coxian+ distribution. Notice that the rates in the first box are the same for all states.
We now provide some intuition behind the creation of the EC distribution. Recall that a Coxian distribution is very good for approximating any distribution with high variability. In particular, a two-phase Coxian distribution is known to
A Closed-Form Solution for Mapping General Distributions
205
well-represent any distribution that has high second and third moments (any 3 G G distribution G that satisfies mG 2 > 2 and m3 > 2 m2 ) [18]. However a Coxian distribution requires many more phases for approximating distributions with lower second and third moments. (For example, a Coxian distribution requires at n+1 least n phases to well-represent a distribution G with mG 2 ≤ n for integers n ≥ 1) [18]. The large number of phases needed implies that many free parameters must be determined which implies that any algorithm that tries to well-represent an arbitrary distribution using a minimal number of phases is likely to suffer from computational inefficiency. By contrast, an n-phase Erlang distribution has only two free parameters and is also known to have the least normalized second moment among all the n-phase PH distributions [1]. However the Erlang distribution is obviously limited in the set of distributions which it can well-represent. Our approach is therefore to combine the Erlang distribution with the twophase Coxian distribution, allowing us to represent distributions with all ranges of variability, while using only a small number of phases. Furthermore the fact that the EC distribution has very few free parameters allows us to obtain closedfrom expressions for the parameters (n, p, λY , λX1 , λX2 , pX ) of the EC distribution that well-represents any given distribution in PH3 . Outline of Paper. We begin in Section 2 by characterizing the EC distribution in terms of normalized moments. We find that for the purpose of moment matching it suffices to narrow down the set of EC distributions further from six free parameters to five free parameters, by optimally fixing one of the parameters. We next present three variants for closed-form solutions for the remaining free parameters of the EC distribution, each of which achieves slightly different goals. The first closed-form solution provided, which we refer to as the simple solution, (see Section 3) has the advantage of simplicity and readability; however it does not work for all distributions in PH3 (although it works for almost all). This solution requires at most OP T (G) + 2 phases. The second closed-form solution provided, which we refer to as the improved solution, (see Section 4.1) is defined for all the input distributions in PH3 and uses at most OP T (G) + 1 phases. This solution is only lacking in numerical stability. The third closedform solution provided, which we refer to as the numerically stable solution, (see Section 4.2) again is defined for all input distributions in PH3 . It uses at most OP T (G) + 2 phases and is numerically stable in that the moments of the EC distribution are insensitive to a small perturbation in its parameters.
2
EC Distribution: Motivation and Properties
The purpose of this section is twofold: to provide a detailed characterization of the EC distribution, and to discuss a narrowed-down subset of the EC distributions with only five free parameters (λY is fixed) which we will use in our moment-matching method. Both of these results are summarized in Theorem 1. To motivate the theorem in this section, consider the following story. Suppose one is trying to match the first three moments of a given distribution G to a
206
T. Osogami and M. Harchol-Balter
distribution P which consists of a generalized Erlang distribution (in a generalized Erlang distribution the rates of the exponential phases may differ) followed by a two-phase Coxian+ distribution. If the distribution G has sufficiently high second and third moments, then a two-phase Coxian+ distribution alone suffices and we need zero phases of the generalized Erlang distribution. If the variability of G is lower, however, we might try appending a single-phase generalized Erlang distribution to the two-phase Coxian+ distribution. If that doesn’t suffice, we might append a two-phase generalized Erlang distribution to the two-phase Coxian+ distribution. If our distribution G has very low variability we might be forced to use many phases of the generalized Erlang distribution to get the variability of P to be low enough. Therefore, to minimize the number of phases in P , it seems desirable to choose the rates of the generalized Erlang distribution so that the overall variability of P is minimized. Continuing with our story, one could express the appending of each additional phase of the generalized Erlang distribution as a “function” whose goal is to reduce the variability of P yet further. We call this “function φ.” Definition 6. Let X be an arbitrary distribution. Function φ maps X to φ(X) such that φ(X) = Y ∗ X, where Y is an exponential distribution with rate λY independent of X, Y ∗X is the convolution of Y and X, and λY is chosen so that the normalized second moment of φ(X) is minimized. Also, φl (X) = φ(φl−1 (X)) refers to the distribution obtained by applying function φ to φl−1 (X) for integers l ≥ 1, where φ0 (X) = X. Observe that, when X is a k-phase PH distribution, φ(X) is a (k + 1)-phase PH distribution whose underlying Markov chain can be obtained by appending a state with rate λY to the Markov chain underlying X, where λY is chosen so φ(X) that m2 is minimized. In theory, function φ allows each successive exponential distribution which is appended to have a different first moment. The following theorem shows that if the exponential distribution Y being appended by function φ is chosen so as to minimize the normalized second moment of φ(X) (as specified by the definition), then the first moment of each successive Y is always the same and is defined by the simple formula shown in (1). The theorem below further characterizes the normalized moments of φl (X). Theorem 1. Let φl (X) = Yl ∗ φl−1 (X) and let λYl = λY l =
1 Y µ1 l
for l = 1, ..., N . Then,
1 (mX − 1)µX 2 1
(1)
for l = 1, ..., N . The normalized moments of ZN = φN (X) are: N mZ = 2 N mZ = 3
(mX 2 − 1)(N + 1) + 1 ; (mX 2 − 1)N + 1 X mX 2 m3 2 X ((m2 − 1)(N + 1) + 1) ((mX 2 − 1)N + 1) X X X (mX 2 − 1)N 3m2 + (m2 − 1)(m2 + 2)(N
+
(2)
2 2 + 1) + (mX 2 − 1) (N + 1) 2
X ((mX 2 − 1)(N + 1) + 1) ((m2 − 1)N + 1)
.(3)
A Closed-Form Solution for Mapping General Distributions
207
Observe that, when X is a k-phase PH distribution, φN (X) is a (k+N )-phase PH distribution whose underlying Markov chain can be obtained by appending N states with rate λY to the Markov chain underlying X, where λY is chosen φ(X) so that m2 is minimized. The remainder of this section will prove the above theorem and a corollary. Proof (Theorem 1). We first characterize Z = φ(X) = Y ∗ X, where X is an arbitrary distribution with a finite third moment and Y is an exponential distribution. The nor2 mX µY 2 +2y+2y 1 malized second moment of Z is mZ , where y = µX . Observe that 2 = (1+y)2 1
X mZ 2 is minimized when y = m2 − 1, namely,
X µY1 = (mX 2 − 1)µ1 .
(4)
Observe that when equation (4) is satisfied, the normalized second moment of Z satisfies: mZ 2 = 2−
1 , mX 2
(5)
and the normalized third moment of Z satisfies: mZ 3 =
1 X mX 2 (2m2
− 1)
mX 3 +
3(mX 2 − 1) . mX 2
(6)
We next characterize Zl = φl (X) = Yl ∗ φl−1 (X) for 2 ≤ l ≤ N : By (5) and (6), (2) and (3) follow from solving the following recursive formulas (where we φl (X)
φl (X)
use bl to denote m2
and Bl to denote m3
):
1 ; bl Bl 3(bl − 1) = . + bl (2bl − 1) bl
bl+1 = 2 − Bl+1
(7) (8)
The solution for (7) is given by bl =
(b1 − 1)l + 1 (b1 − 1)(l − 1) + 1
for all l ≥ 1, and the solution for (8) is given by Bl =
b1 B1 + (b1 − 1)(l − 1) 3b1 + (b1 − 1)(b1 + 2)l + (b1 − 1)2 l2 ((b1 − 1)l + 1) ((b1 − 1)(l − 1) + 1)2
(9)
(10)
for all l ≥ 1. Equations (9) and (10) can be easily verified by substitution into (7) and (8), respectively. This completes the proof of (2) and (3). The proof of (1) proceeds by induction. When l = 1, (1) follows from (4). Assume that (1) holds when l = 1, ..., t. Let Zt = φt (X). By (2), which is proved
t above, mZ 2 =
(mX 2 −1)(t+1)+1 . (mX 2 −1)t+1
Thus, by (4)
Y
Zt X X t µ1 t+1 = (mZ 2 − 1)µ1 = (m2 − 1)µ1 .
208
T. Osogami and M. Harchol-Balter
N F , then Corollary 1. Let Z = φ (X). If X ∈ F | 2 < m N 2 N +2 N +1 . ZN ∈ F N +1 < mF 2 < N Corollary 1 suggests the number N of times that function φ must be applied to N X to bring mZ into the desired range, given the value of mX 2 . Observe that 2 + } any Coxian distribution is in {F | 2 < mF 2 N Proof (Corollary 1). By (2), mZ is a continuous and monotonically increasing 2 N . Thus, the infimum and the supremum of mZ are given by function of mX 2 2 ZN evaluating m2 at the infimum and the supremum, respectively, of mX 2 . When ZN ZN N +2 N +1 X → 2, m → . When m → ∞, m → . mX 2 2 2 2 N +1 N
3
A Simple Closed-Form Solution
Theorem 1 implies that the parameter λY of the EC distribution can be fixed without excluding the distributions of lowest variability from the set of EC distributions. In the rest of the paper, we constrain λY as follows: λY =
1 , X (mX 2 − 1)µ1
(11)
and derive closed-form representations of the remaining free parameters (n, X p, λX1 , λX2 , pX ), where these free parameters will determine mX 2 and µ1 in (11). Obviously, at least three degrees of freedom are necessary to match three moments. As we will see, the additional degrees of freedom allow us to accept all input distributions in PH3 , use a smaller number of phases, and achieve numerical stability. We introduce the following sets of distributions to describe the closed-form solutions compactly: Definition 7. Let Ui , Mi , and L be the sets of distributions defined as follows: F F U0 = F mF 2 > 2 and m3 > 2m2 − 1 , i+1 i + 2 F Ui = F < mF and mF 2 < 3 > 2m2 − 1 , i+1 i F F M0 = F m2 > 2 and m3 = 2mF 2 −1 , i+1 i + 2 F Mi = F < mF and mF 2 < 3 = 2m2 − 1 , i+1 i F F F L = F m2 > 1 and m2 < mF 3 < 2m2 − 1 , + ∞ + for nonnegative integers i. Also, let U + = ∪∞ i=1 Ui , M = ∪i=1 Mi , U = U0 ∪U , + and M = M0 ∪ M .
These sets are illustrated in Figure 4. The next theorem provides the intuition behind the sets U, M, and L; namely, for all distributions X, the distributions X and A(X) are in the same classification region (Figure 4).
A Closed-Form Solution for Mapping General Distributions
209
00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 00000000000000 11111111111111 m3 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 L 0 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 U0 M0 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 L1 00000000000000000000000000 11111111111111111111111111 U1 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 U 00000000000000000000000000 11111111111111111111111111 2 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 L 2 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 11111111111111111111111111 00000000000000000000000000 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 11111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000 M 111111111111111 3 11111111111111111111111111 00000000000000000000000000 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 M2 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 1 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111
1
m2
2
Fig. 4. A classification of distributions. The dotted lines delineate the set of all nonG negative distributions G (mG 3 ≥ m2 ≥ 1).
Lemma 1. Let ZN = AN (X) for integers N ≥ 1. If X ∈ U (respectively, X ∈ M, X ∈ L), then ZN ∈ U (respectively, ZN ∈ M, ZN ∈ L) for all N ≥ 1. Proof. We prove the case when N = 1. The theorem then follows by induction. 1 Let Z = A(X). By (2), mX 2 = 2−mZ , and 2
mZ 3 = (respectively, <, and >) = (respectively, <, and >)
2mX mX 2 −1 2 −1 + 3 X X mX (2m − 1) m 2 2 2 2mZ 2 − 1,
where the last equality follows from mX 2 =
1 . 2−mZ 2
By Corollary 1 and Lemma 1, it follows that: Corollary 2. Let ZN = AN (X) for N ≥ 0. If X ∈ U0 (respectively, X ∈ M0 ), then ZN ∈ UN (respectively, ZN ∈ MN ). The corollary implies that for all G ∈ UN ∪ MN , G can be well-represented by an (N + 2)-phase EC distribution with no mass probability at zero (p = 1), since, for all F ∈ U0 ∪ M0 , F can be well-represented by two-phase Coxian+ distribution, and ZN = AN (X) can be well-represented by (2 + N )-phase EC distribution. It can also be easily shown that for all G ∈ LN , G can be wellrepresented by an (N + 2)-phase EC distribution with nonzero mass probability at zero (p < 1). From these properties of AN (X), it is relatively easy to provide a closedform solution for the parameters (n, p, λX1 , λX2 , pX ) of an EC distribution Z so that a given distribution G is well-represented by Z. Essentially, one just needs to find an appropriate N and solve Z = AN (X) for X in terms of normalized moments, which is immediate since N is given by Corollary 1 and the normalized moments of X can be obtained from Theorem 1. A little more effort is necessary to minimize the number of phases and to guarantee numerical stability. In this section, we give a simple solution, which assumes the following condi− − tion on A: A ∈ PH− 3 , where PH3 = U ∪ M ∪ L. Observe PH3 includes almost all distributions in PH3 . Only the borders between the Ui ’s are not included. We also analyze the number of necessary phases and prove the following theorem:
210
T. Osogami and M. Harchol-Balter
Theorem 2. Under the simple solution, the number of phases needed to wellrepresent any distribution G by an EC distribution is at most OP T (G) + 2.
The Closed-Form Solution: The solution differs according to the classification of the input distribution G. When G ∈ U0 ∪ M0 , a two-phase Coxian+ distribution suffices to match the first three moments. When G ∈ U + ∪ M+ , G is well-represented by an EC distribution with p = 1. When G ∈ L, G is well-represented by an EC distribution with p < 1. For all cases, the parameters (n, p, λX1 , λX2 , pX ) are given by simple closed formulas. (i) If G ∈ U0 ∪ M0 , then a two-phase Coxian+ distribution suffices to match the first three moments, i.e., p = 1 and n = 2 (N = 0). The parameters (λX1 , λX2 , pX ) of the two-phase Coxian+ distribution are chosen as follows [25,18]: λX1 =
u+
where u =
√
u2 − 4v , 2µG 1
λX2 =
6−2mG 3 G and 3mG 2 −2m3 + +
v=
u−
√ u2 − 4v , 2µG 1
pX =
and
G λX2 µG 1 (λX1 µ1 − 1) , G λX1 µ1
12−6mG 2 G −2mG ) . mG (3m 2 2 3
(ii) If G ∈ U ∪ M , Corollary 1 specifies number, n, of phases needed:
mG k G 2 n = min km2 > = (12) +1 , G k−1
(N =
mG 2 mG 2 −1
m2 − 1
− 1 ). Next, we find the two-phase Coxian+ distribution X ∈
U0 ∪ M0 such that G is well-represented by Z, where Z(·) = Y (n−2)∗ (·) ∗ X(·) and Y is an exponential distribution satisfying (1), Y (n−2)∗ is the (n − 2)-th convolution of Y , and Y (n−2)∗ ∗ X is the convolution of Y (n−2)∗ and X. To shed light on this expression, consider i.i.d. random variables V1 , ... Vk whose
k+1 distributions are Y and a random variable Vk+1 . Then random variable t=1 Vt has distribution Z. By Theorem 1, this can be achieved by setting mX 2 =
(n − 3)mG 2 − (n − 2) ; (n − 2)mG 2 − (n − 1)
where
mX 3 =
βmG 3 −α ; mX 2
µX 1 =
µG 1 , (n − 2)mX 2 − (n − 3) (13)
X 2 X α = (n − 2)(mX 2 − 1) n(n − 1)(m2 ) − n(2n − 5)m2 + (n − 1)(n − 3) ,
β = (n − 1)mX 2 − (n − 2)
2
(n − 2)mX 2 − (n − 3)
.
Thus, we set p = 1, and the parameters (λX1 , λX2 , pX ) of X are given by case (i), using the first moment and the normalized moments of X specified by (13). (iii) If G ∈ L, then let p=
2mG 2
1 , − mG 3
G mW 2 = pm2 ,
G mW 3 = pm3 ,
and
µW 1 =
µG 1 . p
(14)
G is then well-represented by distribution Z, where Z(·) = W (·)p+1−p. To shed light on this expression, consider a random variables V1 whose distribution is W ,
A Closed-Form Solution for Mapping General Distributions
211
where W is an EC distribution whose first moment and normalized moments are specified by (14). Then, V1 with probability p 0 with probability 1 − p.
V2 =
has distribution Z, since Pr(V2 < t) = p Pr(V1 < t) + (1 − p). Observe that p satisfies 0 ≤ p < 1 and W satisfies W ∈ M. If W ∈ M0 , the parameters of W are provided by case (i), using the normalized moments specified by (14). If W ∈ M+ , the parameters of W are provided by case (ii), using the normalized moments specified by (14). Figure 5 shows a graphical representation of the simple solution. 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 0000000000000 1111111111111 m3 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 G 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 11111111111111111111111111 00000000000000000000000000 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 000000000000000000000000001111111111111 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 3 11111111111111111111111111 00000000000000000000000000 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111 00000000000000000000000000 1 11111111111111111111111111 00000000000000000000000000 11111111111111111111111111
1
2
(i)
m2
0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 m3 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 X 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 N 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 A 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 0000000000000000000000000 1111111111111111111111111
11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 0000000000000 0000000000000000000000000 1111111111111111111111111 000000000000000000000000001111111111111 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 3 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 0000000000000000000000000 1111111111111111111111111 G 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1 1111111111111111111111111
1
2
m2
0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 m3 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111
0000000000000 1111111111111 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 X 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 N 1111111111111 G 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 A 1111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000 0000000000000000000000000 1111111111111111111111111 0000000000000 1111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000 1111111111111 3 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 W 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 0000000000000000000000000 1111111111111111111111111 00000000000000000000000000 11111111111111111111111111
1 1
(ii)
2
m2
(iii)
Fig. 5. A graphical representation of the simple solution. Let G be the input distribution. (i) If G ∈ U0 ∪ M0 , G is well-represented by a two-phase Coxian+ distribution X. (ii) If G ∈ U + ∪M+ , G is well-represented by AN (X), where X is a two-phase Coxian+ distribution. (iii) If G ∈ L, G is well-represented by Z, where Z is W = AN (X) with probability p and 0 with probability 1 − p and X is a two-phase Coxian+ distribution.
Analyzing the Number of Phases Required. The proof of Theorem 2 relies on the following theorem: Theorem 3. [18] Let S (n) denote the set of distributions that are wellrepresented by an n-phase acyclic PH distribution. Let SV (n) and E (n) be the sets defined by: n+1 n+3 F and mF m2 ; SV (n) = F mF 2 > 3 ≥ n n+2 n+1 n+2 F (n) F and m3 = = F m 2 = E n
n
for integers n ≥ 2. Then S (n) ⊂ SV (n) ∪ E (n) for integers n ≥ 2. Proof (Theorem 2). We will show that (i) if a distribution G is in SV (l) ∩(U ∪ M), then at most l + 1 phases are used, and (ii) if a distribution G is in SV (l) ∩ L, then at most l + 2 phases are used. Since S (l) ⊂ SV (l) ∪ E (l) by Theorem 3,
212
T. Osogami and M. Harchol-Balter
this completes the proof. Notice that the simple solution is not defined when G ∈ E (l) . (i) Suppose G ∈ U ∪ M. If G ∈ SV (l) , then by (12) the EC distribution provided by the simple solution has at most l + 1 phases. (ii) Suppose G ∈ L. If mG
l+2 2 G ∈ SV (l) , then mW 2 = 2mG −mG > l+1 . By (12), the EC distribution provided 2 3 by the simple solution has at most l + 2 phases.
4
Variants of Closed-Form Solutions
In this section, we present two refinements of the simple solution (Section 3), which we refer to as the improved solution and the numerically stable solution. 4.1
An Improved Closed-Form Solution
We first describe the properties that the improved solution satisfies. We then describe the high level ideas behind the construction of the improved solution. Figure 6 is an implementation of the improved solution. See [17] for details on how the high level ideas described above are realized in the improved solution. Properties of the Improved Solution. This solution is defined for all the input distributions G ∈ PH3 and uses a smaller number of phases than the simple solution. Specifically, the number of phases required in the improved solution is characterized by the following theorem: Theorem 4. Under the improved solution, the number of phases needed to wellrepresent any distribution G by an EC distribution is at most OP T (G) + 1. For a proof of the theorem, see [17]. High Level Ideas. Consider an arbitrary distribution G ∈ PH3 . Our approach consists of two steps, the first of which involves constructing a baseline EC distribution, and the second of which involves reducing the number of phases in this baseline solution. If G ∈ PH− 3 , then the baseline solution used is simply given by the simple solution (Section 3). If G ∈ / PH− 3 , then to obtain the baseline EC distributing we first find a distribution W ∈ PH− 3 such that mW 2
mG 2
mW 3 mW 2
=
mG 3 mG 2
and
< and then set p such that G is well-represented by distribution Z, where Z(·) = W (·)p + 1 − p. (See Section 3 for an explanation of Z). The parameters of the EC distribution that well-represents W are then obtained by the simple solution (Section 3). Next, we describe an idea to reduce the number of phases used in the baseline EC distribution. The simple solution (Section 3) is based on the fact that a distribution X is well-represented by a two-phase Coxian distribution when X ∈ U0 ∪ M0 . In fact, a wider range of distributions are well-represented by the set of two-phase Coxian distributions. In particular, if 3 X X X ∈ F ≤ mX 2 ≤ 2 and m3 = 2m2 − 1 , 2
A Closed-Form Solution for Mapping General Distributions
213
then X is well-represented by a two-phase Coxian distribution. In fact, the above solution can be improved upon yet further. However, for readability, we postpone this to [17].5 G G (n, p, λY , λX1 , λX2 , pX ) = Improved(µG 1 , µ2 , µ3 ) G G Input: the first three moments of a distribution G: µG 1 , µ2 , and µ3 . Output: parameters of the EC distribution, (n, p, λY , λX1 , λX2 , pX )
1. mG 2 = 2. p =
3. µW 1 = 4. n = 5. mX 2
µG µG 2 3 mG 3 = µG µG . 2; (µG ) 1 1 2 2 G (mG 2 ) +2m2 −1 if mG 3 > 2(mG )2 2 1 if mG 3 < 2mG −mG
1
2
3
µG 1 ; mW 2 p mW 2 mW −1 2
9. u =
2mG 2
− 1,
1 mG −1 2
is an integer,
otherwise. = pmG 2 ;
G mW 3 = pm3 .
W W if mW 3 = 2m2 − 1, and m2 < 2
mW 2 + 1 otherwise. mW −1 2 (n−3)mW µW 2 −(n−2) 1 = (n−2)mW −(n−1) ; µX 1 = (n−2)mX −(n−3) . 2 2 X 2 X (n − 2)(mX 2 − 1) n(n − 1)(m2 ) − n(2n − 5)m2 + (n − 1)(n − 3) . 2 X X (n − 1)m2 − (n − 2) (n − 2)m2 − (n − 3) . βmW −α = m3 X . 2 X X 1 if 3mX 0 if 3mX 2 = 2m3 2 = 2m3 ; v= 6−2mX 12−6mX 3 2 otherwise otherwise 3mX −2mX mX (3mX −2mX ) 2 3 2 2 3
6. α = 7. β = 8. mX 3
2mG 2 − 1, and
10. λX1 =
√
u+
u2 −4v ; 2µX 1
λX2 =
√
u−
u2 −4v ; 2µX 1
pX =
X λX2 µX 1 (λX1 µ1 −1) ; λX1 µX 1
λY =
.
1 (mX −1)µX 2 1
.
Fig. 6. An implementation of the improved closed-form solution.
4.2
A Numerically Stable Closed-Form Solution
The improved solution (Section 4.1) is not numerically stable when G ∈ U and l+1 mG 2 is close to l for integers l ≥ 1, i.e., on the borders between Ui ’s. In this section, we present a numerically stable solution. We first describe the properties that the numerically stable solution satisfies. We then describe the high level ideas behind the construction of the numerically stable solution. Figure 6 is an implementation of the numerically stable solution. See [17] for details on how the high level ideas described above are realized in the numerically stable solution. Properties of the Numerically Stable Solution. The numerically stable solution uses at most one more phase than the improved solution and is defined 5
While this further improvement reduces the number of necessary phases by one for many distributions, it does not improve the worst case performance.
214
T. Osogami and M. Harchol-Balter G G (n, p, λY , λX1 , λX2 , pX ) = Stable(µG 1 , µ2 , µ3 ) G G If m3 ≤ 2m2 − 1, use Improved. Otherwise, replace steps 2-4 of Improved as follows:
2. n =
(mG )2 −2mG +2 2 2
2(mG −1) 2
n−1
1 n + n−1 . n−2 2mG 2 G µ1 W W µ1 = p ; m2 = pmG 2 ;
3. p = 4.
3mG 2 −2+
.
G mW 3 = pm3 .
Fig. 7. An implementation of the numerically stable closed-form solution.
for all the input distributions in PH3 . Specifically, the number of phases required in the numerically stable solution is characterized by the following theorem: Theorem 5. Under the numerically stable solution, the number of phases needed to well-represent any distribution G by an EC distribution is at most OP T (G) + 2. A proof of Theorem 5 is given in [17]. The EC distribution, Z, that is provided by the numerically stable solution is numerically stable in the following sense: Proposition 1. Let Z be the EC distribution provided by the numerically stable solution, where the input distribution G is well-represented by Z. Let (n, p, λY , λX1 , λX2 , pX ) be the parameters of Z. Suppose that each parameter p, λY , λX1 , λX2 , and pX has an error ∆p, ∆λY , ∆λX1 , ∆λX2 , and ∆pX , respectively, in Z G absolute value. Let ∆µZ 1 = |µ1 −µ1 | be the error of the first moment of Z and let Z Z G ∆mi = |mi − mi | be the error of the i-th normalized moment of Z for i = 2, 3. ∆pX ∆λY ∆λX1 ∆λX2 −5 (respectively, = 10−9 ), then If ∆p p , λY , λX1 , λX2 , and pX < = 10 ∆µZ 1 µZ 1
∆mZ
< 0.01 and mZi < 0.01 for i = 2, 3, provided that the normalized moments i of G satisfies the condition in Figure 8 (a) (respectively, (b)). In Proposition 1, was chosen to be 10−5 and 10−9 , respectively. These correspond to the precisions of the float (six decimal digits) and double (ten decimal digits) data type in C, respectively. In Figure 8 (b), it is impossible to distinguish the set of all non-negative distributions from the set of distributions for which the stability guarantee of Proposition 1 holds. Closed form formulas for the curves in Figure 8 and a proof of Proposition 1 are given in [17]. High Level Ideas. Achieving the numerical stability is based on the same idea as treating input distributions which are not in PH− 3 . Namely, we first find an mW
mG
G EC distribution W such that m3W = m3G and mW 2 < m2 so that the solution 2 2 is numerically stable for W , and then set p such that G is well-represented by Z(·) = W (·)p + 1 − p. (See Section 3 for an explanation of Z).
A Closed-Form Solution for Mapping General Distributions Stability region (err=10−9)
5
5
4.5
4.5
4
4
3.5
3.5 m3
m3
Stability region (err=10−5)
3
3
2.5
2.5
2
2
1.5
1.5
1 1
1.5
2
2.5 m2
(a) = 10−5
3
215
3.5
4
1 1
1.5
2
2.5 m2
3
3.5
4
(b) = 10−9
Fig. 8. If the normalized moments of G lie between the two solid lines, then the normalized moments of the EC distribution Z, provided by the numerically stable solution, are insensitive to the small change ( = 10−5 for (a) and = 10−9 for (b)) in the parameters of Z. The dotted lines delineate the set of all nonnegative distributions G G (mG 3 ≥ m2 ≥ 1).
5
Conclusion
In this paper, we propose a closed-form solution for the parameters of a PH distribution, P , that well-represents a given distribution G. Our solution is the first that achieves all of the following goals: (i) the first three moments of G and P agree, (ii) any distribution G that is well-represented by a PH distribution (i.e., G ∈ PH3 ) can be well-represented by P , (iii) the number of phases used in P is at most OP T (G) + c, where c is a small constant, (iv) the solution is expressed in closed form. Also, the numerical stability of the solution is discussed. The key idea is the definition and use of EC distributions, a subset of PH distributions. The set of EC distributions is defined so that it includes minimal PH distributions, in the sense that for any distribution, G, that is well-represented by n-phase acyclic PH distribution, there exists an EC distribution, E, with at most n + 1 phases such that G is well-represented by E. This property of the set of EC distributions is the key to achieving the above goals (i), (ii), and (iii). Also, the EC distribution is defined so that it has a small number (six) of free parameters. This property of the EC distribution is the key to achieving the above goal (iv). The same ideas are applied to further reduce the degrees of freedom of the EC distribution. That is, we constrain one of the six parameters of the EC distribution without excluding minimal PH distributions from the set of EC distributions. We provide a complete characterization of the EC distribution with respect to the normalized moments; the characterization is enabled by the simple definition of the EC distribution. The analysis is an elegant induction based on the recursive definition of the EC distribution; the inductive analysis is enabled by a solution
216
T. Osogami and M. Harchol-Balter
to a nontrivial recursive formula. Based on the characterization, we provide three variants of closed-form solutions for the parameters of the EC distribution that well-represents any input distribution, G, that can be well-represented by a PH distribution (G ∈ PH3 ). One take-home lesson from this paper is that the moment-matching problem is better solved with respect to the above four goals by sewing together two or more types of distributions, so that one can gain the best properties of both. The EC distribution sews the two-phase Coxian distribution and the Erlang distribution. The point is that these two distributions provide several different and complementary desirable properties. Future work includes assessing the minimality of our solution with respect to general (cyclic) PH distributions. If our solution is not close to minimal, then finding a minimal cyclic PH distribution that well-represents any given distribution G is also important. While acyclic PH distributions are well characterized in [18], the minimum number of phases required for a general (cyclic) PH distribution to well-represent a given distribution is not known. Acknowledgement. We would like to thank Miklos Telek for his help in improving the presentation and quality of this paper.
References 1. D. Aldous and L. Shepp. The least variable phase type distribution is Erlang. Communications in Statistics – Stotchastic Models, 3:467–473, 1987. 2. T. Altiok. On the phase-type approximations of general distributions. IIE Transactions, 17:110–116, 1985. 3. A. Feldmann and W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 32:245– 279, 1998. 4. H. Franke, J. Jann, J. Moreira, P. Pattnaik, and M. Jette. An evaluation of parallel job scheduling for ASCI blue-pacific. In Proceedings of Supercomputing ’99, pages 679–691, November 1999. 5. M. Harchol-Balter, C. Li, T. Osogami, A. Scheller-Wolf, and M. S. Squillante. Analysis of task assignment with cycle stealing under central queue. In Proceedings of ICDCS ’03, pages 628–637, May 2003. 6. A. Horv´ ath and M. Telek. Approximating heavy tailed behavior with phase type distributions. In Advances in Matrix-Analytic Methods for Stochastic Models, pages 191–214. Notable Publications, July 2000. 7. A. Horv´ ath and M. Telek. Phfit: A general phase-type fitting tool. In Proceedings of Performance TOOLS 2002, pages 82–91, April 2002. 8. M. A. Johnson. Selecting parameters of phase distributions: Combining nonlinear programming, heuristics, and Erlang distributions. ORSA Journal on Computing, 5:69–83, 1993. 9. M. A. Johnson and M. F. Taaffe. An investigation of phase-distribution momentmatching algorithms for use in queueing models. Queueing Systems, 8:129–147, 1991.
A Closed-Form Solution for Mapping General Distributions
217
10. M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Mixtures of Erlang distributions of common order. Communications in Statistics – Stochastic Models, 5:711–743, 1989. 11. M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Density function shapes. Communications in Statistics – Stochastic Models, 6:283–306, 1990. 12. M. A. Johnson and M. R. Taaffe. Matching moments to phase distributions: Nonlinear programming approaches. Communications in Statistics – Stochastic Models, 6:259–281, 1990. 13. S. Karlin and W. Studden. Tchebycheff Systems: With Applications in Analysis and Statistics. John Wiley and Sons, 1966. 14. R. E. A. Khayari, R. Sadre, and B. Haverkort. Fitting world-wide web request traces with the EM-algorithm. Performance Evalutation, 52:175–191, 2003. 15. R. Marie. Calculating equilibrium probabilities for λ(n)/ck /1/n queues. In Proceedings of Performance 1980, pages 117–125, 1980. 16. M. F. Neuts. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. The Johns Hopkins University Press, 1981. 17. T. Osogami and M. Harchol-Balter. A closed-form solution for mapping general distributions to minimal PH distributions. Technical Report CMU-CS-03-114, School of Computer Science, Carnegie Mellon University, 2003. 18. T. Osogami and M. Harchol-Balter. Necessary and sufficient conditions for representing general distributions by Coxians. In Proceedings of TOOLS ’03, September 2003. 19. T. Osogami, M. Harchol-Balter, and A. Scheller-Wolf. Analysis of cycle stealing with switching cost. In Proceedings of Sigmetrics ’03, pages 184–195, June 2003. 20. A. Riska, V. Diev, and E. Smirni. Efficient fitting of long-tailed data sets into PH distributions. Performance Evaluation, 2003 (to appear). 21. C. Sauer and K. Chandy. Approximate analysis of central server models. IBM Journal of Research and Development, 19:301–313, 1975. 22. L. Schmickler. Meda: Mixed Erlang distributions as phase-type representations of empirical distribution functions. Communications in Statistics – Stochastic Models, 8:131–156, 1992. 23. M. Squillante. Matrix-analytic methods in stochastic parallel-server scheduling models. In Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications, July 1998. 24. D. Starobinski and M. Sidi. Modeling and analysis of power-tail distributions via classical teletraffic methods. Queueing Systems, 36:243–267, 2000. 25. M. Telek and A. Heindl. Matching moments for acyclic discrete and continuous phase-type distributions of second order. International Journal of Simulation, 3:47–57, 2003. 26. W. Whitt. Approximating a point process by a renewal process: Two basic methods. Operations Research, 30:125–147, 1982. 27. Y. Zhang, H. Franke, J. Moreira, and A. Sivasubramaniam. An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration. IEEE Transactions on Parallel and Distributed Systems, 14:236–247, 2003.
An EM-Algorithm for MAP Fitting from Real Traffic Data Peter Buchholz Fakult¨ at f¨ ur Informatik, TU Dresden, D-01062 Dresden, German [email protected]
Abstract. For model based analysis of computer and telecommunication systems an appropriate representation of arrival and service processes is very important. Especially representations that can be used in analytical or numerical solution approaches like phase type (PH) distributions or Markovian arrival processes (MAPs) are useful. This paper presents an algorithm to fit the parameters of a MAP according to measured data. The proposed algorithm is of the expectation-maximization (EM-) type and extends known approaches for the parameter fitting of PH-distributions and hidden Markov chains. It is shown that the algorithm generates MAPs which approximate traces very well and especially capture the autocorrelation in the trace. Furthermore the approach can be combined with other more efficient but less accurate fitting techniques by computing initial MAPs with those techniques and improving these MAPs with the approach presented in this paper. Keywords: Markovian Arrival Process, EM-Algorithm, Data Fitting
1
Introduction
Traffic measurements show that real network traffic has a high variability and a non-negligible autocovariance over long distances. Different studies observed properties like self-similarity, fractality and long range dependency combined with heavy tailed distributions [11,13]. The adequate capturing of these properties is one of the key aspects in modeling telecommunication systems since the approximation of real traffic by a Poisson process can result in a significant underestimation of response times and blocking probabilities. Consequently, a large number of traffic models has been developed in the last decade. Of particular interest are Markovian models for traffic description since they can be used in analytical models and they are easy to integrate in simulation models. Although Markovian models cannot describe non-exponential asymptotic behavior which sometimes has been observed, it seems that they can approximate these behaviors arbitrarily well over huge time scales and are therefore often an adequate model for the description of interarrival or service times. Furthermore, Markovian Arrival Processes (MAPs) can be used to describe and model correlated data streams which often appear in practice. The major problem in the practical use
This research is partially supported by DFG, SFB 358
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 218–236, 2003. c Springer-Verlag Berlin Heidelberg 2003
An EM-Algorithm for MAP Fitting from Real Traffic Data
219
of phase type distributions and MAPs is the appropriate parameterization of the distributions to match some measured traffic stream. Unfortunately, parameterization of general phase type distributions results in a nonlinear optimization problem such that usually only a few traffic characteristics are approximated by the phase type distribution. Especially the autocorrelation structure which is present in many real traffic streams is not considered in phase type approximations. In this paper we present a new heuristic approach to fit the parameters of a MAP according to some measured data stream. The proposed approach uses an EM-algorithm and is an extension of an approach that has been published recently for the fitting of parameters of hidden Markov chains [19]. The paper is structured as follows: In the next section basic definitions and notations are introduced. In this section also MAPs and their analysis are considered and a brief overview of basic fitting approaches for phase type distributions is given. Afterwards, in section 3, the new fitting procedure is introduced and its effort is analyzed. In section 4, the method is validated by different examples.
2 2.1
Basic Definitions and Notations Basic Notation
Vectors and matrices are denoted as bold-faced small and capital letters. Elements are described using brackets. Vectors are row vectors, if not explicitly mentioned. aT and AT are the transposed vector a and matrix A, respectively. I and e are the identity matrix and identity vector and ex is a vector with 1 in position x and 0 elsewhere. Sets are denoted by calligraphic letters except the sets of (non-negative) real numbers R (R+ ) and natural numbers N. |S| denotes the number of elements in set S. 2.2
Data Sets and Their Characteristics
We consider the fitting of traces resulting from measured interarrival or service times. A trace T is defined as a sequence of m times ti > 0 (i = 1, . . . , m). The i-th moment of the trace and the variance are estimated as m m 1 1 i µi = (tj ) and σ 2 = (tj − µ1 )2 . m j=1 m − 1 j=1 The autocorrelation of lag k is estimated from ρk =
m−k 1 (tj − µ1 )(tj+k − µ1 ) . (m − k − 1)σ 2 j=1
The distribution function of a trace is a step function with m steps whose values are defined as m δ(tj ≤ t) F TT (t) =
j=1
m where δ(b) for a boolean expression b equals 1 if the expression is true and 0 otherwise.
220
2.3
P. Buchholz
Markovian Arrival Processes (MAPs)
Let Z(t) be an irreducible Markov chain with generator matrix Q and state space S = {0, . . . , n − 1}. Furthermore let D0 and D1 be two matrices such that D1 is non-negative, Q(x, y) − D1 (x, y) ≥ 0 for x = y, D1 (x, x) ≥ 0 and D0 = Q − D1 . D0 is the generator matrix of an absorbing Markov chain. Matrix D0 includes as off-diagonal elements transition rates of the Markov chain that do not correspond to an arrival and in the main diagonal the negative sum of all arrival rates out of a state are collected. Matrix D1 contains transition rates that are accompanied by an arrival. Consequently, we define a MAP as M = (D0 , D1 ). The class of MAPs contains as subclasses all the different distributions of the phase type like hyperexponential-, Erlang-, Cox- or PH-distributions and it also contains MMPP which are MAPs where D1 is a diagonal matrix. The stationary distribution of a MAP is defined as the solution of πQ = 0 with πeT = 1.0. From the stationary distribution the distribution immediately after an arrival can be computed as π = πD1 /(πD1 eT ). The column vector of the i-th conditional moments is given as the solution of the set of equations D1 m(i) = −i · m(i−1) where m(0) = eT and m(i) is the column vector of conditional i-th moments. I.e., if the MAP is in state x, then m(i) (x) is the i-th moment of the time to the next arrival. The absolute moments are then given by E[T (i) ] = π m(i) . The autocorrelation of order or lag k (k ≥ 1) is computed from [12] k E[T (1) ]π (−D0 −1 D1 m(1) − eT E[T (1) ] . E[ρk ] = E[T (2) ] − (E[T (1) ])2 The value of the distribution function at time t ≥ 0 results from F T (t) = π D0 [t]eT with D0 [t] = exp(D0 t) which is usually computed using the randomization technique [18] with the relation D0 [t] =
∞
β(k, αt)(P0 )k
k=0
where P0 = D0 /α + I, P1 = D1 /α, α ≥ maxx∈S |D0 (x, x)| and β(k, αt) = exp(−αt)(αt)k /k!, the probability of k jumps of a Poisson process with rate α in the interval [0, t). For a practical implementation, the infinite sum is truncated from the left by starting the summation with value l ≥ 0 and to the right by ending the summation with value r < ∞ to compute the Poisson probabilities up to machine precision [5]. The number of required iteration is in O(αt) for large values of αt. Thus, in randomization, probabilities are computed from a discrete time Markov chain and a Poisson process. This feature will be exploited for the development of an EM-algorithm for MAP-parameter fitting. 2.4
Fitting Procedures for Phase Type Distributions and MAPs
A large number of methods exist to fit PH-distributions according to measured data [6,10]. Methods can be distinguished whether they perform the fitting based
An EM-Algorithm for MAP Fitting from Real Traffic Data
221
on the complete trace or by using some information extracted from the trace. Most known methods belong to the second class and approximate some quantities of the data by an appropriate phase type distribution. However, if only some quantities are matched, it is not clear which quantities of a trace are important and which are not. Some methods exist to use the whole trace for parameter fitting, most of these methods belong to the class of EM-methods like the approach of this paper. Examples of those approaches are [3] where hyperexponential distributions are fitted to approximate Pareto or Weibull distributions and [8] where the former approach is extended by fitting hyperexponential distributions directly from measured data. Another EM-approach which fits phase type distributions by minimizing the Kullback-Leibler distance of the observed empirical distribution and a phase type distribution is given in [2]. In [14] an EMalgorithm for PH fitting is presented which decomposes traces first into subsets to improve efficiency. The fitting of MAPs in principle can be performed similarly to the fitting of PH-distributions. Unfortunately, MAP-fitting seems to be much more complex than PH-fitting. Nevertheless some older methods are available to fit specific MAPs or MMPPs with few phases [4,16]. More recently EM-algorithms for the use in performance analysis have been developed. The approaches published in [20,17] describe packet losses in communication systems by DTMCs and estimate the parameters of the DTMC by an EM-algorithms. In [9,15] EM-algorithms are presented to fit a batch Markovian arrival process (BMAP) from measured data. The approach presented in [9] is similar to our approach because it also uses randomization in conjunction with an EM-algorithm, but the used EMalgorithm differs slightly from our approach. Our technique is mainly influenced by [19] where an approach is described to model delays in communication systems by continuous time hidden Markov chains. In this approach first the parameters of a discrete time hidden Markov chain are estimated with an EM-algorithm and in a second step the discrete time chain is transformed into a continuous time hidden Markov process. However, in contrast to [19] we directly estimate the parameters of the continuous time process which avoids the often crucial step of transforming a discrete in a continuous time model. Furthermore, the result of our fitting procedure is a MAP and not a hidden Markov chain which is much more natural if times have to be described.
3 3.1
EM-Fitting of MAPs The Likelihood of an Observed Sequence
In [19] conditional state probabilities for being in a specific state after observing an initial part of the trace and probabilities of observing the remaining part of a trace depending on the current state are given for discrete time hidden Markov chains where elements of the trace are defined by observable states. Our approach has to consider transitions instead of observable states and we apply the approach to the matrices of the continuous time Markov chain after using randomization. Therefore we compute some form of joint densities or likelihoods
222
P. Buchholz
rather than probabilities. Let a(i) be a row vector including in position x the likelihood of state x immediately after observing t1 , . . . , ti (i ≤ m) a(i) = a(i−1) D0 [ti ]P1 with a(0) = π .
(1)
If vector a(i) is normalized to 1.0, then it includes the conditional state probabilities after the i-th observation. Similarly we define the backward likelihood in a column vector b(i) . b(i) = D0 [ti ]P1 b(i+1) with b(m) = eT .
(2)
The normalized vector b(i) includes in position b(i) (x) the probability of observing sequence ti , . . . , tm starting in state x. The quality of a MAP M for the approximation of a trace T can be measured by the value of the likelihood d(T , M) = αm a(m) eT
(3)
where a(m) is computed from M as introduced above and α is the rate used for randomization. Observe that the measure is independent of α for a given MAP M. MAP M is a better approximation for trace T than MAP M if d(T , M) > d(T , M ). Thus, a procedure for MAP fitting has to maximize d(T , M) which, unfortunately, is a complex, highly non-linear optimization problem. 3.2
An EM-Algorithm for Parameter Fitting
The basic idea of our EM-algorithm is to compute the transition likelihoods between states according to transitions with and without arrivals and use the normalized likelihoods as estimates for the transition probabilities in the matrices P0 and P1 . According to the jumps of the Poisson process we define a row vector k v(i),k = a(i) (P0 ) describing the forward likelihood after k internal transitions in the i-th interval. Similarly we can define a backward likelihood as a column vector k w(i),k = (P0 ) P1 b(i+1) . From these vectors, two n × n matrices of transition likelihoods can be defined elementwhise as X0 (i) (x, y) =
r i −1
β(k, αti )
k=li
and X1 (i) (x, y) =
k−1
v(i),l (x)P0 (x, y)w(i),k−l−1 (y)
(4)
l=0 ri
β(k, αti )v(i),k (x)P1 (x, y)b(i+1) (y)
(5)
k=li
where li and ri are the left and right truncation point for the computation of the Poisson probabilities in the interval [0, ti ). X0 (i) contains estimates for the likelihood of transitions without arrivals and X1 (i) estimates for the likelihood of transitions with arrivals. The likelihood values are collected in matrices Y0 and Y1 such that Y0 =
m i=1
X0 (i) and Y1 =
m i=1
X1 (i)
An EM-Algorithm for MAP Fitting from Real Traffic Data
223
The normalized matrices ˆ 0 (x, y) = Y
n−1 z=0
Y0 (x, y) Y1 (x, y) ˆ 1 (x, y) = , Y n−1 n−1 n−1 Y0 (x, z) + Y1 (x, z) Y0 (x, z) + Y1 (x, z) z=0
z=0
z=0
(6)
ˆ0 + Y ˆ1 can afterwards be used as new estimates for P0 and P1 . Observe that Y is a stochastic matrix. By combination of the introduced steps we obtain an iterative algorithm to fit the parameters of a MAP according to some trace T . Input: trace T = (t1 , . . . , tm ) set basic rate α, e.g., α = mini=1,...,m ((ti )−1 ) ; choose randomly P0 ≥ 0, P1 ≥ 0 such that P0 + P1 is stochastic ; repeat compute π and set a(0) = π ; for i = 1 to m − 1 do compute a(i) via (1) ; b(m) = eT ; for i = m − 1 downto 1 do compute b(i+1) via (2) ; compute X0 (i) via (4) and X1 (i) via (5) ; Y0 = Y0 + X0 (i) and Y1 = Y1 + X1 (i) ; P0 old = P0 and P1 old = P1 ; ˆ 0 and P1 = Y ˆ1 ; P0 = Y old until max P0 − P0 , P1 − P1 old ≤ ; 16. D0 = α P0 − diag(P0 eT + P1 eT and D1 = αP1 ; 17. Output: M = (D0 , D1 ) ;
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
3.3
Improvements and Extensions
Since the problem of parameter fitting of MAPs is complicated and results in a nonlinear optimization problem, we cannot expect to find the global optimum in a single run of the algorithm. However, the approach is very flexible, since it can be started with an arbitrary MAP as input and the quality of the approximation can be measured by computing d(T , M). Thus, it is usually a good idea to start the algorithm with different MAPs and perform the optimization for the whole trace or a subsequence of the trace to observe the values d(T , M) for the resulting MAPs. The best MAP, or if different MAPs with similar d-values are generated these MAPs, can afterwards by used as input for further steps of the algorithm. Alternatively one may start with some MAP resulting from any other method to approximate the trace. E.g., one may use one of the methods presented in [8,3,6,15,21] to fit first some characteristics of the trace and use the algorithm to improve the MAP or PH-distribution resulting from this fitting. With the choice of the initial MAP, also the number of states is fixed. Since the number of states determines the complexity of the algorithm and the complexity of the optimization problem, a small dimension of the state space reduces the effort,
224
P. Buchholz
but might yield a bad approximation. It is very hard to determine a priori the appropriate dimension of the MAP and it is not even clear that an increased number of states always yields a better approximation. This has already been noticed in [19]. Additionally the complexity of the algorithm and also the complexity of the optimization problem depends also the number of non-zero elements in the matrices P0 and P1 . If one element P0 (x, y) or P1 (x, y) becomes 0 during the algorithm or is set to 0 right from the beginning, then it will remain 0, because the corresponding values X0 (i) (x, y) computed via (4) and X1 (i) (x, y) computed via (5) will remain 0. If the algorithm is started with sparse matrices, the structure is preserved or additional zero elements are introduced. Thus, the approach may as well be used to fit MMPPs instead of MAPs by starting with an MMPP. The effort of the approach and also the quality of the fitting depend, among other parameters, on the choice of α. In the presented algorithm α is chosen equal to the inverse of the minimum value in the trace. This choice results in a MAP which matches the first moment of the trace, but may yield to long run times. A smaller value of α improves the runtime, but may yield a bad approximation of the trace. However, the first moment can alway be exactly matched by rescaling α. Assume that the procedure yields a MAP with matrices P0 and P1 . Let π the stationary distribution of the MAP, then the first moment equals E[T (1) ] = π · P1 eT /α. If µ1 is the first moment of the trace, then setting α to α · E[T (1) ]/µ1 yields a new MAP with the same matrices P0 and P1 , but different matrices D0 and D1 which matches exactly the first moment of the trace. Although the intention of the algorithm is the fitting of processes with some autocorrelation structure and not the fitting of distributions, the technique can also be used to fit distributions. In this case, a PH distribution rather than a MAP is generated. This means that D1 = d1 T π where d1 is a column vector of length n such that D0 eT + d1 = 0. Let p1 = d1 /α such that P0 + p1 · π is a stochastic matrix. Instead of P1 now p1 and π have to be fitted by the algorithm. Since values from a distribution are assumed to be independent, it is not necessary to consider a sequence of times in a trace, instead one can initialize each a(i) with π and b(i) with eT and perform with these values the computation of the matrix X0 (i) and of the vectors x1 (i) and z(i) which are used to compute new values for p1 and π . Values are computed via ri x1 (i) (x) = β(k, αti )v(i),k (x)p1 (x) (7) k=li
and vector z(i) is computed via ri k β(k, αti )w(i),k where w(i),k = (P0 ) p1 T . z(i) =
(8)
k=li
Define the weight of the i-th transition as r −1 i (i) (i) ξ = x1 (x) . k=li
(9)
An EM-Algorithm for MAP Fitting from Real Traffic Data
225
From the weights ξ (i) , the vectors x1 (i) , z(i) and the matrix X0 (i) , which is generated as shown in (4) using vectors w(i),k from (8), a new vector p1 and a new matrix P0 are generated. m
P0 (x, y) =
m n−1 i=1 z=0
and
i=1
ξ (i) X0 (i) (x, z) + m
p1 (x) =
m n−1 i=1 z=0
ξ (i) X0 (i) (x, y)
i=1
m n−1 i=1 z=0
(10) ξ (i) x1 (i) (z)
ξ (i) x1 (i) (x)
ξ (i) X0 (i) (x, z) +
m n−1 i=1 z=0
(11) ξ (i) x1 (i) (z)
The new vector π is chosen in a way that the likelihood is maximized. 1 if z(i) (x) > z(i) (y) for all y = 0, . . . , n − 1 π (x) = 0 otherwise
(12)
If z(i) contains more than one maximal element, one of these elements is chosen. Alternatively one my fix the vector right from the beginning of the algorithm and generate the PH-distribution according to the fixed vector. 3.4
Effort of the Fitting Procedure
In this subsection we briefly analyze the asymptotic effort per iteration and the memory requirements of the algorithm. We assume that m n which is a realistic assumption. Since the vectors a(i) are computed starting with i = 0 and the vectors b(i) are computed starting with i = m, one of both series of vectors has to be precomputed and stored which requires memory in O(nm). We assume that vectors a(i) are first computed. Thus, each iteration starts with the computation of π with an effort in O(n3 ). For the computation of the vectors a(i) the randomization approach has to be applied and O(ti α) vector matrixproducts have to be computed to derive a(i) from a(i−1) . Let tav = m 1/m · i=1 ti , then all vectors a(i) are computed in O(mtav αn2 ). Starting with i = m vectors b(i) are computed and for each b(i) matrix elements X0 (i) and X1 (i) are computed. Vector b(i) is computed from b(i+1) with O(ti α) iterations such that the computation of all vectors b(i) requires the same effort than the computation of all vectors a(i) . Computation of the matrix elements in step i requires an effort in O(ti αn2 ) if ri = 0. For values ti ≥ α, the lower index for summation grows (i.e., ri > 0) and the effort shrinks. The normalization of matrix elements requires an effort in O(n2 ) which is negligible. Consequently, the overall effort for the computation of the matrix elements is in O(mtav αn2 ) which is also the effort for one iteration of the complete algorithm. Vectors b(i) can be stored in the same locations as a(i) and apart from the vectors only a few matrices need to be stored such that memory requirements for the algorithm remains in O(mn).
226
4
P. Buchholz
Experimental Validation
Since the proposed approach is a heuristic which is obviously not guaranteed to find the optimal solution, it is important to validate the results of the approach by means of examples. We consider three different classes of problems for validation. First independent data drawn from different distributions is used as input. For these examples PH-distributions are generated using (10)-(12). Afterwards data is drawn from MAPs, resulting in correlated samples. From these traces MAPs are generated. In a last series of experiments, traces resulting from measurements are used as input. Afterwards we compare the quality of the generated MAPs by analyzing the performance measures for a simple queueing network which is fed with original and the fitted MAP resulting from the EM-algorithm. We do not present the effort which is required for fitting the different distributions because the number of iterations is very sensitive for the initial MAP, the choice of α and the stopping criterion. Small changes in these parameters result in different runtimes of the algorithm. However, it is interesting to note that the resulting MAP seems to be relatively robust concerning the parameters. Often the resulting MAPs are identical up to the ordering of states. A good strategy is to choose for the first steps of the EM-algorithm a small value of α since this results in less iterations in the randomization steps. If convergence is observed such that the value d(T , M) does not change during some iterations, α is increased for a better approximation. This approach often yields a faster convergence and converges towards a MAP which is identical to the MAP resulting from the procedure with a large value for α right from the beginning. However, this observation is not always true. For the presented examples with 103 − 104 samples in the trace the runtimes of the EM-algorithm for = 10−6 are in the range of 5 minutes to 5 hours on a standard PC. 4.1
Independent Samples from Known Distributions
In this subsection we consider the approximation of four different distributions by phase type distributions with varying numbers of phases. In all cases the PHdistributions are generated from a randomly generated PH-distribution with dense matrix D0 and a vector d1 where all elements are non-zero. All initial PH-distributions result from the same setting of the random number generator. The first two distributions we approximate are a lognormal distribution with mean 1.0 and standard deviation 1.5 and uniform distribution on the interval [0.5, 1.5]. From both distributions we draw a sample of size 104 and generate different PH-distributions from these samples. Table 1 contains the first three moments of the samples and the moments of the generated PH-distributions. Furthermore ln(d(T , M)) is given as measure of the difference between sample and approximating PH-distribution. Observe that a larger value of d(T , M) indicates a better approximation. In figure 1 the densities of the traces and the corresponding PH-distributions are shown. Densities of traces are approximated by histograms with intervals of width 0.1. For the lognormal distribution, the approximations are good. By
An EM-Algorithm for MAP Fitting from Real Traffic Data
227
Table 1. Moments and d-values for the lognormal and uniform distribution.
E[T 1 ] E[T 2 ] E[T 3 ] ln(d(T , M)
Lognormal distribution Uniform distribution Trace PH 2 PH 5 Trace PH 5 PH 10 PH 20 1.004e+0 1.004e+0 1.004e+0 9.960e-1 9.960e-1 9.960e-1 9.960e-1 4.154e+0 3.308e+0 3.055e+0 1.076e+0 1.488e+0 1.191e+0 1.136e+0 1.075e+2 2.722e+1 2.027e+1 1.240e+0 2.964e+0 1.664e+0 1.461e+0 -9.491e+3 -9.113e+3 -6.556e+3 -3.121e+3 -2.332e+3
inspection of the different densities it becomes clear that a better approximation of the moments does not necessarily yield a better approximation of the distribution. For the uniform distribution the approximations are less good, although the values for d(T , M) are relatively small for this distribution. The reason is that continuous phase type distributions always have an infinite support and can therefor not exactly approximate distributions with a finite support like the uniform distribution. Lognormal 1.40 1.20
1.20
1.00
1.00
0.80 0.60
0.80 0.60
0.40
0.40
0.20
0.20
0.00 0.00
0.50
1.00 t
1.50
PH with 5 phases PH with 10phases PH with 20phases
1.40
Densities
Densities
Uniform PH with 2 phases PH with 5 phases
2.00
0.00 0.00
0.50
1.00 t
1.50
2.00
Fig. 1. Densities of the traces and the PH-approximations for the first two examples
Next we consider the fitting of two different Weibull-distributions. The first has a shape of 5 and a scale parameter of 1 and the second has a shape of 0.66464 and a scale parameter of 0.6. The second distribution, which is also used as an example in [3], has a decreasing density whereas the density of the first distribution is first increasing and then decreasing. For both distributions a trace with 104 elements is generated and afterwards the EM-algorithm is applied. For the first version of the Weibull distribution we start with randomly generated phase type distributions with 2, 5, 10 and 20 phases. The results for this distribution are shown in the upper half of table 2 and the resulting densities are printed on the left side of figure 2. The PH-approximation becomes better with an increasing number of phases which can be seen by comparing the measures d(T , M) and by comparing the resulting densities of the phase type distributions with the empirical density of the trace. The second Weibull distribution has decreasing density function such that the efficient fitting methods of [3,8] can be applied. As shown in [3] the Weibull distribution can be matched very well with a 6 hyperexponential distribution with 6 phases. This hyperexponential distribution can be transformed into an
228
P. Buchholz Table 2. Moments and d-values for the Weibull distributions. Weibull distribution (5.0, 1.0) Trace PH 2 PH 5 PH 10 PH 20 E[T 1 ] 9.381e-1 9.381e-1 9.381e-1 9.381e-1 9.381e-1 E[T 2 ] 9.261e-1 1.320e+0 1.056e+0 9.688e-1 9.437e-1 E[T 3 ] 9.519e-1 2.476e+0 1.387e+0 1.092e+0 1.011e+0 ln(d(T , M) -5.806e+2 -1.902e+2 1.166e+1 7.534e+1 Weibull distribution (0.66464, 0.6) Trace Cox FW Cox 1 Cox 2 PH E[T 1 ] 9.498e-1 9.498e-1 9.498e-1 9.498e-1 9.498e-1 E[T 2 ] 3.227e+0 3.658e+0 3.260e+0 3.221e+0 3.097e+0 E[T 3 ] 1.920e+1 3.036e+1 2.048e+1 1.922e+1 1.684e+1 f tM (0) 2.651e+1 2.551e+1 1.467e+1 1.346e+1 ln(d(T , M) -6.823e+2 -6.801e+2 -6.818e+2 -6.793e+2
Weibul (5.0, 1.0)
Weibull (0.66464, 0.6)
2.00
2.00 Trace PH with 2 phases PH with 5 phases PH with 10phases PH with 20phases
1.50 Densities
Densities
1.50
1.00
0.50
0.00 0.00
Trace Cox FH98 Cox 1 Cox 2 PH
1.00
0.50
0.50
1.00 t
1.50
2.00
0.00 0.00
0.50
1.00 t
1.50
2.00
Fig. 2. Densities of the traces for the Weibull distributions and PH-approximations
equivalent Coxian distribution with 6 phases [7]1 . The corresponding Coxian distribution where the value of α is scaled to fit exactly the first moment of the trace is denoted as Cox FW. Distribution Cox 1 results from the use of the EMalgorithm with Cox FW as initial distribution. Cox 2 is a Coxian distribution which is computed from a randomly generated initial Coxian distribution with 6 phases. About 10 times more iterations are required to generate Cox 2 compared to Cox 1. The last distribution, denoted as PH results from a PH-distribution with 6 phases. The densities of the different distributions are shown on the right side of figure 2. Obviously all densities are very similar and match the trace very well. The remaining quantities of the distributions are shown in table 2. The value d(T , M) is minimal for the PH and Cox FW yields the largest value. However, the differences are small which shows that the method of [3] gives excellent results for this kind of distributions.
1
We use Coxian instead of hyperexponential distributions because they can be described more easily using the matrices D0 and D1 .
An EM-Algorithm for MAP Fitting from Real Traffic Data
4.2
229
Fitting of Data Generated from MAPs
If traces are generated from MAPs or MMPPs, then a significant autocorrelation often exists and cannot be neglected. The same holds for many real data sets. In this subsection we fit one MAP and one MMPP using the proposed method, before the approach is applied to real data in the following subsection. For both examples one MAP and one MMPP are fitted starting from a randomly generated MAP or MMPP, respectively. First we consider a MAP with 3 states and the following matrices. −3.721 0.500 0.020 0.200 3.000 0.001 D0 = 0.100 −1.206 0.005 and D1 = 1.000 0.100 0.001 0.001 0.002 −0.031 0.005 0.003 0.020 From this MAP we generate a sample with 5 · 103 elements which is used for fitting a MMPP and a MAP of order 3. Table 3 includes the results of the fitting procedure. The d-value in the column trace describes the d-value of the original MAP computed for the trace. It is interesting to note that the d-value of the fitted MAP is slightly larger than the d-value of the original MAP, whereas the d-value for the fitted MMPP is smaller. The autocorrelation of lag 2 of the trace overestimates the true value, whereas the lag 1 autocorrelation is underestimated. Table 3. Moments and d-value for the 3 state MAP. orig. MAP Trace MAP 3 MMPP 3 E[T 1 ] 1.020e+0 1.068e+0 1.068e+0 1.068e+0 E[T 2 ] 2.717e+1 2.729e+1 2.594e+1 2.642e+1 E[T 3 ] 2.544e+3 2.288e+3 2.014e+3 2.098e+3 ρ1 0.30409 0.24200 0.35475 0.35648 ρ2 0.19824 0.26829 0.27334 0.26806 ln(d(T , M) -2.850e-3 -2.835e+3 -2.924e+3
The fitted MAP is described by the matrices −0.873 0.006 0.000 0.033 0.000 0.834 (M AP ) (M AP ) = 0.007 −0.037 0.001 , D1 = 0.000 0.028 0.001 D0 0.000 0.002 −3.653 3.636 0.008 0.007 and the matrices of the fitted MMPP are −2.826 0.000 0.645 2.181 0.000 0.000 D0 (M M P P ) = 0.000 −0.037 0.009 , D1 (M M P P ) = 0.000 0.028 0.000 1.839 0.026 −1.865 0.000 0.000 0.000 In both cases only values ≥ 0.0005 are printed as non-zero values, but most values printed as 0 are very small and can be neglected. Figure 3 includes the densities of the original MAP, the two fitted processes and the approximated density of the trace. Both fitted processes approximate the true density of the
230
P. Buchholz
MAP-fitting
MAP-fitting 0.35
Org. MAP Trace MAP 3 MMPP 3
2.00
Autocorrelation
1.50 Densities
Orig. MAP Trace MAP 3 MMPP 3
0.30
1.00
0.25 0.20 0.15 0.10
0.50 0.05 0.00 0.00
0.00 0.50
1.00
1.50 t
2.00
2.50
3.00
2
4
6
8
10
12
14
16
18
20
lag
Fig. 3. Densities and autocorrelations of the 3 state MAP and of the fitted processes.
MAP and the approximated density of the trace very well, such that not difference between true and fitted density is visible in the figure. Furthermore figure 3 shows the autocorrelations of lag 1 through 20. The autocorrelations of the MAPs and the MMPP are decreasing function with a very regular structure, whereas the autocorrelations of the trace are much more irregular. Furthermore there is a significant difference between the autocorrelations of the MAP and the autocorrelations of the trace generated from this MAP. Consequently, for an accurate representation of the autocorrelation structure a much larger sample size seems to be necessary. As a second example we consider the fitting of a MMPP with 6 states. The MMPP is described by the following matrices. −0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 −1.3 0.1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.2 −2.3 0.1 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 D0 = , D1 = 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.2 −3.3 0.1 0.0 0.0 0.0 0.0 0.2 −4.3 0.1 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.2 −5.2 0.0 0.0 0.0 0.0 0.0 5.0 For the fitting procedure we generate a trace with 104 elements from the MMPP and fit the parameters of a MAP and a MMPP, both with 6 states. The resulting MAP is characterized by the following matrices. −2.13 0.00 0.00 0.12 0.00 0.08 0.00 −0.10 0.00 0.10 0.00 0.00 0.20 0.00 −2.89 0.00 0.00 0.00 (M AP ) and = D0 0.02 0.18 0.00 −1.17 0.00 0.00 0.00 0.00 0.00 0.00 −4.99 0.13 0.00 0.00 0.00 0.00 0.09 −3.64 1.86 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 2.66 (M AP ) D1 = 0.00 0.00 0.00 0.88 0.00 0.00 0.00 0.00 0.00 0.00 4.86 0.00 0.00 0.00 3.55 0.00 0.00 0.00
An EM-Algorithm for MAP Fitting from Real Traffic Data
231
Although the resulting MAP differs from the original MMPP, one can notice that the resulting structure similar. For the MMPP, the following matrices are computed. −1.19 0.08 0.07 0.00 0.00 0.10 0.12 −2.51 0.00 0.03 0.00 0.00 0.20 0.00 −1.23 0.00 0.88 0.13 (M M P P ) and D0 = 0.00 0.10 0.00 −4.29 0.00 0.00 0.00 0.00 0.07 0.00 −0.79 0.72 0.14 0.00 0.26 0.00 0.33 −0.73 1.04 0.00 0.00 0.00 0.00 0.00 0.00 2.36 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 (M M P P ) = D1 0.09 0.00 0.00 4.28 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Table 4. Moments and d-value for the 6 state MMPP. E[T 1 ] E[T 2 ] E[T 3 ] ρ1 ρ2 ln(d(T , M)
orig. MMPP Trace MAP 6 MMPP 6 1.105e+0 1.049e+0 1.049e+0 1.049e+0 1.620e+1 1.449e+1 1.471e+1 1.467e+1 5.646e+2 4.884e+2 5.133e+2 5.086e+2 0.07595 0.08706 0.07807 0.07755 0.06884 0.05952 0.06802 0.06154 -4.839e-3 -4.824e+3 -4.841e+3
The difference between the fitted MMPP and the original MMPP is in some sense larger than the difference between the fitted MAP and the original MMPP because the resulting MMPP has 3 states with arrival rates and 3 states with very small arrival rates. Table 4 contains the measures for the original process, the trace and the fitted processes. Again one can see that the fitted MAP has a smaller distance to the trace than the original MMPP, whereas the distance of the fitted MMPP is slightly larger. However, the difference between the moments and first autocorrelations of the two fitted processes and the trace is smaller than the difference between these measures of the trace and the original process which indicates that the fitting procedure reaches a reasonable accuracy. Figure 4 shows the densities and the autocorrelations for the MMPP fitting. 4.3
Fitting of Measured Sequences
For fitting real data we use the LBL-TCP-2 trace from the internet archive [1]. From this trace we consider the first 106 interarrival times. For the fitting of a MAP to this sequence we consider a subset of 103 consecutive interarrival times. To find an appropriate subsequence for fitting, the trace is divided into 103 subsequences of length 103 . For all subsequences the first two moments and
232
P. Buchholz
MMPP-fitting
MMPP-fitting
2.50
0.10 Org. MMPP Trace MAP 3 MMPP 3
1.50
1.00
0.50
0.00 0.00
Orig. MMPP Trace MAP 3 MMPP 3
0.08 Autocorrelation
Densities
2.00
0.06
0.04
0.02
0.00 0.50
1.00
1.50
2.00
2.50
3.00
2
4
t
6
8
10
12
14
16
18
20
lag
Fig. 4. Densities and autocorrelations of the 6 state MMPP and of the fitted processes.
the autocorrelation of lag 1 and 2 are computed. Afterwards the subsequence which is nearest to the complete trace according to those measures is chosen for fitting. Table 5. Moments and d-value for the LBL-TCP-2 trace.
E[T 1 ] E[T 2 ] E[T 3 ] ρ1 ρ2 ln(d(T , M)
Trace 4.198e-3 5.274e-5 1.283e-6 0.15831 0.12172
Subtrace 4.153e-3 4.994e-5 1.044e-6 0.14931 0.12273
LBL-TCP-2 trace MMPP 3 MMPP 5 4.153e-3 4.153e-3 4.863e-5 4.862e-5 9.840e-7 9.833e-7 0.16066 0.15837 0.11835 0.11578 4.580e+3 4.580e+3
MAP 3 4.153e-3 4.968e-5 1.060e-6 0.16532 0.11547 4.581e+3
MAP 5 4.153e-3 4.673e-5 8.541e-7 0.11369 0.07226 4.597e+3
Four different processes are fitted to the subsequence, namely MAPs of order 3 and 5 and two MMPPs of order 3 and 5. Table 5 contains the moments, the first two autocorrelations and the difference measures for the complete trace, the subsequence used for fitting and the four fitted processes. The results for the two MMPPs and the small MAP are similar with respect to all quantities shown in table 5. The results for the MAP with 5 states are rather different from the other processes. First of all, the d-value is larger for this MAP. However, the higher order moments and also the first two autocorrelations are less good approximations for the quantities of the subtrace. Figure 5 shows the approximated densities of the traces and the densities of the computes MAPs and MMPPs. Furthermore the figure shows the first 20 autocorrelations for the trace and the fitted processes. The densities show a significant difference between the 5 state MAP and the remaining processes. The MMPPs and the 3 state MAP have all densities which are decreasing, whereas the density of the 5 state MAP starts at 0 and reach a value of 396.7 at 0.00025. This behavior is not visible in the trace if intervals of width 0.0005 are used for the approximation of the densities as done for the representation in the figure. However, if this value is reduced the value to 0.00025, then the density of the subtrace is no longer decreasing which explains the better approximation of the subtrace by the 5 state MAP,
An EM-Algorithm for MAP Fitting from Real Traffic Data
LBL-TCP-2 Trace
LBL-TCP-2 Trace
400.00
Trace Subtrace MMPP 3 MMPP 5 MAP 3 MAP 5
0.10 Autocorrelation
300.00 Densities
0.15
Trace Subtrace MMPP 3 MMPP 5 MAP 3 MAP 5
350.00
233
250.00 200.00 150.00
0.05
0.00
100.00 -0.05 50.00 0.00 0.000
-0.10 0.002
0.004
0.006
0.008
0.010
2
4
6
8
10
t
12
14
16
18
20
lag
Fig. 5. Densities/autocorrelations of the LBL trace/subtrace and the fitted processes.
even with a less good approximation of the moments or the autocorrelations. The plot of the autocorrelations of lag 1 through 20 in figure 5 shows that the subtrace is only a good representation for the original trace if autocorrelations of lag 1 and 2 are considered, for higher order autocorrelations trace and subtrace differ significantly. 4.4
Comparison of Queueing Performance Measures
To evaluate the quality of the approximation of some process by a MAP, apart from measures directly related to the process, also performance measures of some queueing system which is fed with the process may be used. We consider the MAP and MMPP which have been presented section 4.2 as input processes for a queue with finite capacity and exponential service time distribution. Thus, a MMPP/M/1/K and a MAP/M/1/K system are analyzed. MAP/M/1 Model
MAP/M/1 Model
3.00
-7.00 Exact H2 MAP MMPP
Exact H2 MAP MMPP
-6.50
2.60
-6.00
2.40
-5.50
ln(p_full)
Mean Population
2.80
2.20
-5.00 -4.50
2.00
-4.00 1.80 -3.50 1.60 -3.00 10
12
14
16 Capacity
18
20
10
12
14
16
18
20
Capacity
Fig. 6. Results for the MAP/M/1/K system and its approximations.
First we use the 3 state MAP as an input process for a queue with mean service time 0.5 and determine the mean population and the probability that the queue is completely filled (pf ull ) for capacities varying from 10 through 20. The exact results are compared with the results for a system where the original MAP is substituted by the fitted MAP or MMPP or by an hyperexponential
234
P. Buchholz
distribution which is derived by fitting the first two moments. Results are shown in figure 6. If the hyperexponentail distribution is used as input process, then the mean population and pf ull are overestimated. The MAP and MMPP approximation are much more accurate, pf ull is slightly overestimated, whereas the true value for the buffer population lies between both approximations. The approximation by the MAP is slightly better than the MMPP-approximation. MMPP/M/1 Model
MMPP/M/1 Model
3.50
-7.00 Exact H2 MAP MMPP
-6.00 ln(p_full)
Mean Population
3.00
Exact H2 MAP MMPP
-6.50
2.50
-5.50 -5.00 -4.50
2.00
-4.00 -3.50
1.50
-3.00 10
12
14
16 Capacity
18
20
10
12
14
16
18
20
Capacity
Fig. 7. Results for the MMPP/M/1/K system and its approximations.
The results for the approximation of the 6 state MMPP are shown in figure 7. They are similar to the results of the previous example.
5
Conclusions
In this paper we present an algorithm of the expectation-maximization type to fit the parameters of Markovian arrival processes according to some measured data. The approach is an extension of known approaches from the area of hidden Markov chains and uses the randomization technique to transform the continuous time Markov arrival process into a discrete time process. The presented algorithm is applied to several example traces and it is shown that the resulting Markovian arrival processes capture the trace behavior very well. The presented approach is very flexible because it allows one to generate Markovian arrival processes of arbitrary structures including Markov modulated Poisson processes or phase type distributions, if the correlation is not relevant. The limiting aspect of the algorithm, like most other algorithms of this type, is the relatively high effort. As shown one iteration of the algorithm requires an effort which is proportional to the effort of the randomization method applied to a Markov chain of the size of the fitted process over the time intervals of the trace. The algorithm can be applied to fit traces with a few thousand entries, but not with a million entries. However, to capture heavy tailed distributions or long range dependencies, very long traces have to be used. Consequently, the major aspect for future research is to increase the efficiency of the approach. This can be done by using better initial guesses of the process to reduce the number of iterations or by representing relevant information of a long trace in a more condensed form.
An EM-Algorithm for MAP Fitting from Real Traffic Data
235
References 1. The internet traffic archive. http://ita.ee.lbl.gov/index.html. 2. S. Asmussen, O. Nerman, and M. Olsson. Fitting phase type distributions via the EM algorithm. Scand. J. Statist., 23:419–441, 1996. 3. A. Feldmann and W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 31:245– 258, 1998. 4. W. Fischer and K. Meier-Hellstern. The Markov-modulated Poisson process (MMPP) cookbook. Performance Evaluation, 18:149–171, 1992. 5. B. L. Fox and P. W. Glynn. Computing Poisson probabilities. Communications of the ACM, 31(4):440–445, 1986. 6. A. Horvath and M. Telek. Markovian modeling of real data traffic: Heuristic phase type and MAP fitting of heavy tailed and fractal like samples. In M. C. Calzarossa and S. Tucci, editors, Performance 2002, volume 2459 of LNCS, pages 405–434. Springer, 2002. 7. V. B. Iversen and F. Nielsen. Some properties of Coxian distributions with applications. In N. Abu el Ata, editor, Modeling Techniques and Tools for Performance Analysis, pages 61–66. Elsevier, 1986. 8. R. El Abdouni Khayari, R. Sadre, and B. Haverkort. Fitting world-wide web request traces with the EM-algorithm. Performance Evaluation, 52:175–191, 2003. 9. A. Klemm, C. Lindemann, and M. Lohmann. Modeling IP traffic using the batch Markovian arrival process. Performance Evaluation, to appear, 2003. 10. A. Lang and J. L. Arthur. Parameter approximation for phase-type distributions. In S. R. Chakravarty and A. S. Alfa, editors, Matrix-analytic methods in stochastic models, Lecture Notes in Pure and Applied Mathematics, pages 151–206. Marcel Dekker, 1996. 11. W. E. Leland, M. Taqqu, W. Willinger, and D. V. Wilson. On the self-similar nature of ethernet traffic. IEEE/ACM Transactions in Networking, 2:1–15, 1994. 12. M. Neuts. Algorithmic Probability: A Collection of Problems. Chapman & Hall, 1995. 13. V. Paxson and S. Floyd. Wide-area traffic: The failure of Poisson modeling. IEEE/ACM Transactions on Networking, 3:226–244, 1995. 14. A. Riska, V. Diev, and E. Smirni. An EM-based technique for approximating long-tailed data sets with PH distributions. Performance Evaluation, to appear, 2003. 15. A. Riska, M. S. Squillante, S. Z. Yu, Z. Liu, and L. Zhang. Matrix-analytic analysis of a MAP/PH/1 queue fitted to web server data. In G. Latouche and P. Taylor, editors, Matrix-Analytic Methods: Theory and Applications, pages 335–356. World Scientific, 2002. 16. T. Ryden. Parameter estimation for Markov modulated Poisson processes. Stochastic Models, 10(4):795–829, 1994. 17. K. Salamatian and S. Vaton. Hidden Markov modelling for network communication channels. In Proc. ACM Sigmetrics, 2001. 18. W. J. Stewart. Introduction to the numerical solution of Markov chains. Princeton University Press, 1994. 19. W. Wei, B. Wang, and D. Towsley. Continuous-time hidden Markov models for network performance evaluation. Performance Evaluation, 49(1-4):129–146, 2002.
236
P. Buchholz
20. M. Yajnik, S. Moon, J. Kurose, and D. Towsley. Measurement and modelling of the temporal dependence in packet loss. In Proc. IEEE Infocom. IEEE CS-Press, 1999. 21. T. Yoshihara, S. Kasahara, and Y. Takashi. Pratical time-scale fitting of selfsimilar traffic with markov-modulated poisson process. Telecommunication Systems, 17:185–211, 2001.
The Correlation Region of Second-Order MAPs with Application to Queueing Network Decomposition Armin Heindl1 , Ken Mitchell2 , and Appie van de Liefvoort2 1
Institut f¨ ur Technische Informatik und Mikroelektronik, Fakult¨ at IV, TU Berlin, D-10587 Berlin, Germany [email protected] 2 School of Computing and Engineering, University of Missouri – Kansas City (UMKC), Kansas City, MO 64110, USA mitchellke|[email protected]
Abstract. Tools for performance evaluation often require techniques to match moments to continuous distributions or moments and correlation data to correlated processes. With respect to efficiency in applications, one is interested in low-dimensional (matrix) representations. For phase-type distributions (or matrix exponentials) of second order, analytic bounds could be derived, which specify the space of feasible moments. In this paper, we add a correlation parameter to the first three moments of the marginal distribution to construct a Markovian arrival process of second order (MAP(2)). Exploiting the equivalence of correlated matrix-exponential sequences and MAPs in two dimensions, we present an algorithm that decides whether the correlation parameter is feasible with respect to the three moments and – if so – delivers a valid MAP(2) which matches the four parameters. We also investigate the restrictions imposed on the correlation structure by an arbitrary MAP(2). Analytic bounds for this maximal correlation region are given. When there is no need for a MAP(2) representation (as in linear algebraic queueing theory), the proposed procedure serves to check the validity of the constructed correlated matrix-exponential sequence. Numerical examples indicate how these results can be used to efficiently decompose queueing networks.
1
Introduction
Markovian arrival processes (MAPs, see e.g., [1]) are widely used in traffic engineering. Very often, low-dimensional representations of these processes are desired in stochastic modeling in order to expedite computational procedures. In this respect, the Markov-modulated Poisson process of order 2 (MMPP(2)), which simply is a MAP of order 2 with a special structure (and thus more restricted in its modeling power), represents the most striking example. MMPP(2)s
This work was supported in part by US NSF under grant ANI 0106640 and by DFG under grant HE3530/1, while Armin Heindl was with SCE at UMKC.
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 237–254, 2003. c Springer-Verlag Berlin Heidelberg 2003
238
A. Heindl, K. Mitchell, and A. van de Liefvoort
also have become popular, because procedures are available to derive the four parameters of the MMPP(2) analytically from (empirical) traffic characteristics (related to moments of the counting function [2,3] or to moments of the marginal and the lag-1 covariance [4]). For higher dimensions and already for the unrestricted MAP(2), the overparameterization of these processes aggravates the solution of this generalized inverse problem. In the setting of matrix-exponential (ME) distributions and correlated sequences, so-called moment- and moment/correlation-canonical forms elegantly solve the corresponding inverse problems at almost no computational cost [5,6, 7]. While ME representations abandon the physical interpretations as observed for phase-type (PH) distributions and MAPs (but still retain a close resemblance with respect to the notation), the cited algorithms effectively deliver minimal representations for given moments of the marginal distribution (and given lag-k autocorrelations). However, when matching 2m − 1 moments and 2m − 3 lag-k autocovariances to a correlated ME sequence of finite order m (m > 1), the outcome is not guaranteed to be a stochastic process unless the given moments and autocorrelations are feasible for the considered dimension. In general, it is an open problem to analytically determine the feasibility of the input parameters. It is well-known that PH distributions and MAPs form proper subsets of ME distributions and sequences, respectively – except for orders m = 1, 2, where the corresponding classes are equivalent. In this paper, we exploit the equivalence of MAP(2)s and ME sequences of order 2 in the following way: If the moment/correlation-canonical form of the ME sequence cannot be transformed into a MAP(2), we know that the considered correlation parameter (e.g., the lag-1 autocovariance) is not feasible (provided that the first three moments of the marginal distribution are). In case the transformation succeeds (i.e., the correlation parameter is feasible), we obtain a MAP(2), which matches the given four parameters. In [8], such a MAP(2) is given directly for hyperexponential marginal distributions and nonnegative correlation. This paper goes beyond [8] in that it exhausts the possible combinations of moments and correlation structures for MAP(2)s. A corresponding algorithm has been implemented and is outlined in Sect. 3 after the relevant notation is introduced in Sect. 2. In Sect. 4, we demonstrate how the algorithm may serve to locate the bounds of the correlation parameter experimentally. Due to the changing (as the case arises) and quite large (considering the small dimensions of the involved processes) number of involved inequalities, determining analytic bounds in terms of the first three moments turns out to be very cumbersome and may provide only little insight due to bulky expressions. The problem simplifies when one is interested in general bounds for the correlation parameter for arbitrary MAP(2)s (i.e., for arbitrary, but fixed first two moments only). In Sect. 5, we derive analytical bounds for this correlation region. Section 6 demonstrates an application of these findings to an extremely efficient decomposition of queueing networks based on correlated traffic descriptors.
The Correlation Region of Second-Order MAPs
2
239
Notation
A matrix-exponential (ME) distribution is a continuous probability distribution whose distribution function can be written as F (t) = 1 − pME · e−BME t · eME
for t ≥ 0 ,
(1)
where pME , BME and eME are an m-dimensional row vector, an invertible square matrix and a column vector, respectively. A correlated (stationary) sequence of ME random variables T1 , T2 , . . . can be defined by means of an additional m-dimensional matrix YME so that the joint probability density over any finite sequence of consecutive interevent times is given by (see [9]) fT1 ,...,Tn (t1 . . . . , tn ) = pME e−BME t1 BME YME e−BME t2 · · · e−BME tn BME YME eME . The matrix-exponential distribution naturally generalizes the scalar exponential distribution to a vector process. Parameter m is called the order of the ME distribution or the ME sequence or – in case of a minimal representation – their degree. We require pME YME = pME and YME eME = eME to obtain invariant marginals in equilibrium, which are then given by (1). Otherwise, no structural or domain restrictions are imposed on the (real) elements of the components pME , BME , eME and YME , except that F (t) and fT1 ,...,Tn (t1 . . . . , tn ) must be true distribution or density functions, respectively. For example, the freedom in selecting the parameters allows us to choose BME to be a matrix with positive diagonal elements and nonpositive off-diagonal, pME to be a probability vector pPH , eME to be a vector of ones (ePH ) and YME as a stochastic matrix so that BME −1 YME is nonnegative. Then, with the definitions D0 = −BME and D1 = BME YME , we obtain a MAP with the following physical interpretation: D0 is a generator matrix of a continuous-time Markov chain on m transient states and D1 is the rate matrix of transitions associated with an observable event. The distribution F (t) is now understood to be the time in the transient states of the Markov chain until absorption. Often, this phase-type analogy serves to give an intuition to the ME expressions. Performance measures can be computed in the very same manner, e.g., the nth moment of the marginal distribution is given by E[X n ] = n! pME BME −n eME
or n! pPH (−D0 )−n ePH .
(2)
We point out that the power of ME representations is their purely algorithmic and global view (i.e., their abandonment of internal physical interpretations), which creates a great deal of freedom in the algebraic manipulation of these processes. For example, similarity transforms can be applied to ME distributions and sequences without affecting the scalar results. For instance, let X be any non-singular matrix of the same dimension as matrix BME . Then, the tuple pME X −1 , XBME X −1 , XeME ; XYME X −1 represents the same correlated ME sequence as pME , BME , eME ; YME . This shows that ME representations are
240
A. Heindl, K. Mitchell, and A. van de Liefvoort
not unique. In the phase-type domain, such similarity transformations generally destroy the (local) physical interpretability. The theory of linear-algebra queueing theory (LAQT), which is founded on matrix exponentials, has been covered in many excellent publications (e.g., [10,11]). In the following, we highlight some results – tailored to the two-dimensional situation – as relevant for the understanding of this paper. In [7,5], Mitchell and Van de Liefvoort show how to construct a correlated ME sequence from moment and correlation data. In two dimensions, the resulting moment/correlation-canonical form is given by: r1 r1 1 0 pME = (1, 0) , BME −1 = h2 r3 −r1 (h2 +r2 ) , eME = (1, 0)T , YME = , 0 γ r1 h2 (3) E[X i ] where ri = i! , i = 1, 2, 3 are the first three reduced moments of the marginal distribution. The parameters h2 = r2 − r12 and h3 = r1 r3 − r22 (see below) are only introduced for notational convenience (despite their more fundamental meaning as Hankel determinants, see [5]). For ME sequences of order 2 the lag-k covariances are found to be cov[X0 , Xk ] = E[(X0 − r1 )(Xk − r1 )] = γ k h2
k = 1, 2, . . .
(4)
Thus, γ can easily be derived from any (odd-k) covariance value and may be interpreted as a decay rate. Note that – due to the central role of the canonical form (3) in this paper – we will assume from now on that pME = (1, 0), eME = (1, 0)T (as well as pPH = (p, 1 − p), ePH = (1, 1)T ). Two important conclusions can be drawn from the canonical representation (3). Since a marginal ME distribution of second order is fully specified by the first three moments (i.e., they also fix all higher moments) and since at the same time the moment canonical form delivers the minimal representation based on the given moments, the fact that h2 appears in the denominator of a diagonal element 2r −r 2 of BME −1 implies the following: As h2 = 0 ⇔ r2 = r12 ⇔ c2v ≡ 2r2 1 = 1, 1
where c2v is the squared coefficient of variation, there can be no second-order representation with a squared coefficient of variation equal to one that is not stochastically equivalent to the (scalar) exponential distribution. Due to the equivalence of PH and ME distributions for m = 2, this observation is true also for PH distributions. Obviously, a one-dimensional Poisson process cannot carry any correlations (i.e., γ must be 0) and consequently, we ignore the case h2 = 0 in the remainder of this paper. The second conclusion concerns the impact of parameter γ, which appears solely in matrix YME . Therefore, varying only γ allows us to construct different correlated point processes with identical marginal distributions, since these are completely defined in terms of pME , BME , eME . For second-order ME distributions, moment bounds for the first three reduced moments have been established in [6] and naturally coincide with those developed for PH distributions of second order [12]. Thus, the feasibility of r1 , r2 , r3 can be confirmed beforehand so that this paper is mainly dedicated to determining the feasibility of the correlation parameter γ. For the sake of comprehensiveness,
The Correlation Region of Second-Order MAPs
241
Table 1. Bounds for the first three reduced moments of ME(2)/PH(2) distributions r1 > 0 hypoexp. hyperexp.
r2 3 2 r 4 1 r12
≤ r2 < r12 (⇒ h2 < 0) < r2
3
r3
r1 (2h2 + r2 ) + 2(−h2 ) 2 ≤ r3 ≤ r1 (h2 + r2 )
(⇔ 0 < h2 )
2 r2 r1
< r3
(⇔ 0 < h3 )
Table 1 recollects the moment bounds, which differ for the hypo- (h2 < 0) and hyperexponential (h2 > 0) setting. Additionally, it can easily be proved that h3 = r1 r3 − r22 must be negative in the hypoexponential case.
3
The Conversion Algorithm
The previous section has already indicated an algorithm to check the feasibility of the parameter γ given the first three (feasible) moments of the marginal distribution and a correlation parameter: Starting from the moment-canonical ME sequence of order 2, we find (conditions on the existence of) an invertible matrix X that transforms the ME sequence into a MAP(2). Due to the equivalence of both traffic classes for order 2, the restrictions on the matrix and vector elements of the MAP(2) will permit the existence of such a matrix X, if and only if the parameter γ is feasible. Before we outline the general constructive algorithm for X, which on the fly decides on the feasibility of γ, we present two special cases in which the MAP(2) can be given directly (and not via a similarity transform from the canonical representation (3)). 3.1
The Uncorrelated Processes: γ = 0
For γ = 0, all the covariances vanish and we obtain a renewal process. Since the interevent times are then independent and identically distributed, they must each start with the same initial vector pPH . This vector, as well as the two-dimensional matrix D0 , can be computed from the first three reduced (or power) moments as described in [12]1 . The uncorrelated MAP(2) with this marginal distribution is simply given by D0 and D1 = −D0 ePH pPH or in the ME tuple notation pPH , −D0 , ePH ; ePH pPH . Of course, with a PH marginal distribution already given, this result easily generalizes to higher dimensions, just like the following special case. 3.2
Hyperexponential Marginals and γ > 0
In [8], Mitchell demonstrates in the ME setting how to introduce correlations into an arbitrary renewal process with marginal distribution p, B, e (not necessarily 1
Actually, [12] performs moment fitting to a canonical representation α, T for acyclic continuous PH distributions of order 2, where we set pPH = α and D0 = T . Acyclicity does not confine the modeling power of PH distributions of second order.
242
A. Heindl, K. Mitchell, and A. van de Liefvoort
in moment-canonical form). Choosing Yγ = (1−γ)ep+γI, where I is an identity matrix of appropriate dimension, preserves the marginal distribution throughout the range [−1, 1) for γ. Assuming that p, B, e is a PH distribution, such a matrix Yγ yields a MAP for all γ such that 0 ≤ γ < 1, only if B is a (positive) diagonal matrix. In two dimensions, γ is exactly the parameter introduced in the previous section. To construct a MAP(2) with a hyperexponential marginal, this approach can be pursued in the following way: If h2 > 0, the first three (feasible) moments r1 , r2 , r3 are fitted to an H2 distribution by an algorithm of Whitt [13]. This results in a PH distribution α, T with the desired diagonal generator T so that the ME tuple pPH , −T , ePH ; Yγ becomes a MAP(2) with D0 = T and D1 = (γ − 1)D0 ePH pPH − γD0 , where any γ ∈ [0, 1) is feasible. This special case demonstrates a great flexibility for nonnegative γ in the hyperexponential range. Since finite-dimensional processes – and in particular MAP(2)s – must be short-range dependent, i.e., γ < 1, this parameter may actually take any nonnegative value that is possible a priori. The remainder of this paper will reveal that circumstances are no longer as favorable for negative γ and/or in the hypoexponential case. 3.3
The General Situation γ = 0
In the general situation, we exploit the equivalence of MAP(2)s and ME sequences of second order to determine the feasibility of the correlation parameter γ with respect to the first three (feasible) moments of the marginal distribution. The proposed algorithm finds a suitable invertible matrix X which converts the canonical form (3) into a valid MAP(2) via a similarity transform. It fails, if and only if the value of γ is not feasible. Thus, the benefits of the algorithm are two-fold: – It validates the given canonical form (of order 2) as a stochastic process. In the absence of analytical bounds for γ in the correlated ME sequence, this is a valuable contribution. – In case of a feasible parameter γ, the algorithm readily provides an equivalent MAP(2) representation, whose local physical interpretation may help to gain additional insight. A similarity transform by means of matrix X relates the ME sequence and the equivalent MAP(2) as follows:
− D0
ePH = XeME pPH X = pME YPH = XYME X −1 = BPH = XBME X −1
(5) (6) (7) (8)
From (5) and our vector definitions, we observe that the first column of X must consist of ones only. In addition, (6) implies that the two elements in the second column have opposite signs. Thus, matrix X is of the form
The Correlation Region of Second-Order MAPs
1 −y X= 1 x
243
,
(9)
where x and y are either both positive or both negative. Matrix X transforms y x , x+y ) and – together with requiring that YPH (7) be a stochastic pPH to ( x+y matrix – results in the constraints x y (10) max(− , − ) ≤ γ ≤ 1 . y x Note that (10) guarantees that γ will be greater than or equal to −1, where γ is increasingly confined in the negative range as x and y diverge. For the transformed ME sequence to be a MAP(2), we also require that the off-diagonal elements of D0 be nonnegative and the diagonals be negative and less than or equal to the negated off-diagonal element of the respective row (see (11) to (14)). The conditions mentioned so far do not yet warrant that the elements of D1 = (−D0 )YPH are nonnegative. This requirement on D1 yields the four additional inequalities (15) to (18): 1 −yh22 − r1 (r3 − r1 (h2 + r2 )) ≤ 0 (11) r1 h3 2 2 1 −y h2 − yr1 (r3 − r1 (2h2 + r2 )) + r12 h2 ≥ 0 (12) (x + y)r1 h3 2 2 1 x h2 − xr1 (r3 − r1 (2h2 + r2 )) − r12 h2 ≥ 0 (13) (x + y)r1 h3 1 2 xh2 − r1 (r3 − r1 (h2 + r2 )) ≤ 0 (14) r 1 h3 1 2 xyh2 + xr1 (r3 − r1 (h2 + r2 )) + yγr12 h2 + γr12 h2 ≥ 0 (15) (x + y)r1 h3 2 2 1 y h2 + yr1 ((r3 − r1 (h2 + r2 )) − γr1 h2 ) − γr12 h2 ≥ 0 (16) (x + y)r1 h3 2 2 1 −x h2 + xr1 ((r3 − r1 (h2 + r2 )) − γr1 h2 ) + γr12 h2 ≥ 0 (17) (x + y)r1 h3 1 −xyh22 + yr1 (r3 − r1 (h2 + r2 )) + xγr12 h2 − γr12 h2 ≥ 0 (18) (x + y)r1 h3 Note that computing D1 from D0 and YPH ensures D1 ePH = −D0 ePH as necessary for the Markov chain generator Q = D0 + D1 . With h3 occurring in the denominators, the above inequalities also reflect that r3 cannot attain the lower bound for hyperexponential distributions of second order. If γ is positive, we thus need to solve the above eight nonlinear inequalities for feasible pairs (x, y). If γ is negative, the first inequality of (10) results in two additional constraints on x and y. Solving this set of nonlinear inequalities is not a standard problem and one cannot expect to find closed-form expressions for the boundaries of the feasible ranges easily. Although the roots of the related (at most quadratic) equations are readily found, their interpretation with respect to the feasible ranges very much depends on the signs of x, y and h2 , h3 . We distinguish four cases in our implementation:
244
A. Heindl, K. Mitchell, and A. van de Liefvoort
a) hyperexponential and x, y > 0 b) hyperexponential and x, y < 0
c) hypoexponential and x, y > 0 d) hypoexponential and x, y < 0
The complexity of the problem prevents us from stating the implemented algorithm here in full and tedious detail. Instead, we give a rough sketch thereof. First, observe that only either x or y appears in several inequalities. This suggests the procedure below, which successively extracts boundaries for x and y from the inequalities. If the algorithm aborts before reaching step 5, γ must be infeasible and the canonical form (3) does not capture a valid stochastic process. (1)
(1)
Step 1: Determine absolute lower and upper bounds xmin and xmax for x in (1) (1) terms of r1 , r2 , r3 , γ from (13), (14) and (17). Abort if xmax < xmin . (1) (1) Step 2: Determine absolute lower and upper bounds ymin and ymax for y in (1) (1) terms of r1 , r2 , r3 , γ from (11), (12) and (16). Abort if ymax < ymin . Step 3: Determine lower and upper bounds for y relative to x from (15), (18) and – if γ < 0 – from (10). (2) (2) Step 4: Obtain additional (absolute) lower and upper bounds xmin and xmax (1) for x by relating all current (absolute – see ymin – and relative – see step 3 ) (1) lower bounds for y to all current (absolute – see ymax – and relative – see step 3 ) upper bounds for y, i.e., each lower y-bound must be less than or equal (2) (2) to each upper y-bound at any feasible x. Abort if xmax < xmin (otherwise γ is feasible). (2) (2) Step 5: For an arbitrary feasible x∗ ∈ [xmin , xmax ] determine the absolute lower (2) (2) and upper bounds ymin and ymax for y in terms of r1 , r2 , r3 , γ and x∗ from (2) (2) (15), (18) and – if γ < 0 – from (10). Choose y ∗ from the interval [ymin , ymax ]; the tuple (x∗ , y ∗ ) completes the specification of matrix X that transforms the ME sequence into a MAP(2). Note that step 4 guarantees the existence of a feasible y ∗ , after a feasible x∗ has been found. Apparently, x is not and cannot be unique, because in general a stochastic process that can be captured by a MAP(2) (or a matrix-exponential sequence) has several such (equivalent) representations. Depending on the signs of x, y and h2 , h3 , the same inequality may contribute no, a single or two relevant bound(s). In this respect, it is by no means trivial that the cases a) and b) (as well as c) and d)) always arrive at the same answer for the decision problem, although our experiments showed this to be the case. When considering a current feasible range for x, bounds for y may have to be interpreted inversely in different subintervals of this range as determined by the actual values of the involved parameters. These may also cause roots to be complex so that either no bounds result from the inequality or no solution exists. In fact, such requirements can be exploited to find necessary conditions for the parameter γ. Generally, however, the host of different options in the course of the above procedure makes a symbolic solution very cumbersome and will most likely provide further insight only in special cases, where the involved expressions simplify. In Sect. 5, we investigate such special cases. In Sect. 4, we demonstrate how the algorithm can be used to experimentally retrieve bounds for γ.
The Correlation Region of Second-Order MAPs
245
4.5 1 r3 of Prop.
3.5 3
correlation parameter
third reduced moment
4
Marie’s r3 BI
2.5 2 1.5 1
BIII
0.5 feasible correlation region 0 M2 -0.5
M1 bounds for Marie bounds for minimum r3
-1 BII
0.5 0.5
1
1.5
2
2.5
squared coefficient of variation
3
0
1
2
3
4
5
6
squared coefficient of variation
Fig. 1. The feasible ranges of reduced third moment r3 (left-hand side) and of correlation parameter γ (right-hand side)
4
Experimental Bounds of γ for Marie’s Distribution
Although the algorithm sketched in the previous section primarily decides whether a specific correlation parameter γ is feasible for a MAP(2) (or an ME sequence of order 2) with respect to given first three moments of the marginal distribution, it can also be used to mark the feasible range of γ for r1 , r2 , r3 , e.g., by using bisection. Two cases with different third moments (with respect to r1 and r2 ) – once chosen according to Marie’s distribution and once (quasi-)minimal – reveal the dependence of the γ-bounds on the third moment. The left-hand side of Fig. 1 plots the employed third-moment curves – Marie’s r3 and BI/BII – in the feasible range of r3 (also see bounds in Table 1 or [12]). First, we provide results for Marie’s distribution [14], which – unlike most other published generic PH distributions of order 2 – covers the permissible hyper- and hypoexponential domain completely. For a squared coefficient of vari2r −r 2 2 ation c2v = 2r2 1 = h2r+r ≥ 12 , Marie’s distribution is generated from the two 2 1 1 moments r1 and r2 : 2 − c12 1 v , (19) · pPH = (1, 0) , BPH = 0 c12 r1 v
with ePH again being the column vector of ones. The third moment is r3 = 1 4 2 3 4 (2cv + cv + 1)r1 (see also left-hand side of Fig. 1). In our experiments, we fix r1 = 1 so that we can easily plot the feasible range of γ versus the only independent variable c2v . The solid lines in Fig. 1 (right-hand side) show the resulting graphs. First of all, we notice from Fig. 1 (right-hand side) how restricted the range of γ may be in the hypoexponential domain and/or if γ < 0 as compared to the situation hyperexponential, γ > 0. Furthermore, the case distinctions encountered in the course of the algorithm manifest themselves in piecewise defined boundaries. While these boundaries inevitably undergo redefinitions at the singularity c2v , the other transition points (e.g., at c2v = 32 in the figure) vary with
246
A. Heindl, K. Mitchell, and A. van de Liefvoort
the specific moment values. For Marie’s distribution with r1 = 1, we can come up with analytic bounds for negative γ: for
1 2
≤ c2v ≤
3 2
(c2v = 1): The lower γ-bound M1 is linear in c2v , i.e., 1 γmin = −(c2v − ) . 2
for
3 2
≤ c2v : The lower γ-bound M2 is given by γmin = − 2(c21−1) . v
In the hypoexponential domain, where r3 is generally confined to a relatively narrow strip (see bounds BII and BIII on left-hand side of Fig. 1), the depicted γ-range should be somewhat representative for any distribution of order 2 with r1 = 1. In the hyperexponential domain, the flexibility to vary r3 (thus transcending Marie’s distribution) allows us to loosen the restriction on γ. For example, at c2v = 6.0, where the minimal value for γ is −0.1 for Marie’s distribution (r3 = 19.75), the range of γ may reach down as far as −0.4, when r3 approaches r2 its minimum value r21 = 12.25. The dashed-dotted lines in Fig. 1 (right-hand side) delimit the feasible γrange, as r3 is set to (or at least very close to) its minimum boundary value (dependent on r2 and r1 = 1). The discontinuity at c2v = 1 for negative γ can be attributed to the fact that different bound functions apply for r3 in the hypoand hyperexponential domain. In the hypoexponential domain, setting r3 to its minimum value (on bound BII on left-hand side of Fig. 1) often yields only negligible gain: at c2v = 0.6, the minimum value of γ may only be reduced from −0.1 to −0.10557281 (with r3 down to 0.5789 from 0.58 for Marie’s distribution). The rule “the lower r3 , the larger the feasible γ-range” turns out to be valid consistently only in the hypoexponential domain. For c2v > 1 – more precisely for 1 < c2v < 3 –, the bound behavior is more involved, as Fig. 1 (right-hand side) indicates. If c2v ≥ 3, the (virtually) minimal r3 (see bound BI on left-hand side of Fig. 1) again maximizes the feasible γ-range (for the selected r1 ). Our experiments suggest the following for c2v ∈ (1, 3): If we concede absolute freedom of choice in parameter r3 , r3 can be chosen in such a way (with respect to c2v ) that γ(c2v ) may range in [−1, 1). For example, for c2v = 32 , this r3 should equal r2
+ c2v + 1)r1 (Marie’s distribution); for c2v = 3, r3 ≈ r21 (lower bound). In fact, the next section will prove that in the considered c2v -range r3 = r1 (2h2 + r2 ) fulfills the desired properties. 1 4 4 (2cv
5
Analytic Bounds for the Maximal Correlation Region
In this section, we give analytic expressions for the absolute bounds of γ for MAP(2)s with arbitrary first three moments of the marginal distribution. We first prove the following proposition. Proposition 1. For 1 < c2v < 3, the third (reduced) moment r3 can always be chosen in such a way (with respect to the given first two moments r1 , r2 ) that
The Correlation Region of Second-Order MAPs
247
a MAP(2) can be constructed such that its marginal distribution matches the moments r1 , r2 (and r3 ) and its correlation parameter γ may take any arbitrary value in [−1, 1). Proof. We will show that for a specific choice of the third moment we may even select y equal to x (thus simplifying the representation (9) of matrix X) and have these parameters fixed for any γ ∈ [−1, 1). Choose r3 = r3∗ ≡ r1 (2h2 + r2 ) and set x = y = √rh1 . Clearly, in the considered range of c2v (1 < c2v < 3), 2
2
r3∗ is a feasible third (reduced) moment, i.e., r3∗ > rr21 (see dashed line on lefthand side of Fig. 1 for an illustration). It is easily verified that the proposed settings satisfy all involved inequalities. Equality holds for (12) and (13), while inequalities (10), (11) and (14) √ to (18) are either trivially fulfilled or true because c2v < 3 ⇔ r2 < 2r1 2 ⇔ r1 > h2 or −1 ≤ γ ≤ 1. As we exclude the degenerate case γ = 1, the proposition is proved 2 . The MAP(2) constructed from canonical form (3) by similarity transformation with matrix X (where x = y = √rh1 ) 2 matches r1 , r2 , r3 = r1 (2h2 + r2 ) and an arbitrary γ ∈ [−1, 1). Following the idea of the algorithm in Sect. 3, other boundaries of the correlation region can also be derived. For example, for c2v ≥ 3 the lower bound G2 of (negative) γ (see Fig. 2 – left-hand side) is simply given by γ>−
c2v
2 r2 =− 1 −1 h2
for c2v ≥ 3 .
(20)
Note that this bound only depends on the squared coefficient of variation. Besides Fig. 2 (left-hand side, solid line), this curve is also depicted in Fig. 1 (right-hand side, dashed-dotted line). In the course of the derivations, limit argur2 ments must be employed, because the lower bound of r3 > r21 actually cannot be attained. This also explains why γ must be strictly greater than its lower boundary (20). Thus, all extremal boundaries for the correlation region of MAP(2)s are available in analytical form in the hyperexponential case. Bounds for the hypoexponential case may be obtained analogously. However, they no longer depend exclusively on c2v , but also on r1 . As an example, we give the lower bound G1 of the correlation region in the hypoexponential domain for r1 ≥ 1:
1 − c2v 1 − c2v 1 1 − 2(1 − c2v ) if 0.5 ≤ c2v < 1 . γ≥ − + 2 2 r1 For r1 = 1, the bound is plotted in Fig. 2 (left-hand side). Due to the nature of the correlation region in the hypoexponential domain, applications for correlated MAPs of second order with hypoexponential marginal distributions appear quite limited. So, we omit corresponding proofs here. 2
Note that γ = 1 causes YPH to become diagonal so that any correlation is eliminated. Actually, we are then dealing with two independent Poisson processes.
A. Heindl, K. Mitchell, and A. van de Liefvoort absolute lag-1 coefficient of correlation
248
correlation parameter
1
0.5
0 correlation region of MMPP(2) -0.5
G1
max. correlation region of MAP(2)
G2
maximal bounds
-1 0
1
2
3
4
5
squared coefficient of variation
6
1
0.8
0.6
0.4
0.2
0 1
2
3
4
5
6
7
8
9
10
squared coefficient of variation
Fig. 2. The maximal/envelope range of correlation parameter γ (left-hand side) and a-priori restrictions imposed on lag-1 coefficient of correlation (right-hand side)
The hatched area in the left-hand side of Fig. 2 indicates that in the hyperexponential range (c2v > 1) with nonnegative correlation structure, one can always 2 construct an MMPP(2) from r1 , r2 , r3 and cov[X0 , X1 ] for any r3 > r3min = rr21
and 0 ≤ γ = cov[Xh20 ,X1 ] < 1 (see [4], where, however, these bounds are not as clearly identified). So, in this setting, the four parameters are decoupled as much as possible a priori for a second-order representation, which holds for both MAP(2)s and MMPP(2)s. For nonrenewal MMPP(2)s (i.e., arrival rates = 0), γMMPP(2) = (1 + λν00 + λν11 )−1 > 0. Thus, their correlation structure λ0 , λ1 cov[X0 , Xk ] = γ k h2 must be nonnegative, as the squared coefficient of variation of MMPPs in general cannot be less than one (so that h2 ≥ 0). We also point out that γ is not to be confused with the lag-1-coefficient of correlation (defined h2 0 ,X1 ] by corr[X0 , X1 ] = cov[X Var[X0 ] = γ h2 +r2 ). As a MAP(2) requires |γ| ≤ 1 a priori, it can be easily shown that the lag-1 coefficient of correlation is in any case restricted by 1 1 1 . | corr[X0 , X1 ] | ≤ (1 − 2 ) ≤ 2 cv 2 The right-hand side of Fig. 2 illustrates these bounds on the absolute value of corr[X0 , X1 ], which are further tightened by the maximal γ-region of MAP(2)s (outside the interval 1 < c2v < 3). Note again the impossibility to construct a correlated MAP(2) with c2v = 1. If the algorithm of Sect. 3 fails, the insight provided by the presented bounds may be exploited to enforce feasibility for the correlation parameter γ (at the expense of an adjusted r3 ). In the next section, we will put such a procedure to construct a MAP(2) from arbitrary values for r1 , r2 , r3 and γ to the test. Of course, explicit bounds for γ in terms of any feasible moment set r1 , r2 , r3 would allow to exactly determine how much the third moment or γ has to be varied to enter the feasible correlation region.
The Correlation Region of Second-Order MAPs
6
249
Application to Decomposition of Queueing Networks
Parsimonious models are especially important in traffic-based decomposition of queueing networks. In this class of approximate algorithms, queues are analysed in isolation with respect to (approximate) arrival process characterizations derived from the output(s) of upstream queue(s). Recent publications usually rely on MAPs (or subclasses thereof) to describe the internal traffic streams [15,16, 17,18,4]. Correlated matrix-exponential sequences have also been used [7]. The efficiency and – for larger networks – even the practicability of traffic-based decomposition heavily depends on the dimensions of the employed MAPs, which often depend multiplicatively on the orders of the arrival process and the (phasetype) service representation. Exceptions are [18], where this MAP dependence is linear, and [4], which maps marginal and correlation data to MMPPs of fixed order 2 (MMPP(2)). The decomposition techniques facilitated by the findings of the current paper will also result in minimal correlated fixed-order representations and generalize the latter method in that they exhaust the potential of these representations (in going from MMPP(2) to MAP(2)). In addition, these techniques prove quite flexible in the interpretation of the correlation parameter. Of course, we do not claim that MAP(2) representations generally yield satisfactory output traffic descriptors, but in many situations their accuracy and parsimony make them very attractive substitutes for unnecessarily complex output models. We will focus on MAP(2)s in queue output approximations as they ensue from the bounds discussed in previous sections. 6.1
Dual Tandem Queue with MMPP(2) Input
First, we demonstrate the flexibility in using MAP(2)s in network decomposition and their accuracy when there is no conflict (for second-order representations) between and among involved marginal moments and the initial correlation structure. Consider the dual tandem queue in Fig. 3 (top) with MMPP(2) external input and exponential and Erlang-2 services with means 1.0 and 0.8, respectively, also studied in [18]. Table 2 summarizes internal traffic characteristics and the mean queue lengths at the second node, computed using various methods including simulation (top row) and the MAP-based decomposition from [18] (bottom row). The latter approach, which bases the output approximation on a busy period analysis of the first queue, results in MAPs of order mA + 3mS (i.e., 5 in this case). Parameters mA and mS denote the orders of the arrival process and the service representation. The values for the third reduced moments r3 and the lag-i-covariances (i = 1, 2) given in the simulation row are the values obtained by a numerical analysis of the MAP/PH/1 queue [1] and its departure process [19]. The entry fitted means that the respective decomposition technique matches this parameter. All methods capture the first two marginal moments of the true output process (r1 = 2, r2 ≈ 7.34, i.e., c2v ≈ 2.67), so these are omitted from the table. All other methods in Table 2 are based on decomposition with MAP(2)s, which arise from different settings for the third marginal moment r3 and the
250
A. Heindl, K. Mitchell, and A. van de Liefvoort
MMPP(2):
mean rate = 0.5, s.c.v. = 4.1
exponential(1.0)
erlang(0.8,2)
(ν 0 = 0.9375, ν1 = 0.0625, λ 0 = 6.0, λ 1 = 0.1333) MAP(2):
mean rate = 0.5, s.c.v. = 3.25 −2
D0 =
0
0 −0.2
D1 =
1
1
0.2
0
erlang(1.0,2)
erlang(1.0,2)
Fig. 3. Specifications for the dual tandem queues with MMPP(2) or MAP(2) input
correlation parameter γ (second column) in canonical form (3) – by similarity transform with matrix X using the values of the last column for x and y (also see (9)). Note that the covariance expression cov[X0 , Xk ] = γ k h2 (see (4)) allows different methods to find values for γ. For our experiments, we chose cov[X0 ,X1 ] (lag-1-dec., r3∗ -lag-1-dec.). h2 0 ,X2 ] ∗ γ2 ≡ cov[X cov[X0 ,X1 ] (decay-dec., r3 -decay-
– to match the lag-1-covariance, i.e., γ1 ≡
– to match the correlation decay, i.e., dec.). 2 – to take an average of the above two values, i.e., γ3 ≡ γ1 +γ (av.-γ-dec.). 2 – an uncorrelated MAP(2) output approximation, i.e., γ0 ≡ 0 (renewal-dec., r3∗ -renewal-dec.). The methods with the prefix r3∗ - differ from their counterparts (in rows 2, 3 and 5) only in their choice of the third reduced moment according to Proposition 1. In the considered dual tandem queue, the methods lag-1-dec. and decaydec. deliver excellent results for the mean queue lengths at the second node with relative errors below 1%. Note that these results can often even be improved (depending on the constellation of the involved lag-1- and lag-2-covariances) by an average value for γ (av.-γ-dec.). This value trades off between matching the lag-1-covariance and the decay of the correlation structure. In this example, the MAP(2)-based decomposition has a significantly lower relative error in the mean queue length at the second node compared with the busy-period approach in [18], even though a much more compact MAP representation is employed (order 2 instead of 5). For higher-order arrival processes or service times, this order relation will naturally be much more favorable for the MAP(2)-based decomposition. Since r1 , r2 and cov[X0 , X1 ] of the output process define a γ-value in the shaded region of Fig. 2 (left-hand side), we can also obtain an MMPP(2) by means of the fitting method in [4]. This MMPP(2) is of course stochastically equivalent to the MAP(2) resulting from method lag-1-dec. Generally, any freedom in the algorithm of Sect. 3 (step 5, when x∗ and y ∗ are chosen) is used to construct a diagonal D0 (if possible3 ), which makes subsequent analyses efficient. 3
In our examples, it is possible. In fact, when c2v > 1 and γ > 0 (as in the first example), the resulting MAP(2) is identical to the one as constructed in Sect. 3.2.
The Correlation Region of Second-Order MAPs
251
Table 2. Mean queue lengths (mql) at second node and internal traffic characteristics for the MMPP(2) dual tandem queue method simulation lag-1-dec. decay-dec. av.-γ-dec. renewal-dec. r3∗ -lag-1-dec. r3∗ -decay-dec. r3∗ -rnwl-dec. busy prds [18]
γ – 0.6684 0.6933 0.6809 0.0 0.6684 0.6933 0.0 –
mql 0.9479 0.9397 0.9513 0.9454 0.7924 2.0439 2.1609 1.0816 0.9762
conf./err. ± 0.0073 −0.87% +0.36% −0.26% −16.4% +115.6% +128.0% +14.1% +3.0%
r3 lag-1-cov lag-2-cov x/y 35.8562 2.2308 1.5466 – (fitted) (fitted) 1.4910 2.0041/0.5980 (fitted) 2.3140 1.6043 (ditto) (fitted) 2.2724 1.5472 (ditto) (fitted) 0.0 0.0 (ditto) 28.0251 (fitted) 1.4910 1.0948 28.0251 2.3140 1.6043 (ditto) 28.0251 0.0 0.0 (ditto) (fitted) 2.2842 1.3754 –
The other results given in Table 2 serve to illustrate the impact of the third marginal moment and/or the correlation parameters in the MAP(2)-based decomposition. A renewal approximation with r3 matched exactly (method renewal-dec.) yields a relative error of −16.4%. The proposition of Sect. 5 enables us to avoid the algorithm of Sect. 3 altogether, if 1 < c2v < 3, – although the third marginal moment is now fixed to r3 ≡ r3∗ = r1 (2h2 + r2 ) (see proof of 1 the proposition). Then, we set x = y = √rh2 = 1.0948. The pertinent MAP(2) output model with D0 diagonal, shows that performance at the downstream queue is very sensitive to deviations in the third marginal moment. In the hyperexponential setting, lower values for r3 result in higher mean queue lengths (by as much as 128% in our example). Note that ignoring the positive correlations here (r3∗ -renewal-dec.) leads to deceptive results. The ensuing decrease in mean queue length lets this technique appear to be “better” than the others in the r3∗ -family of approaches. 6.2
Dual Tandem Queue with MAP(2) Input
The situation becomes more intricate when the marginal moments are incompatible with the preferable correlation parameter γ. For example, this may occur when γ is within the maximal correlation range (for arbitrary third moment), but outside the bounds for the specific interevent distribution (as fixed by the first three moments). In our second experiment (see dual tandem queue in Fig. 3 with lower specifications), we study this situation, where we investigate whether precedence should be given to matching γ over r3 , or vice versa. The external MAP(2) input together with the Erlang-2 service distribution (mean 1.0) at the first queue now result in a negative lag-1-covariance and a positive lag-2covariance in the internal traffic (see Table 3). The conditions for the output approximation are aggravated by the significantly different values obtained for γ1 and γ2 – both less than minimal γmin ≈ −0.23 (as obtained from the algorithm of Sect. 3) feasible for r3 = 30.1811. Apart from the techniques used in Table 2, we examine an additional technique, called γmin -r3 -dec., which fixes the
252
A. Heindl, K. Mitchell, and A. van de Liefvoort
Table 3. Mean queue lengths (mql) at second node and internal traffic characteristics for the MAP(2) dual tandem queue method simulation lag-1-dec. decay-dec. γmin -r3 -dec. renewal-dec. r3∗ -lag-1-dec. r3∗ -decay-dec. r3∗ -rnwl-dec. busy prds [18]
γ – −0.3510 −0.8286 −0.23 0.0 −0.3510 −0.8286 0.0 –
mql 1.0348 1.1176 1.1411 1.0927 1.1227 1.2582 1.1613 1.3875 1.1124
conf./err. ± 0.0082 +8.0% +10.3% +5.6% +8.5% +21.6% +12.2% +34.1% +7.5%
r3 30.1811 28.065 24.234 (fitted) (fitted) 23.5133 23.5133 23.5133 (fitted)
lag-1-cov −0.9076 (fitted) −2.1434 −0.5947 0.0 (fitted) −2.1424 0.0 −0.3240
lag-2-cov x/y 0.7521 – 0.3186 2.0989/0.7371 1.7751 1.3563/1.1407 0.1368 2.5918/0.5969 0.0 (ditto) 0.3186 1.2438 1.7751 (ditto) 0.0 (ditto) 0.2356 –
original r3 and sets the correlation parameter to γmin . While for the r3∗ -methods the choice of r3∗ ensures any value (≥ −1 for γ (since 1 < c2v ≈ 2.29 < 3), the γ-values obtained for lag-1-cov. and decay-dec. entail to lower the third marginal moment to the respective maximal value for the specific γ. From the results in Table 3, we observe that in this example r3 is more critical than γ so that method γmin -r3 -dec. performs best with a deviation of +5.6%. This value is actually still better than that yielded by the output approximation according to [18], which matches the first three moments of the busy period of the first queue with a MAP of order 8 (four times as large). Note that the negative lag-1-covariance reduces the mean queue length at the second node compared to the renewal case (check for identical third moments). Thus, as either γ must be chosen closer to zero or r3 below the original value, the mean queue length will necessarily be overestimated here by the MAP(2)-techniques. Consequently, r3∗ -renewal-dec. gives the worst performance. We also remark that due to the negative lag-1-covariance in this example, no reasonable (apart from renewal) approximation can be achieved here by means of MMPP(2)s as traffic descriptors. Let us also give a MAP(2) representation for the internal traffic explicitly, say for method γmin -r3 -dec.: D0 =
−0.8141 0 0 −0.1869
and D1 =
0.6265 0.1876 0.18685 0.00005
.
Besides the typical diagonal form for D0 , it is interesting to notice how the parameters of the external input MAP(2) are modulated as it traverses the first queue. Once again, we point out that the Markov chain interpretation of this MAP(2)s may be considered a nice spin-off product of the presented algorithm. But once the feasibility of the parameters involved in canonical form (3) have been determined, it is more advisable to proceed with this ME representation for subsequent queue analyses (by LAQT techniques) to avoid additional numerical errors.
The Correlation Region of Second-Order MAPs
6.3
253
Discussion
In this paper, we can only highlight some issues faced when applying our theoretical considerations to the decomposition of queueing networks. In future work, we will conduct extensive experiments to investigate the behavior of MAP(2)based decomposition in other settings (varying arrival processes, service times, utilities, considered measures, etc.). Of course, the limitations as discussed in previous sections are obvious and manifold. An additional qualitative constraint applies due to the fact that the lag-2-covariance must be nonnegative in the hyperexponential case (cov[X0 , X2 ] = γ 2 h2 ≥ 0 for c2v ≥ 1) and nonpositive in the hypoexponential case (cov[X0 , X2 ] = γ 2 h2 ≤ 0 for c2v ≤ 1). Despite these obvious limitations, we regard it worthwhile to include MAP(2)s/ME(2)s as candidates for traffic description in network decomposition. Feasibility checks can be quickly performed. In case of a positive outcome of this decision problem, accuracy can be expected to be appealing, while the efficiency of network decomposition may be dramatically enhanced by means of these quasi-minimal traffic descriptors. Of course, truncated output MAP approximations as proposed in [15,16,17], which precisely match the first lag-k-covariances (k ≥ 2), will provide more accurate results, but have orders of at least mA (1+kmS ), i.e., multiplicative in mA and mS . For k = 2, this leads to MAPs of order 10 for the output approximations of the first queue in the examples (cf. our fixed dimension 2). With increasing arrival and service representations, these MAPs may soon become inefficiently large, especially when superpositions of MAPs are involved in network decomposition.
7
Conclusions
In this paper, we exploited the equivalence of correlated ME sequences of second order and MAP(2)s in order to gain insight into the correlation that can be captured by such two-dimensional processes. The presented algorithm converts the moment/correlation-canonical form of the ME process, which is conveniently expressed in terms of the first three moments of the marginal distribution and a correlation parameter, into a valid MAP(2), if the given parameters are feasible. For feasible moments, the algorithm fails, if and only if the correlation parameter γ is not permissible. Note that matching γ does not necessarily mean fitting the first coefficient of correlation or the lag-1 covariance. Rather, γ may also be used to adjust the decay of the correlation structure or be interpreted even more generally. Apart from the insight into the behavior of the studied processes, the results of this paper – including several analytic bounds for γ – have obvious practical implications. In many applications – we have looked into network decomposition here –, compact and correlated traffic processes are desired for accurate models, while avoiding state space explosion. The presented algorithm aids in the decision whether this goal may already be achieved with MAPs of second order.
254
A. Heindl, K. Mitchell, and A. van de Liefvoort
References 1. Latouche, G., Ramaswami, V.: Introduction to Matrix-Analytic Methods in Stochastic Modeling. Series on statistics and applied probability. ASA-SIAM (1999) 2. Heffes, H., Lucantoni, D.M.: A Markov-modulated characterization of packetized voice and data traffic and related statistical multiplexer performance. IEEE J. on Selected Areas in Commun. 4 (1986) 856–868 3. Gusella, R.: Characterizing the variability of arrival processes with indexes of dispersion. IEEE J. on Selected Areas in Commun. 9 (1991) 203–211 4. Ferng, H.W., Chang, J.F.: Connection-wise end-to-end performance analysis of queueing networks with MMPP inputs. Performance Evaluation 43 (2001) 39–62 5. van de Liefvoort, A.: The moment problem for continuous distributions. Technical Report WP-CM-1990-02, School of Interdisciplinary Computing and Engineering, University of Missouri – Kansas City, USA (1990) 6. Heindl, A., van de Liefvoort, A.: Matrix-exponential and matrix-geometric distributions of second order: Moment matching and moment bounds. (submitted for publication in 2003) 7. Mitchell, K., van de Liefvoort, A.: Approximation models of feed-forward G/G/1/N queueing networks with correlated arrivals. Performance Evaluation 51 (2003) 137–152 8. Mitchell, K.: Constructing a correlated sequence of matrix exponentials with invariant first-order properties. Operations Research Letters 28 (2001) 27–34 9. Lipsky, L., Fiorini, P., Hsin, W., van de Liefvoort, A.: Auto-correlation of lag-k for customers departing from semi-Markov processes. Technical Report TUM-19506, TU Munich (1995) 10. Lipsky, L.: Queueing Theory: A linear algebraic approach. MacMillan (1992) 11. van de Liefvoort, A.: A note on count processes. Assam Statistical Review 5 (1991) 1–11 12. Telek, M., Heindl, A.: Matching moments for acyclic discrete and continuous phasetype distributions of second order. Intl. Journal of Simulation 3 (2003) 47–57 13. Whitt, W.: Approximating a point process by a renewal process, I. Two basic methods. Operations Research 30 (1982) 125–147 14. Marie, R.A.: M´ethodes iteratives de resolution de mod`eles mathematiques de syst`emes informatiques. RAIRO Informatique/Comput. Sci. 12 (1978) 107–122 15. Sadre, R., Haverkort, B.: Characterizing traffic streams in networks of MAP/MAP/1 queues. In: Proc. 11th GI/ITG Conference on Measuring, Modelling and Evaluation of Computer and Communication Systems, Aachen, Germany (2001) 16. Green, D.: Lag correlations of approximating departure processes of MAP/PH/1 queues. In: Proc. 3rd Int. Conf. on Matrix-Analytic Methods. (2000) 135–151 17. Bean, N.G., Green, D.A., Taylor, P.G.: Approximations to the output process of MAP/PH/1/ queues. In: Proc. 2nd Int. Workshop on Matrix-Analytic Methods. (1998) 151–159 18. Heindl, A., Telek, M.: Output models of MAP/PH/1(/K) queues for an efficient network decomposition. Performance Evaluation 49 (2002) 321–339 19. Ferng, H.W., Chang, J.F.: Departure processes of BMAP/G/1 queues. Queueing Systems 39 (2001) 109–135
EvalVid – A Framework for Video Transmission and Quality Evaluation Jirka Klaue, Berthold Rathke, and Adam Wolisz Technical University of Berlin, Telecommunication Networks Group (TKN) Sekr. FT5-2, Einsteinufer 25, 10587 Berlin, Germany, {jklaue,rathke,wolisz}@ee.tu-berlin.de
Abstract. With EvalVid1 we present a complete framework and tool-set for evaluation of the quality of video transmitted over a real or simulated communication network. Besides measuring QoS parameters of the underlying network, like loss rates, delays, and jitter, we support also a subjective video quality evaluation of the received video based on the frame-by-frame PSNR calculation. The tool-set has a modular construction, making it possible to exchange both the network and the codec. We present here its application for MPEG-4 as example. EvalVid is targeted for researchers who want to evaluate their network designs or setups in terms of user perceived video quality. The tool-set is publicly available [11].
1
Introduction
Recently noticeably more and more telecommunication systems are supporting different kinds of real-time transmission, video transmission being one of the most important applications. This increasing deployment causes the quality of the supported video to become a major issue. Surprisingly enough, although an impressive number of papers has been devoted to mechanisms supporting the QoS in different types of networks, much less has been done to support the unified, comparable assessment of the quality really achieved by the individual approaches. In fact, many researchers constrain themselves to prove that the mechanism under study has been able to reduce the packet loss rate, packet delay or packet jitter considering those measures as sufficient to characterize the quality of the resulting video transmission. It is, however, well known that the above mentioned parameters can not be easily, uniquely transformed into a quality of the video transmission: in fact such transformation could be different for every coding scheme, loss concealment scheme and delay/jitter handling. Publicly available tools for video quality evaluation often assume synchronized frames at the sender and the receiver side, which means they can’t calculate the video quality in the case of frame drops or frame decoding errors. Examples are the JNDmetrixIQ software [4] and the AQUAVIT project [5]. Such tools are not meant for evaluation of incompletely received videos. They are only applicable to videos where every frame could be decoded at the receiver side. Other researchers occupied with video quality 1
This work has been partially supported by the German research funding agency ’Deutsche Forschungsgemeinschaft’ under the program "Adaptability in Heterogeneous Communication Networks with Wireless Access" (AKOM)
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 255–272, 2003. c Springer-Verlag Berlin Heidelberg 2003
256
J. Klaue, B. Rathke, and A. Wolisz
evaluation of transmission-distorted video, e.g., [20, 21], did not make their software publicly available. To the best knowledge of the authors there is no free tool-set available which satisfies the above mentioned requirements. In this paper we introduce EvalVid, a framework and a toolkit for a unified assessment of the quality of video transmission. EvalVid has a modular structure, making it possible to exchange at users discretion both the underlying transmission system as well as the codecs, so it is applicable to any kind of coding scheme, and might be used both in real experimental set-ups and simulation experiments. The tools are implemented in pure ISO-C for maximum portability. All interactions with the network are done via two trace files. So it is very easy to integrate EvalVid in any environments. The paper is structured as follows: we start with an overview of the whole framework in Section 2, followed by the explanation of the scope of the supported functionality in Section 3 with explanation of the major design decisions. Afterwards the individual tools are described in more detail (Section 4). Exemplary results and a short outline of the usability and further research issues complete the paper.
2
Framework and Design
Source Video Encoder
VS video trace
ET
sender trace
Network (or simulation) loss / delay
EvalVidAPI tcpdump
EvalVidAPI tcpdump
In Figure 1 the structure of the EvalVid framework is shown. The interactions between the implemented tools and data flows are also symbolized. In Section 3 it is explained what can be calculated and Section 4 shows how it is done and which results can be obtained. Play-out buffer
User Video Decoder
receiver trace
coded video reconstructed erroneous video
RESULTS: - frame loss / frame jitter
FV raw YUV video (sender)
reconstructed raw YUV video (receiver)
PSNR
erroneous video
raw YUV video (receiver)
- user perceived quality
M OS
Fig. 1. Scheme of evaluation framework
Also, in Figure 1, a complete transmission of a digital video is symbolized from the recording at the source over the encoding, packetization, transmission over the network,
EvalVid – A Framework for Video Transmission and Quality Evaluation
257
jitter reduction by the play-out buffer, decoding and display for the user. Furthermore the points, where data are tapped from the transmission flow are marked. This information is stored in various files. These files are used to gather the desired results, e.g., loss rates, jitter, and video quality. A lot of information is required to calculate these values. The required data are (from the sender side): – raw uncompressed video – encoded video – time-stamp and type of every packet sent and from the receiver side: – time-stamp and type of every packet received – reassembled encoded video (possibly errorneous) – raw uncompressed video to be displayed The evaluation of these data is done on the sender side, so the informations from the receiver have to be transported back to the sender. Of practical concern is that the raw uncompressed video can be very large, for instance 680 MB for a 3 minute PDA-screen sized video. On the other hand it is possible to reconstruct the video to be displayed from the information available at the sender side. The only additional information required from the receiver side is the file containing the time stamps of every received packet. This is much more convenient than the transmission of the complete (errorneous and decoded) video files from the receiver side. The processing of the data takes place in 3 stages. The first stage requires the timestamps from both sides and the packet types. The results of this stage are the frame-type based loss rates and the inter-packet times. Furthermore the errorneous video file from the receiver side is reconstructed using the original encoded video file and the packet loss information. This video can now be decoded yielding the raw video frames which would be displayed to the user. At this point a common problem of video quality evaluation comes up. Video quality metrics always require the comparison of the displayed (possibly distorted) frame with the corresponding original frame. In the case of completely lost frames, the required synchronization can not be kept up (see Section 4.4 for further explanations). The second stage of the processing provides a solution to this problem. Based on the loss information, frame synchronization is recovered by inserting the last displayed frame for every lost frame. This makes further quality assessment possible. The thus fixed raw video file and the original raw video file are used in the last stage to obtain the video quality. The boxes in Figure 1 named VS, ET, FV, PSNR and MOS are the programs of which the framework actually consists (see Section 4). Interactions between the tools and the network (which is considered a black box) are based on trace files. These files contain all necessary data. The only file that must be provided from the user of EvalVid is the “receiver trace file”. If the network is a real link, this is achieved with the help of TCPdump (for details see Section 4, too). If the network is simulated, then this file must be produced by the receiver entity of the simulation. This is explained in the documentation [11].
258
J. Klaue, B. Rathke, and A. Wolisz
For the tools within EvalVid only these trace files, the original video file and the decoder are needed. Therefore, in the context of EvalVid the network is just a “black box” which generates delay, loss and possible packet reordering. It can be a real link, such as Ethernet or WLAN, or a simulation or emulation of a network. Since the only interaction of EvalVid and the network is represented by the two trace files (sender and receiver), the network box can be easily replaced, which makes EvalVid very flexible. Similarly, the video codec can also be easily replaced.
3
Supported Functionalities
In this section the parameters calculated by the tools of EvalVid are described, formal definitions and references to deeper discussions of the matter, particularly for video quality assessment, are given. 3.1
Determination of Packet and Frame Loss
Packet loss. Packet losses are usually calculated on the basis of packet identifiers. Consequently the network black box has to provide unique packet id’s. This is not a problem for simulations, since unique id’s can be generated fairly easy. In measurements, packet id’s are often taken from IP, which provides a unique packet id. The unique packet id is also used to cancel the effect of reordering. In the context of video transmission it is not only interesting how much packets got lost, but also which kind of data is in the packets. E.g., the MPEG-4 codec defines four different types of frames (I, P, B, S) and also some generic headers. For details see the MPEG-4 Standard [10]. Since it is very important for video transmissions which kind of data gets lost (or not) it is necessary to distinguish between the different kind of packets. Evaluation of packet losses should be done type (frame type, header) dependent. Packet loss is defined in Equation 1. It is expressed in percent. packet loss P LT = 100
nTrecv nTsent
,
where:
(1)
T : Type of data in packet (one of all, header, I, P, B, S) nTsent : number of type T packets sent nTrecv : number of type T packets received Frame loss. A video frame (actually being a single coded image) can be relatively big. Not only in the case of variable bit rate videos, but also in constant bit rate videos, since the term constant applies to a short time average. I-frames are often considerable larger than the target (short time average) constant bit rate even in “CBR” videos (Figure 2). It is possible and likely that some or possibly all frames are bigger than the maximum transfer unit (MTU) of the network. This is the maximum packet size supported by the network (e.g. Ethernet = 1500 and 802.11b WLAN = 2312 bytes). These frames has to
EvalVid – A Framework for Video Transmission and Quality Evaluation
259
Examples of MPEG-4 CBR
700 Target Bit Rate [kB/s]
600 500 400 300 200 100 0 1
11
21
31
41
51
61
71
81
91
101
# frame
Fig. 2. CBR MPEG-4 video at target bit rate 200 kbps
be segmented into smaller packets to fit the network MTU. This possible segmenting of frames introduces a problem for the calculation of frame losses. In principle the frame loss rate can be derived from the packet loss rate (packet always means IP packet here). But this process depends a bit of the capabilities of the actual video decoder in use, because some decoders can process a frame even if some parts are missing and some can’t. Furthermore, wether a frame can be decoded depends on which of its packet got lost. If the first packet is missing, the frame can almost never be decoded. Thus, the capabilities of certain decoders has to be taken into account in order to calculate the frame loss rate. It is calculated separately for each frame type. frame loss
F LT = 100
nTrecv nTsent
,
where:
(2)
T : Type of frame (one of all, header, I, P, B, S) nTsent : number of type T frames sent nTrecv : number of type T frames received Determination of Delay and Jitter. In video transmission systems not only the actual loss is important for the perceived video quality, but also the delay of frames and the variation of the delay, usually referred to as frame jitter. Digital videos always consists of frames with have to be displayed at a constant rate. Displaying a frame before or after the defined time results in “jerkiness” [20]. This issue is addressed by so called play-out buffers. These buffers have the purpose of absorbing the jitter introduced by network delivery delays. It is obvious that a big enough play-out buffer can compensate any amount of jitter. In extreme case the buffer is as big as the entire video and displaying starts not until the last frame is received. This would eliminate any possible jitter at the cost of a additional delay of the entire transmission time. The other extreme would be a buffer capable of holding exactly one frame. In this case no jitter at all can be eliminated but no additional delay is introduced. There have been sophisticated techniques developed for optimized play-out buffers dealing with this particular trade-off [17]. These techniques are not within the scope
260
J. Klaue, B. Rathke, and A. Wolisz
of the described framework. The play-out buffer size is merely a parameter for the evaluation process (Section 4.3). This currently restricts the framework to static playout buffers. However, because of the integration of play-out buffer strategies into the evaluation process, the additional loss caused by play-out buffer over- or under-runs can be considered. The formal definition of jitter as used in this paper is given by Equation 3, 4 and 5. It is the variance of the inter-packet or inter-frame time. The “frame time” is determined by the time at which the last segment of a segmented frame is received. inter-packet time itP0 = 0 itPn = tPn − tPn−1
(3)
where: tPn : time-stamp of packet number n inter-frame time itF0 = 0 itFm = tFm − tFm−1 where:
tFm : time-stamp of last segment of frame number m
packet jitter
jP = N : it¯N :
frame jitter
jF =
N 1 2 (iti − it¯N ) N i=1
(4)
number of packets average of inter-packet times M 1 2 (iti − it¯M ) M i=1
(5)
where: M : it¯M :
number of frames average of inter-frame times
For statistical purposes histograms of the inter-packet and inter-frame times are also calculated by the tools of the framework (see Section 4.3). 3.2 Video Quality Evaluation Digital video quality measurements must be based on the perceived quality of the actual video being received by the users of the digital video system because the impression of the user is what counts in the end. There are basically two approaches to measure digital video quality, namely subjective quality measures and objective quality measures. Subjective quality metrics always grasp the crucial factor, the impression of the user watching the video while they are extremely costly: highly time consuming, high manpower requirements and special equipment needed. Such objective methods are described in detail by ITU [3, 15], ANSI [18, 19] and MPEG [9]. The human quality
EvalVid – A Framework for Video Transmission and Quality Evaluation
261
Table 1. ITU-R quality and impairment scale Scale 5 4 3 2 1
Quality Impairment Excellent Imperceptible Good Perceptible, but not annoying Fair Slightly annoying Poor Annoying Bad Very annoying
impression usually is given on a scale from 5 (best) to 1 (worst) as in Table 1. This scale is called Mean Opinion Score (MOS). Many tasks in industry and research require automated methods to evaluate video quality. The expensive and complex subjective tests can often not be afforded. Therefore, objective metrics have been developed to emulate the quality impression of the human visual system (HVS). In [20] there is an exhaustive discussion of various objective metrics and their performance compared to subjective tests. However, the most widespread method is the calculation of peak signal to noise ratio (PSNR) image by image. It is a derivative of the well-known signal to noise ratio (SNR), which compares the signal energy to the error energy. The PSNR compares the maximum possible signal energy to the noise energy, which has shown to result in a higher correlation with the subjective quality perception than the conventional SNR [6]. Equation 6 is the definition of the PSNR between the luminance component Y of source image S and destination image D. P SN R(n)dB = 20 log10
(6) N col N row 2 [YS (n, i, j) − YD (n, i, j)] Vpeak
1 Ncol Nrow
i=0 j=0
k
Vpeak = 2 − 1 k = number of bits per pixel (luminance component) The part under the fraction stroke is nothing but the mean square error (MSE). Thus, V the formula for the PSNR can be abbreviated as P SN R = 20 log Mpeak SE , see [16]. Since the PSNR is calculated frame by frame it can be inconvenient, when applied to videos consisting of several hundred or thousand frames. Furthermore, people are often interested in the distortion introduced by the network alone. So they want to compare the received (possibly distorted) video with the undistorted2 video sent. This can be done by comparing the PSNR of the encoded video with the received video frame by frame or comparing their averages and standard deviations. 2
Actually, there is always the distortion caused by the encoding process, but this distortion also exists in the received video
262
J. Klaue, B. Rathke, and A. Wolisz
Another possibility is to calculate the MOS first (see Table 2) and calculate the percentage of frames with a MOS worse than that of the sent (undistorted) video. This method has the advantage of showing clearly the distortion caused by the network at a glance. In Section 4 you can see an example produced with the MOS tool of EvalVid. Further results gained using EvalVid are shown briefly in Section 5. Table 2. Possible PSNR to MOS conversion [14] PSNR [dB] > 37 31 - 37 25 - 31 20 - 25 < 20
MOS 5 (Excellent) 4 (Good) 3 (Fair) 2 (Poor) 1 (Bad)
4 Tools This section introduces the tools of the EvalVid framework, describes their purpose and usage and shows examples of the results attained. Furthermore sources of sample video files and codecs are given. 4.1
Files and Data Structures
At first a video source is needed. Raw (uncoded) video files are usually stored in theYUV format, since this is the preferred input format of many available video encoders. Such files can be obtained from different sources, as well as free MPEG-4 codecs. Sample videos can also be obtained from the author. Once encoded video files (bit streams) exist, trace files are produced out of them. These trace files contain all relevant information for the tools of EvalVid to obtain the results discussed in Section 3. The evaluation tools provide routines to read an write these trace files and use a central data structure containing all the information needed to produce the desired results. The exact format of the trace files, the usage of the routines and the definition of the central data structure are described briefly in the next section and in detail in the documentation [11]. 4.2 VS – Video Sender For MPEG-4 video files, a parser was developed based on the MPEG-4 video standard [10]; simple profile and advanced simple profile are implemented. This makes it possible to read any MPEG-4 video file produced by a conforming encoder. The purpose of VS is to generate a trace file from the encoded video file. Optionally, the video file can be transmitted via UDP (if the investigated system is a network setup). The results produced
EvalVid – A Framework for Video Transmission and Quality Evaluation
263
Table 3. The relevant data contained in the video trace file is the frame number, the frame type and size and the number of segments in case of (optional) frame segmentation. The time in the last column is only informative when transmitting the video over UDP, so that you can see during transmission, if all runs as expected (The time should reflect the frame rate of the video, e.g. 40 ms at 25 Hz). Format of video trace file: Frame Number Frame Type Frame Size Number of UDP-packets Sender Time 0 H 24 1 segm 40 ms 1 I 9379 10 segm 80 ms 2 P 2549 3 segm 120 ms 3 B 550 1 segm 160 ms ...
Table 4. The relevant data contained in the sender trace file is the time stamp, the packet id and the packet size. This file is generated separately because it can be obtained by other tools as well (e.g. TCP-dump, see documentation). Format of sender trace file: time stamp [s] 1029710404.014760 1029710404.048304 1029710404.048376 ...
packet id id 48946 id 48947 id 48948
payload size udp 24 udp 1024 udp 1024
by VS are two trace files containing information about every frame in the video file and every packet generated for transmission (Table 3 and 4). These two trace files together represent a complete video transmission (at the sender side) and contain all informations needed for further evaluations by EvalVid. With VS you can generate these coupled trace files for different video files and with different packet sizes, which can then be fed into the network black box (e.g. simulation). This is done with the help of the input routines and data structures provided by EvalVid, which are described in the documentation. The network then causes delay and possibly loss and re-ordering of packets. At the receiver side another trace, the receiver trace file is generated, either with the help of the output routines of EvalVid, or, in the case of a real transmission, simply by TCP-dump (4.7), which produces trace files compatible with EvalVid. It is worth noting that although the IP-layer will segment UDP packets exceeding the MTU of underlying layers and will try to reassemble them at the receiving side it is much better to do the segmenting self. If one segment (IP fragment) is missing, the whole packet (UDP) is considered lost. Since it is preferable to get the rest of the segments of the packet I would strongly recommend using the optional MTU segmentation function of VS, if possible.
264
4.3
J. Klaue, B. Rathke, and A. Wolisz
ET – Evaluate Traces
The heart of the evaluation framework is a program called ET (evaluate traces). Here the actual calculation of packet and frame losses and delay/jitter takes place. For the calculation of these data only the three trace files are required, since there is all necessary information included (see Section 4.2) to perform the loss and jitter calculation, even frame/packet type based. The calculation of loss is quite easy, considering the availability of unique packet id’s. With the help of the video trace file, every packet gets assigned a type. Every packet of this type not included in the receiver trace is counted lost. The type based loss rates are calculated according to Equation 1. Frame losses are calculated by looking for any frame, if one of it’s segments (packets) got lost and which one. If the first segment of the frame is among the lost segments, the frame is counted lost. This is because the video decoder cannot decode a frame, which first part is missing. The type-based frame loss is calculated according to Equation 2. This is a sample output of ET for losses (a video transmission of 4498 frames in 8301 packets). PACKET H: I: P: B: ALL:
LOSS 1 2825 2210 3266 8302
0 3 45 166 214
0.0% 0.1% 2.0% 5.1% 2.6%
FRAME LOSS H: 1 I: 375 P: 1125 B: 2998 ALL: 4499
0 3 45 166 214
0.0% 0.8% 4.0% 5.5% 4.8%
The calculation of inter-packet times is done using Equation 3 and 4). Yet, in the case of packet losses, these formulas can’t be applied offhand. This is because in the case of packet losses no time-stamp is available in the receiver trace file for the lost packets. This raises the question how the inter-packet time is calculated, if at least one of two consecutive packets is lost? One possibility would be to set the inter-packet time in the case of the lost packet to an “error” value, e.g., 0. If then a packet is actually received, one could search backwards, until a valid value is found. The inter-packet time in this case would be tn − tlast received packet . This has the disadvantage of not getting a value for every packet and inter-packet times could grow unreasonable big. That’s why the approach used by ET is slightly different. If at least one (of the two actually used in every calculation) packets is missing, there will be not generated an invalid value, but rather a value will be “guessed”. This is done by calculating a supposable arrival time of a lost packet. We will show how this is done later in this section, and in particular using Equation 7. This practically means that for lost packets the expectancy value of the sender inter-packet time is used. If relatively few packets get lost, this method does not have a significant impact on the jitter statistics. On the other hand, if there are very high loss rates, we recommend another possibility: to calculate only pairwise received packets and count lost packets seperately.
(7) arrival time (lost packet) tRn = tRn−1 + tSn − tSn−1 where: tSn : time-stamp of sent packet number n tRn : time-stamp of (not) received packet number n
EvalVid – A Framework for Video Transmission and Quality Evaluation
265
inter-frame delay [ms]
Now, having a valid time-stamp for every packet, inter-packet (and based on this, inter-frame) delay can be calculated according to Equation 3. Figure 3 shows an example of the inter-frame times calculated by ET.
350 300 250 200 150 100 50 0 1
101
201
301
401
501
601
701
801
901
# fr am e
Fig. 3. Example inter-packet times (same video transmission as used for loss calculation)
ET can also take into account the possibility of the existence of certain time bounds. If there is a play-out buffer implemented at the receiving network entity, this buffer will run empty, if no frame arrives for a certain time, the maximum play-out buffer “size”. Objective video quality metrics like PSNR cannot take delay or jitter into account. However, an empty (or full) play-out buffer effectively causes loss (no frame there to be displayed). The maximum play-out buffer size can be used to “convert” delay into loss. With ET you can do this by providing the maximum play-out buffer size as a parameter. The matching of delay to loss is then done as follows: MAX = maximum play-out buffer size new_arrival_time(0) := orig_arrival_time(0); FOREACH frame m IF (m is lost) -> new_arrival_time(m) := new_arrival_time(m-1) + MAX ELSE IF (inter-frame_time(m) > MAX) -> frame is marked lost -> new_arrival_time(m) := new_arrival_time(m-1) + MAX ELSE -> new_arrival_time(m) := new_arrival_time(m-1) + (orig_arrival_time(m) - orig_arrival_tm(m-1)); END IF END IF END FOREACH Another task ET performs, is the generation of a corrupted (due of losses) video file. This corrupted file is needed later to perform the end-to-end video quality assessment.
266
J. Klaue, B. Rathke, and A. Wolisz
Thus another file is needed as input for ET, namely the original encoded video file. In principle the generation of the corrupted video is done by copying the original video packet by packet where lost packets are omitted. One has to pay attention to the actual error handling capabilities of the video decoder in use. It is possible, that the decoder expects special markings in the case of missing data, e.g., special code words or simply an empty (filled with zeros) buffer instead of a missing packet. You must check the documentation of the video codec you want to use.
4.4
FV – Fix Video
Digital video quality assessment is performed frame by frame. That means that you need exactly as many frames at the receiver side as at the sender side. This raises the question how lost frames should be treated if the decoder does not generate “empty” frames for lost frames 3 . The FV tool is only needed if the codec used cannot provide lost frames. How lost frames are handled by FV is described in later in this section. Some explanations of video formats may be required. You can skip these parts if you are already familiar with this.
Raw video formats. Digital video is a sequence of images. No matter how this sequence is encoded, if only by exploiting spatial redundancy (like Motion-JPEG, which actually is a sequence of JPEG-encoded images) or by also taking advantage of temporal redundancy (as MPEG or H.263), in the end every video codec generates a sequence of raw images (pixel by pixel) which can then be displayed. Normally such a raw images is just a two-dimensional array of pixels. Each pixel is given by three color values, one for the red, for the green and for the blue component of its color. In video coding however pixels are not given by the three ground colors, but rather as a combination of one luminance and two chrominance values. Both representations can be converted back and forth (Equation 8) and are therefore exactly equivalent. It has been shown that the human eye is much more sensitive to luminance than to chrominance components of a picture. That’s why in video coding the luminance component is calculated for every pixel, but the two chrominance components are often averaged over four pixels. This halves the amount of data transmitted for every pixel in comparison to the RGB scheme. There are other possibilities of this so called YUV coding, for details see [10].
Y = 0.299R + 0.587G + 0.114B
(8)
U = 0.565(B − Y ) V = 0.713(R − Y ) 3
This is a Quality of Implementation issue of the video decoder. Because of the time stamps available in the MPEG stream, a decoder could figure out if one or more frames are missing between two received frames.
EvalVid – A Framework for Video Transmission and Quality Evaluation
267
R = Y + 1.403V G = Y − 0.344U − 0.714V B = Y + 1.770U The decoding process of most video decoders results in raw video files in the YUV format. The MPEG-4 decoder which I mostly use writes YUV files in the 4:2:0 format. Decode and display order. The MPEG standard basically defines three types of frames, namely I, P and B frames. I frames contain an entire image, which can be decoded independently, only spatial redundancy is exploited. I frames areintra coded frames. P frames are predicted frames; they contain intra coded parts as well as motion vectors which are calculated in dependence on previous (I or P) frames. P frame coding exploits both spatial and temporal redundancy. These frames can only be completely decoded if the previous I or P frame is available. B frames are coded exclusively in dependence on previous and successive (I or P) frames. B frames only exploit temporal redundancy. They can be decoded completely only if the previous and successive I or P frame is available. That’s why MPEG reorders the frames before transmission, so that any frame received can be decoded immediately, see Table 5. Table 5. MPEG decode and display frame ordering Display order Frame type 1 I 2 B 3 B 4 P 5 B 6 B ...
Decode order 2 3 1 5 6 4
Because of this reordering issue, a coded frame does not correspond to the decoded (YUV) frame with the same number. FV fixes this issue, by matching display (YUV) frames to transmission (coded) frames according to Table 5. There are more possible coding schemes than the one shown in this table (e.g. schemes without B frames, with only one B frame in between or with more than two B frames between two I (or P) frames), but the principle of reordering is always the same. Handling of missing frames. Another issue fixed by FV is the possible mismatch of the number of decoded to the original number of frames caused by losses. A mismatch would make quality assessment impossible. A decent decoder can decode every frame, which was partly received. Some decoders refuse to decode parts of frames or to decode B frames, where one of the frames misses from which it was derived. Knowing the handling of missing or corrupted frames by the decoder in use, FV can be tuned to fix
268
J. Klaue, B. Rathke, and A. Wolisz
the handling weaknesses of the decoder. The fixing always consists of inserting missing frames. There are two possibilities of doing so. The first is to insert an “empty” frame for every not decoded frame (for whatever reason). An empty frame is a frame containing no information. An empty frame will cause certain decoders to display to display a black (or white) picture. This is not a clever approach, because of the usually low differences between two consecutive video frames. So FV uses the second possibility, which is the insertion of the last decoded frame instead of an empty frame in case of a decoder frame loss. This handling has the further advantage of matching the behaviour of a real world video player. 4.5
PSNR – Quality Assessment
The PSNR is the base of the quality metric used in the framework to assess the resulting video quality. Considering the preparations from preliminary components of the framework, the calculation of the PSNR itself is now a simple process described by Equation 6. It must be noted, however, that PSNR cannot be calculated if two images are binary equivalent. This is because of the fact that the mean square error is zero in this case and thus, the PSNR couldn’t be calculated according to Equation 6. Usually this is solved by calculating the PSNR between the original raw video file before the encoding process and the received video. This assures that there will be always a difference between to raw images, since all modern video codecs are lossy.
PSNR [dB]
50
PSNR tim e s e r ie s low los s e s
40 30 20 10 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 1701 1801 1901 2001 2101 2201 2301 2401 2501 2601 2701 2801 2901 3001 3101 3201 3301 3401 3501 3601 3701 3801 3901 4001 4101 4201 4301 4401
0
# fr am e
PSNR [dB]
50
PSNR tim e s e r ie s ve ry high los s e s
40 30 20 10 1 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 1501 1601 1701 1801 1901 2001 2101 2201 2301 2401 2501 2601 2701 2801 2901 3001 3101 3201 3301 3401 3501 3601 3701 3801 3901 4001 4101 4201 4301 4401
0
# fram e
Fig. 4. Example of PSNR (same video transmitted with few and with high losses)
Almost all authors, who use PSNR, only use the luminance component of the video (see Section 3). This is not surprising considering the relevance of the Y component for the HVS (Section 3.2). Figure 4 exemplifies two PSNR time series. Other metrics than
EvalVid – A Framework for Video Transmission and Quality Evaluation
269
PSNR can be used, in this case the desired video quality assessment software , e.g., [20], [2] or [4] must replace the PSNR/MOS modules. 4.6
MOS Calculation
Since the PSNR time series’ are not very concise an additional metric is provided. The PSNR of every single frame is mapped to the MOS scale in Table 1 as described in section 3.2. Now there are only five grades left and every frame of a certain grade is counted. This can be easily compared with the fraction of graded frames from the original video as pictured in Figure 5. The rightmost bar displays the quality of the original video as a reference, “few losses” means an average packet loss rate of 5%, and the leftmost bar shows the video quality of a transmission with a packet loss rate of 25%. Figure 5 pictures the same video transmissions as Figure 4.
100% 90%
MOS scale
80% 70% 60%
5
50%
excellent
4 good
40%
3 fair
30%
2 poor
20%
1 bad
10% 0%
high losses few losses
lossless
Fig. 5. Example of MOS graded video (same video transmissions as in Figure 4)
The impact of the network is immediately visible and the performance of the network system can be expressed in terms of user perceived quality. Figure 5 shows how near the quality of a certain video transmission comes to the maximal achievable video quality. 4.7
Required 3rd Party Tools
The programs described above are available as ISO-C source code or pre-compiled binaries for Linux-i386 and Windows. To perform ones own video quality evaluations, you still need some software from other sources. Their integration into the EvalVid framework is described in the documentation. If you want to evaluate video transmission systems using a Unix system or Windows, then you need TCP-dump or win-dump, respectively. You can get them it from:
270
J. Klaue, B. Rathke, and A. Wolisz
– http://www.tcpdump.org – http://windump.polito.it You also need raw video files (lossless coded videos) and a video encoder and decoder, capable of decoding corrupted video steams. There are MPEG-4 codecs available from: – MPEG-4 Industry Forum (http://www.m4if.org/resources.php) – MPEG (http://mpeg.nist.gov/)
5
Exemplary Results
This tool-set has been used to evaluate video quality for various simulations [1, 12] and measurements [7]. It proved usable and quite stable. Exemplary results are shown here and described briefly. Figure 6 shows the result of the video quality assessment with EvalVid for a simulation of MPEG-4 video transmission over a wireless link using different scheduling policies and dropping deadlines. The picture shows the percentage of frames with the five MOS ratings, the rightmost bar shows the MOS rating of the original (without network loss) video. It can be clearly seen that the blind scheduling policy does not work very well and that the video quality for the two other policies increases towards the reference with increasing deadlines.
Deadline
Blind
Reference (no loss)
Deadline Drop
100 %
Frames with MOS
90% 80%
MOS Scale
70%
Excellent Good
60%
Fair Poor
50%
Bad 40% 30% 20% 10% 0%
10
20
30
40
50
10
20
30
40
50
Deadline [ms]
Fig. 6. Example of video quality evaluation (MOS scale) with EvalVid
Similarly, Figure 7 shows the enhancement of user satisfaction with increasing dropping deadlines and better scheduling schemes in a simulation of an OFDM system. The “user satisfaction” was calculated based on the MOS results obtained with EvalVid. The
EvalVid – A Framework for Video Transmission and Quality Evaluation
271
number of satisfied users
bars in this figure show the number of users that could be supported with a certain mean MOS.
10 9 8 7 6 5 4 3 2 1 0
250 175 deadline [ms] 100 S / OFF
S / ON
D / OFF
D / ON
subcarrier assignment and semantic scheduling
S / OFF
S / ON
D / OFF
D / ON
100
2
3
4
6
175
3
4
5
8
250
4
5
6
9
Fig. 7. Example of video quality evaluation (number of satisfied users) with EvalVid
6
Conclusion and Topics to Further Research
The EvalVid framework can be used to evaluate the performance of network setups or simulations thereof regarding user perceived application quality. Furthermore the calculation of delay, jitter and loss is implemented. The tool-set currently supports MPEG-4 video streaming applications but it can be easily extended to address other video codecs or even other applications like audio streaming. Certain quirks of common video decoders (omitting lost frames), which make it impossible to calculate the resulting quality, are resolved. A PSNR-based quality metric is introduced which is more convenient especially for longer video sequences than the traditionally used average PSNR. The tool-set has been implemented in ISO-C for maximum portability and is designed modularly in order to be easily extensible with other applications and performance metrics. It was successfully tested with Windows, Linux and Mac OS X. The tools of the EvalVid framework are continuously extended to support other video codecs as H.263, H.26L and H.264 and to address additional codec functionalities like fine grained scalability (FGS) [13] and intra frame resynchronisation. Furthermore the support of dynamic play-out buffer strategies is subject of future developments. Also it is planned to add support of other applications, e.g. voice over IP (VoIP) [8] and synchronised audio-video streaming. And last but not least other metrics than PSNRbased will be integrated into the EvalVid framework.
272
J. Klaue, B. Rathke, and A. Wolisz
References [1] A. C. C. Aguiar, C. Hoene, J. Klaue, H. Karl, A. Wolisz, and H. Miesmer. Channel-aware schedulers for voip and mpeg-4 based on channel prediction. to be published at MoMuC, 2003. [2] Johan Berts and Anders Persson. Objective and subjective quality assessment of compressed digital video sequences. Master’s thesis, Chalmers University of Technology, 1998. [3] ITU-R Recommendation BT.500-10. Methodology for the subjective assessment of the quality of television pictures, March 2000. [4] Sarnoff Corporation. Jndmetrix-iq software and jnd: A human vision system model for objective picture quality measurements, 2002. [5] Project P905-PF EURESCOM. Aquavit - assessment of quality for audio-visual signals over internet and umts, 2000. [6] Lajos Hanzo, Peter J. Cherriman, and Juergen Streit. Wireless Video Communications. Digital & Mobile Communications. IEEE Press, 445 Hoes Lane, Piscataway, 2001. [7] Daniel Hertrich. Mpeg4 video transmission in wireless lans | basic qos support on the data link layer of 802.11b. Minor Thesis, 2002. [8] H. Sanneck, W. Mohr, L. Le, C. Hoene, and A. Wolisz. Quality of service support for voice over ip over wireless. Wireless IP and Building the Mobile Internet, December 2002. [9] ISO-IEC/JTC1/SC29/WG11. Evaluation methods and procedures for july mpeg-4 tests, 1996. [10] ISO-IEC/JTC1/SC29/WG11. ISO/IEC 14496: Information technology - Coding of audiovisual objects, 2001. [11] J. Klaue. Evalvid | http://www.tkn.tu-berlin.de/research/evalvid/fw.html. [12] J. Klaue, J. Gross, H. Karl, and A. Wolisz. Semantic-aware link layer scheduling of mpeg-4 video streams in wireless systems. In Proc. of Applications and Services in Wireless Networks (AWSN), Bern, Switzerland, July 2003. [13] Weiping Li. Overview of fine granularity scalability in mpeg-4 video standard. IEEE transaction on circuits and systems for video technology, March 2001. [14] Jens-Rainer Ohm. Bildsignalverarbeitung fuer multimedia-systeme. Skript, 1999. [15] ITU-T Recommendations P.910 P.920 P.930. Subjective video quality assessment methods for multimedia applications, interactive test methods for audiovisual communications, principles of a reference impairment system for video, 1996. [16] Martyn J. Riley and Iain E. G. Richardson. Digital Video Communications. Artech House, 685 Canton Street, Norwood, 1997. [17] Cormac J. Sreeman, Jyh-Cheng Chen, Prathima Agrawal, and B. Narendran. Delay reduction techniques for playout buffering. IEEE Transactions on Multimedia, 2(2):100–112, June 2000. [18] ANSI T1.801.01/02-1996. Digital transport of video teleconferencing / video telephony signals. ANSI, 1996. [19] ANSI T1.801.03-1996. Digital transport of one-way video signals - parameters for objective performance assessment. ANSI, 1996. [20] Stephen Wolf and Margaret Pinson. Video quality measurement techniques. Technical Report 02-392, U.S. Department of Commerce, NTIA, June 2002. [21] D. Wu, Y. T. Hou, W. Zhu, H.-J. Lee, T. Chiang, Y.-Q. Zhang, and H. J. Chao. On end-to-end architecture for transporting mpeg-4 video over the internet. IEEE Transactions on Circuits and Systems for Video Technology, 10(6):923–941, September 2000.
A Class-Based Least-Recently Used Caching Algorithm for World-Wide Web Proxies Boudewijn R. Haverkort1 , Rachid El Abdouni Khayari2 , and Ramin Sadre3 1
3
University of Twente, Department of Electrical Engineering, Mathematics and Computer Science, P.O. Box 217, 7500 AE Enschede, The Netherlands 2 University of the Federal Armed Forces, Department of Computer Science, D-85577 Neubiberg, Germany RWTH Aachen, Department of Computer Science, D-52056 Aachen, Germany
Abstract. In this paper we study and analyze the influence of caching strategies on the performance of WWW proxies. We propose a new strategy, class-based LRU, that works recency- as well as size-based, with the ultimate aim to obtain a well-balanced mixture between large and small documents in the cache, and hence, good performance for both small and large object requests. To achieve this aim, the cache is partitioned in classes, each one assigned to a specific document size range; within a class, the classical LRU strategy is applied. We show that for class-based LRU good results are obtained for both the hit rate and the byte hit rate, if the size of the classes and the corresponding document size ranges are well chosen. The latter is achieved by the use of a Bayesian decision rule and a characterisation of the requested object-size distribution. In doing so, class-based LRU is an adaptive strategy: a change in request patterns results, via a change in the distributions, in a change in cache partitioning and request classification. Finally, the complexity of class-based LRU is comparable to that of LRU and, therefore, smaller then of its “competitors”.
1
Introduction
Today, the largest share of traffic in the internet originates from WWW requests. The increasing use of WWW-based services has not only led to high frequented web servers but also to heavy-used components of the internet. Fortunately, it is well-known that there are often-visited sites, so that object caching can be employed to reduce internet traffic [1] and to decrease the perceived end-to-end delays. Properties of Web Caching. Web caching has some special properties which makes it an interesting research area, different from traditional approaches towards caching: (i) The sizes of the objects to be cached vary greatly. Thus, one
The research reported in this paper was performed while all the authors were at the RWTH Aachen, Germany. The first two authors have been supported by the German DFG, under contracts HA 2966/2-1 and HA 2966/2-2.
P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 273–290, 2003. c Springer-Verlag Berlin Heidelberg 2003
274
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
can not assume fixed-sized “pages” as in main-memory caching. Additionally, objects may be of different types, thus influencing caching decisions; (ii) The costs to request particular objects vary largely and are difficult to compute in advance. These costs are not only different per object, even for the same object they depend on the load of the origin server, its operational state and the path/distance between client and server; (iii) Objects in the cache are read only and hence no write-back mechanism is needed. Cache Location. There are three possible locations for object caching. The locality in the requests of many clients can be exploited by client-side caching in order to reduce network traffic and response time. With proxy-server caching, clients must configure their browsers so that all HTTP requests are directed to a so-called proxy, located “between” the LAN (at which also the clients are connected) and the internet. In doing so, external bandwidth will be saved, however, at the risk of the proxy server itself becoming a bottleneck. Primary web servers might cache objects in main memory, in order to reduce disk I/O. In doing so, no (internet) bandwidth is saved, and, since the retrieval time is often dominated by the network latency [2], we do not address this further. Furthermore, the load on the disks seldom forms a bottleneck in a web server. Table 1 gives an overview over the pros and cons of the three different caching locations. Table 1. Pros and cons of web caching according to its location at client and proxy + reduces network traffic + reduces response time + decreases server workload − distorts server access statistics − danger of data inconsistency at proxy − cache miss extends response time − possible new bottleneck
at server − does not reduce network traffic − does not (really) reduce response time − does not decrease server workload − higher server load + decreases response time in LAN + decreases I/O load
Analysis Methods. Cache performance heavily depends on the size of the provided cache and the employed replacement strategy. In order to validate the performance of caching algorithms against these factors, one can use one of the following three methods. With a trace-driven simulation, client requests are recorded in a trace file, which is used as input to the simulation. Note that simulation time passes much faster than real time; in our studies, a trace comprising 54 days could be simulated in only a few minutes. In contrast to a (recorded) trace-driven simulation, traces could also be synthetically generated. However, one runs the risk that some typical characteristics of WWW-traces are not accounted for in the correct way. Finally, in a real implementation, the evaluation
A Class-Based Least-Recently Used Caching Algorithm
275
is performed by observing a real client or server running with the (new) caching strategy. This is the most accurate but also the most resource-consuming method (which should only be performed after the other methods yield promising results). Organisation of the Paper. We will give an overview of existing web caching strategies in Section 2, before we present the new class-based LRU strategy in Section 3. The new strategy is evaluated in Section 4 (and, additionally, in Appendix B). Conclusions are drawn in Section 5.
2
Caching Strategies
Over the last few years, many well-known caching strategies have been evaluated, see for instance [3,4,5,6]. Aim of these strategies has been to improve the cache hit rate (defined as the percentage of requests that could be served from the cache), the cache byte hit rate (defined as the percentage of bytes that could be served from the cache), or, even better, both. At the center of all the approaches is the question which object has to be replaced when a new object has to be stored (and the cached is already completely filled). Below, we summarize some strategies that did well in the past (see also Table 2). LRU (Least-Recently Used). This method removes the object whose last request lies longest in the past. SLRU (Segmented LRU). This method has been developed for disk caching, and is a refinement of standard LRU [7]. The cache is divided into two segments called protected and unprotected. Insertion and deletion of objects are handled by applying the LRU strategy to the unprotected segment. However, when a object hit occurs in the unprotected segment, the object is moved to the protected segment. If there is not enough space in the protected segment then the least valuable object is moved out, and inserted as the most valuable object into the unprotected segment. If there is not enough space in the unprotected segment, then the least valuable object is removed from it. Throughout this paper, we employ SLRU with cache fractions 10%, 60% and 90% for the unprotected segment. LRU-k . This method removes the object whose k-th request lies longest in the past. LRU-k has been developed by O’Neil et al. for use in databases [8]. This strategy respects not only the request recency, but also the request frequency. LFU (Least Frequently Used). This method removes the object with the fewest requests. If two or more objects have the same number of requests, then a second strategy is necessary, often LRU is chosen. Note that once-popular objects stay long in the cache, even when they are not requested anymore (leading to a so-called polluted cache).
276
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
LFU-Aging. This method is similar to LFU, but avoids polluted caches [9] by introducing two parameters. The first one limits the average number of requests for all objects in the cache. If the average number exceeds this limit, all request counters are divided by two, thus causing inactive objects to loose their popularity. The second parameter limits the value of the request counters, thus controlling how long a formerly popular object will stay in the cache. It has been found that LFU-aging performs better than simple LFU without being very sensitive to the values of the parameters [3]. LFF (Largest File First). This method removes the largest object from the cache. This method has been developed by Williams et al. [4] especially for web proxy caches. GDS (Greedy Dual-Size). This method has been developed by Cao and Irani [6] and assigns to each object a value of benefit, initially set to β = cost/size. Object removal takes place in the order of the smallest value of benefit. When an object is removed, its value is subtracted from all other object values in the cache. If an object gets a hit, its value is increased to its original value. To determine the cost-value of an object, various models have been proposed, from which we consider the following two: 1. GDS-Hit tries to maximize the hit rate. Therefore, it sets cost = 1. Thus, large objects have a small benefit and are removed faster than small ones. 2. GDS-Byte tries to maximize the byte hit rate, by setting cost = 2 + size/536. In doing so, one sets the costs roughly equal to the number of TCP segments that have to be transferred in case of a cache miss (the cache miss itself costs 2 extra units, and the size of a TCP segment is 536 byte). Recently, a variant of GDS, called GD* or GDSP was proposed by Jin and Bestavros [10]. Lindemann et al. studied the performance of web caching algorithms in relation to the type of object being cached, e.g., html, doc, etc [11]. Although GD* seems to work best for image, html and multimedia objects, when the variance in multimedia object size increases (which is, actually, quite likely) the advantages seem to diminish. GDSF (GDS with Frequency). This method is a variant of GDS, but also respects the request frequency φ when computing the value of benefit β = φ × cost/size. In this paper, we use GDSF with cost = 1, which yields the largest hit rate. LUV (Least-Unified Value). In this method, recently proposed in [12], the “reference potential” of an object is estimated, thereby using a function that relates the reference potential of an object to its most recent reference; an example of such a function is f (x) = ( 12 )λx , with 0 ≤ λ < 1, and where x measures the time until the last reference. Various simulations show that both a good hit-rate and a good byte hit-rate can be attained. However, what remains a problem is the right choice of λ; no algorithm for that exists. The authors propose to have
A Class-Based Least-Recently Used Caching Algorithm
277
λ depend, dynamically, on the recent past, but how this should be done is left for further study. Table 2. Classification and origin of caching algorithms according to the employed object characteristics strategy size-based recency-based frequency-based origin FIFO √ LFF web caching √ LRU memory caching √ LUV web caching √ LFU memory caching √ √ SLRU disk caching √ √ LRU-k database caching √ √ web caching LFU-Aging √ √ GDS web caching √ √ √ GDSF web caching
Roughly speaking, comparisons of the various caching strategies indicate that size-based strategies tend to yield better results for the hit rate, whereas frequency-based strategies tend to improve the byte hit rate. No strategy has been recognized as the ultimate best one, rather the choice of a good strategy depends on the characteristics of the considered workload. These considerations have led us to develop a workload-based caching strategy for web proxies, as will be discussed in the next section.
3
Class-Based LRU
In this section we introduce the basic idea for class-based LRU (C-LRU). After that, we present the scheme to determine the required parameters for it. Since C-LRU allows for workload adaptation, we also discuss the adaptation frequency. 3.1
Basic Idea
The caching strategy C-LRU is a refinement of standard LRU. Its justification lies in the fact that object-size distributions in the WWW have a heavy tail, that is, although small objects are more popular and are requested more frequently, large objects occur more often than it has been expected in the past, and therefore have a great impact on the perceived performance. In most caching methods, the object sizes are completely ignored, or either small or large objects are favored. However, upon hitting a large cached object, the byte hit rate increases strongly, but the hit rate only in a limited fashion. On the other hand, hitting many small cached objects only increases the byte hit rate in a limited way, but this does increase the hit rate substantially. Hence, both a
278
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre cache
class 1 5KB
3
5KB
5KB
2
3
6KB 2
5KB
5KB
3
new object class 2
145KB
12KB
10KB
20KB
decision function
15KB
10KB
145KB
120KB
class 3 110KB
158KB
class 4
1.2MB
0.85MB
Fig. 1. Principle of Class-Based LRU
high byte hit rate and a high hit rate can only be attained by creating a proper balance between large and small objects in the cache. Given the heavy-tailedness of the object-size distribution (see below), we should be careful to reserve cache space for small objects, that is, we should not allow large objects to fully utilise all of the cache. Similar considerations play a role in load balancing approaches for web server clusters, such as EquiLoad [13]. With C-LRU, this is achieved by partitioning the cache into portions reserved for objects of a specific size, as follows (see Figure 1): (i) the available memory for the cache is divided into I partitions where each partition i (for i = 1, · · · , I) takes a specific fraction pi of the cache (0 < pi < 1, i pi = 1); (ii) partition i caches objects belonging to class i, where class i is defined to encompass all objects of size s with ri−1 ≤ s < ri (0 = r0 < r1 < · · · < rI−1 < rI = ∞); (iii) each partition in itself is managed with the LRU strategy. Thus, when an object has to be cached, its class has to be determined before it is passed to the corresponding partition. For this strategy to work, we need an approach to determine the values p1 , · · · , pI and r1 , · · · , rI . This will be addressed in the next section. 3.2
Determining the Fractions pi and the Boundaries ri
Object-size distributions. As has recently been shown, the object-size distribution of objects requested at proxy servers, can very well be described with a hyper-exponential distribution1 ; the parameters of such a hyperexponential 1
In fact, some case studies show that, for instance, also a lognormal distribution does a fine job in describing web object-size distributions; still though, hyper-exponential distributions can be used to also approximate such distributions.
A Class-Based Least-Recently Used Caching Algorithm
279
distribution can be estimated easily with the EM-algorithm [14]. This implies that the object-sizes density f (s) takes the form of a probabilistic mixture of exponential terms: f (s) =
I i=1
ci λi e−λi s =
I
fi (s),
i=1
0 ≤ ci ≤ 1,
I
ci = 1.
(1)
i=1
This can now be interpreted as follows: the weights ci indicate the frequency of occurrence for objects of class i and the average size of objects in class i is given by 1/λi . In [14] it is shown that I normally lies in the range of 4 to 8. Cache fractions. For the fractions pi , we propose three possible values: (a) to optimize the hit rate, we take the partition size pi proportional to the probability that a request refers to an object from class i, that is, we set pi = ci ; (b) to optimize the byte hit rate, we take into account the expected amount of bytes of objects of class i in Irelation to the overall expected amount of bytes, that is, we set pi = (ci /λi )/( j=1 cj /λj ); (c) finally, we may take into account the fraction of bytes “encompassed by” class i (with class ranges ri as defined below), i.e., r I I we take pi = rii+1 j=1 fj (s)ds = j=1 cj (e−λj rj − e−λj rj+1 ). Cache range boundaries. The cache range boundaries ri are computed using a Bayesian decision rule. In Appendix A we show that the class C(s) an object = i (fi (·) as in of size s belongs to is given by i such that fi (s) > fj (s) for all j (1)), that is: C(s) = argmaxi (ci λi e−λi s ), i = 1, · · · , I. (2) From this expression, we find that the class boundaries ri are obtained by solving for s in ci λi e−λi s = ci+1 λi+1 e−λi+1 s , which yields ri =
ln(ci λi ) − ln(ci+1 λi+1 ) , λi − λi+1
for i = 1, · · · , I − 1.
(3)
Note that the class boundaries need to be determined only once. Upon the arrival of a request for an object with size s, a simple (binary) search in the ranges {[0, r1 ], [r1 , r2 ], · · · , [rI−1 , rI ]} will yield the appropriate class C(s). 3.3
When to Compute pi and ri ?
Since C-LRU exploits characteristics of the requested objects sizes, it is important to determine how often one has to adapt the characterisation2 . We see three possibilities: (i) once-only determination: a typical sample of the (past) request log is chosen and the parameters are determined once-only and assumed not to change in the future; (ii) periodical application: after a predetermined period of time, e.g., every 24 hours, one re-analyses object-size characteristics, and changes 2
Note that the EM algorithm only takes a few minutes for traces covering hundreds of millions of requests.
280
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
the parameters accordingly; (iii) on-demand: by observing the achieved (byte) hit rate over time, performance changes might be observed, which might lead to adaptations of the parameters. A recently finished master thesis investigated adaptation in detail [15]; a full investigation of adaptation strategies and their performance implications (like done for adaptive load balancing in [16]) goes, however, beyond the scope of the current paper.
4
Evaluation of Class-Based LRU
To evaluate and compare the performance of C-LRU, we performed trace-driven simulations. We used two traces: the RWTH trace has been collected over 54 days early 2000 and consist of the logged requests to the proxy-server of the RWTH, and the 1996 7-day DEC trace of the web proxy of Digital Equipment Company [17]. Below, we start with a detailed analysis of the traces, before we continue with a detailed comparison study. We finish with a discussion of the complexity of the caching algorithms. Table 3. Statistics for the RWTH and the DEC trace
total #requests total #bytes #cacheable request #cacheable bytes fraction #cacheable requests total #cacheable bytes average object size squared coeff. of variation median smallest object largest object unique objects total size of unique objects HR∞ BHR∞ original size of trace file size after preprocessing
4.1
RWTH DEC 32,341,063 3,763,710 353.27 GB 31.93 GB 26,329,276 3,571,761 277.25 GB 30.14 GB 81.4 % 94.9 % 78.5 % 94.4 % 10,529 Bytes 10,959 Bytes 373.54 90.92 3,761 Bytes 3,696 Bytes 118 Bytes 14 Bytes 228.9 MB 132.7 MB 8,398,821 1,379,865 157.31 GB 17.08 GB 30.46 % 47.34 % 16.01 % 39.32 % 2 GB 800 MB 340 MB 47 MB
Analysis of the Traces
In our study, we only considered static (cacheable) objects, requests to dynamic objects were removed as far as identified. Table 3 presents some important statistics for both traces. Note that the object-size distributions exhibit high squared
A Class-Based Least-Recently Used Caching Algorithm
281
coefficients of variation and very small medians (compared to the means); this is an indicator for heavy-tailedness. The maximum reachable hit rate (denoted as HR∞ ) and the maximum reachable byte hit rate (BHR∞ ) have been computed using a trace-based simulation with infinite cache. Below, we address the object sizes, and the recency and frequency of object requests for the RWTH trace; corresponding results for the DEC trace are given in Appendix B. Object-Size Distribution. Figure 2(left) shows the complementary log-log plot of the object-size distribution. As can be seen, this distribution decays more slowly than an exponential distribution (with the same mean), thus showing heavy-tailedness. This becomes even more clear from the histogram of object sizes in Figure 3. The heavy-tailedness is also present when looking at the request frequency as a function of the object size (see Figure 4). It shows that small objects are not only more numerous but also that they are requested more often than large objects (this inverse correlation between file size and file popularity has also been stated in [9]). Thus, caching strategies which favor small objects are supposed to perform better. However, the figure also shows that large objects cannot be neglected.
1
1 object size distribution (DEC-trace) negative exponential distribution
0.1
0.1
0.01
0.01
1-F(x)
1-F(x)
object size distribution (RWTH-trace) negative exponential distribution
0.001
0.001
0.0001
0.0001
1e-05
1e-05
1e-06
1e-06 1
10
100
1000
10000 object Size
100000
1e+06
1e+07
1e+08
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
objectt size
Fig. 2. Complementary log-log plot of the object-size distribution for the RWTH trace (left) and the DEC trace (right)
Recency of Reference (Temporal Locality). Another way to determine the popularity of objects is the temporal locality of their references [18]. However, recent tests have pointed out that this property decreases [19], possibly due to client caching. We performed the common LRU stack-depth [18] method to analyse the temporal locality of references. The results are given in Figure 5 (left). The positions of the requested objects within the LRU stack are combined in 5000 blocks. The figure shows that about 20% of all requests have a strong temporal locality, thus suggesting the use of a recency-based caching strategy.
282
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre 200000
10000
180000 160000 1000
number of objects
number of objects
140000 120000 100000 80000
100
60000 10 40000 20000 0 0
2000
4000
6000
8000
1 10000
10000
100000
object size (in 100 byte bins)
1e+06
object size (in 100 byte bins)
Fig. 3. Number of objects as function of objects size for the RWTH trace: (left) linear scale for objects smaller than 10 KB; (right) log-log scale for objects larger than 10 KB 100000
100000
90000 80000
10000
number of requests
number of requests
70000 60000 50000 40000
1000
100
30000 20000
10
10000 0 0
2000
4000
6000
8000
1 10000
10000
100000
object size (in 100 byte bins)
1e+06
object size (in 100 byte bins)
Fig. 4. Number of requests by object size for the RWTH trace: (left) linear scale for objects smaller than 10 KB; (right) log-log scale for objects larger than 10 KB 45 100000 40
35 10000 references to objects
percentage
30
25
20
1000
100
15
10 10 5
0
1 0
50000
100000
150000
position in LRU stack (in 5000 blocks)
200000
250000
1
10
100
1000
10000
100000
1e+06
1e+07
object rank
Fig. 5. Analysis of the RWTH trace: (left) temporal locality characteristics (LRU stackdepth); (right) frequency of reference as a function of object rank (Zipf’s law)
Frequency of Reference. Object which have often been requested in the past, are probably popular for the future too. This is explained by Zipf’s law: if one
A Class-Based Least-Recently Used Caching Algorithm
283
ranks the popularity of words in a given text (denoted ρ) by their frequency of use (denoted P ), then it holds P ∼ 1/ρ. Studies have shown that Zipf’s law also holds for WWW objects. Figure 5(right) shows a log-log plot of all 8.3 million requested objects of the RWTH-trace. As can be seen the slope of the log-log plot is nearly −1, as predicted by Zipf’s law, suggesting to use frequency-based strategies. It should be mentioned that there are many objects which have been requested only once, namely 72.64% of all objects in the DEC trace and 67.5% in the RWTH-trace. Frequency-based strategies have the advantage that “one timers” are poorly valued, so that frequently requested objects stay longer in the cache and cache pollution can be avoided. 4.2
Performance Comparison
In this section, we only consider the RWTH trace, similar experiments were done for the DEC trace (see Appendix B). We performed the simulations using our own trace-driven simulator written in C++; its design was inspired by an earlier simulator [20]. The code of the simulator is very compact, and its main function is to efficiently update a few tables, that represent the current cache content, upon the arrival of a new request. The reason not to use a “standard simulator” (whatever that may be) is to obtain high performance. As an example of this, simulating the RWTH trace (covering over 26 million requests) took less than 15 minutes on a 500 MHz LINUX PC with 320 MB main memory, which amounts to about 30000 handled requests per second. More information on the simulator implementation can be found in [21]. To obtain reasonable results for the hit rate and the byte hit rate, the simulator has to run for a certain amount of time without hits or misses being counted. The so-called warm-up phase was set to 8% of all requests, which corresponds to two million requests and a time period of approximately four days. Since the cache size is a decisive factor for the performance of the cache, hence, we have performed the evaluation with different cache sizes, as shown in Table 4. Table 4. Cache sizes as percents of total amount of objects requested 64 MB 256 MB 1 GB 4 GB 16 GB 64 GB 256 GB RWTH 0,04% 0,16% 0,64% 2,54% 10,17% 40,69% 162,7% DEC 0,37% 1,49% 5,86% 23,43% 93,7% 374,8% —
First, we have to specify the parameters for the C-LRU strategy, as described in Section 3. Using the EM-algorithm, with I = 4, the corresponding values for pi and ri are listed in Table 5 (with the cases (a)–(c) as presented in Section 3.2). Note that, in a sense we have tuned C-LRU, as we knew the object-size distribution in advance (by studying the traces a priori). However, experiments have shown that also with small fractions of the trace (e.g., the part of the warm-up phase) a very similar classification is derived.
284
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre Table 5. RWTH trace: parameters for C-LRU
i 1 2 3 4
ci 0.65 0.321 0.027 0.002
λi 0.0003858 0.0000798 0.000015633 0.000000646
(a) 65% 32.1% 2.7% 0.2%
pi (b) 16% 38.2% 16.4% 29.4%
(c) ri−1 ri 48.6% 0 7455 30.2% 7455 63985 18.0% 63985 386270 3.4% 386270 ∞
In Figure 6, we show the simulation results of the RWTH trace for the different caching strategies with respect to the hit rate (top) and the byte hit rate (bottom). For the C-LRU strategy, we have included the results for the cases (a), (b) and (c); referred to as “C-LRU(a)”, “C-LRU(b)” and “C-LRU(c)”, respectively. With respect to the hit rate, the simulations show that GDS-Hit provides the best performance for smaller cache sizes. However, for larger cache sizes, it is outperformed by C-LRU(a) and C-LRU(c). The weak performance of C-LRU(a) and C-LRU(c) for small absolute cache sizes can be understood when looking at the partition sizes: the assigned cache size of only 0.2% (respectively 3.4%) for class 4 (i = 4) is too small considering the fact that partition 4 is responsible for all objects larger than 367 KB. In practical use, this fact does not pose a problem since typical caches nowadays are larger than 1 GB and, indeed, CLRU performs well for those cache sizes. For the byte hit rate, one observes that the performance of all strategies is nearly equal, except for LFF which yields the worst results. C-LRU(a) shows a small performance decrease of about 1% for very large cache sizes. However, C-LRU(c) performs as good as the other strategies. Note that this behaviour is not surprising since C-LRU(a) has been chosen to optimise the hit rate (see Section 3.2). The reverse can be observed for C-LRU(b): chosen to optimise for the byte hit rate, its performance is quite low when considering the hit rate. 4.3
Time Complexity
When choosing a caching strategy for practical use, the incurred CPU overhead for managing the cache is of utmost importance. Table 6 shows the time complexity of the typical operations performed on the cache (N is the number of cached objects, and I is the number of classes), i.e., the identification of a cache hit or miss, the insertion of an object into the cache, the deletion of an object from the cache and the update of the specific data structures when an access to a cache entry has taken place. These complexities follow from the employed data structures, being unordered lists for (most) LRU variants, and ordered lists (trees) for the other caching approaches. C-LRU just maintains I (smaller) LRU data structures. A cache hit or miss is determined using a constant-time hashing function.
A Class-Based Least-Recently Used Caching Algorithm
285
32 30 28 26
% Hit rate
24 22 20 GDS-Hit GDS-Byte LFU-Aging LRU LFF SLRU LRU-k C-LRU (a) C-LRU (b) C-LRU (c)
18 16 14 12 0.01
0.1
1 %(cache size: trace size)
10
100
18
16
14
% Byte Hit rate
12
10
8 GDS-Hit GDS-Byte LFU-Aging LRU LFF SLRU LRU-k C-LRU (a) C-LRU (b) C-LRU (c)
6
4
2 0.01
0.1
1 %(cache size: trace size)
10
100
Fig. 6. Hit rate (top) and byte hit rate (bottom) comparison of the caching strategies for the RWTH trace
286
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
As can be seen, LRU, SLRU have the smallest time complexity. The performance of C-LRU depends on the number of classes I, which, in all our experiments, ranges from 4 to 8, hence, the time complexity of C-LRU is nearly equal to the complexity of LRU. In contrast, GDS and other methods require O(log N ). Table 6. Complexity for various cache operations
LRU SLRU LRU-k LFU LFF GDS C-LRU
5
Hit/Miss O(1) O(1) O(1) O(1) O(1) O(1) O(1)
Insert Delete Update O(1) O(1) O(1) O(1) O(1) O(1) O(log N ) O(1) O(log N ) O(log N ) O(1) O(log N ) O(log N ) O(1) O(log N ) O(log N ) O(1) O(log N ) O(log I) O(log I) O(log I)
Conclusion
In this paper, we have proposed a new caching strategy. Unlike most existing strategies, the new C-LRU strategy bases its replacement decisions both on the size of the requested objects as well as on the recency of the requests. We have shown that these characteristics are important for WWW proxy-server caching, making our strategy an interesting choice. For the performance of CLRU, we can make two statements: considering the byte hit rate, its performance is comparable to existing strategies, but when looking at the hit rate, C-LRU is clearly better than most other strategies, sharing the first place with GDSHit depending on cache size. This is important since the response time of web servers, as perceived by the end users, is mainly determined by the hit rate [6]. The run-time time complexity of C-LRU is nearly equal to LRU, that is, it is not dependent on the number of cached objects (as is the case for GDS-Hit). The evaluation as described in this paper has been based on trace-driven simulations. We have recently also implemented class-based LRU in squid, a public domain proxy server for LINUX [22]. This prototype shows comparable performance to what has been observed using the simulations. The C-LRU caching approach naturally allows for an adaptive caching strategy. A thorough investigation of this aspect has recently been performed [15], however, its results will be presented in the near future.
A Class-Based Least-Recently Used Caching Algorithm
A
287
Bayesian Decision
When an object of size s is requested, we have to compute its class C(s) ∈ {1, · · · , I}. The assignment of objects to classes has to be done such that the probability of a wrong decision is minimised. We therefore define the cost function L[k, l], k, l ∈ {1, · · · , I}, which expresses the cost that an observation from class k has been assigned in to class l. For our purpose, it suffices to set L(k, l) = 0, if k = l, and L(k, l) = 1, if k = l. The overall cost for making (wrong) decisions (denoted R) can now be obtained by integrating over all possible object sizes: R=
I s k=1
p(s, k) · L[k, l] ds =
s
p(s)
I
p(k|s) · L[k, l] ds,
(4)
k=1
where p(s) is the probability for an object of size s, and p(s, k) = p(s) · p(k|s) is the joint probability for an object of size s to be classified as class k. We now write l = C(s) and note that the overall costs R is minimised when the summation in the above integration is minimised. This is accomplished by setting: I p(k|s) · L[k, l] . (5) C(s) := argminl=1,···,I k=1
By the fact that we are dealing with costs that are either 0 or 1, this can be further reduced: I p(k|s) − p(l|s) C(s) = argminl k=1
= argminl {1 − p(l|s)} = argmaxl {p(l|s)} .
(6)
The EM-algorithm delivers the probability p(s) as weighted sum of exponential I densities: p(s) = i=1 ci · p(s|i). Using Equation (1), we thus obtain p(l|s) = I
cl · λl e−λl ·s
k=1 ck
· λk e−λk ·s
.
(7)
Since the denominator in (7) is constant for fixed s, it does not play a role in determining the maximum value of (6),
hence, the maximum expression (6) reduces to C(s) = argmaxl cl · λl e−λl ·s .
B
Results of the DEC Trace
The results for the DEC trace are given in Figure 2(right) and Figures 7 through 10.
288
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
30000
10000
25000 1000
number of objects
number of objects
20000
15000
100
10000 10 5000
0 0
2000
4000 6000 object size (in 100 byte bins)
8000
1 10000
10000
100000 object size (in 100 byte bins)
1e+06
Fig. 7. Number of objects as function of objects size for the DEC trace: (left) linear scale for objects smaller than 10 KB; (right) log-log scale for objects larger than 10 KB
100000
10000
90000 80000 1000
number of objects
number of objects
70000 60000 50000 40000
100
30000 10 20000 10000 0 0
2000
4000 6000 objectt size (in 100 byte bins)
8000
1 10000
10000
100000 object size (in 100 byte bins)
1e+06
Fig. 8. Number of requests by object size for the DEC trace: (left) linear scale for objects smaller than 10 KB; (right) log-log scale for objects larger than 10 KB
25
100000 DEC
10000
references to object
20
percentage
15
10
5
1000
100
10
0
1 0
50000
100000
150000
Position in LRU stack (in 5000 blocks)
200000
250000
1
10
100
1000
10000
100000
1e+06
1e+07
objectt rank
Fig. 9. Analysis of the DEC trace: (left) temporal locality characteristics (LRU stackdepth); (right) frequency of reference as a function of object rank (Zipf’s law)
A Class-Based Least-Recently Used Caching Algorithm
289
50
45
% Hit rate
40
35
30
25
GDS-Hit GDS-Byte LFU-Aging LRU LFF SLRU LRU-K C-LRU (a) C-LRU (b)
20
15 0.1
1
10
100
%(cache size:trace size) 40
35
30
% Byte Hit rate
25
20
15 GDS-Hit GDS-Byte LFU-Aging LRU LFF SLRU LRU-K C-LRU (a) C-LRU (b)
10
5
0 0.1
1
10
100
%(cache size:trace size)
Fig. 10. Hit rate (top) and byte hit rate (bottom) comparison of the caching strategies for the DEC trace
290
B.R. Haverkort, R. El Abdouni Khayari, and R. Sadre
References 1. J. Gettys, T. Berners-Lee, H. F. Nielsen: Replication and Caching Position Statement. http://www.w3.org/Propagation/activity.html (1997) 2. I. Tatarinov, A. Rousskov, V. Soloviev: Static caching in web servers. In: Proc. 6th IEEE Int’l Conf. on Computer Communication and Networks. (1997) 410–417 3. M.F. Arlitt, R. Friedrich, T. Jin: Workload characterization of a web proxy in a cable modem. In: Proc. ACM SIGMETRICS ’99. (1999) 25–36 4. S. Williams, M. Abrams, C.R. Standridge, G. Abdulla, E.A. Fox: Removal policies in network caches for world-wide web documents. In: Proc. ACM SIGCOMM ’96. (1996) 293–305 5. P. Lorenzetti, L. Rizzo: Replacement Policies for a Proxy Cache. Technical report, Universita di Pisa (1996) 6. P. Cao, S. Irani: Cost-aware WWW Proxy Caching Algorithms. In: Proc. USENIX, Monterey, CA (1997) 193–206 7. R. Kardella, J. Love, B. Wherry: Caching Strategies to Improve Disk System Performance. IEEE Computer 27 (1997) 207–218 8. E.J. O’Neil, P.E. O’Neil, G. Weikum: The LRU-k page replacement algorithm for database disk buffering. In: Proc. ACM SIGMOD ’93. (1993) 297–306 9. J. Robinson, M. Devrakonda: Data cache management using frequency-based replacement. In: Proc. ACM SIGMETRICS ’90. (1990) 134–142 10. T. Jin, A. Bestavros: Greedy-dual* web caching algorithm: Exploiting the two sources of temporal locality in web request streams. Computer Communications 22 (2000) 174–283 11. C. Lindemann, O. Waldhorst: Evaluating the impact of different document types on the performance of web cache replacement schemes. In: Proc. IEEE Int’l Performance and Dependability Symposium. (2002) 717–726 12. H. Bahn, K. Koh, S.H. Noh, S.L. Min: Efficient replacement of nonuniform objects in web caches. IEEE Computer 35 (2002) 65–73 13. G. Ciardo, A. Riska, E. Smirni: EquiLoad: a load balancing policy for clustered web servers. Performance Evaluation 46 (2001) 101–124 14. R. El Abdouni Khayari, R. Sadre, B.R. Haverkort: Fitting world-wide web request traces with the EM-Algorithm. Performance Evaluation 52 (2003) 175–191 15. S. Celik: Adaptives caching in proxy-servern. Master’s thesis, RWTH Aachen, Department of Computer Science, Aachen, Germany (2003) 16. A. Riska, W. Sun, E. Smirni, G. Ciardo: AdaptLoad: effective balancing in clustered web servers under transient load conditions. In: Proc. 22nd IEEE Int’l. Conf. on Distributed Computing Systems. (2002) 104–111 17. Digital Equipment Cooperation: Digital’s Web Proxy Traces. (ftp://ftp.digital.com/pub/DEC/traces/proxy) 18. M.F. Arlitt, C.L. Williamson: Internet web servers: Workload chracterization and performance implications. IEEE/ACM Transactions on Networking 5 (1997) 631– 645 19. P. Barford, A. Bestavaros, A. Bradley, M. Crovella: Changes in Web Client Access Patterns. WWW Journal 2 (1999) 3–16 20. M.F. Arlitt, C.L. Williamson: Trace-driven simulation of document caching strategies for internet web servers. SCS Simulation Journal 68 (1997) 23–33 21. M. Pistorius: Caching strategies for web-servers. Master’s thesis, RWTH Aachen, Department of Computer Science, Aachen, Germany (2001) 22. T. Isenhardt: Einsatz von klassenbasierten verfahren in proxy-servern. Master’s thesis, RWTH Aachen, Department of Computer Science, Aachen, Germany (2002)
Performance Analysis of a Software Design Using the UML Profile for Schedulability, Performance, and Time Jing Xu, Murray Woodside, and Dorina Petriu Dept. of Systems and Computer Engineering, Carleton University, Ottawa K1S 5B6, Canada ^[XMLQJFPZSHWULX`#VFHFDUOHWRQFD Abstract. As software development cycles become shorter, it is more important to evaluate non-functional properties of a design, such as its performance (in the sense of response times, capacity and scalability). To assist users of UML (the Unified Modeling Language), a language extension called Profile for Schedulability, Performance and Time has been adopted by OMG. This paper demonstrates the use of the profile to describe performance aspects of design, and to evaluate and evolve the design to deal with performance issues, based on a performance model in the form of a layered queueing network. The focus is on addressing different kinds of performance concerns, and interpreting the results into modifications to the design and to the planned run-time configuration.
1 Introduction The Unified Modeling Language (UML) 2 is the most widely used design notation for software at this time, unifying a number of popular approaches to specifying structure and behaviour. To enable users to capture time and performance requirements, and to evaluate those properties from early specifications, a language extension called the UML Profile for Schedulability, Performance and Time has been defined and adopted (the SPT Profile)7. In 18, the process of specifying a system with the SPT Profile was described, together with a layered queueing model created from it. The example was a building security system called BSS. This paper considers how to use the same model to study several performance questions, and to improve the design. The goal of the study is to provide a blueprint to users of the SPT Profile for exploring how performance issues are related to features of a software design, and to gain experience with use of the Profile. This is the first step towards a methodology for guiding design changes and explorations, based on UML and layered modeling, and previous work such as view navigation 19, optimal configuration 4, and performance patterns and anti-patterns 16. The use of the SPTProfile can be envisaged as in Figure 1, with a process to interpret performance estimates made by a model, and to suggest changes to the design or to the configuration in the intended environment. If there is a performance shortfall, the process could iteratively improve the design until is satisfactory. The SPT profile extends UML by providing stereotypes and tagged values to represent performance requirements, the resources used by the system and some P. Kemper and W.H. Sanders (Eds.): TOOLS 2003, LNCS 2794, pp. 291–307, 2003. © Springer-Verlag Berlin Heidelberg 2003
292
J. Xu, M. Woodside, and D. Petriu
behaviour parameters, to be applied to certain UML behaviour models. The selected behaviour models describe scenarios (e.g. sequence diagrams and activity diagrams), because performance is usually specified and analyzed relative to selected scenarios
3HUIRUPDQFHWDUJHWV 5HVSRQVHWLPH 7KURXJKSXW 8WLOL]DWLRQRIUHVRXUFHV
%HKDYLRUGHPDQG SDUDPHWHUV
&38WLPH ,2UHTXHVWV /RRSFRXQWV %UDQFKSUREDELOLWLHV
80/ 6RIWZDUH 6SHFLILFDWLRQ 7RROZLWK 6373URILOH
$QDO\]HWKH2XWSXW0HDVXUHV DGHTXDWHSHUIRUPDQFH" HQKDQFHFRQILJXUDWLRQDQG UHVRXUFHV" PRGLI\GHVLJQ"
3HUIRUPDQFH 0RGHOLQJ 7RRO
3HUIRUPDQFH2XWSXW 0HDVXUHV
5HVSRQVHWLPH 7KURXJKSXW 8WLOL]DWLRQRIUHVRXUFHV
Fig. 1. Performance measures: targets, input and output, and improvement process
(which in turn represent system responses). Some examples of the Profile stereotypes are shown below, for the example system. The performance model used in this work is a layered queueing network (LQN) model, just one of several possible target formalisms. LQNs are particularly well suited to analyzing software performance because they model layered resources and logical resources in a natural way, and they scale up well for large systems 6. The concepts and notation for LQNs will be briefly introduced for the example, below. The process for improving designs will be explored using a Building Security System (BSS), which is intended to control access and to monitor activity in a building like a hotel or a university laboratory. Scenarios derived from two Use Cases will be considered, related to control of door locks by access cards, and to video surveillance. In the Access Control scenario a card is inserted into a door-side reader, read and transmitted to a server, which checks the access rights associated with the card in a data base of access rights, and then either triggers the lock to open the door, or denies access. In the Aquire/Store Video scenario, video frames are captured periodically from a number of web cameras located around the building, and stored in the database. The system must implement other Use Cases as well, such as operations for administration of the access rights, for sending an alarm after multiple access failures, or for viewing the video frames, but for simplicity we assume that the main performance concerns relate to the two Use Cases described above. Both scenarios have delay requirements. The access control scenario has a target completion time of one second, and the surveillance cycle has a target of one second or less between consecutive polls of a given camera. In both cases we will suppose that 95% of responses, or of polling cycles, should meet the target delay. Further, it is desired to initially handle access requests at about 1 per 2 second on average, and to deploy about 50 cameras. Additional camera capacity would be desirable, and a practical plan for scaling up the system to larger buildings and higher loads is to be created.
Performance Analysis of a Software Design Using the UML Profile
293
2 Behaviour Specification of BSS and Its Performance Annotations The BSS has the planned deployment shown in Figure 2, with one application processor, a separate database processor, and peripheral devices accessed over a LAN. 3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
6HFXULW\&DUG 5HDGHU
'RRU/RFN $FWXDWRU
9LGHR &DPHUD
^3$FDSDFLW\ `
'LVN
3$UHVRXUFH!!/$1
3$KRVW!!
$SSOLF&38 $FFHVV&RQWURO $FFHV &RQWUROOHU
9LGHR$FTXLVLWLRQ 9LGHR &RQWUROOHU
3$UHVRXUFH!! %XIIHU ^3$FDSDFLW\ 1EXI`
3$KRVW!!
'%B&38
'DWDEDVH
$FTXLUH3URF 6WRUH3URF
%XIIHU 0DQDJHU
Fig. 2. Deployment of the Building Security System
The access and surveillance scenarios will be described through sequence diagrams, using stereotypes and tagged values defined in the SPT Profile 7. Some of the key stereotypes seen in these diagrams are a performance context defining a scenario made up of steps and driven by a workload, and a resource, with a special host resource for a processor. These stereotypes are, respectively, <<3$FRQWH[W>>, <<3$VWHS>>, <<3$RSHQ/RDG>> and <<3$FORVHG/RDG>> for workloads, <<3$UHVRXUFH>> and <<3$KRVW>>. Figure 3 shows the scenario for access control. The User provides an open workload, meaning a given arrival process. The tagged values define it as a Poisson process with a mean interarrival time of 0.5 seconds, and state a percentile requirement on the response time (95% of responses under 1 second). They also define a variable name 8VHU5 for the resulting 95th percentile value, to be estimated. Each step is defined as a focus of control for some component, and the stereotype can be applied to the focus of control or to the message that initiates it; it can also be defined in a note. The steps are tagged with a demand value for processing time (tag 3$GHPDQG) which is the CPU demand for the step. The request goes from the card reader to the $FFHVV&RQWUROOHU software task, to the database and its disk, and then back to execute the check logic and either allow the entry or not. RSHQ'RRU is a conditional step which can be tagged with a probability (3$SURE) which here is set to unity.
294
J. Xu, M. Woodside, and D. Petriu
3$RSHQ/RDG!! ^3$RFFXUHQFH3DWWHUQ 3$UHVS7LPH
µSRLVVRQ¶µV¶
3$FRQWH[W!!
µUHT¶¶SHUFHQWLOH¶µV¶
µSUHG¶¶SHUFHQWLOH¶8VHU5 `
3$UHVRXUFH!!
&DUG5HDGHU
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
$ODUP
'RRU/RFN
'LVN
'DWDEDVH
$FFHVV &RQWUROOHU
^3$FDSDFLW\
^3$FDSDFLW\
`
`
8VHU
R
3$VWHS!!
^3$H[W2S
UHDG `
UHDG&DUG
3$VWHS!!
3$VWHS!! ^3$GHP DQG
^3$GHPDQG
µDVP G¶µPHDQ¶µPV¶ `
µDVPG¶
µPHDQ¶µPV¶ `
DGPLWFDUG,QIR
JHW5LJKWV 3$VWHS!! ^3$GHP DQG
µDVP G¶µPHDQ¶
µPV¶ 3$SURE
UHDG5LJKWV
3$VWHS!! ^3$GHP DQG
µDVP G¶
`
>QRWBLQBFDFKH@UHDG'DWD
R
µP HDQ¶µPV¶ `
3$VWHS!! 3$VWHS!! ^3$GHOD\
µPV¶ 3$SURE
HQWHU%XLOGLQJ
^3$GHP DQG
µDVP G¶µPHDQ¶
µDVPG¶
µPHDQ¶µPV¶ `
FKHFN5LJKWV
`
>2.@RSHQ'RRU
3$VWHS!! 3$VWHS!! ^3$SURE
`
>QRW2.@DODUP
^3$GHP DQG
µDVP G¶µPHDQ¶
µPV¶ 3$SURE
`
>QHHGWRORJ"@ORJ(YHQW 3$VWHS!! ^3$GHPDQG
3$VWHS!! ^3$GHPDQG
µDVPG¶
ZULWH(YHQW
R
µDVP G¶
µP HDQ¶µP V¶ `
ZULWH5HF
µPHDQ¶µPV¶ `
Fig. 3. Annotated Sequence Diagram for the Access Control Scenario
The devices are stereotyped as <<3$UHVRXUFH!!, as in the deployment diagram, and so are the software tasks $FFHVV&RQWUROOHU and 'DWDEDVH; this is because a task has a queue and acts as a server to its messages. A resource can be tagged as having multiple copies, as in a multiprocessor or a multithreaded task. The Database process is tagged with 10 threads, by ^3$FDSDFLW\ `, and its disk subsystem is tagged as having two disks. The scenario for the video surveillance is shown in Figure 4. There is a single 9LGHR&RQWUROOHU task which commands the acquisition of video frames from 1 cameras in turn, by a process $FTXLUH3URF. The initial step is the focus of control of 9LGHR&RQWUROOHU which is stereotyped as a closed workload source with one instance, with a required cycle time having 95% of cycles below 1 second, and a predicted value &\FOH to represent the model result. $FTXLUH3URF is a concurrent process (<<3$UHVRXUFH>>). It acquires a %XIIHU resource by a step DOORF%XI which is also stereotyped as <<*50DFTXLUH>>, indicating a resource acquisition. %XIIHU is a passive resource shown in the deployment diagram with a multiplicity 1EXI, managed by %XI0DQDJHU. In the sequence diagram the use of %XIIHU is indicated by a note and by the stereotype <<*50DFTXLUH!!. In the base case, 1EXI is set to 1. Once a buffer is acquired, $FTXLUH3URF requests the
Performance Analysis of a Software Design Using the UML Profile
295
3$FRQWH[W!! 3$UHVRXUFH!!
9LGHR &RQWUROOHU
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
%XIIHU0DQDJHU
$FTXLUH3URF
3$UHVRXUFH!!
6WRUH3URF
'DWDEDVH
^3$FDSDFLW\ `
R
3$VWHS!! ^3$UHS 1`
>1@SURF2QH,PDJHL
R
7KLVREMHFWPDQDJHVWKH UHVRXUFH%XIIHU
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
JHW%XIIHU *50DFTXLUH!!
3$FORVHG/RDG!! ^3$SRSXODWLRQ 3$LQWHUYDO µUHT¶¶SHUFHQWLOH¶ µV¶ µSUHG¶¶SHUFHQWLOH¶&\FOH `
R
DOORF%XIE
R
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶3 µPV¶ 3$H[W2S QHWZRUN3 `
JHW,PDJHLE
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV `
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
SDVV,PDJHLE
VWRUH,PDJHLE
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶¶PV¶ `
3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
VWRUHLE 3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶% µPV¶ 3$H[W2S ZULWH%ORFN% `
ZULWH,PJLE
IUHH%XIE 3$VWHS!! ^3$GHPDQG µDVPG¶ µPHDQ¶µPV¶ `
*50UHOHDVH!!
UHOHDVH%XIE
R
Fig. 4. Annotated Sequence Diagram for the Acquire/Store Video Scenario
image from the camera, receives it and passes the full buffer to a separate process 6WRUH3URF, which stores the frame in the database and releases the buffer. The ZULWH,PJ operation on the 'DWDEDVH has a tag 3$H[W2S to indicate that it calls (% times) for an operation ZULWH%ORFN which is not defined in the diagram. This operation can be filled in, in the performance model, by a suitable operation to write one block of data to disk.
3 LQN Model A layered queueing model was derived from the concurrent processes and their interactions, using the principles of scenario traversal described in 11. The resulting model is shown in Figure 5. Each process is represented by a “task” rectangle with
296
J. Xu, M. Woodside, and D. Petriu
one or more “entry” rectangles attached to its left. A “task” models an active object, process, thread or any other logical resource that requires mutual exclusion (such as the buffer pool described below). An “entry” models the operation which processes a distinct class of messages received by the task, For example, if a “task” models an object, an “entry” models a method. Arrows to other entries indicate requests made by an operation to other components. A solid arrowhead shows a synchronous call (where the caller expects a reply, and is blocked until receiving it), as from the 8VHU to the &DUG5HDGHU in Figure 5; it may be shown as a call and its corresponding return in the sequence diagram. An open arrowhead shows an asynchronous message, and a dashed arrow shows a synchronous request which is forwarded to another task. A server task can carry out part of its work after replying to its client; this is termed a “second phase” of service, and may have its own workload. For each entry the host demand is represented by >VV@ for first and second phase CPU demand in time units. For each request arc the mean number of calls in the two phases are represented by \\ ; the second phase value is optional. For example, the entry DGPLW of the task $FFHVV&RQWURO logs a message to the database and has some execution in second phase. A request arc in the model can have a mean number of calls per entry invocation, or a deterministic integer number. Here all the calls are mostly given as averages, however 1 is the exact number of calls in a polling cycle, and each one leads to exactly one buffer request, one JHW,PDJH, one SDVV,PDJH, one VWRUH,PDJH, and one ZULWH,PDJH operation. Similarly one 8VHU leads to one UHDG5LJKWV and one XQORFN operation. Since logical resources are represented by “tasks” in an LQN, the buffer pool is modeled by a task which is shaded in Figure 5 (we can think of it as a virtual task). It has an “entry” EXI(QWU\ which makes synchronous virtual calls to invoke the operations which are carried out holding the buffer (in 20 these operations were identified with the resource context of the buffer). Although the Sequence Diagram shows that these operations are in the same $FTXLUH3URF task in the software, they are separated in the model into a nested pseudo task $FTXLUH3URF, which executes while $FTXLUH3URF is blocked. This only breaks a calling cycle which would otherwise appear around %XIIHU, and does not affect the behaviour of the model. Passing the buffer to 6WRUH is also modeled as a call from the %XIIHU virtual task to 6WRUH. It is a second phase call because the reference task 9LGHR&RQWUROOHU the originator of the chain of requests, is not supposed to wait for the storing of the frame in the database, only %XIIHU must wait for it. 6WRUH finally calls the %XIIHU0DQDJHU task to release the buffer; however, to avoid another calling cycle in the model, the release is again modeled as an entry of a pseudo task %XI0JU. Task multiplicities represent the number of identical replicated processes (or threads) that work in parallel serving requests from the same queue, or the number of logical resources of the same type (e.g., buffers). The parameter values for 3, packets per video frame and %, disk operations to store a video frame, are both set to 8. The number of buffers 1%XI for the buffer pool was set to 1.
Performance Analysis of a Software Design Using the UML Profile
DFTXLUH/RRS
8VHU
9LGHR&RQWUROOHU
>@
UDWH VHF
SURF2QH,PDJH $FTXLUH3URF >@
DOORF
8VHU3
UHDG&DUG &DUG5HDGHU >@
$SSOLF &38
&DUG3
DGPLW
%XIIHU0DQDJHU
>@
8VHUV
1
>@
$FFHVV&RQWUROOHU
IRUZDUGHG
EXI(QWU\
%XIIHU
3
>@
LQILQLWH
/RFN /RFN3
ZULWH,PJ UHDG5LJKWV ZULWH(YHQW >@
% 1HW3
ORFN
>@
$ODUP3
UHOHDVH%XI %XI0JU >@
>@
1HWZRUN
IRUZDUGHG
VWRUH,PDJH 6WRUH3URF
$ODUP
>@
>@
QHWZRUN
DODUP
'XPP\
JHW,PDJH SDVV,PDJH $FTXLUH3URF >@
297
ZULWH%ORFN UHDG'DWD >@
>@
>@
>@
'DWD%DVH
WKUHDGV
ZULWH5HF >@
'LVN
'% &38
WKUHDGV 'LVN3
Fig. 5. Layered Queueing Network model for the Building Security System
4 Performance Evaluation and Improvement The model was solved by simulation to obtain the percentile values for delays, giving results for the user response times, throughputs, service times of entries and tasks, utilizations and waiting times for software or hardware resources, and probabilities of missing the deadlines. As mentioned above, the performance requirements are to meet a 1-second-deadline for both the Access Control scenario and the Acquire/Store Video scenario, with a 95% confidence level. In the LQN model, these requirements are translated to requiring the service time of the Video Controller task (also called its cycle time) and the response time of the User task to be less than 1 second with probability of 95%. 4.1 Base Case for Performance Evaluation At the very beginning we did not know if the performance of the system could be satisfied and if there were some bottleneck or design pitfalls in the system. Therefore, we started the evaluation with a base case, which is using a single copy for all
298
J. Xu, M. Woodside, and D. Petriu
software and hardware resources, except for the network, database and disk (whose multiplicities were set according to the system design). Table 1 shows the LQN results for the base case. It lists the cycle time for polling all cameras, the response time for a human user accessing the door, the normalized utilizations of the software and hardware resources and the probabilities of missing deadlines. Here we list only the normalized utilizations of the most heavily loaded resources. The normalized utilization is the ratio of the mean number of busy resources to the total number of the corresponding resources. A resource with a normalized utilization of 100% is fully saturated. By using normalized utilization, we can assess at a glance the actual usage of a resource without worrying about the total number of resources. Checking the simulation results for the base case, we can see that the internal throughputs and utilizations are constant, and the cycle time to poll all cameras grows linearly, as the number of cameras is increased. This follows from the design polling the cameras one at a time. No new polling request is generated before the $FTXLUH3URF completes the polling of a camera and returns.
Table 1. Simulation results for the base case Average Response Time Ncam Cycl User AcqProc e (sec) (sec) 10 0.327 0.127 0.960
Prob of Missing Deadline
Normalized Utilizations
Buffer
StoreProc
AppCPU
Cycle
RUser
0.9998
0.582
0.549
0
0.031
20
0.655 0.138
0.963
0.9999
0.582
0.545
0.0007
0.036
30
0.983 0.133
0.964
0.9999
0.582
0.544
0.4196
0.038
40
1.310 0.129
0.965
0.9999
0.582
0.544
0.9962
0.034
The results show the performance requirement for the Access Control scenario can be achieved in all cases, with about 3%-4% probability of missing the deadline. However, the other requirement for the Video Acquire scenario cannot be fulfilled for 50 cameras, or even for 30. The probability of missing the deadline jumps from 0.07% for 20 cameras to 42.96% for 30 cameras, and to 99.62% for 40 cameras. This is clearly unsatisfactory. In this paper, we use the term capacity to indicate the maximum number of cameras the system can support while still meeting the 5% deadline miss requirement. From the simulation results, we learn the capacity for the base case is just above 20, which is far from satisfactory. Therefore, we have to analyze more deeply the LQN performance results, in order to identify bottlenecks and to eliminate design pitfalls. We can see that in the base case two tasks are nearly fully saturated, $FTXLUH3URF and %XIIHU. This is a typical example of the bottleneck push-back phenomenon described in 8. Here %XIIHU can be deemed as a server, which provides services to $FTXLUH3URF. In spite of being saturated, $FTXLUH3URF is not the
Performance Analysis of a Software Design Using the UML Profile
299
bottleneck, because its underlying server %XIIHU is also saturated. On the other hand, %XIIHU is the real bottleneck, because it is saturated while its direct or indirect servers are not saturated. As suggested in 8, a standard inexpensive way of relieving a software bottleneck is by cloning (i.e., making multiple identical copies of the constrained server that share the same incoming request queue). In the case of the buffer pool, clones take the form of additional buffers. We also expect that by relieving one bottleneck, another bottleneck may appear, and that we can repeat the process until the bottleneck is either pushed down to the hardware resources (hardware saturation), or up to the client end (adequate capacity for the offered load). Hardware bottlenecks can also be solved by cloning in the form of multiple devices such as multiprocessors. There are other ways of solving bottlenecks, such as changing the scenario design, using more efficient strategies for scheduling, modifying the deployment, etc. Furthermore, when all bottlenecks are eventually solved or have been pushed to client end, we have to depend on other methods for further improving the performance, as discussed in the next section. 4.2 Strategy for Improving the Performance Our strategy for improving the system performance is sketched in Figure 6. We start with the base case of the performance model, which is translated directly from the design. Solving the model by simulation, we can get the performance result data from which we can identify the performance problems. If the performance requirements were satisfied, that means the current design is fine. Otherwise, we further explore the
,QLWLDO GHVLJQ
*HW/41 5HVXOWV
3HUIRUPDQFH 6DWLVILHG"
)HHGEDFN WRGHVLJQ
1R
&ORQH %RWWOHQHFN 5HVRXUFH
2WKHU VROXWLRQV
1R
%RWWOHQHFN )RXQG"
Fig. 6. Strategy of performance improving for BSS
performance results, looking for bottlenecks. If a bottleneck is found, we can solve it by cloning the bottleneck resource, such as using multiple buffers, multithreading.software processes or using multiple processors. We can achieve this by modifying the performance model, then solving it with new parameters and repeating the same procedure
300
J. Xu, M. Woodside, and D. Petriu
In 8 utilization measures are used to locate the bottleneck in client-server systems and rendezvous networks. In this paper, we use the normalized utilization, which has been defined in section 4.2, as one of the indicators. The most saturated resource has the greatest potential to be the bottleneck. However, to decide whether it is the real bottleneck, the system architecture should also be considered, because a client resource may be blocked by a server resource, which is in fact the real bottleneck. (Please note that a resource that requests a service is called a client resource, whereas one that provides the service is called a server resource.) Usually, a resource with a large number of outgoing (fan-outs) calls, second phase service, or incoming asynchronous calls can become easily the bottleneck. Using multiple identical copies of resources is a straightforward way to solve the bottleneck. Usually, resolving one bottleneck in this way will push the bottleneck to another resource, either to a lower layer or to a higher one. Repeatedly, by adjusting the number of copies for different resources within the system, the bottleneck will move around, until is finally eliminated. In this paper, we call this procedure system configuration tuning. At that point, in open systems there is no saturated resource, and in closed systems the saturation is pushed back to the external client end. There is a second path in the strategy given in Figure 6, which is seeking other solutions for performance improvement when no bottleneck could be identified. If the performance requirements still cannot be satisfied after tuning the system configuration, it is not because of limited resources. The cause may be heavy execution demand, long scenario paths, or lack of concurrency in the system. In this case, we take the second path of the strategy, for which there is no standard approach. The solution is usually project specific. Typical solutions include changing the scenario design, shortening long scenarios, decomposing large components, using more efficient scheduling strategies, and modifying the deployment. After applying these solutions, bottlenecks usually appear again in the system, because such solutions lead to a more efficient, and therefore more intense, usage of the existing resources. Thus we are back on the main path of the strategy. By repeating the strategy, we will eventually reach a point where the performance requirements can be met (assuming that the requirements are reasonable). Then we translate the changes that were applied to the performance model in terms of system configuration information and software design description, and give feedback to the designer. 4.3 Using Multiple Copies or Clones of Resources This section describes efforts to solve the software and hardware bottlenecks by using multiple copies of resources (i.e., by cloning). As discussed in the base case evaluation (section 4.1), the first bottleneck is the %XIIHU. Therefore, our first solution step is to use multiple buffers. Many cases were solved for the system under different configurations, i.e. with different numbers of cameras and buffers. Table 2 lists the data for 40 cameras and different numbers of buffers. As seen in Table 2, the performance improvement due to multiple buffers is obvious. The probability of missing the cycle time deadline drops greatly, from 99% for 1 buffer to 9.35% for 10 buffers, but the requirement of a 5% probability for
Performance Analysis of a Software Design Using the UML Profile
301
missing the deadline is still not achieved. We can see that now there is a newly saturated resource, namely 6WRUH3URF, which is the real bottleneck in the case with 10 buffers. We notice that the normalized utilization of %XIIHU drops at first as 1%XI grows from 1 to 4, then raises slightly afterwards. However, the normalized utilization of %XIIHU is only high (over 84%) in the case with 10 buffers, because it is blocked by its server resource 6WRUH3URF. The bottleneck is pushed to a lower layer in the model. Table 2. LQN Results for using multiple Buffers (40 cameras) Average Normalized Utilizations Response Time NBuf Cycle User AcqProc Buffer StoreProc AppCPU (sec) (sec) 1 1.309 0.137 0.965 0.9999 0.583 0.544
Prob of Missing Deadline Cycle
RUser
0.9961
0.034
2
1.016
0.132
0.975
0.8762
0.800
0.702
0.5503
0.032
3
0.941
0.132
0.980
0.8235
0.893
0.756
0.2506
0.036
4
0.911
0.131
0.983
0.8042
0.936
0.782
0.1597
0.032
7
0.879
0.132
0.986
0.8136
0.984
0.810
0.0948
0.033
10
0.872
0.129
0.987
0.8437
0.995
0.817
0.0935
0.034
Therefore, the second solution step is to clone the 6WRUH3URF task. Table 3 shows the results for the case of 40 cameras with 4 buffers. We can see that with 2 6WRUH3URF threads, the probability of missing the deadline has dropped to a satisfying level. The system capacity is now above 40 cameras, about double that for the base case. 7DEOH/415HVXOWVIRUPXOWLWKUHDGLQJ6WRUH3URFFDPHUDV%XIIHUV No. of Store Proc 1
Average Response Normalized Utilizations Time Cycle User (sec) AcqProc Buffer StoreProc AppCPU (sec) 0.911 0.131 0.983 0.8042 0.936 0.782
Prob of Missing Deadline Cycle
RUser
0.1597
0.032
2
0.756
0.137
0.946
0.5805
0.616
0.940
0.0022
0.035
3
0.743
0.139
0.932
0.5484
0.441
0.956
0.0015
0.039
According to the same reasoning, the new bottleneck is $SSOLFDWLRQ&38. The bottleneck was pushed from software resources to hardware resources. This bottleneck can be relieved by using a multi-processor, giving the results in Table 4. The simulation results show that 2 ApplicationCPUs are enough for solving the hardware bottleneck here. Using a double-processor is a typical configuration strategy. The system capacity is now 50 cameras, with 4 %XIIHUV, 2 6WRUH3URF threads and 2 $SSOLFDWLRQ&38V. This is 2.5 times higher than the base case and achieves our initial goal of system capacity. The point has been reached where, except for the reference task 9LGHR&RQWUROOHU, there is only one saturated resource in the system, namely
302
J. Xu, M. Woodside, and D. Petriu
$FTXLUH3URF. We may consider it as the bottleneck. However, LQN results show that cloning it gives no improvement. In fact, its queue contains at most one request at any time and never grows. The performance is not limited by a lack of resources now, but by design limitations. 7DEOH /41 5HVXOWV IRU XVLQJ PXOWLSOH SURFHVVRUV IRU $SSOLFDWLRQ&38 FDPHUDV %XIIHUV6WRUH3URFWKUHDGV
Average Prob of Missing Normalized Utilizations No. Response Deadline Time of CPU Cycle User AcqProc Buffer StoreProc AppCPU Cycle RUser (sec) (sec) 1 0.756 0.137 0.946 0.5805 0.616 0.94 0.0022 0.035 2 0.648 0.127 0.995 0.6111 0.653 0.549 0 0.035 3 0.644 0.128 0.997 0.6105 0.652 0.368 0 0.033
Now we take the second path of our performance-improving strategy. 4.4 Changing the Scenario Design to Introduce More Concurrency As mentioned before, there are two saturated tasks in the model, the reference task 9LGHR&RQWUROOHU, and $FTXLUH3URF. A reference task in a closed model drives the system by generating workload, and usually represents the behaviour of an external client. Its normalized utilization is always 1, because it is always blocked by all of the services in the scenario. Here 9LGHR&RQWUROOHU is similar to an external client, although it is a part of the system. The 9LGHR&RQWUROOHU has to wait for the message returned from $FTXLUH3URF before generating the next polling call. The call from the 9LGHR&RQWUROOHU to $FTXLUH3URF is synchronous, and all of the work of $FTXLUH3URF is finished in its first phase. Therefore, only one instance of $FTXLUH3URF can be activated at any time. The system suffers from too much serialization, and the system capacity is limited by the duration of the scenario which polls one video camera. To solve this problem, a change in the system design is required. A key point is to enable concurrent activations of the $FTXLUH3URF task by multi-threading the process. A solution is to move the calls made by $FTXLUH3URF for allocating and using the buffer into its second phase, and making an early reply to 9LGHR&RQWUROOHU Then 9LGHR&RQWUROOHU can generate its next polling call earlier. After introducing more concurrency into the system, the software and hardware bottlenecks appear again. We come to the main path in our strategy again, tuning the system configuration. During the tuning, the bottleneck moves around within the system. For example, the bottleneck may move to task $FTXLUH3URF or
Performance Analysis of a Software Design Using the UML Profile
303
6WRUH3URF as well as $SSOLFDWLRQ&38. There are different configuration strategies to address these problems. By repeatedly tuning the system configuration on software and hardware, the system performance can be improved dramatically. Table 5 shows some system performance results under different configurations. Here we aim for a capacity of 100 cameras and increase the number of threads for %XIIHU, $FTXLUH3URF, 6WRUH3URF and the number of $SSOLFDWLRQ&38 step by step. Finally, with 3 $FTXLUH3URF, 6 6WRUH3URF threads and 3 $SSOLFDWLRQ&38, we manage to achieve the 1-second-deadline for the cycle time for the case with 100 cameras with a probability of 99.95%. This capacity is 5 times higher than the base case, and twice the capacity before changing the design. 7DEOH/41UHVXOWVIRUZLWKPRUHFRQFXUUHQF\FDVHFDPHUDV%XIIHUV
Average Multiplicity Response Normalized Utilizations (Acquire, Time Buffer, Acquir Store, Cycle User Store e Buffer App App. CPU) (sec) (ms) Proc Proc CPU 2, 4, 2, 2 1.250 0.133 0.988 0.923 0.886 0.710 2, 10, 6, 3 0.837 0.132 0.988 0.689 0.751 0.707 3, 10, 6, 3 0.768 0.134 0.983 0.895 0.910 0.769
Prob of Missing Deadline Cycle RUser 0.9995 0.0332 0.0057 0.0307 0.0005 0.0352
The results also show that$FTXLUH3URF and 6WRUH3URF tasks are saturated again. Therefore, we expect that there is more room for improving the capacity by further tuning the system configuration. 4.5 Feedback into the Software Design The exploration described above is carried out in the space of LQN models, but the final result must be transferred back into the software design. The modified sequence diagram in Figure 7 shows two kinds of change: • suggestions on multithreading of active objects are represented by the tag 3$FDSDFLW\, as in the object $FTXLUH3URF and 6WRUH3URF. • the design change to $FTXLUH3URF, to put all the buffer processing into a second phase. The second phase is incorporated into the specification by changing the synchronous message from 9LGHR&RQWUROOHU to $FTXLUH3URF into two asynchronous messages, for the 9LGHR&RQWUROOHU’s request and the corresponding reply. After sending the reply, $FTXLUH3URF invokes its own JHW%XIIHU operation and everything that follows, as the second phase. As this example shows, some kinds of feedback can be presented in the design model just by using tag values defined in SPT Profile. However, others require deeper changes to be made by the designers. For instance, multithreading may require changes to synchronize threads or to maintain consistency in data shared by the threads; partitioning an object into two concurrent objects would require new classes.
304
J. Xu, M. Woodside, and D. Petriu
Techniques to support these changes, possibly based on patterns to solve typical problems that arise, will be needed. 3$FRQWH[W!! 3$UHVRXUFH!!
3$UHVRXUFH!!
9LGHR &RQWUROOHU
^3$FDSDFLW\
R
1`
>1@SURF2QH,PDJHL
^3$FDSDFLW\
R
3$VWHS!! ^3$UHS
'DWDEDVH
6WRUH3URF
%XIIHU0DQDJHU
`
3$UHVRXUFH!!
3$UHVRXUFH!!
3$UHVRXUFH!!
$FTXLUH3URF
^3$FDSDFLW\
`
`
7KLVREMHFWP DQDJHVWKH 3$VWHS!! ^3$GHP DQG
UHVRXUFH%XIIHU
µDVP G¶
µP HDQ¶µPV¶ `
JHW%XIIHU
R
*50DFTXLUH!!
DOORF%XIE
3$FORVHG/RDG!! ^3$SRSXODWLRQ 3$LQWHUYDO
µUHT¶¶SHUFHQWLOH¶ µV¶
µSUHG¶µSHUFHQWLOH¶&\FOH `
3$VWHS!!
R
^3$GHP DQG
µDVP G¶
µP HDQ¶µPV¶ `
3$VWHS!! ^3$GHP DQG
µDVP G¶
µP HDQ¶3 µP V¶ 3$H[W2S 3$VWHS!! ^3$GHP DQG
QHWZ RUN
3 `
JHW,PDJHLE
µDVP G¶
µP HDQ¶µPV `
3$VWHS!! ^3$GHP DQG
µDVP G¶
µP HDQ¶µPV¶ `
3$VWHS!!
SDVV,PDJHLE
^3$GHP DQG
3$VWHS!!
µDVP G¶
µP HDQ¶µPV¶ `
VWRUH,PDJHLE
^3$GHP DQG
µDVP G¶
µP HDQ¶µP V¶ `
VWRUHLE 3$VWHS!! ^3$GHP DQG
µDVP G¶
µP HDQ¶% µP V¶ 3$VWHS!! ^3$GHP DQG
µDVP G¶
3$H[W2S
Z ULWH%ORFN% `
ZULWH,PJLE
µP HDQ¶¶PV¶ `
IUHH%XIE *50UHOHDVH!!
UHOHDVH%XIE
3$VWHS!! ^3$GHP DQG
µDVP G¶
µP HDQ¶µPV¶ `
R
Fig. 7. Modified Sequence Diagram for Acquire/Store scenario
5 Related Work Other researchers have developed approaches to convert UML software specifications into different kinds of performance models, including Queueing networks (Smith and Williams 17), Petri Net models (Merseguer et al 9, and work surveyed by Pooley 14), and Stochastic Process Algebras (Canevet et. al. 3). Layered queueing models were produced from UML activity diagrams by Petriu and Shen in 13. Other kinds of specifications have also been converted, for instance LQN models were produced by automated transformation of a non-UML scenario specification in 11. Most of these papers focus on the conversion process itself. The use of a model to improve a software design is the focus of the “performance principles” in Smith's book 17 and of her work with Williams on antipatterns 16. In
Performance Analysis of a Software Design Using the UML Profile
305
10, Menasce and Gomaa modeled an information system and redesigned database transactions for performance. In 9 Merseguer et.al. evolved the design of a wireless application. Improvements related to software bottlenecks were defined by Neilson et. al. in 8, including their systematic removal by introduction of task threads. In 12, Petriu and Woodside showed how part of an e-commerce system could be redesigned to achieve performance goals. The question of how to navigate through the results of a software model, to identify the source of performance problems, was discussed in a general way in 19. In a broader context of computer and network systems (not just software designs) various kinds of reasoning aids for performance diagnosis have been described. An example for distributed computing (with references to other work) is described by Hellerstein 5 in the form of a QPD (Quantitative Performance Diagnosis) algorithm. It estimates how much certain system attributes such as traffic levels and hardware capacities contribute to performance problems, to assist in hardware improvement. QPD includes navigation of the measurements, guided by a model-like view of the system.
6 Conclusion A layered performance model has been used to expose performance problems in a software design, and to evaluate design changes, in a case study which combines client-server aspects with real-time deadlines including video frame transfer. Three kinds of performance issue emerged in this study. First, there was a question of video buffers; it is essential to provide multiple buffers, to overlap video acquisition from the cameras with the storage of the frames. For a capacity of 50 cameras, double buffering (with two buffers alternating) is not as good as four buffers used in rotation. For a higher capacity of 100 cameras, 10 buffers are needed. Second, there are deployment and configuration issues, such as providing multithreaded tasks for adequate concurrency. Finally a change in the software execution sequence within the video acquisition task was found beneficial. By providing an early reply to the control task that manages the acquisition loop, greater concurrency in video acquisition could be achieved and a much higher capacity was obtained. A notable result here is that the software change could only be identified as beneficial, after the thread configuration and buffer questions had been resolved. Without the buffers and the threads, the early reply from $FTXLUH3URF would not help (the results for that case are not given here, but they are identical to the base case). So we have evidence for a general principle for software design improvement: Holistic Improvement Principle: Software design improvements can only be evaluated in the context of the best possible deployment and configuration alternatives. The case study shows how drastic changes in the design can be inserted into a performance model and evaluated very quickly and inexpensively. Many possible changes can be assessed, the best ones are selected, and finally the design is updated to incorporate the beneficial changes.
306
J. Xu, M. Woodside, and D. Petriu
A question raised and not resolved here is the systematic navigation of the model results to identify and rank the potential design changes at each step. This is one subject of the PUMA project 15 for integration of UML design and performance engineering. An outline of the complete PUMA project is shown in Figure 1, including UML model transformation, performance model experimentation, and feedback of results. Acknowledgements. This research was supported by the Natural Sciences and Engineering Research Council of Canada.
References 1. 6LPRQD%HUQDUGL6XVDQQD'RQDWHOOL-RVH0HUVHJXHU³)URP80/VHTXHQFHGLDJUDPVDQG
VWDWHFKDUWVWRDQDO\VDEOHSHWULQHWPRGHOV´3URFUGLQWHUQDWLRQDOZRUNVKRSRQ6RIWZDUH DQGSHUIRUPDQFH5RPH,WDO\SS± 2. *UDG\ %RRFK ,YDU -DFREVRQ DQG -DPHV 5XPEDXJK 7KH 8QLILHG 0RGHOLQJ /DQJXDJH 8VHU*XLGH5HDGLQJ0DVV$GGLVRQ:HVOH\ 3. &&DQHYHW6*LOPRUH-+LOOVWRQ03URZVHDQG36WHYHQV³3HUIRUPDQFHPRGHOOLQJ ZLWK 80/ DQG VWRFKDVWLF SURFHVV DOJHEUDV´ 3URF ,(( RQ &RPSXWHUV DQG 'LJLWDO 7HFKQLTXHV2FWREHU 4. + ( (O6D\HG 'RQ &DPHURQ & 0 :RRGVLGH ³$XWRPDWLRQ 6XSSRUW IRU 6RIWZDUH 3HUIRUPDQFH (QJLQHHULQJ´ 3URF -RLQW ,QW &RQI RQ 0HDVXUHPHQW DQG 0RGHOLQJ RI &RPSXWHU 6\VWHPV 6LJPHWULFV 3HUIRUPDQFH &DPEULGJH 0$ -XQH ± $&0RUGHUQRSS± 5. -RVHSK / +HOOHUVWHLQ ³$ *HQHUDO3XUSRVH $OJRULWKP IRU 4XDQWLWDWLYH 'LDJQRVLV RI 3HUIRUPDQFH3UREOHPV´-RXUQDORI1HWZRUNDQG6\VWHPV0DQDJHPHQW 6. 3UDVDG-RJDOHNDU0XUUD\:RRGVLGH³(YDOXDWLQJWKH6FDODELOLW\RI'LVWULEXWHG6\VWHPV´ ,(((7UDQVRQ3DUDOOHODQG'LVWULEXWHG6\VWHPVYQSS±-XQH 7. 2EMHFW 0DQDJHPHQW *URXS ³80/ 3URILOH IRU 6FKHGXODELOLW\ 3HUIRUPDQFH DQG 7LPH 6SHFLILFDWLRQ´20*$GRSWHG6SHFLILFDWLRQSWF-XO\ 8. -( 1HLOVRQ &0 :RRGVLGH '& 3HWULX DQG 6 0DMXPGDU ³6RIWZDUH %RWWOHQHFNLQJ LQ &OLHQW6HUYHU 6\VWHPV DQG 5HQGH]YRXV 1HWZRUNV´ ,((( 7UDQV 2Q 6RIWZDUH (QJLQHHULQJ9RO1RSS±6HSWHPEHU 9. -RVH0HUVHJXHU-DYLHU&DPSRV(GXDUGR0HQD³3HUIRUPDQFHDQDO\VLVRI,QWHUQHWEDVHG VRIWZDUH UHWULHYDO V\VWHPV XVLQJ 3HWUL 1HWV´ 3URFHHGLQJV RI WKH WK $&0 ,QWHUQDWLRQDO :RUNVKRSRQ0RGHOLQJ$QDO\VLVDQG6LPXODWLRQRI:LUHOHVVDQG0RELOH6\VWHPV 5RPH,WDO\SS± 10. ' 0HQDVFH DQG + *RPDD ³$ 0HWKRG IRU 'HVLJQ DQG 3HUIRUPDQFH 0RGHOLQJ RI &OLHQW6HUYHU6\VWHPV´,(((7UDQVDFWLRQVRQ6RIWZDUH(QJLQHHULQJYROQRSS ± 11. 'RULQ 3HWULX 0XUUD\ :RRGVLGH ³6RIWZDUH 3HUIRUPDQFH 0RGHOV IURP 6\VWHP 6FHQDULRV LQ8VH&DVH0DSV´3URF,QW&RQIRQ0RGHOLQJ7RROVDQG7HFKQLTXHVIRU&RPSXWHU DQG &RPPXQLFDWLRQ 6\VWHP 3HUIRUPDQFH (YDOXDWLRQ 3HUIRUPDQFH 722/6 /RQGRQ$SULO 12. 'RULQ 3HWULX 0XUUD\ :RRGVLGH ³$QDO\VLQJ 6RIWZDUH 5HTXLUHPHQWV 6SHFLILFDWLRQV IRU 3HUIRUPDQFH´3URFUG,QW:RUNVKRSRQ6RIWZDUHDQG3HUIRUPDQFH5RPHSS±-XO\
Performance Analysis of a Software Design Using the UML Profile
307
13. '&3HWULX +6KHQ ³$SSO\LQJ WKH 80/ 3HUIRUPDQFH 3URILOH *UDSK *UDPPDU EDVHG
GHULYDWLRQ RI /41 PRGHOV IURP 80/ VSHFLILFDWLRQV´ LQ &RPSXWHU 3HUIRUPDQFH (YDOXDWLRQ 0RGHOOLQJ 7HFKQLTXHV DQG 7RROV 7RQ\ )LHOGV 3HWHU +DUULVRQ -HUHP\ %UDGOH\ 8OL +DUGHU (GV /HFWXUH 1RWHV LQ &RPSXWHU 6FLHQFH SS± 6SULQJHU9HUODJ 14. 5 3RROH\ ³6RIWZDUH (QJLQHHULQJ DQG 3HUIRUPDQFH D 5RDGPDS´ LQ 7KH )XWXUH RI 6RIWZDUH (QJLQHHULQJ SDUW RI WKH QG ,QW &RQI RQ 6RIWZDUH (QJLQHHULQJ ,&6( /LPHULFN,UHODQG-XQHSS± 15. 380$3HUIRUPDQFHIURP8QLILHG0RGHO$QDO\VLV ZZZVFHFDUOHWRQFDUDGVSXPD 16. & 6PLWK DQG / :LOOLDPV ³6RIWZDUH 3HUIRUPDQFH $QWLSDWWHUQV´ LQ 3URFHHGLQJV RI WKH 6HFRQG ,QWHUQDWLRQDO :RUNVKRS RQ 6RIWZDUH DQG 3HUIRUPDQFH :263 2WWDZD &DQDGD6HSWHPEHUSS± 17. &86PLWKDQG/*:LOOLDPV3HUIRUPDQFH6ROXWLRQV$GGLVRQ:HVOH\ 18. &0:RRGVLGH ' 3HWULX ³3HUIRUPDQFH $QDO\VLV ZLWK 80/´ &KDSWHU LQ ³80/ IRU 5HDO'HVLJQRI(PEHGGHG5HDO7LPH6\VWHPV´(GLWRUV/XFLDQR/DYDJQR*UDQW0DUWLQ DQG%UDQ6HOLF.OXZHU$FDGHPLF3XEOLVKHU1HZ
Author Index
Benoit, A. 98 Bohnenkamp, H. 116 Brenner, L. 98 Buchholz, P. 218
Meini, B. 134 Miner, A.S. 78 Mingozzi, E. 134 Mitchell, K. 237
Chandrasekaran, B. Ciardo, G. 78
Nicol, D.M.
29
Osogami, T.
El Abdouni Khayari, R. Fernandes, P. German, R.
273
98 11
Harchol-Balter, M. 182, 200 Harrison, P. 152 Haverkort, B.R. 273 Heindl, A. 237 Hermanns, H. 116 Hielscher, K.-S.J. 11 Jones, R.L.
78
182, 200
Panda, D.K. 29 Petriu, D. 291 Plateau, B. 98 Rathke, B. Riabov, A.
255 63
Sadre, R. 273 Schulman, M. 63 Siminiceanu, R. 78 Stea, G. 134 Stewart, W.J. 98 Wolisz, A. 255 Wolter, K. 47 Woodside, M. 169, 291 Wyckoff, P. 29
Kasprowicz, K. 47 Katoen, J.-P. 116 Klaren, R. 116 Klaue, J. 255 Lenzini, L. 134 Liefvoort, A. van de Liljenstam, M. 1 Liu, J. 1 Liu, Z. 63
1
Xia, C. 63 Xu, J. 291 237
Zertal, S. 152 Zhang, F. 63 Zhang, L. 63 Zheng, T. 169