Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2537
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Dror G. Feitelson Larry Rudolph Uwe Schwiegelshohn (Eds.)
Job Scheduling Strategies for Parallel Processing 8th International Workshop, JSSPP 2002 Edinburgh, Scotland, UK, July 24, 2002 Revised Papers
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Dror G. Feitelson The Hebrew University, School of Computer Science and Engineering 91904 Jerusalem, Israel E-mail:
[email protected] Larry Rudolph Massachusetts Institute of Technology, Laboratory for Computer Science Cambridge, MA 02139, USA E-mail:
[email protected] Uwe Schwiegelshohn Universität Dortmund, Computer Engineering Institute 44221 Dortmund, Germany E-mail:
[email protected]
Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
CR Subject Classification (1998): D.4, D.1.3, F.2.2, C.1.2, B.2.1, B.6, F.1.2 ISSN 0302-9743 ISBN 3-540-00172-7 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by DA-TeX Gerd Blumenstein Printed on acid-free paper SPIN: 10871568 06/3142 543210
Preface
This volume contains the papers presented at the 8th Workshop on Job Scheduling Strategies for Parallel Processing, which was held in conjunction with HPDC11 and GGF5 in Edinburgh, UK, on July 24, 2002. The papers have been through a complete review process, with the full version being read and evaluated by five to seven members of the program committee. We would like to take this opportunity to thank the program committee, Andrea Arpaci-Dusseau, Walfredo Cirne, Allen Downey, Wolfgang Gentzsch, Allan Gottlieb, Moe Jette, Richard Lagerstrom, Jens Mache, Cathy McCann, Reagan Moore, Bill Nitzberg, Mark Squillante, and John Towns, for an excellent job. Thanks are also due to the authors for their submissions, presentations, and final revisions for this volume. Finally, we would like to thank the MIT Laboratory for Computer Science and the School of Computer Science and Engineering at the Hebrew University for the use of their facilities in the preparation of these proceedings. This year saw an emphasis on two main themes. The first was the classical MPP scheduling area. The main focus in this area was on backfilling, including several advanced variations on the basic scheme. It is also noteworthy that several papers discussed the use of adaptiveness in job scheduling. The second major theme was scheduling in the context of grid computing, which is emerging as an area of much activity and rapid progress. These are complemented by an invited paper providing an overview of the scheduling and resource management area of the Global Grid Forum (GGF) effort. This was the eighth annual workshop in this series, which reflects the continued interest in this area. The proceedings of previous workshops are available from Springer-Verlag as LNCS volumes 949, 1162, 1291, 1459, 1659, 1911, and 2221 (and since 1998 they have also been available online). We hope you find these papers interesting and useful.
September 2002
Dror Feitelson Larry Rudolph Uwe Schwiegelshohn
Table of Contents
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching . . . . . . . . 1 Achim Streit Preemption Based Backfill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Quinn O. Snell, Mark J. Clement, and David B. Jackson Job Scheduling for the BlueGene/L System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Elie Krevat, Jos´e G. Casta˜ nos, and Jos´e E. Moreira Selective Reservation Strategies for Backfill Job Scheduling . . . . . . . . . . . . . . . . .55 Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and Ponnuswamy Sadayappan Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 Barry G. Lawson and Evgenia Smirni Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy . . . . . 88 William A. Ward, Jr., Carrie L. Mahood, and John E. West The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Su-Hui Chiang, Andrea Arpaci-Dusseau, and Mary K. Vernon Economic Scheduling in Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Carsten Ernemann, Volker Hamscher, and Ramin Yahyapour SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems . . . . . . . . . . 153 Karl Czajkowski, Ian Foster, Carl Kesselman, Volker Sander, and Steven Tuecke Local versus Global Schedulers with Processor Co-allocation in Multicluster Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Anca I.D. Bucur and Dick H.J. Epema Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Christopher Pinchak, Paul Lu, and Mark Goldenberg Current Activities in the Scheduling and Resource Management Area of the Global Grid Forum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .229 Bill Nitzberg and Jennifer M. Schopf Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .237
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching Achim Streit PC2 - Paderborn Center for Parallel Computing, Paderborn University 33102 Paderborn, Germany [email protected] http://www.upb.de/pc2
Abstract. The performance of job scheduling policies strongly depends on the properties of the incoming jobs. If the job characteristics often change, the scheduling policy should follow these changes. For this purpose the dynP job scheduler family has been developed. The idea is to dynamically switch the scheduling policy during runtime. In a basic version the policy switching is controlled by two parameters. The basic concept of the self-tuning dynP scheduler is to compute virtual schedules for each policy in every scheduling step. That policy is chosen which generates the ’best’ schedule. The performance of the self-tuning dynP scheduler no longer depends on a adequate setting of the input parameters. We use a simulative approach to evaluate the performance of the selftuning dynP scheduler and compare it with previous results. To drive the simulations we use synthetic job sets that are based on trace information from four computing centers (CTC, KTH, PC2, SDSC) with obviously different characteristics.
1
Introduction
A modern resource management system for supercomputers consists of many different components which are all vital for the everyday usage of the machine. Despite the fact that the management software should be working properly, the scheduler plays a major role in improving the acceptance, usability, and performance of the machine. The performance of the scheduler could be seen as a quality of service with regard to the performance of the users jobs (e. g. wait and response time). But also the machine owner is interested in a good scheduler performance for e. g. increasing the utilization of the machine. Hence, much work has been done in the field of improving or developing new scheduling algorithms and policies in general. Some examples to mention are: gang-scheduling [2] (combined with migration and backfilling [20]), several backfilling variants (conservative [4], EASY [13], or slack-based [18]), or a tool for predicting job runtimes [14]. Also, research for explicit machines was done, e. g. for the IBM SP2 and the LoadLeveler System [18, 11], or the IBM ASCI Blue [5, 10]. D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 1–23, 2002. c Springer-Verlag Berlin Heidelberg 2002
2
Achim Streit
It is common to use simulation environments for evaluating scheduling algorithms. Job sets are often based on trace information from real machines. Especially for that purpose a Parallel Workload Archive [17] was established. During the last years such simulation environments were also used to evaluate scheduling algorithms for upcoming computational grid environments [1]. Admittedly no grid job trace is available at the moment. In this paper we follow a similar approach: we built a simulation environment tailored for our resource management system CCS (Computing Center Software) [8]. In that simulation environment the exact scheduling process of CCS is modelled. Because real trace data from our hpcLine cluster exists, it is possible to develop and evaluate new scheduling algorithms. Our cluster is operated in space sharing mode, as our users often need the exclusive access to the network interface and the full compute power of the nodes. Three scheduling policies are historically grown: FCFS (first come, first serve), SJF (shortest jobs first) and LJF (longest jobs first), each supplemented with conservative backfilling. In the following we present a scheduler family developed for our system. It is based on the three single policies and dynamically switches between them automatically and in real time. Furthermore, it offers self-tuning ability, so that no interaction or startup parameter is necessary. Besides a trace based job set from our machine (PC2), we also evaluated the algorithms with three other trace based job sets from the Cornell Theory Center (CTC), the Swedish Royal Institute of Technology (KTH) and the San Diego Supercomputer Center (SDSC). The remainder of this paper is organized as follows: In the next section some related work is presented. In section 3 the algorithms are presented, starting with the basic variant and the self-tuning dynP scheduler with two different deciders. After that the used workloads are presented and examined in section 4. The evaluation in section 5 starts with a short look on the used performance metrics and proceeds with the results. We surveyed different aspects of the algorithms and also present a comparison with previous work. Finally this paper ends with a conclusion in section 6.
2
Related Work
Ramme and Gehring [12, 6] introduced the IVS (Implicit Voting System) for scheduling the MPP-systems of a virtual machine-room in 1996. The problem was that the systems were switched between batch and interactive mode manually at fixed times of the day (i. e. interactive mode during working hours and batch mode for the rest of the day and during weekends). This static solution restricted the usage of the systems very much. The idea was, that the users themselves should vote for the used scheduling method depending on the characteristics of their favored resource requests. However, the users should not vote explicitly. Therefore the IVS was developed. Three strategies are used as a basis: FCFS, FFIH (first fit, increasing height), and FFDH (first fit, decreasing height). FFIH sorts the request list by increasing estimated job runtime, so that
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
3
short (interactive) jobs are at the front. FFDH sorts the requests in a opposite order than FFIH. Using FFIH leads to a shorter average waiting time in general, whereas FFDH commonly improves the overall system utilization. Hence the basic idea of IVS is to check, whether more batch or interactive jobs are in the system. Depending on that, IVS switches between FFDH (more batch jobs) or FFIH (more interactive jobs). If the system is not saturated, FCFS is used. IVS was never implemented and tested in a real environment, as the project finished with the Ph.D. thesis. Feitelson and Naaman [3] published work about self-tuning systems in 1999. Modern operating systems are highly parameterized, so that the administrative staff is forced to use a trial-and-error approach for optimizing these parameters. A better way would be to automate this process. The idea is to look at past information (i. e. log files), use this information as input for simulations with different parameter values, and evaluate them. Genetic algorithms are set in to derive new parameter values. With these genetic parameter values again simulations are driven in the idle loop of the machine to conduct a systematic search for optimal parameter values. The authors call such systems which learn about their environment self-tuning, as the system itself automatically searches for optimized parameter values. In a case study for scheduling batch jobs on a iPSC/860 they found out that with the self-tuning search procedure the overall system utilization can be improved from 88% (with the default parameters) to 91%. That means, that the number of resources lost to fragmentation is reduced by one quarter.
3
Algorithms
In this section the different versions of the dynP (for dynamic Policy) scheduler family and the history of development are presented. We start with the basic dynP scheduler which needs a lower and upper bound as parameters. Then follows the self-tuning dynP scheduler with an introduction to the simple decider and its disadvantages. Finally the new, advanced decider for the self-tuning dynP scheduler is presented. 3.1
The Basic dynP Scheduler
We started our work with two job sets which were derived from traces of our 96-node hpcLine cluster. This machine is managed by CCS which is a longterm running project at the PC2 . All policies are combined with (conservative) backfilling [4]. Currently, CCS is configured to use FCFS for scheduling jobs. Now the question is, has performance suffered, because we have used FCFS instead of SJF or LJF. So we have developed a simulation framework for evaluating the three scheduling policies with two trace based job sets from our machine. The results show, that FCFS is a good average for both job sets [15]. The other policies show opposing results: for the first job set SJF is better than FCFS and LJF is worst, and for the second job set, LJF is the best, followed by
4
Achim Streit
FCFS and SJF is worst. From that we have developed the idea of dynamically switching the scheduling policy during runtime. A decision criterion was needed to decide when to switch from one policy to the other. For that we have used the average estimated runtime of all jobs currently in the waiting queue. The decider is evoked every time a new job is submitted and the algorithm works as follows: basic_dynP_algorithm() { IF (jobs in waiting queue >= 5) { AERT = average estimated runtime of all jobs currently in the waiting queue; IF (0 < AERT <= lower_bound) { switch to SJF; } ELSE IF (lower_bound < AERT <= upper_bound) { switch to FCFS; } ELSE IF (upper_bound < AERT) { switch to LJF; } reorder waiting queue according to new policy; } }
Note, we are using a threshold of 5 jobs to prevent unnecessary policy switches, if the waiting queue is too short. An experimental search is done to find appropriate parameter values for the two bounds. With the right settings this basic dynP scheduler outperforms FCFS for both job sets. The different behavior for the two job sets (with the same bounds) is obvious when looking at the usage of the three policies. For the first job set (with more short jobs) FCFS and SJF is used more than three quarters of the whole schedule time. The second job set consisted of more long jobs, so that LJF was used most of the time. 3.2
The Self-Tuning dynP Scheduler
One problem still is, that a long lasting trial-and-error process is needed, to find proper values for the two bounds. The fact that our simulation environment is now working with full schedules1 and inspired by Feitelsons and Naamans work about self-tuning systems [3], brought us to the idea of the self-tuning dynP scheduler: Let the scheduler generate full schedules for each of the three strategies in every scheduling step. And switch to that policy which generates the best schedule for the current situation. core_self_tuning_dynP_algorithm() { mySchedule = JobQueue.generateSchedule("FCFS"); FCFS = mySchedule.getQuality(QualityParameter); mySchedule = JobQueue.generateSchedule("SJF"); SJF = mySchedule.getQuality(QualityParameter); mySchedule = JobQueue.generateSchedule("LJF"); LJF = mySchedule.getQuality(QualityParameter); // call the decider switch_to_best_policy(FCFS,SJF,LJF); }
The quality parameter in getQuality() specifies the metrics for evaluating the virtual schedule and is one of the following: 1
A newly submitted job is directly placed in the schedule and a proposed starttime is assigned to it. This feature allows to work with advanced reservations.
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
5
– Makespan (MS) M S = max j.EndT ime j∈Jobs
– Average Response Time (ART) (j.EndT ime − j.SubmitT ime) ART =
j∈Jobs
|Jobs|
– Average Response Time weighted by Width (ARTwW) (j.requestedResources ∗ (j.EndT ime − j.SubmitT ime)) j∈Jobs ART wW = j.requestedResources j∈Jobs
Note, that the best schedule has the lowest quality number. Preliminary evaluations have shown that using the average response time weighted by job width leads to good results for different workloads. By weighting the response time with the job width, jobs requesting more resources have a greater influence on the quality of the schedule. Otherwise small, often insignificant jobs would have the same influence as such large jobs. The initial policy is FCFS. A Simple Decider At first we use a quite simple decider mechanism for the switch to best policy() method. In [16] we show that the self-tuning dynP scheduler combined with this simple decider achieves only average results for the two trace based job sets. The simple decider works as follows: switch_to_best_policy(FCFS,SJF,LJF) { // the simple decider IF (SJF <= LJF) { IF (FCFS <= SJF) { newPolicy = "FCFS"; } ELSE { newPolicy = "SJF"; } } ELSE { IF (FCFS <= LJF) { newPolicy = "FCFS"; } ELSE { newPolicy = "LJF"; } } }
When looking at how often each policy was used during the whole schedule to start jobs, we found out that the simple decider preferred to use FCFS and SJF, but seldom uses LJF. Therefore, we analyzed all possible combinations of schedule quality numbers (cf. Tab. 1). Note, that we are using the following abbreviations in the table: – with FCFS (SJF and LJF respectively) we mean the quality (ARTwW) of the schedule generated with FCFS – the three symbols <, =, > are used for comparing the quality numbers For example: FCFS < SJF means that the FCFS generated schedule has a lower ARTwW than the SJF generated schedule and is therefore better. Note, case 4 is split up for counting the cases (cf. Tab. 6) and better understanding.
6
Achim Streit
Table 1. Behavior of the single and advanced decider for all combinations of schedule quality numbers combinations
case 1 2 3 4 a b c 5 6 a b c 7 8 a b c 9 10 a b c
FCFS = SJF = LJF SJF < FCFS, SJF < LJF FCFS < SJF, FCFS < LJF LJF < FCFS, LJF < SJF FCFS < SJF FCFS = SJF FCFS > SJF FCFS = SJF, LJF < FCFS (→ LJF < SJF) FCFS = SJF, FCFS < LJF (→ SJF < LJF) current policy = FCFS current policy = SJF current policy = LJF FCFS = LJF, SJF < FCFS (→ SJF < LJF) FCFS = LJF, FCFS < SJF (→ LJF < SJF) current policy = FCFS current policy = SJF current policy = LJF SJF = LJF, FCFS < SJF (→ FCFS < LJF) SJF = LJF, SJF < FCFS (→ LJF < FCFS) current policy = FCFS current policy = SJF current policy = LJF
simple decider
advanced decider
FCFS SJF FCFS
current policy SJF FCFS
LJF LJF LJF LJF
LJF LJF LJF LJF
current policy FCFS FCFS SJF
current policy current policy FCFS SJF
current policy FCFS FCFS FCFS
current policy FCFS current policy FCFS
SJF current policy SJF
SJF current policy current policy
The Advanced Decider In Tab. 1 four cases are marked (cases: 1, 6b, 8c, and 10c) with bold fonts. In these cases the simple decider favors FCFS (three times) and SJF. We developed a new, more advanced decider which generates decisions as found in the last column. Note, that in three cases no exact decision is possible based on the quality of the three schedules and the current policy: – case 6c: the current policy (LJF) needs to be changed, as it is obviously the worst. FCFS or SJF can be taken, as both are equal. We choose FCFS, as it might be beneficial for the average response time of the generated schedule. – case 8b: similar to case 6c, but FCFS or LJF can be taken. – case 10a: similar to case 6c, but SJF or LJF can be taken. We choose SJF for preferring short jobs.
4
Workloads
In previous work we used two job sets which were 3-month traces from our machine of the year 2000. They consist of roughly 8 000 jobs and are described more detailed in [15]. Now for this work we generated a job trace of the complete year 2001. Additionally, we downloaded three job traces from Feitelsons Parallel Workload Archive [17] which were already used in many other publications before. We used traces from the Cornell Theory Center (CTC), the Swedish
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
7
Royal Institute of Technology (KTH), and the San Diego Supercomputer Center (SDSC). All logs were derived from IBM SP2 machines. Besides the standard job information (e. g. job width, submit time, etc.) these three traces also hold information about the estimated runtime of the jobs. This information is essential when working with a backfilling scheduler. The four job traces were then analyzed to build synthetic job sets with 10 000 jobs each. These synthetic job sets retained the characteristics of the original traces (e. g. unused nodes during the night or weekends). This mechanism of analyzing the trace and generating new jobs from the obtained information has some advantages: 1. Job sets of various sizes (number of jobs) can be generated. This method is used here. 2. The information from the trace analysis can be modified to generate job sets with different characteristics. 3. Other modifications can easily be applied, e. g. large jobs are split up in width, so that the maximum job width is only 64, but the total area of all jobs is not changed. The complete process of analyzing the trace and generating synthetic job sets is described in [9]. The four most important job properties to be analyzed are: submission time (more precisely: interarrival time), width (number of requested resources), estimated and actual runtime. The submission time of all jobs is best x α expressed by a Weibull distribution (f (x) = 1 − e−( β ) ). The three other properties could not be expressed by any distribution although they are dependent on each other. Hence, a 3-dimensional matrix is generated which holds probability values for all possible combinations of width, estimated and actual runtime. Tab. 2 shows the output of the trace analyzer. Note that except for PC2 trace jobs with a longer actual than estimated runtime occurred in the traces. CCS kills jobs that try to run longer than estimated. The job generator is started with the information of the trace analyzer. When a new job should be generated, first the submission time is computed from a random number and the Weibull distribution of job interarrival times. Another three random numbers and the probability matrix for the width, estimated and actual runtime are used to generate the three missing parameters of a new job. This way of generating job sets was also used in [7, 1] before. The four job sets we constructed are described in Tab. 3. Note that again the maximum values for the actual runtime are often greater than estimated, so this property was taken over from the traces. Of course our simulation environment kills all jobs, that try to run longer than estimated. Fig. 1 shows distributions for the estimated and actual runtimes. The staircase shape of the estimated runtime curves indicates that users tend to use common or round values, (e. g. PC2 syn: 10 minutes (600 seconds), 30 minutes (1800 seconds), or 2 hours (7200 seconds). These estimates are used for about 52% of all jobs). The curves for the actual runtime are smoother, as jobs usually have different runtimes and do not end at specific times. However steps are also
8
Achim Streit
Table 2. Output of the trace analyzer CTC trace number of jobs maximum job width width of machine (batch) average requested resources estimated runtime
actual runtime
avg min max avg min max
jobs with actual > estimated runtime interarrival time
avg min max
likelihood of interarrival time Weibull: α Weibull: β
KTH trace
PC2 trace SDSC trace
79 302 28 490 336 100 430 100 10.72 7.68 24 324 s 13 677 s 0s 60 s 64 800 s 216 000 s 10 983 s 8 876 s 0s 0s 71 998 s 226 709 s 7 180 478 (= 9.05 %) (= 1.68 %) 369 s 1 031 s 0s 0s 164 472 s 327 952 s
35 094 96 96 6.34 11 716 s 1s 1 209 600 s 4 346 s 1s 604 800 s 0 (= 0 %) 870 s 0s 313 861 s
67 667 128 352 10.53 14 337 s 0s 172 800 s 6 119 s 0s 510 209 s 4 327 (= 6.39 %) 934 s 0s 79 503 s
0.0063932
0.0116883
0.0659942
0.0104926
0.35 60
0.35 200
0.25 40
0.4 290
EASY
EASY
FCFS + cons. backfilling
EASY
original scheduler
Table 3. Properties of the four synthetic job sets CTC syn number of jobs maximum job width width of machine (batch) average requested resources estimated runtime
actual runtime
avg min max avg min max
jobs with actual > estimated runtime interarrival time
avg min max
KTH syn
PC2 syn
SDSC syn
10 000 10 000 10 000 10 000 330 100 96 128 430 100 96 352 10.67 7.73 6.48 10.71 24 260 s 13 643 s 11 516 s 14 305 s 0s 60 s 1s 0s 64 800 s 216 000 s 604 800 s 64 800 s 10 924 s 9 060 s 4 442 s 6 085 s 0s 0s 1s 0s 71 998 s 215 965 s 604 800 s 64 884 s 929 178 0 640 (= 9.29 %) (= 1.78 %) (= 0 %) (= 6,40 %) 287 s 1 047 s 1 013 s 945 s 0s 0s 0s 0s 24 815 s 191 258 s 236 546 s 76 019 s
visible, e.g. for PC2 syn and the 2 hour mark. This indicates that many jobs that were estimated to run for 2 hours also ended at that time. Likely these jobs were underestimated, as CCS kills jobs that try to run longer than estimated.
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
estimated runtime
actual runtime
1 acc. percentage of jobs
9
1 CTC_syn PC2_syn KTH_syn SDSC_syn
0.9 0.8 0.7
0.9 0.8 0.7
0.6 0.5
0.6 0.5
0.4 0.3
0.4 0.3
0.2 0.1
0.2 0.1
0
0 1
10
100
1000 in seconds
10000
100000
1e+06
1
10
100
1000
10000
100000
1e+06
in seconds
Fig. 1. Distributions of job runtime for the four synthetic job sets
5
Evaluation
It is common practice to evaluate new or modified scheduling algorithms with simulations before they are actually deployed in a real environment. For that we developed a simulation framework, called MuPSSiE (= Multi Purpose Scheduling Simulation Environment). Today, many scheduling systems only look at the current time and try to start as many waiting jobs as possible.2 With that a scheduler cannot specify a starting time for a submitted job. A unique feature of our simulation environment is that it always generates the full schedule for all jobs in the system. Two advantages of such an approach are: 1. Proposed start times are assigned to jobs right after their submission as they are directly placed in the schedule. 2. Advanced reservations with guaranteed start times are possible. Like backfilling schedulers such an approach needs information about the job’s estimated runtime. Reservations also need information about their start time, which could either be a keyword (now, asap) or a time/date string. All jobs without reservations are called variable jobs as they can be started at any time. A rescheduling process is done every time a job ends. At that time, every running and pending job is placed into the schedule. The order in which variable jobs are newly placed is specified by the scheduling policy. Note that the scheduling policy has no influence on reservations. Generating the full schedule is also the basis for the self-tuning scheduler. In principle it would be possible to evaluate this algorithm with job sets which contain variable jobs and reservations. But as no real job/reservation traces exist, we concentrate only on variable jobs in this context. The simulation framework basically is a set of stand-alone tools, like job set converters/modifiers, job set and schedule analyzers, a schedule viewer, a single 2
Of course, information about the near future (i. e. runtime of already running jobs) is also needed when backfilling is applied.
10
Achim Streit
machine scheduler (with several policies and the dynP algorithm implemented), and a multi-site grid scheduler that can work with advanced reservations. It is developed and implemented in ANSI C++ on a Windows system (Borland C++ Builder), but the simulation runs are done on Linux systems. The execution time of a single simulation run is between some seconds and up to 24 hours (on AMD Athlon XP 1800+ systems). This strongly depends on the length of the average backlog, as the longer the backlog gets, the longer the rescheduling process takes. Note, that the self-tuning dynP scheduler has to compute two additional schedules in each step for comparing the three policies. A detailed survey on the execution time of a self-tuning step showed, that it took about 0.05 seconds for an average of 185 jobs (CTC syn, shrinking factor of 0.80) to find the best scheduling policy. For resource management systems like CCS, where 2 minutes are estimated for constructing a partition of nodes, checking and cleaning them, the execution time of the self-tuning step can be neglected, even if it grows. For evaluating the self-tuning scheduler and the advanced decider we did simulation runs with all combinations of all three quality metrics and the four synthetic job sets. Additional simulation runs for the three single policies (FCFS, SJF, LJF) as an evaluation basis were done. We also used a shrinking factor to decrease the average interarrival time. With that the 10 000 jobs are submitted in a shorter time. The consequences are: the average backlog grows, the utilization increases, and jobs wait longer for their start. 5.1
Metrics
Measuring the performance of a scheduler respectively the quality of the generated schedule can be done with different metrics. In general they can be classified in two groups: owner centric and user centric. Owner centric criteria mainly focus on the schedule as a whole, e. g. the makespan, utilization, loss of capacity [19], or number of jobs processed over a given time period. Machine owners make use of such numbers to document how good/effective their machine was used or to emphasize the need for a faster machine. The machine users are generally not interested in such numbers, as they only want the output of their jobs as soon as possible. For them metrics like the average response or wait time or the slowdown of their jobs is interesting. But the requirements of both groups sometimes overlap. Owners may be interested in short response times for their users, as this somehow represents a quality of service they provide. In this work we use three main metrics in the evaluation: the utilization of a schedule, the average response time (ART) and the average job slowdown (SLD). We further weight the jobs response time and slowdown with its width for giving larger jobs a larger influence. Additionally we bound the slowdown by 60 seconds, so that very short jobs are not considered [4].
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
11
Utilization (N = number of total resources): (j.RequestedResources ∗ j.RunT ime) U T IL =
j∈Jobs
N ∗ (lastEndT ime − f irstSubmitT ime)
Average response time weighted by job width: (j.RequestedResources ∗ j.ResponseT ime) j∈Jobs ART wW = j.RequestedResources j∈Jobs
Average Slowdown weighted by width and bound by 60 seconds: (j.RequestedResources ∗ j.Slowdown) j∈Jobs SLDwW 60 = j.RequestedResources j∈Jobs
with: – j.RunT ime = j.EndT ime − j.StartT ime – j.ResponseT ime = j.EndT ime − j.SubmitT ime max(j.ResponseT ime, 60) – j.Slowdown = max(j.RunT ime, 60) Later we will show diagrams where these three metrics are plotted on the y-axis for different shrinking factors on the x-axis. Note, that each data point in a diagram refers to one simulation run. Points connected by a line are simulation runs with equal configurations (quality parameter, decider) and job input. Only the shrinking factor is decreased to simulate a higher workload for the scheduler. 5.2
Results
In the following we present the performance of the self-tuning dynP scheduler with the advanced decider. We compare these results against the simple decider and the three basic strategies. When talking about metrics (e. g. ARTwW) we will use the following definitions: With ’quality metrics’ we mean the metrics used in the self-tuning scheduler for measuring the three schedules in each step (cf. Sec. 3.2). But with ’performance metrics’ we are talking about measuring the complete schedule after the simulation ended. Additionally, when ’response time’ is written, the average response time weighted by job width (ARTwW) of the whole schedule is meant. Comparison of Simple and Advanced Decider At first we compare the results with three quality metrics: ART, ARTwW, and MS (Makespan). In Tab. 4 the most important criteria are presented. In the first column different performance metrics for measuring the complete simulated schedule are given. With
12
Achim Streit
LOC we mean the loss of capacity and when the system is in a saturated state the equation loss of capacity = 1 - utilization holds. A high loss of capacity indicates that many resources are lost, as jobs are waiting and they could not be started (they need more resources than available). A small loss of capacity indicates that either no jobs were waiting or the machine was fully utilized. The numbers in the table were obtained using a shrinking factor of 1.00. The columns show the performance numbers with the simple or advanced decider and a quality metrics for the self-tuning function. Finally, the last three rows of each block show the number of jobs started with each of the three policies during the whole scheduling process.
Table 4. Comparing the results (rows) of the simple and advanced decider with shrinking factor = 1.00 and three different quality metrics. Note, the numbers in ’jobs started with ...’ are also 1/100th % of all jobs CTC syn ARTwW UTIL LOC jobs started with SJF jobs started with FCFS jobs started with LJF
ARTwW simple advanced
ART simple advanced
MS simple advanced
25 980 s 75.36 % 0.12650 2 703 6 817 477
21 912 s 75.70 % 0.10911 6 115 1 296 2 586
26 082 s 74.02 % 0.13731 2 472 6 418 1 107
21 648 s 74.96 % 0.12099 5 860 1 144 2 993
27 268 s 75.64 % 0.09054 130 5 295 4 572
25 829 s 75.08 % 0.10395 551 953 8 493
48 077 s 66.00 % 0.16398 2 865 6 854 278
33 299 s 65.64 % 0.16456 7 885 1 345 767
49 301 s 66.01 % 0.17121 2 768 6 313 916
40 663 s 65.64 % 0.18386 5 730 1 313 2 954
50 497 s 66.14 % 0.16588 142 5 951 3 904
50 395 s 66.11 % 0.16629 322 686 8 989
43 124 s 42.60 % 0.22423 2 553 7 344 103
31 493 s 42.49 % 0.23792 7 550 1 062 1 388
42 795 s 42.60 % 0.22147 2 615 7 106 279
31 994 s 42.58 % 0.23937 6 651 766 2 673
47 747 s 42.60 % 0.22724 65 6 789 3 146
47 260 s 42.60 % 0.22722 391 229 9 380
9 969 s 31.32 % 0.00133 317 9 561 121
9 936 s 31.32 % 0.00119 3 924 4 206 1 869
9 963 s 31.32 % 0.00135 309 9 517 173
9 921 s 31.32 % 0.00124 1 031 4 797 4 171
9 982 s 31.32 % 0.00135 1 9 570 428
9 991 s 31.32 % 0.00135 0 3 087 6 912
KTH syn ARTwW UTIL LOC jobs started with SJF jobs started with FCFS jobs started with LJF
PC2 syn ARTwW UTIL LOC jobs started with SJF jobs started with FCFS jobs started with LJF
SDSC syn ARTwW UTIL LOC jobs started with SJF jobs started with FCFS jobs started with LJF
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
13
Obviously the SDSC syn job set is not really useful, as the schedule seems to be quite empty (low utilization), and the schedules do not change very much with different deciders or quality metrics (e. g. the response times are almost the same). Combined with the low utilization the loss of capacity shows, that almost no jobs were waiting when resources could be utilized. When looking at the user-centric performance metric ARTwW the advanced decider is almost always better than the simple decider, with only one exception: MS as quality metrics and the SDSC syn job set (by 0.09%). The maximum benefit is reached with the KTH syn job set and quality metrics ARTwW, where the response time gets better by 30.74%. Also a user-centric quality metrics for the self-tuning scheduler improves the response time for the whole simulated schedule more than the utilization. On the other hand the MS quality metrics does not improve the total utilization that much. Because the job area does not change, a shorter makespan directly results in a higher utilization. Possibly the utilization can easier be improved with the MS quality metrics, if a higher workload is used (shrinking factor < 1.00). Overall, the ARTwW quality metrics generates the best results for KTH syn and PC2 syn, only for CTC syn ART is slightly better. The usage of the policies over the time shows that the simple decider favors FCFS regardless which quality metrics or job set was used. Especially when the utilization should be increased, it is better to use the LJF policy. This behavior is seen for the advanced decider with the MS quality metrics as over about 7 000 jobs are started with LJF. In case of the two user-centric quality metrics ARTwW and ART, only between 750 and 1 350 jobs are started with FCFS (for CTC syn, KTH syn and PC2 syn). Here SJF is the best choice for optimizing the schedulers performance (both for response time and utilization), as about two-thirds of all jobs are started with that policy. In general the advanced decider achieves better results than the simple decider, especially when focusing on user-centric performance metrics. And this does not depend on the used quality metrics (even when using MS as quality metrics). Except for the KTH syn job set this also holds for the owner-centric performance metrics. The advanced decider does not favor any specific policy, like it was with the simple decider and FCFS. Hence, completely different numbers for the jobs started with each policy are achieved with the advanced decider. This shows, that the self-tuning scheduler with the advanced decider really reacts on different job set characteristics. Detailed Look at the Advanced Decider After this brief performance overview we now take a more detailed look at the advanced decider. In Tab. 5 we used a shrinking factor of 1.00 and ARTwW as quality metrics for the selftuning function. Again some average values, the utilization, loss of capacity, and the number of jobs started with each policy are shown. The number of switches to each policy, the number of cases with the same policy as before, and the average size of the backlog (length of waiting queue) at the start of the selftuning function is printed.
14
Achim Streit
Table 5. Detailed results of the self-tuning dynP scheduler and the advanced decider using ARTwW as quality metrics. The numbers are equal to the third column of Tab. 4 CTC syn KTH syn PC2 syn SDSC syn average actual runtime average wait time ART ARTwW utilization LOC
10 927 s 3178 s 14 105 s 21 912 s 75.70% 0.10911
9 063 s 6 925 s 15 989 s 33 299 s 65.64% 0.16456
4 442 s 4 577 s 9 018 s 31 493 s 42.49% 0.23792
6 085 s 44.02 s 6 129 s 9 936 s 31.32% 0.00119
6 115 1 296 2 586
7 885 1 345 767
7 550 1 062 1 388
3 924 4 206 1 869
total tries of switches switches to SJF switches to FCFS switches to LJF same policy
16 946 4.49 % 4.59 % 0.59 % 90.32 %
18 840 5.80 % 5.85 % 0.45 % 87.92 %
15 552 3.61 % 3.64 % 0.23 % 92.52 %
509 3.73 % 3.93 % 0.98 % 91.36 %
average backlog at start of self-tuning
22.05
14.48
15.04
6.36
jobs started with SJF jobs started with FCFS jobs started with LJF
Like before the SDSC syn job set is not very helpful for our evaluations, as jobs only wait some 44 seconds in average, which is roughly 0.7% of the average actual runtime. For the other job sets the percentage are 35% for CTC syn, 84% for KTH syn and even 91% for PC2 syn. Also the numbers for the ’total tries of switches’ and ’average backlog at start of self-tuning’ show, that submitted jobs in the SDSC syn job set are started right away, without getting delayed. Additionally the low number of total tries of policy switches shows that the selftuning functionality only makes sense, when more than one job is in the schedule and not yet running. Later we will use the shrinking factor for increasing the workload with the SDSC syn job set to achieve useful results. We now take a look at the number of policy switches. Obviously almost all calls (> 90%) to the advanced decider (total tries of switches) result in using the same policy as before, although the average size of the backlog at the start of the self-tuning function is considerably large (again except for SDSC syn). When a decision is made and the new policy is different from the current policy, most of the switches are done between SJF and FCFS for all four job sets. Only a minority of policy switches is done to LJF. With the fact that most of the jobs are started with SJF, the advanced decider seems to use FCFS only for one or two job starts and then switches back to SJF. But if it was decided to switch to LJF, many jobs are started with that policy. So the switches to LJF are quite effective and useful, and the decisions to switch to FCFS are often undone after a short time. We also did a case analysis, where the occurrence of each case in the decision algorithm (cf. Tab. 1) is counted. Tab. 6 shows the results. Over two-thirds of all cases are 6b (FCFS is equal to SJF, LJF is worse, the current policy is SJF
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
15
Table 6. Case analysis of the advanced decider using ARTwW as quality metrics. Refer to Tab. 1 for details CTC syn
KTH syn
PC2 syn
SDSC syn
case: 1 cases: 2 & 7 cases: 3 & 9 cases: 4b & 5 case: 4a case: 4c case: 6a case: 6b case: 6c case: 8a case: 8b case: 8c case: 10a case: 10b case: 10c
2 549 1 324 692 1 298 5 4 81 10 669 100 224 0 0 0 37 0
1 693 1 756 1 029 403 5 3 116 13 664 81 89 1 0 0 26 0
2 209 920 535 304 2 0 58 11 449 36 39 0 0 0 17 0
271 27 15 31 1 0 8 150 5 1 0 0 0 1 0
total
16 946
18 840
15 552
509
differences to the simple decider (1+6b+8c+10c) taking the current policy (1+5+6a+6b+8a+8c+10b+10c)
13 218 15 357 13 658 421 (=78.00%) (=81.51%) (=87.82%) (=82.71%) 14 858 15 932 14 076 462 (=87.68%) (=84.56%) (=90.51%) (=90.76%)
and SJF is chosen). Later, Fig. 4 shows, that SJF is the best single policy for a shrinking factor of 1.00. Hence, the advanced decider chooses SJF most of the time. And as FCFS is not much worse than (or equal to) SJF, it can often happen that FCFS and SJF generate equal schedules in a scheduling step, so the current policy SJF is used further on. Case 6b also effects the percentages of different decisions to the simple decider and of cases, where the current policy is taken, so that both are above 78%. Looking at the three critical cases (6c, 8b, and 10a) from Sec. 1 (a new policy has to be taken but the decider can choose from two) shows that case 6c is seldom used (< 1%) and the other two never. With that the advanced decider only favors FCFS in an insignificant number of cases, so that the advanced decider can be called fair with regard to the used policies. Performance at Different Workloads The following shrinking factors for the interarrival time are used to generate varying workloads: – – – –
CTC syn: from 1.00 down to 0.65 KTH syn: from 1.00 down to 0.60 PC2 syn: from 0.80 down to 0.40 SDSC syn: from 0.60 down to 0.30
Note, these values for the shrinking factor are chosen for generating useful performance numbers for medium utilizations. If the values are further decreased
Achim Streit
ARTwW of the whole schedule (in 10 000 sec.)
ARTwW of the whole schedule (in 10 000 sec.)
16
CTC_syn
KTH_syn
20
20 simple decider + ARTwW advanced decider + ART advanced decider + ARTwW advanced decider + Makespan
15
15
10
10
5
5
0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
20
20
15
15
10
10
5
5
0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 2. Average Response Time weighted by width (ARTwW) of complete schedules: comparing the three quality parameters for the advanced and single decider. Each data point refers to a single simulation run using different shrinking factors. The legend applies to all four diagrams. Smaller values are better
the saturated state is reached. In the diagrams we plot the shrinking factor on the x-axis and one of the three metrics introduced in Sec. 5.1 on the y-axis. Each data point refers to a single simulation run. Reducing the shrinking factor (and thereby the average interarrival time) increases the workload for the scheduler. In Fig. 2 the performance of the simple and advanced decider with the three quality metrics from Sec. 3.2 is compared. The quality metrics ART and ARTwW generate better results than the makespan criterion, which better improves the utilization (not shown in a diagram). As can be seen in Tab. 4 the MS quality parameter for the advanced decider prefers to take LJF. LJF (combined with backfilling) is known for improving the utilization, as it allows the backfilling routine to start jobs in the holes of the LJF schedule. Comparing the simple and advanced decider for the ARTwW quality parameter shows, that the advanced decider is mostly superior when looking at the response times. Noticeable improvements are achieved for the PC2 syn job set, where the advanced decider is approximately 30% better than the simple decider with the same quality metrics. For KTH syn the improvements are still at 20% whereas for CTC syn both deciders are not that much different. If a different metrics (slowdown) is used for the comparison (Fig. 3 only shows job sets CTC syn and PC2 syn) especially for the CTC syn job set the order changes. Only in that case the simple decider achieves smaller (i.e. better)
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
17
slowdown values than the advanced decider. Whereas for PC2 syn the difference between the simple and advanced decider even grows for small shrinking factors, and the slowdown value remains almost constant over a wide range of shrinking factors. Note, the PC2 syn job set generates 10 times higher slowdowns than CTC syn. In the following we are concentrating on the quality parameter ARTwW for the self-tuning scheduler with the advanced decider. In three diagrams (performance metric ARTwW in Fig. 4, SLDwW 60 in Fig. 5, and utilization in Fig. 8) we compare the advanced decider with quality metrics ARTwW, the three single strategies FCFS, SJF, and LJF (each combined with conservative backfilling), and the basic dynP (with 7200s for the lower and 9000s for the upper bound). The bounds for basic dynP were the best in [15] for two job sets extracted form the PC2 workload. Here the old bound setting for the basic dynP scheduler performs bad for the four synthetic job sets. This shows that in a real world environment the system administrator would have to re-adapt these bounds once in a while. Especially when the overall job set characteristic often change, a self-tuning scheduler is easier to handle. The single policy SJF combined with conservative backfilling achieves the lowest response times and slowdowns for all four job sets and also over a wide range of shrinking factors. For maximum utilization LJF is the best choice. In some way this is interesting, FCFS is commonly used in many resource management systems, as it might
CTC_syn
KTH_syn
SLDwW_60 of the whole schedule
500
3000 simple decider + ARTwW advanced decider + ART advanced decider + ARTwW advanced decider + Makespan
450 400 350
2500 2000
300 250
1500
200 1000
150 100
500
50 0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
SLDwW_60 of the whole schedule
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
3000
500
2500
450 400
2000
350 300
1500
250 200 150
1000
100 50
500 0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 3. Slowdown weighted by width and bound by 60 seconds (SLDwW 60) of complete schedules. Similar diagrams as in Fig. 2 but with a different metrics on the y-axis. Again, smaller values are better
Achim Streit
ARTwW of the whole schedule (in 10 000 sec.)
ARTwW of the whole schedule (in 10 000 sec.)
18
CTC_syn
KTH_syn
20
20 advanced decider + ARTwW basic dynP, lb 7200 sec., ub 9000 sec. FCFS + cons. Backfilling SJF + cons. Backfilling LJF + cons. Backfilling
15
15
10
10
5
5
0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
20
20
15
15
10
10
5
5
0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 4. ARTwW of complete schedules: comparing the advanced decider with quality parameter ARTwW with the three single policies FCFS, SJF, and LJF (each combined with conservative backfilling) and basic dynP with lower bound 7200s and upper bound 9000s (this setting achieved the best results in [15])
be a good tradeoff between low response times and high utilizations on many machines. For the PC2 syn job set the advanced decider (with the ARTwW quality parameter) achieves a similar performance than SJF, but cannot outperform it. For the CTC syn job set and smaller shrinking factors than 0.9 the performance drops significantly. The advanced decider cannot compete with FCFS and even reaches the performance of the taillight LJF. For the other two job sets the advanced decider generally follows FCFS. Note, these statements are true for response time (ARTwW in Fig. 4) as well as for slowdown (SLDwW 60 in Fig. 5). The diagrams for the utilization (Fig. 8) shows that for small shrinking factors all schedulers achieve similar or equal utilizations. As soon as the saturated state is reached, the values differ. Obviously the LJF scheduler achieves the highest utilizations (95% and more), as the sorting of the waiting queue might leave much room for the backfilling method. It is very interesting and at the same time unexplainable that the self-tuning scheduler with the advanced decider and quality parameter ARTwW can follow SJF and/or FCFS for 3 of the 4 job sets, but not for CTC syn. Pure curiosity lead us to the idea of computing the slowdown for different actual runtimes separately. Arbitrarily we used the following categorization: ≤5m, ≤10m, ≤1h, ≤12h, and ≥12h. Almost 50% of all jobs fall in the first class with a
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
19
maximum actual runtime of 5 minutes. Therefore we analyzed the results again and now computed the average slowdown weighted by width with a 300 seconds (= 5 minutes) bound. The diagrams in Fig. 7 (SLDwW 300) are similar to the 60 second bound version, except that the scaling on the y-axis has changed to smaller values. This shows, that very short running jobs (roughly 50% of all jobs) have a great influence on the overall quality of the schedule, but they have no influence on the ranking of the schedulers itself. Still, these small jobs are considered in the self-tuning step (Sec. 3.2). So another solution would be to neglect these small jobs there, too, but the problem is, that the self-tuning process only knows about runtime estimates (and not actual values). One possible reason for the poor performance of the self-tuning scheduler could be false or inaccurate runtime estimates. Therefore we ran the simulations again, but now with each jobs estimated runtime set to the actual runtime. Comparing the diagrams in Fig. 6 with Fig. 4 shows only a small performance gain of the advanced decider. SJF is still the best choice and the self-tuning scheduler comes close to its performance. But the perfect estimates diagram show a different interesting aspect: good runtime estimates become more important with an increased workload (i. e. smaller shrinking factor; saturated state). Exemplary we take a look at the SJF curve and the CTC syn job set. Anyhow the response time starts at about 20.000 seconds (for 1.00 as shrinking factor), with either real or perfect estimates used. As soon as the workload increased (e. g. to a shrinking factor of 0.7) the difference is remarkable (102.725 s for per-
CTC_syn
KTH_syn
SLDwW_60 of the whole schedule
500
3000 advanced decider + ARTwW basic dynP, lb 7200 sec., ub 9000 sec. FCFS + cons. Backfilling SJF + cons. Backfilling LJF + cons. Backfilling
450 400 350
2500 2000
300 250
1500
200 1000
150 100
500
50 0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
SLDwW_60 of the whole schedule
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
3000
500
2500
450 400
2000
350 300
1500
250 200 150
1000
100 50
500 0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 5. SLDwW 60 of complete schedules. Similar diagrams as in Fig. 4, but with a different metrics on the y-axis
Achim Streit
ARTwW of the whole schedule (in 10 000 sec.)
ARTwW of the whole schedule (in 10 000 sec.)
20
CTC_syn
KTH_syn
20
20 advanced decider + ARTwW, perfect estimates FCFS + cons. Backfilling, perfect estimates SJF + cons. Backfilling, perfect estimates LJF + cons. Backfilling, perfect estimates
15
15
10
10
5
5
0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
20
20
15
15
10
10
5
5
0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 6. Similar diagrams as in Fig. 4 (response time), but now with perfect estimates, i.e. estimated runtime = actual runtime. Note, the curve for basic dynP is left out fect and 145.212 s for real estimates). Even bigger differences can be observed for the PC2 syn job set.
6
Conclusions and Future Work
In this paper we presented the process of developing a job scheduler family with dynamic policy switching and self-tuning ability. The basic idea of self-tuning is to generate complete schedules for each of the three policies (FCFS, SJF, LJF) in each scheduling step. Then these schedules are measured by means of a quality metrics and that policy is chosen, which generates the best schedule. We use the average response time (unweighted and weighted by the job width) and the makespan as quality metrics. At first we developed a simple decider mechanism which already achieved reasonable good results for two trace job sets. When investigating the behavior of that decider in detail, we found out that this decider sometimes prefers FCFS when other policies obviously should have been taken. Therefore we developed the advanced decider. We evaluated the self-tuning dynP scheduler with this advanced decider in a simulation environment. To have more diversity in the results we used four synthetic job sets, which are based on real trace logs from four different computing centers. Additionally we decreased the average interarrival time between two jobs to generate a higher workload for the scheduler.
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
21
The results show that the advanced decider achieves a better performance (up to 30%) than the simple decider using the same quality metrics. The best quality metrics to use is ARTwW. A future deployment of the algorithm in the resource management software CCS will show, whether the simulated results can be achieved, or the influence of overestimation is too large. A real world deployment also has to show, whether the non-predictability of the scheduler discourages users from working with the system or not. Additionally users might submit fake jobs (long estimated runtime and short or none actual runtime) to trick the system, so that their real computing job is favored. Future versions of the self-tuning dynP scheduler might come with several enhancements. With a ’slackness’ option the schedule quality of the current schedule is virtually increased by e. g. 5%. Hence a new policy has to be better than the current policy by 5%. With a ’reduced future’ option enabled, the qualities of the three schedules in each self-tuning step are only computed from e. g. the first 20 started jobs or all jobs that are started within the next 6 hours. Thereby jobs which will be started far in the future are not considered. Additionally more scheduling policies (e. g. FFIA) and quality parameters (e. g. slowdown and wait time) might be added. Also combining several quality parameters for the self-tuning process might improve the performance. A desirable performance curve for the self-tuning dynP scheduler would always stay below or is equal to the best single policy at any workload.
CTC_syn
KTH_syn
SLDwW_300 of the whole schedule
200
400 advanced decider + ARTwW basic dynP, lb 7200 sec., ub 9000 sec. FCFS + cons. Backfilling SJF + cons. Backfilling LJF + cons. Backfilling
150
350 300 250
100
200 150
50
100 50
0
0 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
SLDwW_300 of the whole schedule
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
400
200
350 300
150
250 200
100
150 100
50
50 0
0 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 7. Similar diagrams as in Fig. 5 (slowdown), but now with a bound of 300 seconds
22
Achim Streit
Utilization of the whole schedule (in %)
CTC_syn
KTH_syn
100
100 advanced decider + ARTwW basic dynP, lb 7200 sec., ub 9000 sec. FCFS + cons. Backfilling SJF + cons. Backfilling LJF + cons. Backfilling
95 90
95 90 85
85
80 75
80
70 75
65
70
60 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
1
0.95
0.9
Utilization of the whole schedule (in %)
PC2_syn
0.85
0.8
0.75
0.7
0.65
0.6
0.3
0.25
0.2
SDSC_syn
100
100
95 90
95 90
85 80
85 80
75 70 65
75 70 65
60 55
60 55
50
50 0.8
0.75
0.7
0.65 0.6 0.55 shrinking factor
0.5
0.45
0.4
0.6
0.55
0.5
0.45 0.4 0.35 shrinking factor
Fig. 8. Utilization of complete schedules. Similar diagrams as in Fig. 4, but with a different metric on the y-axis. Note, only here higher values on the y-axis are better.
References [1] C. Ernemann, V. Hamscher, U. Schwiegelshohn, A. Streit, and R.Yahyapour. On Advantages of Grid Computing for Parallel Job Scheduling. In Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2002), 2002. 2, 7 [2] D. G. Feitelson and M. A. Jette. Improved Utilization and Responsiveness with Gang Scheduling. In D. G. Feitelson and L. Rudolph, editor, Proc. of 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1291, pages 238–262. Springer Verlag, 1997. 1 [3] D. G. Feitelson and M. Naaman. Self-Tuning Systems. In IEEE Software 16(2), pages 52–60, April/May 1999. 3, 4 [4] D. G. Feitelson and A. Mu’alem Weil. Utilization and Predictability in Scheduling the IBM SP2 with Backfilling. In Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP-98), pages 542–547, Los Alamitos, March 1998. IEEE Computer Society. 1, 3, 10 [5] H. Franke, J. Jann, J. Moreira, P. Pattnaik, and M. Jette. An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In Proceedings of SC’99, Portland, Oregon, pages 11–18. ACM Press and IEEE Computer Society Press, 1999. 1 [6] J. Gehring and F. Ramme. Architecture-Independent Request-Scheduling with Tight Waiting-Time Estimations. In D. G. Feitelson and L. Rudolph, editor, Proc. of 2nd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1162, pages 65–80. Springer Verlag, 1996. 2
A Self-Tuning Job Scheduler Family with Dynamic Policy Switching
23
[7] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour. Evaluation of job-scheduling strategies for grid computing. In Proceedings of 1st IEEE/ACM International Workshop on Grid Computing (Grid 2000), volume 1971, pages 191–202, 2000. 7 [8] A. Keller and A. Reinefeld. Anatomy of a Resource Management System for HPC Clusters. In Annual Review of Scalable Computing, vol. 3, Singapore University Press, pages 1–31, 2001. 2 [9] J. Krallmann. A Quantative Analysis of Scheduling-Algorithms for Parallel Machines (in German). Master’s thesis, Dortmund University, 1998. 7 [10] J. E. Moreira, H. Franke, W. Chan, and L. L. Fong. A Gang-Scheduling System for ASCI Blue-Pacific. In Proceedings of the 7th International Conference in High-Performance Computing and Networking (HPCN’99), volume 1593 of Lecture Notes in Computer Science, pages 831–840. Springer, 1999. 1 [11] A. W. Mu’alem and D. G. Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. In IEEE Trans. Parallel & Distributed Systems 12(6), pages 529–543, June 2001. 1 [12] F. Ramme and K. Kremer. Scheduling a Metacomputer by an Implicit Voting System. In 3rd Int. IEEE Symposium on High-Performance Distributed Computing, pages 106–113, 1994. 2 [13] J. Skovira, W. Chan, H. Zhou, and D. Lifka. The EASY — LoadLeveler API Project. In D. G. Feitelson and L. Rudolph, editor, Proc. of 2nd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1162, pages 41–47. Springer Verlag, 1996. 1 [14] W. Smith, I. Foster, and V. Taylor. Predicting Application Run Times Using Historical Information. In D. G. Feitelson and L. Rudolph, editor, Proc. of 4th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1459, pages 122–142. Springer Verlag, 1998. 1 [15] A. Streit. On Job Scheduling for HPC-Clusters and the dynP Scheduler. In Proceedings of the 8th International Conference on High Performance Computing (HiPC 2001), volume 2228 of Lecture Notes in Computer Science, pages 58–67. Springer, December 2001. 3, 6, 17, 18 [16] A. Streit. The Self-Tuning dynP Job-Scheduler. In Proceedings of the 11th International Heterogenous Computing Workshop (HCW) at IPDPS 2002, Lecture Notes in Computer Science, 2002. 5 [17] Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/, February 2002. 2, 6 [18] D. Talby and D. G. Feitelson. Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. TR 98-13, Hebrew University, Jerusalem, April 1999. 1 [19] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques. In D. G. Feitelson and L. Rudolph, editor, Proceedings of the 14th International Conference on Parallel and Distributed Processing Symposium (IPDPS-00), pages 133–144. Springer Verlag, 2000. 10 [20] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration. In D. G. Feitelson and L. Rudolph, editor, Proc. of 7th Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221, pages 133–158. Springer Verlag, 2001. 1
Preemption Based Backfill Quinn O. Snell, Mark J. Clement, and David B. Jackson Brigham Young University, Provo, Utah 84602 {snell,clement}@cs.byu.edu [email protected]
Abstract. Recent advances in DNA analysis, global climate modeling and computational fluid dynamics have increased the demand for supercomputing resources. Through increasing the efficiency and throughput of existing supercomputing centers, additional computational power can be provided for these applications. Backfill has been shown to increase the efficiency of supercomputer schedulers for large, homogenous machines[1]. Utilizations can still be as low as 60% for machines with heterogeneous resources and strict administrative requirements. Preemption based backfill allows the scheduler to be more aggressive in filling up the schedule for a supercomputer[2]. Utilization can be increased and administrative requirements relaxed if it is possible to preempt a running job to allow a higher priority task to run.
1
Introduction
The last few years have witnessed an explosion in supercomputing applications. Biologists, who have traditionally performed most of their analysis by hand, are now able to analyze the complete human genome. Although this information promises to revolutionize the way research is performed in several disciplines, the corresponding computational requirements are increasing at a rate that surpasses funding increases for supercomputing resources. The supercomputing infrastructure landscape is also changing rapidly in response to the demand for computational resources. Many researchers are creating clusters of workstations to run computationally intense workloads. These clusters often grow at an incremental rate with additional machines being added to the cluster as new funding for research is received. Additions to existing clusters often have different memory capabilities and may have licenses that allow software to run on only a subset of the processors in the cluster. These additions invariably cause the cluster to become more heterogeneous. This complicates the job of a scheduler since tasks that require a certain set of resources may only be schedulable on a small subset of the nodes in the system. Another cause of heterogeneity arises due to administrative policies. As new infrastructure is purchased by a particular organization, the organization may impose scheduling policies intended to insure that their community has priority access to the resources. Other users are normally welcome to use the nodes if they are available, but should not delay the execution of users in the community. D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 24–37, 2002. c Springer-Verlag Berlin Heidelberg 2002
Preemption Based Backfill
25
These administrative policies cause a normal scheduler to be overly conservative in scheduling low priority jobs and computational resources will often be wasted in order to insure the priority of community tasks. 1.1
Backfill
Backfill is performed when the highest priority job requires more resources than are currently available on the supercomputer. A lower priority task that is guaranteed to complete before the anticipated initiation time of a high priority job is allowed to run. Since these resources are not usable by the highest priority job, throughput and efficiency are increased on the machine. In some cases, there will be no lower priority task that will fit in the idle time slot. In this case, processing time will be wasted in order to guarantee that the high priority task will run as soon as possible. Preemptive backfill allows the scheduler to start lower priority jobs even if they are not assured of completion before higher priority users require the resource. In the majority of cases, the low priority task will finish and have no impact on the high priority schedule. If the low priority job takes longer than the available idle slot, it will be preempted by suspending it or killing it. This research addresses the following questions with respect to preemptive backfill: – Which jobs are the best candidates for preemptive backfill? Users normally provide an estimate of execution time when they submit a job. If this time limit is exceeded, the job is killed. As a result, users often submit estimates that are far greater than the real time used by a task. By looking at the accuracy of past estimates, the scheduler is able to select tasks for preemptive backfill that have the highest probability of completion in the allotted time. – Which preemption policies provide the best utilization and throughput on the machine? Many supercomputer operating systems do not provide the ability to suspend a parallel task. This is due to the fact that all threads and spawned processes must be suspended along with the main task. The operating system must also implement security measures so that a running task does not have access to data and files used by suspended jobs. When a job must be preempted, all of the progress that the job has made toward completion will be lost. This is the main downside of preemption-based backfill. In order to minimize this negative effect, appropriate policies must be adopted. Possible policies for determining which jobs to preempt include: • Most Recently Started • Furthest From Wall Clock Limit • Furthest From Historical Information Scaled Wall Clock Limit • Minimum Work Completed Supercomputer schedulers that do not perform backfill are typically limited to significantly lower utilization and throughput than classical backfill systems, but priority jobs are guaranteed to run as soon as resources are available. Classical backfill improves the efficiency of the machine by filling idle time with
26
Quinn O. Snell et al.
lower priority tasks, but without preemption, high priority jobs may be delayed. Preemption based backfill provides for high utilization, while guaranteeing that prioritization is preserved.
2
Background
Several different backfill strategies have been employed in the past to improve utilization in supercomputing systems. This section outlines important concepts and existing strategies for improving supercomputer utilization. When a user submits a job, several attributes are included with the submission. These attributes include: – Resources required- This includes memory, network and software licenses. – Number of Processors - The job will not be scheduled until this number of processors is available. These processors may be scheduled in a contiguous block, or may be distributed across the machine. Some submission systems may allow the user to specify that processors should be allocated in groups that share the same memory. – Wallclock Time Limit - After the job has been running for this amount of time, the job will be killed. Users will often overestimate this parameter. The IBM SP2 at the Cornell Theory Center was examined in prior research. Of the jobs submitted, 38% used less than 4% of their wall clock limit. Less than 5% of the jobs used 40% of the wall clock limit [3]. 2.1
Non-backfill
All of the jobs that are submitted will be accounted for in a scheduling queue. The scheduler prioritizes jobs in the queue according to several different policies. Jobs that are associated with certain users and projects may be given higher priority. As jobs wait in the queue, their priority will normally increase so that indefinite postponement is avoided. When a job completes, the scheduler will attempt to start the highest priority job in the queue. If resources are not available for this job, processors may remain idle waiting for another job to complete that will allow the highest priority job to run. Figure 1 shows this wasted processing power with Non-backfill scheduling. 2.2
Classical Backfill
With classical backfill, the scheduler may attempt to backfill jobs with lower priority into this wasted processor space. If the backfilled jobs can complete before Job 1 completes, then Job 2 will not be delayed. Classical backfill will only select jobs with a wallclock time estimate such that the backfilled job will complete before the end time of Job 1. There may be many jobs that will actually complete in this time, but none of them will be scheduled into this time if the user estimate is longer than the available space. This conservative backfill policy
Preemption Based Backfill
27
Processors
Job 1
Wasted Processors Job 2
Time
Fig. 1. Non-backfill scheduling
Processors
Wallclock estimate
Actual Runtime
Job 1
Job 2
Job 3
Time
Fig. 2. Classical backfill
causes lower utilization. If preemption is allowed, the scheduler can run jobs that have a high priority of completing before Job 1 completes. If Job 1 completes early, the scheduler can preempt these backfilled jobs and run Job 2, increasing the overall efficiency of the machine. Even when no jobs are backfilled that have wallclock estimates greater than available time, backfill can cause high priority jobs to be delayed longer than they would have been on a non-backfill system. Figure 2 shows a Classical backfill schedule with wallclock estimates and actual runtimes. When a High priority job completes early, jobs that have been backfilled may prevent the next high priority job from running. This increases the expansion factor for these jobs. Expansion factor is defined as the total time from submission to completion, divided by the actual runtime of a job. Administrators can disallow backfilled jobs on their resources in order to eliminate this expansion in the runtime for their high priority jobs.
28
Quinn O. Snell et al.
Processors
Wallclock estimate
Job 1 Job 2 Job 3
Actual Runtime
Time
Fig. 3. Preemptive backfill 2.3
Preemptive Backfill
The wasted processors in Figure 2 are present because there were no jobs in the queue that matched the available resources. Preemptive backfill will run jobs in these slots that have a high probability of completion in the available time, even if their wallclock estimate exceeds the backfill space. Figure 3 shows the same set of tasks with preemptive backfill. The expansion factor for high priority jobs is minimized because backfilled jobs are preempted when resources become available for the next high priority job. Efficiency is higher because backfill slots can be filled, even if no job exists in the queue with small enough resource requirements. 2.4
Gang Scheduling
Gang Scheduling provides some of the same advantages as preemptive backfill. With Gang Scheduling, all of the processes associated with a task are suspended periodically so that more than one task can be time sliced on the same set of processors. A lower priority job can be backfilled into a slot using gang scheduling and if resources become available to let a higher priority job run, the backfilled job can be preempted by the higher priority job and may only run when the high priority job completes. Several problems arise with gang scheduling. – Migration – It may be better to run the backfilled job in some other processor space than to wait for the high priority task to complete. Some Gang Scheduling systems allow jobs to migrate to processors where they can complete more quickly, but since data files and memory must be migrated with the process, migration can lead to inefficiency in supercomputing systems [4]. – Memory Contention – Many supercomputing applications require large amounts of memory. When time slicing occurs, the caches and memory may have to be swapped out to disk, causing significant performance degradation [5].
Preemption Based Backfill
29
– Security – As mentioned previously, each job must be isolated from the file space and environment of other tasks. This is difficult to implement when more than one job is being time-sliced on a machine. This occurs because in a meta-scheduling environment, each user may not have a separate account. – Availability – Gang scheduling is not available on many operating systems. Even when it is available, preemptive backfill may provide higher performance for many applications.
3
Preemptive Backfill
This research was performed using the Maui Scheduler [6, 7, 8]. The Maui Scheduler is used in many supercomputing sites throughout the world and provides for a simulation mode as well as production mode. Extensive trace data is available from production runs of large jobs over a long period of time. This trace data is used to evaluate different options for preemptive scheduling. 3.1
Preemption
Several forms of preemption are possible. The form used is dependent on operating system functionality. For this research we assume the minimal functionality of being able to kill a job and all of its spawned processes. Several other forms may be available depending on the operating system used: – stop/restart - All supercomputer operating systems provide functionality to stop a parallel job and to start it over again at a later time. All preliminary results are lost and the task must be able to remove any changes that have been made to input files so that the second invocation of the task will have exactly the same environment as the first. Once the job has been stopped, it will return to the queue and backfilled onto another set of processors or run as a priority job when it reaches the head of the queue. Any partial results will be lost and any time spent running a job that is stopped is wasted. – suspend/resume - Many operating systems provide functionality that allows a task to be suspended and then resumed at a later time. All of the processes associated with this task are made non-runnable, but they retain process state and any file system changes remain in place so that the job can continue when it is resumed. The operating system must insure that jobs do not have access to each others files and memory will need to be transferred to a swap file before the new task can start. This may cause some additional delay in the start time of the high priority task when compared to stop/restart. When the job is resumed, it must run on the same set of processors that is was suspended on. This may cause a large expansion factor for jobs that are backfilled in this way. – checkpoint/restart - Many applications perform checkpoint operations to save their intermediate state. Once a checkpoint is performed, the job can be terminated and restarted with the same state present when the checkpoint
30
Quinn O. Snell et al.
was performed. This preemption form has an advantage over suspend/resume since the task can be migrated to another set of processors for continued execution after preemption. It also has an advantage over stop/restart since intermediate computations performed before preemption are not wasted. The principle disadvantage of this option lies in the fact that in most operating systems, the application itself must perform checkpoints. Most operating systems do not support checkpointing because it is difficult for an operating system to determine which parts of the memory space should be saved in order to restart effectively. Many parallel processing systems do not have support for suspend/resume or checkpoint/restart. The results presented in this paper are all based on stop/restart program semantics. If additional functionality is available, preemption becomes even more advantageous in scheduling resources. 3.2
Backfill Options
Several factors must be considered in determining the implementation of a preemptive backfill scheduler. The user and system administrator have different goals in mind, and these goals must all be addressed in order to provide optimal service. Three distinct goals must be considered: – Priority scheduling Goal: The scheduler should run the most important job (based on value of job) first. In many cases one organization will own computing resources, but will be willing to have other jobs serviced if the resources are idle. When a high priority job is submitted, it should be given preferential treatment, or the owner of the resource will not be willing to allow lower priority jobs to run at all. – Backfill scheduling Goal: The scheduler should run the greatest aggregate sum of jobs with the greatest likelihood of successful completion. (based on the value of the job, probability of successful completion). This goal translates into high utilization from the system administrator’s perspective. Possible priorities for this goal include: • Backfill jobs using the most resources first • Backfill the jobs that are most likely to complete first • Backfill highest priority jobs first. High priority jobs may not be candidates for backfill since they may be preempted if an existing job finishes early, or the predicted completion time is inaccurate. – Job Preemption Goal: The scheduler should stop the minimum set of jobs required to allow priority jobs to run immediately (based on the value of jobs, probability of successful completion and the value of the completed work). Experiments performed in this research indicate that preemptive backfill provides lower delay for users and higher utilization for system administrators. By giving more weight to each of the backfill goals, the scheduling algorithm can favor the goal that is most important to the user community.
Preemption Based Backfill
Job Characterization
31
1 2 4 8 16 32
1400 1200 1000 800
Number of Jobs 600 400 200
16
32
0:02
0:04
0:08
0:16
0:32
1:04
2:08
8
4:16
4
8:32
2
17:04
1 Number of Processors
34:08
0
Job Duration (hh:mm)
Fig. 4. Job Characterization of trace data. Many of the jobs are in the 8 processor range and can be backfilled effectively Percentage of Processor Hours
25
20
15 Percentage of processor hours 10
5 34:08 17:04
0 1
8:32 2
4
Number of processors
8
Job Duration in Hours
4:16 16
32
Fig. 5. Distribution of job processor hours. Most of the time in the trace is spent with jobs that take more than 17 hours and have more than 8 processors
4
Experimental Results
Several Experiments were performed to determine the impact of preemptive backfill on system performance. These experiments were performed on the Maui Scheduler[6] with job trace files from the Center for High Performance Computing at the University of Utah. The trace files included 11,445 jobs ranging from 1 to 32 processors, lasting up to 68 hours. Figure 4 shows the job mix present in the trace. There are a large number of jobs requiring less than 8 processors,
32
Quinn O. Snell et al.
providing a large opportunity for backfill. Figure 5 shows the percentage of processor hours used in each job classification. Although there are a large number of jobs with small numbers of processors and a short duration, they account for a small percentage of the overall runtime on the machine. Jobs with duration of more than 17 hours and more than 8 processors dominate in this area. Any preemptive backfill scheme should not hurt the performance of these jobs. 4.1
Non-prioritized Schedule Analysis
Several experiments were run to determine the impact of preemption on the performance of jobs in the trace when no priority was used. The scheduling algorithm was changed so that jobs could be backfilled even if their wallclock estimate of completion time was after the start time of the next normal job. When a scheduling iteration occurs and backfilled jobs are preventing normal jobs from running, a set of the backfilled jobs will be preempted or terminated (under the current model where restart is not available. Metrics used in evaluating the algorithms were Queue Time and Expansion factor. An effective preemptive backfill algorithm will decrease idle time for processors since jobs with a wallclock estimate precluding them from backfill under normal circumstances can be scheduled if they can be preempted when a higher priority job must be run. It is expected that queue time will be reduced with preemptive backfill. Expansion factor considers the percentage of additional time that a job has to wait because of queuing delay (Xfactor=[QueueTime+RunTime]/RunTime). A job with a 5 minute run time that is queued for 100 minutes will notice the delay much more than a job with a 1000 minute run time and the same delay. The expansion factor for the 5 minute job would be 21, where the expansion factor for the 1000 minute job would be 1.1. Optimal expansion factors are close to 1. Figure 6 shows expansion factor results for several algorithms that were used to select which jobs to preempt. In one case a random set of jobs were selected to preempt. This will often result in a job being terminated when it is nearly complete. The random plot is presented to show the worst case behavior of a naive algorithm. The First Fit plot shows the behavior of the normal backfill algorithm without preemption. Notice that the random algorithm behavior has significantly higher Expansion factor and queue time than the non-preemption firstfit algorithm. The duration-consumed algorithm selects a set of jobs to preempt that have been running for the shortest length of time. This strategy attempts to waste the minimum number of node-hours in the preemption. The duration-remaining plot results from preempting the jobs that have the most time left to run. By preempting these jobs, the scheduler will free up larger slots of time for each job preempted, and may be able to preempt a smaller number of jobs. Each job has a Wall Clock prediction of execution time that is an estimate provided by the user. When this Wall Clock time expires, the job will be terminated. As a result, users tend to overestimate the time for their jobs, but the estimates can still provide information as to which jobs to preempt. The
Preemption Based Backfill
33
Average XFactor 3.5
Expansion Factor (run+queue)/run
3 First Fit 2.5
duration-consumed random
2
duration-remaining
1.5
wcdurationpercentresusage wcduration
1 0.5 0
1
2
4
8
16
32
Job Size (in processors)
Fig. 6. Expansion factor for several preemption strategies. Intelligent preemption backfill strategies outperform random strategies and in most cases also perform better than First Fit without preemption wcduration plot represents results from preempting the job with the longest estimate of duration and the wcduration-percentresusage plot shows the results of preempting the job using the largest percentage of resources on the machine. The strategies resulting in the best expansion factor are duration-remaining and duration-consumed. These strategies consistently outperform First Fit. Similar results can be observed in queue time analysis. Figure 7 shows the average queue times for each of the strategies. The average queue time for all of the preemption strategies is less than the First Fit nonpreemption scheduling algorithm. It is difficult to select a clear winner, but duration-remaining and duration-consumed both achieve good performance. 4.2
Prioritized Schedule Analysis
Although non-prioritized jobs seem to benefit from preemption, one of the goals of preemption is to insure that high priority jobs receive preferential treatment. If the owner of a supercomputer can be assured that his jobs will always preempt jobs of other users, he will be more willing to allow low priority jobs to utilize idle node-hours when he doesn’t have an active jobs. In our experiments high priority jobs can always preempt low priority jobs. High priority jobs can preempt backfilled medium priority jobs. Figure 8 shows expansion factor when 80% of the jobs were marked as medium priority and 20% high priority. The high priority jobs achieve an expansion factor near unity, indicating that they have a queue time near zero. The expansion time is greater than one for the 32 processor case since high priority jobs only preempt medium priority jobs when they are backfilled. Medium priority jobs that are scheduled normally will not be preempted. Figure 9 shows
34
Quinn O. Snell et al.
Average Queue Time 9 8
Queue time in hours
7 6
First Fit duration-consumed random duration-remaining wcduration-percentresusage wcduration
5 4 3 2 1 0 1
2
4
8
16
32
Job Size (in processors)
Fig. 7. Average queue time for each of the preemption strategies. All of the strategies achieve lower queue times than First Fit due to a reduction in wasted node-hours since slots can be filed with jobs that wouldn’t normally be backfilled due to their estimated run time the queue times for this case. Medium priority jobs achieve queue times that are very similar to First Fit. Similar results were obtained for a combination of high and low priority jobs. Figure 10 shows the expansion factor with a mix of high priority and low priority jobs. Since preemption is possible, the high priority jobs are only queued behind other high priority jobs. When a low priority job is running and a high priority job arrives, the low priority job is preempted immediately. Figure 11 shows the queue time for this configuration. In this case low priority jobs with 32 processors received a higher average queue time than First Fit, but un most cases the queue time was similar to First Fit even for low priority jobs.
5
Conclusions
This paper examines the impact of preemption on the performance of the Maui supercomputer scheduler. If jobs can be preempted once they are started, more efficient use can be made of processor time that is to small for any eligible backfill jobs. Jobs that are larger than a backfill window can be started with the hope that they will complete before the window expires. Priority markings on lower priority jobs can allow high priority jobs to achieve a much higher quality of service than low priority jobs that run in idle time on a supercomputer. These priority markings allow users to share resources while maintaining preferential treatment for jobs submitted by privileged users. The duration-remaining and duration-consumed preemption strategies result in the best average queue time and expansion factor for non-prioritized jobs. The
Preemption Based Backfill
35
Average Xfactor with 80% of the jobs set to medium priority 3
Expansion Factor (queue+run)/run
2.5
2
Firstfit High Priority Jobs Medium Priority Jobs
1.5
1
0.5
0
1
2
4
8
16
32
Job Size (in processors)
Fig. 8. Expansion factor for a mix of high and medium priority jobs. High priority jobs are able to achieve near-optimal performance even though medium priority jobs are present
Average Queue time with 80% of the jobs set to medium priority 100
Queue Time in Hours
10
Firstfit High Priority Jobs Medium Priority Jobs
1
0.1
0.01 1
2
4
8
16
32
Job Size (in processors)
Fig. 9. Queue time for a mix of high and medium priority jobs. High priority jobs are able to achieve low queue times and medium priority jobs have queue times similar to First Fit
36
Quinn O. Snell et al.
Average Xfactor with 80% of the jobs set to low priority 3
Expansion Factor (queue+run)/run
2.5
2 Firstfit High Priority Jobs Low Priority Jobs
1.5
1
0.5
0
1
2
4
8
16
32
Job Size (in processors)
Fig. 10. Expansion factor for a mix of high and low priority jobs. High priority jobs are able to achieve near-optimal performance even though low priority jobs are present
Average Queue time with 80% of the jobs set to low priority 100
Queue Time in Hours
10
Firstfit High Priority Jobs Low Priority Jobs
1
0.1
0.01 1
2
4
8
16
32
Job Size (in processors)
Fig. 11. Queue time for a mix of high and low priority jobs. High priority jobs are able to achieve low queue times and low priority jobs have queue times similar to First Fit
Preemption Based Backfill
37
queue time and expansion factor resulting from any of the strategies results in improved performance over First Fit without preemption. This research shows that preemptive backfill algorithms can improve the performance of supercomputer schedulers. These results should make resource owners more willing to share their computing resources and will increase the utilization of supercomputing centers.
Acknowledgements We appreciate the comments and suggestions we received from Brian Haymore and Julio Facelli from the Center for High Performance Computing at the University of Utah. They also provided us with job trace files from an environment where preemption could make a big difference. We also appreciate comments from Scott M Jackson from the Molecular Science Computing Facility at Pacific Northwest National Laboratory. Many of the features currently implemented in the Maui Scheduler were implemented in response to suggestions from PNNL.
References [1] D. G. Feitelson and L.Rudolph. Parallel job scheduling: Issues and approaches. Lecture Notes in Computer Science: Job Scheduling Strategies for Parallel Processing, 949, 1995. 24 [2] G. Berry. Preemption in concurrent systems. FSTTCS ’93, Lecture Notes in Computer Science 761, pages 72–93, 1993. 24 [3] D. G. Feitelson and M. Jette. Improved utilization and responsiveness with gang scheduling. Proceedings of the IPPS ’97 Workshop on Job Scheduling Strategies for Parallel Processing, 1997. 26 [4] Uwe Schwiegelshohn. Preemptive weighted completion time scheduling of parallel jobs. Lecture Notes in Computer Science: 4th Europeon Symposium on Algorithms, 1136:39–51, 1996. 28 [5] Anat Batat and Dror G. Feitelson. Gang scheduling with memory considerations. In 14th International Parallel and Distributed Processing Symposium, 2000. 28 [6] D. Jackson. The Maui Scheduler. Technical report. http://supercluster.org/projects/maui. 29, 31 [7] D. Jackson, Q. Snell, and M. Clement. Core algorithms of the maui scheduler. Lecture Notes in Computer Science: Job Scheduling Strategies for Parallel Processing, 2221:87–102, 2001. 29 [8] Q. Snell, M. Clement, D. Jackson, and C. Gregory. The performance impact of advance reservation metascheduling. Lecture Notes in Computer Science: Job Scheduling Strategiew for Parallel Processing, 1911, 2000. 29
Job Scheduling for the BlueGene/L System Elie Krevat1 , Jos´e G. Casta˜ nos2, and Jos´e E. Moreira2 1
Massachusetts Institute of Technology Cambridge, MA 02139-4307 [email protected] 2 IBM T. J. Watson Research Center Yorktown Heights, NY 10598-0218 {castanos,jmoreira}@us.ibm.com
Abstract. BlueGene/L is a massively parallel cellular architecture system with a toroidal interconnect. Cellular architectures with a toroidal interconnect are effective at producing highly scalable computing systems, but typically require job partitions to be both rectangular and contiguous. These restrictions introduce fragmentation issues that affect the utilization of the system and the wait time and slowdown of queued jobs. We propose to solve these problems for the BlueGene/L system through scheduling algorithms that augment a baseline first come first serve (FCFS) scheduler. Restricting ourselves to space-sharing techniques, which constitute a simpler solution to the requirements of cellular computing, we present simulation results for migration and backfilling techniques on BlueGene/L. These techniques are explored individually and jointly to determine their impact on the system. Our results demonstrate that migration can be effective for a pure FCFS scheduler but that backfilling produces even more benefits. We also show that migration can be combined with backfilling to produce more opportunities to better utilize a parallel machine.
1
Introduction
BlueGene/L (BG/L) is a massively parallel cellular architecture system. 65,536 self-contained computing nodes, or cells, are interconnected in a three-dimensional toroidal pattern [19]. In that pattern, each cell is directly connected to its six nearest neighbors, two each along the x, y, and z axes. Three-dimensional toroidal interconnects are simple, modular, and scalable, particularly when compared with systems that have a separate, typically multistage, interconnection network [13]. Examples of successful toroidal-interconnected parallel systems include the Cray T3D and T3E machines [11]. There is, however, a price to pay with toroidal interconnects. We cannot view the system as a simple fully-connected interconnection network of nodes that are equidistant to each other (i.e., a flat network). In particular, we lose an important feature of systems like the IBM RS/6000 SP, which lets us pick any set of nodes for execution of a parallel job, irrespective of their physical location in the machine [1]. In a toroidal-interconnected system, the spatial allocation D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 38–54, 2002. c Springer-Verlag Berlin Heidelberg 2002
Job Scheduling for the BlueGene/L System
39
of nodes to jobs is of critical importance. In most toroidal systems, including BG/L, job partitions must be both rectangular (in a multidimensional sense) and contiguous. It has been shown by Feitelson and Jette [7] that, because of these restrictions, significant machine fragmentation occurs in a toroidal system. Fragmentation results in low system utilization and high wait time for queued jobs. In this paper, we analyze a set of strictly space-sharing scheduling techniques to improve system utilization and reduce the wait time of jobs for the BG/L system. Time-sharing techniques such as gang-scheduling are not explored since these types of schedulers require more memory and operating system involvement than are practically available in a cellular computing environment. We analyze the two techniques of backfilling [8, 14, 17] and migration [3, 20] in the context of a toroidal-interconnected system. Backfilling is a technique that moves lower priority jobs ahead of other higher priority jobs, as long as execution of the higher priority jobs is not delayed. Migration moves jobs around the toroidal machine, performing on-the-fly defragmentation to create larger contiguous free space for waiting jobs. We conduct a simulation-based study of the impact of our scheduling algorithms on the system performance of BG/L. Using actual job logs of supercomputing centers, we measure the impact of migration and backfilling as enhancements to a first-come first-serve (FCFS) job scheduling policy. Migration is shown to be effective in improving maximum system utilization while enforcing a strict FCFS policy. We also find that backfilling, which bypasses the FCFS order, can lead to even higher utilization and lower wait times. Finally, we show that there is a small benefit from combining backfilling and migration. The rest of this paper is organized as follows. Section 2 discusses the scheduling algorithms used to improve job scheduling on a toroidal-interconnected parallel system. Section 3 describes the simulation procedure to evaluate these algorithms and presents our simulation results. Section 4 describes related work and suggests future work opportunities. Finally, Section 5 presents the conclusions.
2
Scheduling Algorithms
System utilization and average job wait time in a parallel system can be improved through better job scheduling algorithms [4, 5, 7, 9, 10, 12, 14, 15, 16, 17, 21, 22, 26]. The opportunity for improvement over a simple first-come first-serve (FCFS) scheduler is much greater for toroidal interconnected systems because of the fragmentation issues discussed in Section 1. The following section describes four job scheduling algorithms that we evaluate in the context of BG/L. In all algorithms, arriving jobs are first placed in a queue of waiting jobs, prioritized according to the order of arrival. The scheduler is invoked for every job arrival and job termination event in order to schedule new jobs for execution. Scheduler 1: First Come First Serve (FCFS). For FCFS, we adopt the heuristic of traversing the waiting queue in order and scheduling each job in a way that
40
Elie Krevat et al.
maximizes the largest free rectangular partition remaining in the torus. For each job of size p, we try all the possible rectangular shapes of size p that fit in the torus. For each shape, we try all the legal allocations in the torus that do not conflict with running jobs. Finally, we select the shape and allocation that results in the maximal largest free rectangular partition remaining after allocation of this job. We stop when we find the first job in the queue that cannot be scheduled. A valid rectangular partition does not always exist for a job. There are job sizes which are always impossible for the torus, such as prime numbers greater than the largest dimension size. Because job sizes are known at job arrival time, before execution, jobs with impossible sizes are modified to request the next largest possible size. Additionally, there are legal job sizes that cannot be scheduled because of the current state of the torus. Therefore, if a particular job of size p cannot be scheduled, but some free partition of size q > p exists, the job will be increased in size by the minimum amount required to schedule it. For example, consider a 4 × 4 (two-dimensional) torus with a single free partition of size 2 × 2. If a user submits a job requesting 3 nodes, that job cannot be run. The scheduler increases the job size by one, to 4, and successfully schedules the job. Determining the size of the largest rectangular partition in a given threedimensional torus is the most time-intensive operation required to implement the maximal partition heuristic. When considering a torus of shape M × M × M , a straightforward exhaustive search of all possible partitions takes O(M 9 ) time. We have developed a more efficient algorithm that computes incremental projections of planes and uses dynamic programming techniques. This projection algorithm has complexity O(M 5 ) and is described in Appendix A. An FCFS scheduler that searches the torus in a predictable incremental fashion, implements the maximal partition heuristic, and modifies job sizes when necessary is the simplest algorithm considered, against which more sophisticated algorithms are compared. Scheduler 2: FCFS With Backfilling. Backfilling is a space-sharing optimization technique. With backfilling, we can bypass the priority order imposed by the job queuing policy. This allows a lower priority job j to be scheduled before a higher priority job i as long as this reschedule does not delay the estimated start time of job i. The effect of backfilling on a particular schedule for a one-dimensional machine can be visualized in Figure 1. Suppose we have to schedule five jobs, numbered from 1 to 5 in order of arrival. Figure 1(a) shows the schedule that would be produced by a FCFS policy without backfilling. Note the empty space between times T1 and T2 , while job 3 waits for job 2 to finish. Figure 1(b) shows the schedule that would be produced by a FCFS policy with backfilling. The empty space was filled with job 5, which can be executed before job 3 without delaying it. The backfilling algorithm seeks to increase system utilization without job starvation. It requires an estimation of job execution time, which is usually not very accurate. However, previous work [8, 18, 23] has shown that overesti-
Job Scheduling for the BlueGene/L System
space ✻
space ✻
2
2
5 3
1
3 4
T1
T2 (a)
41
time
1
✲
5 T1
4 time
T2
✲
(b)
Fig. 1. FCFS policy without (a) and with (b) backfilling. Job numbers correspond to their position in the priority queue mating execution time does not significantly affect backfilling results. Backfilling has been shown to increase system utilization in a fair manner on an IBM RS/6000 SP [8, 23]. Backfilling is used in conjunction with the FCFS scheduler and is only invoked when there are jobs in the waiting queue and FCFS halts because a job does not fit in the torus. A reservation time for the highest-priority job is then calculated, based on the worst case execution time of jobs currently running in the torus. The reservation guarantees that the job will be scheduled no later than that time, and if jobs end earlier than expected the reservation time may improve. Then, if there are additional jobs in the waiting queue, a job is scheduled out of order so long as it does not prevent the first job in the queue from being scheduled at the reservation time. Jobs behind the first job, however, may be delayed. Just as the FCFS scheduler dynamically increases the size of jobs that cannot be scheduled with their current size, similar situations may arise during backfilling. Unlike FCFS, however, the size increase is performed more conservatively during backfilling because there are other jobs in the queue which might better utilize the free nodes of the torus. Therefore, a parameter I specifies the maximum size by which the scheduler will increase a job. For example, by setting I = 1 (our default value), backfilling increases a job size by at most one node. This parameter is used only during the backfilling phase of scheduling; the FCFS phase will always increase the first job in the queue as much as is required to schedule it. Scheduler 3: FCFS With Migration. The migration algorithm rearranges the running jobs in the torus in order to increase the size of the maximal contiguous rectangular free partition. Migration in a toroidal-interconnected system compacts the running jobs and counteracts the effects of fragmentation. While migration does not require any more information than FCFS, it may require additional hardware and software functionality. This paper does not attempt to quantify the overhead of that functionality. However, accepting that this overhead exists, migration is only undertaken when the expected benefits are deemed substantial. The decision to migrate is therefore based on two parameters: FNtor , the ratio of free nodes in the system compared to the size of
42
Elie Krevat et al.
the torus, and FNmax , the fraction of free nodes contained in the maximal free partition. In order for migration to establish a significant larger maximal free partition, FNtor must be sufficiently high and FNmax must be sufficiently low. Section 3.4 contains further analysis of these parameters. The migration process is undertaken immediately after the FCFS phase fails to schedule a job in the waiting queue. Jobs already running in the torus are organized in a queue of migrating jobs sorted by size, from largest to smallest. Each job is then reassigned a new partition, using the same algorithm as FCFS and starting with an empty torus. After migration, FCFS is performed again in an attempt to start more jobs in the rearranged torus. In order to ensure that all jobs fit in the torus after migration, job sizes are not increased if a reassignment requires a larger size to fit in the torus. Instead, the job is removed from the queue of migrating jobs, remaining in its original partition, and reassignment begins again for all remaining jobs in the queue. If the maximal free partition size after migration is worse than the original assignment, which is possible but generally infrequent under the current scheduling heuristics, migration is not performed. Scheduler 4: FCFS with Backfilling and Migration. Backfilling and migration are independent scheduling concepts, and an FCFS scheduler may implement both of these functions simultaneously. First, we schedule as many jobs as possible via FCFS. Next, we rearrange the torus through migration to minimize fragmentation, and then repeat FCFS. Finally, the backfilling algorithm from Scheduler 2 is performed to make a reservation for the highest-priority job and attempt to schedule jobs with lower priority so long as they do not conflict with the reservation. The combination of these policies should lead to an even more efficient utilization of the torus. For simplicity, we call this scheduling technique, that combines backfilling and migration, B+M.
3
Experiments
We use a simulation-based approach to perform quantitative measurements of the efficiency of the proposed scheduling algorithms. An event-driven simulator was developed to process actual job logs of supercomputing centers. The results of simulations for all four schedulers were then studied to determine the impact of their respective algorithms. We begin this section with a short overview of the BG/L system. We then describe our simulation environment. We proceed with a discussion of the workload characteristics for the two job logs we consider. Finally, we present the experimental results from the simulations. 3.1
The BlueGene/L System
The BG/L system is organized as a 32 × 32 × 64 three-dimensional torus of nodes (cells). Each node contains processors, memory, and links for interconnecting to
Job Scheduling for the BlueGene/L System
43
its six neighbors. The unit of allocation for job execution in BG/L is a 512node ensemble organized in an 8 × 8 × 8 configuration. This allocation unit is the smallest granularity for which the torus can be electrically partitioned into a toroidal topology. Therefore, BG/L behaves as a 4 × 4 × 8 torus of these supernodes. We use this supernode abstraction when performing job scheduling for BG/L. 3.2
The Simulation Environment
The simulation environment models a torus of 128 (super)nodes in a threedimensional 4 × 4 × 8 configuration. The event-driven simulator receives as input a job log and the type of scheduler (FCFS, Backfill, Migration, or B+M) to simulate. There are four primary events in the simulator: (1) an arrival event occurs when a job is first submitted for execution and placed in the scheduler’s waiting queue; (2) a schedule event occurs when a job is allocated onto the torus, (3) a start event occurs after a standard delay of one second following a schedule event, at which time a job begins to run, and (4) a finish event occurs upon completion of a job, at which point the job is deallocated from the torus. The scheduler is invoked at the conclusion of every event that affects the states of the torus or the waiting queue (i.e., the arrival and finish events). A job log contains information on the arrival time, execution time, and size of all jobs. Given a torus of size N , and for each job j the arrival time taj , execution time tej and size sj , the simulation produces values for the start time tsj and finish time tfj of each job. These results are analyzed to determine the following f s a r a parameters for each job: (1) wait time tw j = tj −tj , (2) response time tj = tj −tj , and (3) bounded slowdown tbs j =
max (trj ,Γ ) max(tej ,Γ )
for Γ = 10 seconds. The Γ term
appears according to recommendations in [8], because some jobs have very short execution time, which may distort the slowdown. Global system statistics are also determined. Let the simulation time span be T = max∀j (tfj ) − min∀k (tak ). We then define system utilization (also called capacity utilized) as sj tej . (1) wutil = TN ∀j
Similarly, let f (t) denote the number of free nodes in the torus at time t and q(t) denote the total number of nodes requested by jobs in the waiting queue at time t. Then, the total amount of unused capacity in the system, wunused , is defined as: max (tf ) j max (0, f (t) − q(t)) dt. (2) wunused = a TN min (tj ) This parameter is a measure of the work unused by the system because there is a lack of jobs requesting free nodes. The max term is included because the amount of unused work cannot be less than zero. The balance of the system capacity is lost despite the presence of jobs that could have used it. The measure of
44
Elie Krevat et al.
Table 1. Statistics for 10,000-job NASA and SDSC logs Number of nodes: Job size restrictions: Job size (nodes) Mean: Standard deviation: Workload(node-seconds) Mean: Standard deviation:
NASA Ames iPSC/860 log SDSC IBM RS/6000 SP log 128 128 powers of 2 none 6.3 14.4
9.7 14.8
0.881 × 106 5.41 × 106
7.1 × 106 25.5 × 106
lost capacity in the system, which includes capacity lost because of the inability to schedule jobs and the delay before a scheduled job begins, is then derived as: wlost = 1 − wutil − wunused 3.3
(3)
Workload Characteristics
We performed experiments on a 10,000-job span of two job logs obtained from the Parallel Workloads Archive [6]. The first log is from NASA Ames’s 128node iPSC/860 machine (from the year 1993). The second log is from the San Diego Supercomputer Center’s (SDSC) 128-node IBM RS/6000 SP (from the years 1998-2000). For our purposes, we will treat each node in those two systems as representing one supernode (512-node unit) of BG/L. This is equivalent to scaling all job sizes in the log by 512, which is the ratio of the number of nodes in BG/L to the number of nodes in these 128-node machines. Table 1 presents the workload statistics and Figure 2 summarizes the distribution of job sizes and the contribution of each job size to the total workload of the system. Using these two logs as a basis, we generate logs of varying workloads by multiplying the execution time of each job by a coefficient c, mostly varying c from 0.7 to 1.4 in increments of 0.05. Simulations are performed for all scheduler types on each of the logs. With these modified logs, we plot wait time and bounded slowdown as a function of system utilization. 3.4
Simulation Results
Figures 3 and 4 present plots of average job wait time (tw j ) and average job bounded slowdown (tbs ), respectively, vs system utilization (wutil ) for each of j the four schedulers considered and each of the two job logs. We observe that the overall shapes of the curves for wait time and bounded slowdown are similar. The most significant performance improvement is attained through backfilling, for both the NASA and SDSC logs. Also, for both logs, there is a certain benefit from migration, whether combined with backfilling or not. We analyze these results from each log separately.
Job Scheduling for the BlueGene/L System
Histogram of job sizes
7000
3500
6000
3000
5000
2500
Number of jobs
Number of jobs
Histogram of job sizes
4000 3000
2000 1500
2000
1000
1000
500
0 0
20
40 60 80 100 Size of job (number of nodes)
120
0 0
140
20
(a) NASA Ames iPSC/860 7
18
4
16
3.5
14
3 2.5 2 1.5
0 0
140
Workload (runtime * number of nodes) vs Size of job
6
2 120
x 10
8
4
40 60 80 100 Size of job (number of nodes)
140
10
0.5 20
120
12
1
0 0
40 60 80 100 Size of job (number of nodes)
(b) SDSC RS/6000 SP 7
Workload (runtime * number of nodes) vs Size of job
Workload (node−seconds)
Workload (node−seconds)
4.5
x 10
45
20
(c) NASA Ames iPSC/860
40 60 80 100 Size of job (number of nodes)
120
140
(d) SDSC RS/6000 SP
Fig. 2. Job sizes and total workload for NASA Ames iPSC/860((a) and (c)) and San Diego Supercomputer Center (SDSC) IBM RS/6000 SP((b) and (d))
Mean job wait time vs Utilization
4
10
Mean job wait time vs Utilization FCFS Backfill Migration B+M
9
Mean job wait time (seconds)
Mean job wait time (seconds)
12000
x 10
10000
8000
6000
4000
8
FCFS Backfill Migration B+M
7 6 5 4 3 2
2000
1 0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
(a) NASA iPSC/860
0.8
0.85
0.9
0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(b) SDSC RS/6000 SP
Fig. 3. Mean job wait time vs utilization for (a) NASA and (b) SDSC logs
NASA log: All four schedulers provide similar average job wait time and average job bounded slowdown for utilizations up to 65%. The FCFS scheduler saturates at about 77% utilization, whereas the Migration scheduler saturates
46
Elie Krevat et al.
Mean job bounded slowdown vs Utilization
Mean job bounded slowdown vs Utilization
400
350
Mean job bounded slowdown
Mean job bounded slowdown
350
400 FCFS Backfill Migration B+M
300 250 200 150 100 50 0 0.4
FCFS Backfill Migration B+M
300 250 200 150 100 50
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
(a) NASA iPSC/860
0.8
0.85
0.9
0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(b) SDSC RS/6000 SP
Fig. 4. Mean job bounded slowdown vs utilization for (a) NASA and (b) SDSC logs
at about 80% utilization. Backfilling (with or without migration) allows utilizations above 80% and saturates closer to 90% (the saturation region for these schedulers is shown here by plotting values of c > 1.4). We note that migration provides only a small improvement in wait time and bounded slowdown for most of the utilization range, and the additional benefits of migration with backfilling becomes unpredictable for utilization values close to the saturation region. In the NASA log, all jobs are of sizes that are powers of two, which results in a good packing of the torus. Therefore, the benefits of migration are limited. SDSC log: With the SDSC log, the FCFS scheduler saturates at 63%, while the stand-alone Migration scheduler saturates at 73%. In this log, with jobs of more varied sizes, fragmentation occurs more frequently. Therefore, migration has a much bigger impact on FCFS, significantly improving the range of utilizations at which the system can operate. However, we note that when backfilling is used there is again only a small additional benefit from migration, more noticeable for utilizations between 75 and 85%. Utilization above 85% can be achieved, but only with exponentially growing wait time and bounded slowdown, independent of performing migration. Figure 5 presents a plot of average job bounded slowdown (tbs j ) vs system utilization (wutil ) for each of the four schedulers considered and each of the two job logs. We also include results from the simulation of a fully-connected (flat ) machine, with and without backfilling. (A fully-connected machine does not suffer from fragmentation.) This allows us to assess the effectiveness of our schedulers in overcoming the difficulties imposed by a toroidal interconnect. The overall shapes of the curves for wait time are similar to those for bounded slowdown. Migration by itself cannot make the results for a toroidal machine as good as those for a fully connected machine. For the SDSC log, in particular, a fully connected machine saturates at about 80% utilization with just the FCFS scheduler. For the NASA log, results for backfilling with or without migration in the
Job Scheduling for the BlueGene/L System
Mean job bounded slowdown vs Utilization
Mean job bounded slowdown vs Utilization
400
300
400
FCFS Backfill Migration B+M Flat FCFS Flat Backfill
350
Mean job bounded slowdown
Mean job bounded slowdown
350
250 200 150 100 50 0 0.4
47
300
FCFS Backfill Migration B+M Flat FCFS Flat Backfill
250 200 150 100 50
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
(a) NASA iPSC/860
0.8
0.85
0.9
0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(b) SDSC RS/6000 SP
Fig. 5. Mean job bounded slowdown vs utilization for the NASA and SDSC logs, comparing toroidal and flat machines
toroidal machine are just as good as the backfilling results in the fully connected machine. For utilizations above 85% in the SDSC log, not even a combination of backfilling and migration will perform as well as backfilling on a fully connected machine. Figure 6 plots the number of migrations performed and the average time between migrations vs system utilization for both workloads. We show results for the number of total migrations attempted, the number of successful migrations, and the maximum possible number of successful migrations (max successful ). As described in Section 2, the parameters which determine if a migration should be attempted are FNtor , the ratio of free nodes in the system compared to the size of the torus, and FNmax , the fraction of free nodes contained in the maximal free partition. According to our standard migration policy, a migration is only attempted when FNtor ≥ 0.1 and FNmax ≤ 0.7. A successful migration is defined as a migration attempt that improves the maximal free partition size. The max successful value is the number of migrations that are successful when a migration is always attempted (i.e., FNtor ≥ 0.0 and FNmax ≤ 1.0). Almost all migration attempts were successful for the NASA log. This property of the NASA log is a reflection of the better packing caused by having jobs that are exclusively power of two in size. For the SDSC log, we notice that many more total attempts are made while about 80% of them are successful. If we always try to migrate every time the state of the torus is modified, no more than 20% of these migrations are successful, and usually much less. For the NASA log, the number of migrations increases linearly while the average time between these migrations varies from about 90 to 30 minutes, depending on the utilization level and its effect on the amount of fragmentation in the torus. In contrast to the NASA log, the number of migrations in the SDSC log do not increase linearly as utilization levels increase. Instead, the relationship is closer to an elongated bell curve. As utilization levels increase, at first migration attempts and successes also increase slightly to a fairly steady level. Around
48
Elie Krevat et al.
the first signs of saturation the migrations tend to decrease (i.e., at around 70% utilization for the Migration scheduler and 77% for B+M). Even though the number of successful migrations is greater for the SDSC log, the average time between migrations is still longer as a result of the larger average job execution time. Most of the benefit of migration is achieved when we only perform migration according to our parameters. Applying these parameters has three main advantages: we reduce the frequency of migration attempts so as not to always suffer the required overhead of migration, we increase the percentage of migration attempts that are successful, and additionally we increase the average benefits of a successful migration. This third advantage is apparent when we compare the mean job wait time results for our standard FNtor and FNmax settings to that of the scheduler that always attempts to migrate. Even though the maximum possible number of successful migrations is sometimes twice as many as our actual number of successes, Figure 7 reveals that the additional benefit of these successful migrations is very small.
Number of migrations vs Utilization
Number of migrations vs Utilization
2500
3500 3000
Number of migrations
Number of migrations
2000
4000 Migration (total) Migration (successful) Migration (max successful) B+M (total) B+M (successful) B+M (max successful)
1500
1000
2500 2000 1500 Migration (total) Migration (successful) Migration (max successful) B+M (total) B+M (successful) B+M (max successful)
1000 500 500 0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0 0.4
0.9
(a) NASA iPSC/860
0.45
Avg. Time Between Migrations vs Utilization
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
Avg. Time Between Migrations vs Utilization 9000
Avg. Time Between Migrations (seconds)
Avg. Time Between Migrations (seconds)
0.55
(b) SDSC RS/6000 SP
6000
5000
4000
3000
2000 Migration (total) Migration (successful) B+M (total) B+M (successful)
1000
0 0.4
0.5
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
(c) NASA iPSC/860
0.8
0.85
0.9
8000 7000 6000 5000 4000 3000 2000
Migration (total) Migration (successful) B+M (total) B+M (successful)
1000 0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(d) SDSC RS/6000 SP
Fig. 6. Number of total, successful, and maximum possible successful migrations vs utilization ((a) and (b)), and average time between migrations vs utilization ((c) and (d))
Job Scheduling for the BlueGene/L System
49
Mean job wait time vs Utilization 5000
4000
Mean job wait time vs Utilization
4
10
x 10
9
3500 Mean job wait time (seconds)
Mean job wait time (seconds)
4500
Migration (standard migration) Migration (full migration) B+M (standard migration) B+M (full migration)
3000 2500 2000 1500 1000
8
Migration (standard migration) Migration (full migration) B+M (standard migration) B+M (full migration)
7 6 5 4 3 2
500 0 0.4
1
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(a) NASA iPSC/860
0 0.4
0.45
0.5
0.55
0.6
0.65 0.7 Utilization
0.75
0.8
0.85
0.9
(b) SDSC RS/6000 SP
Fig. 7. Mean job wait time vs utilization for the NASA and SDSC logs, comparing the standard migration policy to a full migration policy that always attempts to migrate
We complete this section with an analysis of results for system capacity utilized, unused capacity, and lost capacity. The results for each scheduler type and both standard job logs (c = 1.0) are plotted in Figure 8. The utilization improvements for the NASA log are barely noticeable – again, because its jobs fill the torus more compactly. The SDSC log, however, shows the greatest improvement when using B+M over FCFS, with a 15% increase in capacity utilized and a 54% decrease in the amount of capacity lost. By themselves, the Backfill and Migration schedulers each increase capacity utilization by 15% and 13%, respectively, while decreasing capacity loss by 44% and 32%, respectively. These results show that B+M is significantly more effective at transforming lost capacity into unused capacity. Under the right circumstances, it should be possible to utilize this unused capacity more effectively.
4
Related and Future Work
The topics of our work have been the subject of extensive previous research. In particular, [8, 14, 17] have shown that backfilling on a flat machine like the IBM RS/6000 SP is an effective means of improving quality of service. The benefits of combining migration and gang-scheduling have been demonstrated both for flat machines [24, 25] and toroidal machines like the Cray T3D [7]. The results in [7] are particularly remarkable, as system utilization was improved from 33%, with a pure space-sharing approach, to 96% with a combination of migration and gang-scheduling. The work in [21] discusses techniques to optimize spatial allocation of jobs in mesh-connected multicomputers, including changing the job size, and how to combine spatial- and time-sharing scheduling algorithms. An efficient job scheduling technique for a three-dimensional torus is described in [2].
50
Elie Krevat et al.
System capacity statistics − baseline workload
System capacity statistics − baseline workload
1
0.8
0.6
0.4
0.2
0
Capacity unused Capacity lost Capacity utilized
Fraction of total system capacity
Fraction of total system capacity
Capacity unused Capacity lost Capacity utilized 1
0.8
0.6
0.4
0.2
FCFS
Backfilling Migration Scheduler type
(a) NASA iPSC/860
B+M
0
FCFS
Backfilling Migration Scheduler type
B+M
(b) SDSC RS/6000 SP
Fig. 8. Capacity utilized, lost, and unused as a fraction of the total system capacity This paper, therefore, builds on this previous research by applying a combination of backfilling and migration algorithms, exclusively through space-sharing techniques, to improve system performance on a toroidal-interconnected system. Future work opportunities can further build on the results of this paper. The impact of different FCFS scheduling heuristics for a torus, besides the largest free partition heuristic currently used, can be studied. It is also important to identify how the current heuristic relates to the optimal solution in different cases. Additional study of the parameters I, FNtor , and FNmax may determine further tradeoffs associated with partition size increases and more or less frequent migration attempts. Finally, while we do not attempt to implement complex time-sharing schedulers such as those used in gang-scheduling, a more limited time-sharing feature may be beneficial. Preemption, for example, allows for the suspension of a job until it is resumed at a later time. These time-sharing techniques may provide the means to further enhance the B+M scheduler and make the system performance of a toroidal-interconnected machine more similar to that of a flat machine.
5
Conclusions
We have investigated the behavior of various scheduling algorithms to determine their ability to increase processor utilization and decrease job wait time in the BG/L system. We have shown that a scheduler which uses only a backfilling algorithm performs better than a scheduler which uses only a migration algorithm, and that migration is particularly effective under a workload that produces a large amount of fragmentation (i.e., when many small to mid-sized jobs of varied sizes represent much of the workload). Migration has a significant implementation overhead but it does not require any additional information besides what is required by the FCFS scheduler. Backfilling, on the other hand, does not have a significant implementation overhead but requires additional information pertaining to the execution time of jobs.
Job Scheduling for the BlueGene/L System
51
Simulations of FCFS, backfilling, and migration space-sharing scheduling algorithms have shown that B+M, a scheduler which implements all of these algorithms, shows a small performance improvement over just FCFS and backfilling. However, B+M does convert significantly more lost capacity into unused capacity than just backfilling. Additional enhancements to the B+M scheduler may harness this unused capacity to provide further system improvements. Even with the performance enhancements of backfilling and migration techniques, a toroidalinterconnected machine such as BG/L can only approximate the job scheduling efficiency of a fully connected machine in which all nodes are equidistant.
References [1] T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir. SP2 system architecture. IBM Systems Journal, 34(2):152–184, 1995. 38 [2] H. Choo, S.-M. Yoo, and H. Y. Youn. Processor Scheduling and Allocation for 3D Torus Multicomputer Systems. IEEE Transactions on Parallel and Distributed Systems, 11(5):475–484, May 2000. 49 [3] D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide flock of Condors: Load sharing among workstation clusters. Future Generation Computer Systems, 12(1):53–65, May 1996. 39 [4] D. G. Feitelson. A Survey of Scheduling in Multiprogrammed Parallel Systems. Technical Report RC 19790 (87657), IBM T. J. Watson Research Center, October 1994. 39 [5] D. G. Feitelson. Packing schemes for gang scheduling. In Job Scheduling Strategies for Parallel Processing, IPPS’96 Workshop, volume 1162 of Lecture Notes in Computer Science, pages 89–110, Berlin, March 1996. Springer-Verlag. 39 [6] D. G. Feitelson. Parallel Workloads Archive. URL: http://www.cs.huji.ac.il/labs/parallel/workload/index.html, 2001. 44 [7] D. G. Feitelson and M. A. Jette. Improved Utilization and Responsiveness with Gang Scheduling. In IPPS’97 Workshop on Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science, pages 238–261. Springer-Verlag, April 1997. 39, 49 [8] D. G. Feitelson and A. Mu’alem Weil. Utilization and predictability in scheduling the IBM SP2 with backfilling. In 12th International Parallel Processing Symposium, pages 542–546, April 1998. 39, 40, 41, 43, 49 [9] H. Franke, J. Jann, J. E. Moreira, and P. Pattnaik. An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In Proceedings of SC99, Portland, OR, November 1999. IBM Research Report RC21559. 39 [10] B. Gorda and R. Wolski. Time Sharing Massively Parallel Machines. In International Conference on Parallel Processing, volume II, pages 214–217, August 1995. 39 [11] D. Hyatt. A Beginner’s Guide to the Cray T3D/T3E. URL: http://www.jics.utk.edu/SUPER COMPS/T3D/T3D guide/T3D guideJul97.html, July 1997. 38 [12] H. D. Karatza. A Simulation-Based Performance Analysis of Gang Scheduling in a Distributed System. In Proceedings 32nd Annual Simulation Symposium, pages 26–33, San Diego, CA, April 11-15 1999. 39
52
Elie Krevat et al.
[13] D. H. Lawrie. Access and Alignment of Data in an Array Processor. IEEE Transactions on Computers, 24(12):1145–1155, December 1975. 38 [14] D. Lifka. The ANL/IBM SP scheduling system. In IPPS’95 Workshop on Job Scheduling Strategies for Parallel Processing, volume 949 of Lecture Notes in Computer Science, pages 295–303. Springer-Verlag, April 1995. 39, 49 [15] J. E. Moreira, W. Chan, L. L. Fong, H. Franke, and M. A. Jette. An Infrastructure for Efficient Parallel Job Execution in Terascale Computing Environments. In Proceedings of SC98, Orlando, FL, November 1998. 39 [16] U. Schwiegelshohn and R. Yahyapour. Improving First-Come-First-Serve Job Scheduling by Gang Scheduling. In IPPS’98 Workshop on Job Scheduling Strategies for Parallel Processing, March 1998. 39 [17] J. Skovira, W. Chan, H. Zhou, and D. Lifka. The EASY-LoadLeveler API project. In IPPS’96 Workshop on Job Scheduling Strategies for Parallel Processing, volume 1162 of Lecture Notes in Computer Science, pages 41–47. SpringerVerlag, April 1996. 39, 49 [18] W. Smith, V. Taylor, and I. Foster. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In Proceedings of the 5th Annual Workshop on Job Scheduling Strategies for Parallel Processing, April 1999. In conjunction with IPPS/SPDP’99, Condado Plaza Hotel & Casino, San Juan, Puerto Rico. 40 [19] H. S. Stone. High-Performance Computer Architecture. Addison-Wesley, 1993. 38 [20] C. Z. Xu and F. C. M. Lau. Load Balancing in Parallel Computers: Theory and Practice. Kluwer Academic Publishers, Boston, MA, 1996. 39 [21] B. S. Yoo and C. R. Das. Processor Management Techniques for MeshConnected Multiprocessors. In Proceedings of the International Conference on Parallel Processing (ICPP’95), volume 2, pages 105–112, August 1995. 39, 49 [22] K. K. Yue and D. J. Lilja. Comparing Processor Allocation Strategies in Multiprogrammed Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 49(2):245–258, March 1998. 39 [23] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques. In Proceedings of IPDPS 2000, Cancun, Mexico, May 2000. 40, 41 [24] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. In Proceedings of the 6th International Euro-Par Conference, pages 242–251, August 29 - September 1 2000. 49 [25] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. An Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling. In Job Scheduling Strategies for Parallel Processing, Sigmetrics’01 Workshop, June 2001. 49 [26] B. B. Zhou, R. P. Brent, C. W. Jonhson, and D. Walsh. Job Re-packing for Enhancing the Performance of Gang Scheduling. In Job Scheduling Strategies for Parallel Processing, IPPS’99 Workshop, pages 129–143, April 1999. LNCS 1659. 39
A
Projection of Partitions (POP) Algorithm
In a given three-dimensional torus of shape M × M × M where some nodes have been allocated for jobs, the POP algorithm provides a O(M 5 ) time algorithm
Job Scheduling for the BlueGene/L System
53
for determining the size of the largest free rectangular partition. This algorithm is a substantial improvement over an exhaustive search algorithm that takes O(M 9 ) time. Let FREEPART = {B, S | B is a base location (i, j, k) and S is a partition size (a, b, c) such that ∀ x, y, z, i ≤ x < (i + a), j ≤ y < (j + b), k ≤ z < (k+c), node (x mod M, y mod M, z mod M ) is free}. POP narrows the scope of the problem by determining the largest rectangular partition P ∈ FREEPART rooted at each of the M 3 possible base locations and then deriving a global maximum. Given a base location, POP works by finding the largest partition first in one dimension, then by projecting adjacent one-dimensional columns onto each other to find the largest partition in two dimensions, and iteratively projecting adjacent two-dimensional planes onto each other to find the largest partition in three dimensions. First, a partition table of the largest one-dimensional partitions P ∈ FREEPART is pre-computed for all three dimensions and at every possible base location in O(M 4 ) time. This is done by iterating through each partition and whenever an allocated node is reached, all entries for the current “row” may be filled in from a counter value, where the counter is incremented for each adjacent free node and reset to zero whenever an additional allocated node is reached. For a given base location (i, j, k), we fix one dimension (e.g., k), start a ˜ = i in the next dimension, and multiply X ˜ by the minimum partition counter X table entry of the third dimension for (x mod M, j, k), where x varies as i ≤ x ≤ ˜ and X ˜ varies as i ≤ X ˜ ≤ (i + M ). As the example in Figure 9 shows, when X ˜ X = 1 for some fixed k at base location (1, 2, k) the partition table entry in the Y dimension will equal 3 since there are 3 consecutive free nodes, and our largest ˜ increases to 2, the minimum possible partition size is initially set to 3. When X table entry becomes 2 because of the allocated node at location (2, 4, k) and the ˜ = 3, we calculate a new largest possible partition size is increased to 4. When X largest possible partition size of 6. Finally, when we come across a partition table entry in the Y dimension of 0 because of the allocated node at location (4, 2, k), ˜ We would also have to repeat a similar calculation along we stop increasing X. ˜ the Y dimension, by starting a counter Y.
1
2
X
3
4
~ X=1
~ X=2
~ X=3
~ X=4
1 2 Y 3 4
Fig. 9. 2-dimensional POP Algorithm applied to Base Location (1,2): Adjacent ˜ is incremented 1-dimensional columns are projected onto each other as X
54
Elie Krevat et al.
Finally, this same idea is extended to work for 3 dimensions. Given a similar ˜ in the Z dimension and calculate base location (i, j, k), we start a counter Z ˜ Then we the maximum two-dimensional partition given the current value of Z. ˜ project the adjacent two-dimensional planes by incrementing Z and calculating the largest two-dimensional partition while using the minimum partition table entry of the X and Y dimensions for (i, j, z mod M ), where z varies as k ≤ z ≤ ˜ Z. Using the initial partition table, it takes O(M ) time to calculate a projection for two adjacent planes and to determine the largest two-dimensional partition. Since there are O(M ) projections required for O(M 3 ) base locations, our final algorithm runs in O(M 5 ) time. When we implemented this algorithm in our scheduling simulator, we achieved a significant speed improvement. For the original NASA log, scheduling time improved from an average of 0.51 seconds for every successfully scheduled job to 0.16 seconds, while the SDSC log improved from an average of 0.125 seconds to 0.063 seconds. The longest time to successfully schedule a job also improved from 38 seconds to 8.3 seconds in the NASA log, and from 50 seconds to 8.5 seconds in the SDSC log.
Selective Reservation Strategies for Backfill Job Scheduling Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and Ponnuswamy Sadayappan The Ohio State University, Columbus, OH, USA {srinivas,kettimut,subraman,saday}@cis.ohio-state.edu
Abstract. Although there is wide agreement that backfilling produces significant benefits in scheduling of parallel jobs, there is no clear consensus on which backfilling strategy is preferable – should conservative backfilling be used or the more aggressive EASY backfilling scheme. Using trace-based simulation, we show that if performance is viewed within various job categories based on their width (processor request size) and length (job duration), some consistent trends may be observed. Using insights gleaned by the characterization, we develop a selective reservation strategy for backfill scheduling. We demonstrate that the new scheme is better than both conservative and aggressive backfilling. We also consider the issue of fairness in job scheduling and develop a new quantitative approach to its characterization. We show that the newly proposed schemes are also comparable or better than aggressive backfilling with respect to the fairness criterion.
1
Introduction
Effective job scheduling schemes are important for supercomputer centers in order to improve system metrics like utilization, and user metrics like slowdown and turn around time. It is widely accepted that the use of backfilling in job scheduling results in significant improvement to system utilization over nonbackfilling scheduling approaches [8]. However, when comparing different backfilling strategies, many studies have concluded that the relative effectiveness of different schemes depends on the job mix [10], [12]. The two main variants are conservative backfilling [6] and aggressive (EASY) [6], [13] backfilling. With conservative backfilling, each job is given a reservation when it arrives in the queue, and jobs are allowed to move ahead in the queue as long as they do not cause any queued job to get delayed beyond their reserved start-time. With aggressive backfilling, only the job at the head of the queue is given a reservation. Jobs are allowed to move ahead of the reserved job as long as they do not delay that job. There has been no consensus on which of these two backfilling schemes is better. In order to gain greater insight into the relative effectiveness of conservative and aggressive backfilling, we group jobs into categories and study their effect on
Supported in part by a grant from Sandia National Laboratories.
D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 55–71, 2002. c Springer-Verlag Berlin Heidelberg 2002
56
Srividya Srinivasan et al.
jobs in the different categories. Two important factors that affect the scheduling of a job are the length (run time of the job) and width (number of nodes requested by the job). By classifying jobs along these dimensions, and interpreting metrics like slowdown for various job categories instead of just a single average for the entire job trace, we are able to obtain new insights into the performance of conservative and EASY backfilling. We show that very consistent trends are observed with four different traces from Feitelson’s archive [4]. We observe that conservative and aggressive backfilling each benefit certain job categories while adversely affecting other categories. Conservative backfilling allows less backfilling than aggressive backfilling due to the constraints on the schedule by the reservations of all waiting jobs. Although aggressive backfilling enables many more jobs to be backfilled, those jobs (e.g. wide jobs) that do not easily backfill suffer since they might have to wait till they get to the head of the queue before they get a reservation. We propose a selective reservation scheme intended to obtain the best characteristics from both strategies while avoiding the drawbacks. The main idea is to provide reservations selectively, only to jobs that have waited long enough in the queue. By limiting the number of reservations, the amount of backfilling is greater than conservative backfilling; but by assuring reservations to jobs after a limited wait, the disadvantage of potentially unbounded delay with aggressive backfill is avoided. We show that the new strategy is quite consistently superior to both conservative and aggressive backfilling. Finally, we address the issue of fairness in job scheduling. We propose a new model for quantitative characterization of fairness in job scheduling and show that the new schemes are comparable or better than aggressive backfilling. The paper is organized as follows. In Section 2, we provide some background information pertinent to this paper. Section 3 addresses the comparison of conservative and aggressive backfilling. The new selective backfilling schemes are presented and evaluated in Section 4. In Section 5, we develop a new model for characterizing the fairness of a job scheduling scheme. Related work is presented in Section 6. Concluding remarks are provided in Section 7.
2
Background
Scheduling of parallel jobs is usually viewed in terms of a 2D chart with time along one axis and the number of processors along the other axis. Each job can be thought of as a rectangle whose length is the user estimated run time and width is the number of processors required. Parallel job scheduling strategies have been widely studied in the past [1], [2], [3], [9], [15]. The simplest way to schedule jobs is to use the First-Come-First-Served (FCFS) policy. This approach suffers from low system utilization. Backfilling [11], [12], [16] was proposed to improve system utilization and has been implemented in several production schedulers [7]. Backfilling works by identifying “holes” in the 2D chart and moving forward smaller jobs that fit those holes. There are two common variants to backfilling – conservative and aggressive (EASY)[12], [13]. With conservative backfill, every
Selective Reservation Strategies for Backfill Job Scheduling
57
job is given a reservation when it enters the system. A smaller job is moved forward in the queue as long as it does not delay any previously queued job. With aggressive backfilling, only the job at the head of the queue has a reservation. A small job is allowed to leap forward as long as it does not delay the job at the head of the queue. Some of the common metrics used to evaluate the performance of scheduling schemes are the average turnaround time and the average bounded slowdown. We use these metrics for our studies. The bounded slowdown [6] of a job is defined as follows: Bounded Slowdown = (Wait time + Max(Run time, 10))/ Max(Run time,10) A threshold of 10 seconds is used to limit the influence of very short jobs (which usually are aborted jobs) on the metric. 2.1
Workload Characterization
The simulation studies were performed using a locally developed scheduler with workload logs from several supercomputer centers. From the collection of workload logs available from Feitelson’s archive [4], the CTC workload trace, the SDSC workload trace, the KTH workload trace and the LANL workload trace were used to evaluate the various schemes. The CTC trace was logged from a 430 node IBM SP2 at the Cornell Theory Center, the KTH trace from a 100 node IBM SP2 system at the Swedish Royal Institute of Technology, the SDSC trace from a 128 node IBM SP2 system at the San Diego Supercomputer Center, and the LANL trace from a 1024 node CM-5 system at the Los Alamos National Laboratory. Any analysis that is based on the aggregate slowdown of the system as a whole alone does not provide insights into the variability between different job categories. Therefore in our discussion, we classify the jobs into various categories based on the run time and the number of processors requested, and analyze the average slowdown and turnaround time for each category. In the initial part of the study we compare the performance of the different schemes under the idealistic assumption of accurate user estimates. In later sections, we present results using the actual user estimates from the workload logs.
Table 1. Job categorization criteria - CTC, KTH and SDSC traces ≤1Hr >1Hr
≤8 Processors >8 Processors SN SW LN LW
Table 2. Job categorization criteria - LANL trace ≤1Hr >1Hr
≤64 Processors >64 Processors SN SW LN LW
58
Srividya Srinivasan et al.
Table 3. Job distribution by category Trace CTC KTH SDSC LANL
SN 45.06% 53.78% 47.24% 70.80%
SW 11.84% 19.52% 21.44% 11.72%
LN 30.26% 16.50% 20.94% 9.42%
LW 12.84% 10.20% 10.38% 8.06%
To analyze the performance of jobs of different sizes and lengths, jobs were grouped into 4 categories: based on their run time - Short(S) vs. Long(L); and the number of processors requested - Narrow(N) vs. Wide(W). The criterion used for job classification for the CTC, SDSC and KTH traces are shown in Table 1. For the LANL trace, since no job requested less than 32 processors, the classification criterion shown in Table 2 was used. The distribution of jobs in the various traces, corresponding to the four categories is given in Table 3. The choice of the partition boundaries for the categories is somewhat arbitrary; however, we show in the next section that the categorization permits us to observe some consistent trends that are not apparent when only the overall averages for the entire trace are computed. We find that the same overall trends are observed if the partition boundaries are changed.
3
Conservative versus EASY Backfilling
Previous studies [10], [12] have concluded that the relative performance of EASY and conservative backfill policies is trace and metric dependent and that no consistent trend can be observed. However on finer categorization of the jobs in a trace, consistent category-wise trends become evident. With conservative backfilling, when a job is submitted, it is given a reservation to start at the earliest time that does not violate any previously existing reservations. The existing reservations constrain later arriving jobs from backfilling easily. The longer the job is, the more difficult it is for it to get a reservation ahead of the previously arrived jobs. Therefore long jobs find it difficult to backfill under conservative backfilling. EASY backfilling relaxes this constraint, by maintaining only one reservation at any point of time. The presence of only one “blocking” reservation in the schedule helps long jobs to backfill more easily. Wide jobs find it difficult to backfill because they cannot find enough free processors easily. Conservative backfill helps such wide jobs by guaranteeing them a start time when they enter the system. In EASY backfill, since these jobs are not given a reservation until they reach the head of the idle queue, even jobs having lower priority than these can backfill ahead of them, if they find enough free processors. Thus the jobs in the Long Narrow (LN) category benefit from EASY backfilling, while the jobs in the Short Wide (SW) category benefit from conservative backfilling. As far as the Short Narrow (SN) jobs are concerned, there is no consistent trend between EASY and conservative because these jobs backfill very
Selective Reservation Strategies for Backfill Job Scheduling
Conservative vs EASY - SDSC Trace
Conservative vs EASY - SDSC Trace 15
10 5 EASY
0 SN
SW
LN
LW
Overall
-10
% change - Turnaround time
% change - Slowdown
15
-5
-15
10 5 EASY
0 -5
SN
SW
Conservative vs EASY - CTC Trace
2 0 SN
SW
LN
LW
Overall
EASY
-4 -6 -8
% change - Turnaround time
% change - Slowdown
8
4
-10
6 4 2 EASY
0 -2
SN
SW
LN
LW
Overall
-4 -6 -8
Job Categories
Job Categories
Conservative vs EASY - KTH Trace
Conservative vs EASY KTH Trace
15
20
10 5 EASY
0 SN
SW
LN
LW
Overall
-10
% change - Turnaround time
% change - Slowdown
Overall
Job Categories
6
-15
15 10 5
EASY
0 -5
SN
SW
LN
LW
Overall
-10 -15
Job Categories
Job Categories
Conservative vs EASY - LANL Trace
Conservative vs EASY - LANL Trace
10 5 EASY 0 SN
SW
LN
-5 -10 Job Categories
LW
Overall
% change - Turnaround time
15
% change - Slowdown
LW
-15
Conservative vs EASY - CTC Trace
-5
LN
-10
Job Categories
-2
59
8 6 4 2 0 -2 -4 -6
SN
SW
LN
LW
Overall
EASY
-8 -10 Job Categories
Fig. 1. Category-wise performance comparison of conservative vs. EASY backfilling: normal load. The SW jobs have better slowdowns under conservative backfilling while the LN jobs have better slowdowns under EASY backfilling. This trend is consistent across different traces
60
Srividya Srinivasan et al.
Conservative vs EASY - CTC Trace High load 30
EASY SN
SW
LN
LW
Overall
% change - Turnaround time
% change - Slowdown
Conservative vs EASY - CTC Trace High load 50 40 30 20 10 0 -10 -20 -30 -40 -50
20 10 0 -10
SN
SW
LN
LW
Overall
EASY
-20 -30 -40
Job Categories
Job Categories
Fig. 2. Comparison of conservative and EASY backfilling: high load. The trends for the SW and the LN jobs are more pronounced under high load compared to normal load quickly in both the schemes. Similarly, for the Long Wide (LW) jobs, there is no clear advantage in one scheme over the other because conservative backfilling provides these with the advantage of reservations, while EASY backfilling provides these with better backfilling opportunities due to fewer “blockades” in the schedule. Thus the overall performance of EASY versus conservative backfilling will depend on the relative mix of the jobs in each of the categories. Fig. 1 compares the slowdowns and turnaround times of jobs in the different categories, for EASY and conservative backfilling, for the four traces. The average slowdown and turnaround time for EASY backfilling are shown, as a percentage change compared to the corresponding average for the same set of jobs under conservative backfill scheduling. For example, if the average slowdown of jobs in the SW category were 8.0 for conservative backfill and 12.0 for EASY backfill, the bar in the graph would show +50%. Therefore negative values indicate better performance. The figures indicate that the above mentioned trends are observed irrespective of the job trace used and the metric used. Fig. 2 shows a comparison of the two schemes for the CTC trace under high system load (obtained by multiplying each job’s run time by a factor of 1.3). We find that the same trends are observed and that differences between the schemes are more pronounced under high load. The data above highlights the strengths and weaknesses of the two backfilling schemes: – Conservative backfilling provides reservations to all jobs at arrival time and thus limits the slowdown of jobs that would otherwise have difficulty getting started via backfilling. But it is indiscriminate and provides reservations to all jobs, whether they truly need it or not. By providing reservations to all jobs, the opportunities for backfilling are decreased, due to the blocking effect of the reserved jobs in the schedule. – EASY backfilling provides a reservation to only the job at the head of the job queue. Thus it provides much more opportunity for backfilling. However, jobs that inherently have difficulty backfilling (e.g. wide jobs) suffer relative to conservative backfilling, because they only get a reservation when they manage to get to the head of the queue.
Selective Reservation Strategies for Backfill Job Scheduling
4 4.1
61
Proposed Schemes Selective Reservation Schemes
Instead of the non-selective nature of reservations with both conservative and aggressive backfilling, we propose a selective backfilling strategy: jobs do not get reservation until their expected slowdown exceeds some threshold, whereupon they get a reservation. By doing so, if the threshold is chosen judiciously, few jobs should have reservations at any time, but the most needy of jobs are assured of getting reservations. It is convenient to describe the selective reservation approach in terms of two queues with different scheduling policies - an entry “no-guarantee” queue where start time guarantees are not provided and another “all-guaranteed” queue in which all jobs are given a start time guarantee (similar to conservative backfilling). Jobs enter the system through the entry queue which schedules jobs based on FCFS priority without providing start time guarantees. If a job waits long enough in the entry queue, it is transferred to the guaranteed queue. This is done when the eXpansion Factor (XFactor) of the job exceeds some “starvation threshold”. The XFactor of a job is defined as: XFactor = (Wait time + Estimated Run time)/Estimated Run time . An important issue is that of determination of a suitable starvation threshold. We chose the starvation threshold to simply be the running average slowdown of the previously completed jobs. This is referred to as the Selective-Adaptive or Sel-Adaptive scheme. In the Selective-Adaptive scheme, a single starvation threshold is used for all job categories. Since different job categories have very different slowdowns, another variant of selective reservations was evaluated, where different starvation thresholds were used for different job categories, based again on the running average slowdown of the previously completed jobs in each of these categories. We call this the Selective-Differential-Adaptive or Sel-D-Adaptive scheme. In both the schemes the thresholds are initialized to one, and as jobs complete, the running average is updated appropriately. Since different thresholds are used for different job categories, the Selective-D-Adaptive scheme can also be used to tune specific job categories by appropriately scaling the corresponding starvation thresholds. In the rest of the paper selective backfilling and selective reservation are used interchangeably. 4.2
Performance Evaluation
Fig. 3a compares the percentage change in the average slowdowns for EASY backfill and the selective schemes, with respect to conservative backfilling under high load. It can be observed that the selective reservation scheme achieves at least 45% reduction in the overall slowdown compared to conservative and EASY backfilling. Further, it improves the slowdowns of all categories compared
62
Srividya Srinivasan et al.
Selective Reservation Scheme - SDSC Trace High Load
40 20 0
EASY
-20
Selective-Adaptive
-40
Selective-D-Adaptive
-60 -80 SN
SW
LN
LW
% change - Turnaround time
% change - Slowdown
Selective Reservation Scheme - SDSC Trace High Load
30 20 10 0 -10 -20 -30 -40 -50 -60
Overall
EASY Selective-Adaptive Selective-D-Adaptive
SN
Job Categories
SW
LN
LW
Overall
Job Categories
(a) Average Slowdown
(b) Average Turnaround Time
% change - Worstcase Slowdown
Selective Reservation Scheme - SDSC Trace High Load 40 20 0
EASY
-20
Selective-Adaptive
-40
Selective-D-Adaptive
-60 -80 SN
SW
LN
LW
Job Categories
(c) Worst case Slowdown
Fig. 3. Performance of selective backfilling schemes: accurate user estimates. The selective backfilling schemes achieve a significant reduction in the overall slowdown and turnaround time. The selective schemes also improve the average and worst case slowdowns of most categories
to EASY and conservative backfilling except the LW category, for which there is a slight degradation in slowdown. This degradation in the slowdown for the LW jobs is explained as follows. The LW jobs have difficulty backfilling and hence rely on reservations. Further, the average slowdown for the LW category tends to be much less than the overall average slowdown. Use of the overall average slowdown as the starvation threshold implies that LW jobs will not be moved to the guarantee queue and given a reservation until their XFactor is significantly higher than their group average. This causes a degradation in the slowdown for the LW category. The Selective-D scheme improves the performance of all the categories including the LW category, although the magnitude of improvement for the SW category is slightly lower than the selective scheme. Similar trends are observed when comparing the turnaround times as indicated in Fig. 3b. From Fig. 3c it can be observed that the Selective-D scheme, achieves dramatic reductions in the worst case slowdowns for all the categories when compared to conservative and EASY backfilling.
Selective Reservation Strategies for Backfill Job Scheduling
SDSC Trace - Exact
140 120 100 80 60 40 20 0
Average Turnaround Time
Average Slowdown
SDSC Trace - Exact
1
1.1
1.2
1.3
1.4
100000 80000 60000 40000 20000 0 1
1.5
1.1
1.2
EASY
Selective-Adaptive
Selective-D-Adaptive
Conservative
EASY
30 20 10 0 1.1
1.2
1.3
1.4
1.5
1.6
Selective-Adaptive
EASY
1.7
1.8
80000 60000 40000 20000 0
1.9
1
1.1
1.2
1.3
Selective-Adaptive
Selective-D-Adaptive
Conservative
1.3
EASY
EASY
1.4
1.5
Average Slowdown
0 1.1
Selective-D-Adaptive
Conservative
1.4
Average Slowdown
EASY
1.2
EASY
1.3
1.4
1.5
1.6
Selective-Adaptive
Selective-Adaptive
Selective-D-Adaptive
LANL Trace - Exact Estimates
1.5
60000 50000 40000 30000 20000 10000 0
1.6
1
1.1
Load Conservative
Selective-D-Adaptive
50000
1
Average Turnaround Time 1.3
1.9
Load
Selective-Adaptive
1.2
1.8
100000
LANL Trace - Exact Estimates
1.1
1.7
150000
1.6
350 300 250 200 150 100 50 0 1
1.6
Selective-Adaptive
Load Conservative
1.5
KTH Trace - Exact Estimates Average Turnaround Time
1.2
1.4
Load
700 600 500 400 300 200 100 0 1.1
Selective-D-Adaptive
100000
KTH Trace - Exact Estimates
1
1.5
120000
Load Conservative
1.4
CTC Trace - Exact Estimates Average Turnaround Time
Average Slowdown
CTC Trace - Exact Estimates 40
1
1.3 Load
Load Conservative
63
1.2
1.3
1.4
1.5
1.6
Load
Selective-D-Adaptive
Conservative
EASY
Selective-Adaptive
Selective-D-Adaptive
Fig. 4. Performance of the selective schemes for the various traces under different load conditions: exact estimates. The selective reservation schemes outperform conservative and EASY backfilling, especially at high load
64
Srividya Srinivasan et al.
10 0 -10 -20 -30 -40 -50 -60 -70
Selective Reservation Scheme with User Estimates - SDSC Trace High Load
EASY Selective-Adaptive Selective-D-Adaptive
SN
SW
LN
LW
% change - Turnaround time
% change - Slowdown
Selective Reservation Scheme with User Estimates - SDSC Trace High Load 10 0 -10 -20 -30 -40 -50 -60
EASY Selective-Adaptive Selective-D-Adaptive
SN
Overall
SW
LN
LW
Overall
Job Categories
Job Categories
(a) Average Slowdown
(b) Average Turnaround Time
% change - Worstcase Slowdown
Selective Reservation Scheme with User Estimates - SDSC Trace High Load 60 40 EASY
20
Selective-Adaptive 0
Selective-D-Adaptive
-20 -40 SN
SW
LN
LW
Job Categories
(c) Worst case Slowdown
Fig. 5. Performance of selective backfill schemes: actual user estimates. The selective schemes achieve a significant improvement in the average slowdown and turnaround time of all the categories compared to conservative backfilling
Fig. 4 shows the performance of the selective schemes compared to EASY and conservative backfilling for the various traces under different load conditions. The different loads are modeled through modification of the traces by multiplying the run times of the jobs by suitable constants, keeping their arrival time the same as in the original trace. Higher values of the constant represent proportionately higher offered load to the system, in terms of processor-time product. We observe that the improvements obtained by the selective reservation schemes are more pronounced under high load. 4.3
User Estimate Inaccuracies
We have so far assumed that the user estimates of run time are perfect. Now, we consider the effect of user estimate inaccuracy on the selective reservation schemes. This is desirable from the point of view of realistic modeling of an actual system workload, since a job scheduler only has user run time information to make its scheduling decisions.
Selective Reservation Strategies for Backfill Job Scheduling
65
A clarification about these threshold values is in order. Real traces contain a number of aborted jobs and jobs with poorly estimated run times. The slowdowns of these jobs tend to be much larger than the slowdowns of similar well estimated jobs. This is because the large degree of over-estimation of their run time makes these jobs very hard to backfill. Instead of using the average slowdown of all jobs, which tends to be skewed high due to the aborted or poorly estimated jobs, the starvation threshold is computed from the average slowdown of only the well estimated jobs (whose actual run times are within a factor of two of their estimated run times). Fig. 5a shows the percentage change in the average slowdown for EASY backfill and the selective reservation schemes with respect to conservative backfill. It can be observed from the figure that the selective schemes perform better than conservative backfilling for all job categories. Similar trends can be observed with respect to the average turnaround time from Fig. 5b. Fig. 5c shows the percentage change in the worst case slowdown of the various schemesetive to that of conservative backfilling. Comparing the Selective-Adaptive schemes with EASY backfill, the improvements are not as good as with exact run time estimates. The jobs with significantly over-estimated run times do not get reservations easily (since their XFactors increase at a slower rate compared to an accurately estimated job of the same length) and also cannot backfill easily owing to their seemingly large length. Therefore these jobs tend to incur higher slowdowns with the SelectiveAdaptive schemes than under EASY backfill, which provides greater opportunities for these jobs to backfill (because there is never more than one impeding reservation). In Fig. 6, we show performance of well-estimated jobs (those with estimated run time within a factor of two of the actual run time). The percentage change in the average slowdown and turnaround time and the worst case slowdown are shown for EASY backfill and the selective reservation schemes, relative to conservative backfill. For well-estimated jobs, the performance trends for the various categories are quite similar to the case of exact run time estimates the selective schemes are significantly better than conservative backfill, and also better than EASY backfill for most of the cases. Fig. 7 shows the performance of the selective schemes compared to EASY and conservative backfilling for the SDSC, CTC and KTH traces under different load conditions. The LANL trace did not contain user run time estimates. We again observe that the improvements obtained by the selective reservation schemes are more pronounced under high load.
5
Fairness
Of great importance for production job scheduling is the issue of fairness. A strict definition of fairness for job scheduling could be that no later arriving job should be started before any earlier arriving job. Only an FCFS scheduling policy without backfilling would be fair under this strict definition of fairness. Once back-
66
Srividya Srinivasan et al.
Selective Reservation Scheme - Well Estimated Jobs - SDSC Trace High Load
40 20 0
EASY Selective-Adaptive
-20
Selective-D-Adaptive
-40 -60 -80 SN
SW
LN
LW
% change - Turnaround time
% change - Slowdown
Selective Reservation Scheme - Well Estimated Jobs - SDSC Trace High Load
40 20 EASY
0
Selective-Adaptive -20
Selective-D-Adaptive
-40 -60
Overall
SN
Job Categories
SW
LN
LW
Overall
Job Categories
(a) Average Slowdown
(b) Average Turnaround Time
% change - Worstcase Slowdown
Selective Reservation Scheme - Well Estimated Jobs - SDSC Trace High Load 40 20 0
EASY
-20
Selective-Adaptive
-40
Selective-D-Adaptive
-60 -80 SN
SW
LN
LW
Job Categories
(c) Worst case Slowdown
Fig. 6. Performance of selective backfill schemes: well-estimated jobs
filling is allowed, clearly the strict definition of fairness will be violated. It is well established that backfilling significantly improves system utilization and average slowdown/turnaround-time; thus backfilling is virtually indispensable for nonpreemptive scheduling. If we consider FCFS with conservative backfilling under a scenario of perfect estimation of job run times, a weaker definition of fairness is satisfied: No job is started any later than the earliest time it could have been started under the strictly fair FCFS-No-Backfill schedule. In other words, although later arriving jobs may overtake queued jobs, it is not considered unfair because they do not delay queued jobs. Still considering the scenario of accurate user estimates of run time, how can we evaluate if an alternative scheduling scheme is fair under the above weak criterion? One possibility would be to compare the start time of each job with its start time under the strictly fair FCFS-No-Backfill schedule. However, this is unsatisfactory since the start times of most jobs under FCFS-No-Backfill will likely be worse than FCFS-Conservative, due to the poorer utilization and higher loss-of-capacity with FCFS-No-Backfill. What if we compared start times of each job under the new schedule with the corresponding start time under FCFSConservative? This has a problem too - those jobs that got backfilled and leaped ahead under FCFS-Conservative would have a much earlier reference start time
Selective Reservation Strategies for Backfill Job Scheduling
SDSC Trace - User Estimates
300
Average Turnaround Time
Average Slowdown
SDSC Trace - User Estimates
250 200 150 100 50 0 1
1.1
1.2
1.3
1.4
120000 100000 80000 60000 40000 20000 0 1
1.5
1.1
1.2
EASY
Selective-Adaptive
Selective-D-Adaptive
Conservative
EASY
80 60 40 20 0 1.1
1.2
1.3
1.4
1.5
1.6
Selective-Adaptive
EASY
1.7
1.8
60000 40000 20000 0 1
1.1
1.2
1.3
Selective-Adaptive
Selective-D-Adaptive
Conservative
Average Turnaround Time
Average Slowdown
600 400 200 0 1.3
EASY
EASY
Selective-Adaptive
1.6
1.7
1.8
1.9
1.4
Selective-Adaptive
Selective-D-Adaptive
1.5
140000 120000 100000 80000 60000 40000 20000 0
1.6
1
1.1
Load Conservative
1.5
KTH Trace - User Estimates
800
1.2
1.4
Load
KTH Trace - User Estimates
1.1
Selective-D-Adaptive
80000
1.9
1000
1
1.5
100000
Load Conservative
1.4
CTC Trace - User Estimates Average Turnaround Time
Average Slowdown
CTC Trace - User Estimates 100
1
1.3 Load
Load Conservative
67
1.2
1.3
1.4
1.5
1.6
Load Selective-D-Adaptive
Conservative
EASY
Selective-Adaptive
Selective-D-Adaptive
Fig. 7. Performance of the selective schemes for the various traces under different load conditions: actual user estimates. The selective reservation schemes outperform conservative and EASY backfilling, especially under high load
than would be fair to compare against. To address this problem, we define a ”fairstart” time with each job under a FCFS-Conservative schedule. It is defined as the earliest possible start time the job would have received under FCFSConservative if the scheduling strategy were suddenly changed to strict FCFSNo-Backfill at the instant the job arrived. We then define a fair-slowdown of a job as: Fair-Slowdown = (Fair-Start time under FCFS-Conservative −Queue time + Run time)/(Run time) We can now quantify the fairness of a scheduling scheme by looking at the percentage of jobs that have a higher slowdown than their fair slowdown. Table 4
68
Srividya Srinivasan et al.
Table 4. Fairness comparison FCFS EASY FCFS Sel-Adaptive FCFS Sel-D-Adaptive SJF EASY
≤1 90.46 92.64 92.18 91.08
1-1.5 7.20 4.98 5.18 5.24
1.5-2 1.28 0.7 1.02 1.16
2-4 0.76 1.04 0.88 1.18
>4 0.30 0.54 0.6 1.34
shows the percentage of jobs in 5 different groups. The first column indicates the percentage of jobs that have slowdown less than or equal to their fair slowdown value. Column two indicates the percentage of jobs that have slowdown between 1-1.5 times their fair slowdown value. Column three shows the percentage of jobs that have slowdown between 1.5-2 times their fair slowdown value. Column four indicates the percentage of jobs that have slowdown between 2-4 times their fair slowdown value. Column five shows the percentage of jobs that have slowdown greater than 4 times their fair slowdown value. From the table, it can be observed that 92% of the jobs received fair treatment under the selective reservation schemes and the remaining 8% of the jobs had worse slowdown than their fair slowdown and can be considered to have been treated unfairly, relative to FCFS-Conservative. However, it may be observed that the percentage of jobs that got unfair treatment under aggressive backfilling schemes is higher. Compared to SJF-EASY backfilling, the selective reservation schemes are clearly more fair. But, the percentage of jobs that had slowdown greater than twice their fair slowdown value is slightly greater under the selective reservation scheme when compared to FCFS-EASY backfilling. A scheme that worsens the slowdowns of many jobs in the long categories is not likely to be acceptable even if it improves the slowdowns of most of the other categories. For example, a delay of 1 hour for a 10 minute job (slowdown = 7) is much more tolerable than a slowdown of 7 (i.e. a one-week wait) for a 24 hour job. In order to get insights into how different categories of jobs are treated by the different schemes, we categorized the jobs based on their run time. We compare the number of jobs that received unfair treatment in each of the categories for the different schemes. Fig. 8 shows a comparison of the fairness of the selective reservation schemes with FCFS-EASY and SJF-EASY schemes. From the figure we observe that under the selective reservation schemes, all the jobs that have slowdowns greater than four times their fair slowdown value are short jobs (run time less than or equal to 4 hours) and none of the very long jobs suffer a degradation greater than two times their fair slowdown value. For most length categories, the number of unfairly treated jobs is less with the selective reservation schemes than the aggressive backfilling schemes. Overall, we can conclude that the new schemes are better than or comparable to FCFS-EASY with respect to fairness. FCFSEASY is a widely used scheduling strategy in practice - thus the new selective scheduling schemes would appear to be very attractive, since they have better performance and comparable/better fairness properties.
Selective Reservation Strategies for Backfill Job Scheduling
69
Fig. 8. Fairness comparison of various schemes. The selective backfilling schemes are better than or comparable to FCFS-EASY with respect to fairness
The above model for fairness was based on the observation that FCFSConservative satisfies a weak fairness property and therefore the fair-start time of jobs under FCFS-Conservative can be used as a reference to compare the starttimes with other schedules. Of course, in practice user estimates of run time are not accurate, and in this scenario, even the weak definition of fairness is not satisfied by FCFS-Conservative schedules. Nevertheless, FCFS-Conservative is considered completely acceptable as a scheduling scheme from the viewpoint of fairness. Hence we believe it is appropriate to use it as a reference standard in evaluating the fairness of other schedules in the practical scenario of inaccurate user estimates of run time.
6
Related Work
The relative performance of EASY and conservative backfilling is compared in [5] using different workload traces and metrics. A conclusion of the study is that the relative performance of conservative and EASY backfilling depends on the percentage of long serial jobs in the workload and the accuracy of user estimates. It is observed that if user estimates are very accurate and the trace contains many long serial jobs, then conservative backfilling degrades the performance of the long serial jobs and enhances the performance of the larger short jobs. This is consistent with our observations in this paper. In [14], the effect of backfill policy and priority policy on different job categories was evaluated. A conclusion of the study is that when actual user estimates
70
Srividya Srinivasan et al.
are used, the average slowdown of the well estimated jobs decreases compared to their average slowdown when all user estimates are accurate. Poorly estimated jobs on the other hand, have worse slowdowns compared to when all user estimates are accurate. This effect is more pronounced under conservative backfilling compared to EASY. Other studies that have sought approaches to improve on standard backfilling include [9], [16]. In [16], an approach is developed where each job is associated with a deadline (based on its priority) and a job is allowed to backfill provided it does not delay any job in the queue by more than that job’s slack. Such an approach provides greater flexibility to the scheduler compared to conservative backfilling while still providing an upper bound on each job’s actual start time. In [9], it is shown that systematically lengthening the estimated execution times of all jobs results in improved performance of backfilling schedulers. Another scheme evaluated via simulation in [9] is to sort the waiting queue by length and provide no start-time guarantees. But this approach can result in very high worst case delays and potentially lead to starvation of jobs.
7
Conclusions
In this paper we used trace-based simulation to characterize the relative performance of conservative and aggressive backfilling. We showed that by examining the performance within different job categories, some very consistent trends can be observed across different job traces. We used the insights gleaned from the characterization of conservative and aggressive backfilling to develop a new selective backfilling approach. The new approach promises to be superior to both aggressive and conservative backfilling. We also developed a new model for characterizing the fairness of a scheduling scheme, and showed that the new schemes perform comparably or better than aggressive backfilling schemes.
Acknowledgments We thank the anonymous referees for the numerous suggestions for improving the paper, especially “Referee 4” for his/her extensive comments and suggestions.
References [1] K. Aida. Effect of Job Size Characteristics on Job Scheduling Performance. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 1–17, 2000. 56 [2] O. Arndt, B. Freisleben, T. Kielmann, and F. Thilo. A Comparative Study of Online Scheduling Algorithms for Networks of Workstations. Cluster Computing, 3(2):95–112, 2000. 56 [3] P. J. Keleher D. Perkovic. Randomization, Speculation, and Adaptation in Batch Schedulers. In Supercomputing, 2000. 56
Selective Reservation Strategies for Backfill Job Scheduling
71
[4] D. G. Feitelson. Logs of real parallel workloads from production systems. http:// www.cs.huji.ac.il/labs/parallel/workload/logs.html. 56, 57 [5] D. G. Feitelson. Analyzing the Root Causes of Performance Evaluation Results. Technical report 2002-4, Leibniz Center, Hebrew University, 2002. 69 [6] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In Workshop on Job Scheduling Strategies for Parallel Processing , pages 1–34. 1997. 55, 57 [7] D. Jackson, Q. Snell, and M. J. Clement. Core Algorithms of the Maui Scheduler. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 87–102, 2001. 56 [8] J. P. Jones and B. Nitzberg. Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 1–16, 1999. 55 [9] P. J. Keleher, D. Zotkin, and D. Perkovic. Attacking the Bottlenecks of Backfilling Schedulers. Cluster Computing, 3(4):245–254, 2000. 56, 70 [10] J. Krallmann, U. Schwiegelshohn, and R. Yahyapour. On the Design and Evaluation of Job Scheduling Algorithms. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 17–42, 1999. 55, 58 [11] D. Lifka. The ANL/IBM SP Scheduling System. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 295–303, 1995. 56 [12] A. W. Mu’alem and D. G. Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. In IEEE Trans. Par. Distr. Systems, volume 12, pages 529–543, 2001. 55, 56, 58 [13] J. Skovira, W. Chan, H. Zhou, and D. Lifka. The EASY - LoadLeveler API Project. In Wkshp. on Job Sched. Strategies for Parallel Processing, pages 41–47, 1996. 55, 56 [14] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan. Characterization of Backfilling Strategies for Parallel Job Scheduling. In Proceedings of the ICPP-2002 Workshops, pages 514–519, 2002. 69 [15] A. Streit. On Job Scheduling for HPC-Clusters and the dynP Scheduler. In Proc. Intl. Conf. High Perf. Comp., pages 58–67, 2001. 56 [16] D. Talby and D. G. Feitelson. Supporting Priorities and Improving Utilization of the IBM SP Scheduler Using Slack-Based Backfilling. In Proceedings of the 13th International Parallel Processing Symposium, pages 513–517, 1999. 56, 70
Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems Barry G. Lawson1 and Evgenia Smirni2 1 University of Richmond Department of Mathematics and Computer Science Richmond, VA 23173, USA [email protected] 2 The College of William and Mary Department of Computer Science P.O. Box 8795, Williamsburg, VA 23187, USA [email protected]
Abstract. We describe a new, non-FCFS policy to schedule parallel jobs on systems that may be part of a computational grid. Our algorithm continuously monitors the system (i.e., the intensity of incoming jobs and variability of their resource demands), and adapts its scheduling parameters according to workload fluctuations. The proposed policy is based on backfilling, which reduces resource fragmentation by executing jobs in an order different than their arrival without delaying certain previously submitted jobs. We maintain multiple job queues that effectively separate jobs according to their projected execution time. Our policy supports different job priorities and job reservations, making it appropriate for scheduling jobs on parallel systems that are part of a computational grid. Detailed performance comparisons via simulation using traces from the Parallel Workload Archive indicate that the proposed policy consistently outperforms traditional backfilling. Keywords: batch schedulers, computational grids, parallel systems, backfilling schedulers, performance analysis.
1
Introduction
The ubiquity of parallel systems, from clusters of workstations to large-scale supercomputers interconnected via the Internet, makes parallel resources easily available to researchers and practitioners. Because there is such a commodity of parallel resources that is often underutilized, new research challenges emerge that focus on how to best harness the available parallelism of such computational grids. Resource allocation in parallel systems that are part of a grid is nontrivial. One of the major challenges includes co-scheduling distributed applications
This work was partially supported by the National Science Foundation under grants EIA-9977030, EIA-9974992, CCR-0098278, and ACI-0090221.
D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 72–87, 2002. c Springer-Verlag Berlin Heidelberg 2002
Multiple-Queue Backfilling Scheduling
73
across multiple independent systems, each of which may itself be parallel with its own scheduler. Traditional scheduling policies for stand-alone parallel systems focus on treating differently interactive versus batch jobs in order to maximize the utilization of an (often) expensive system [2]. Because it reduces resource fragmentation and increases system utilization, backfilling has been proposed as a more efficient alternative to simple FCFS schedulers [7, 11]. Users are expected to provide nearly accurate estimates of job execution times. Using these estimates, the scheduler rearranges the waiting queue, allowing short jobs to move ahead of long jobs provided certain previously submitted jobs are not delayed. Various versions of backfilling have been proposed [4, 7, 9]. Keleher et al. characterize the effect of job length and parallelism on backfilling performance [4]. Perkovic and Keleher propose sorting by job length to improve backfilling and introduce the idea of speculative execution, in which long jobs are given a short trial execution to detect whether or not the jobs crash [9]. Industrial schedulers that are widely accepted by the high performance community, including the Maui Scheduler [6] and PBS scheduler [10], offer a variety of configuration parameters. Such parameters include multiple queues to which different job classes are assigned, multiple job priorities, multiple scheduling policies per queue, and the capability to treat interactive jobs differently from batch jobs. The immediate benefit of such flexibility in policy parameterization is the ability to customize the scheduling policy according to the site’s needs. Yet, optimal policy customization to meet the needs of an ever changing workload is an elusive goal. Scheduling jobs on a site that is part of a computational grid imposes additional challenges. The policy must cater to three classes of jobs: local jobs (parallel or sequential) that should be executed in a timely manner, jobs external to the site that do not have high priority (i.e., jobs that can execute when the system is not busy serving local jobs), and jobs that require reservations (i.e., require resources within a very restricted time frame to be successful). In previous work, we proposed a multiple-queue aggressive backfilling scheduling policy that continuously monitors system performance and changes its own parameters according to changes in the workload [5]. In this paper, we propose modifications to the policy that address job priorities and job reservations. We conduct a set of simulation experiments using trace data from the Parallel Workload Archive [8]. Our simulations indicate that, even in the presence of inaccurate estimates, the proposed multiple-queue backfilling policy outperforms traditional backfilling when job priorities and reservations are considered. This paper is organized as follows. Section 2 describes the proposed multiplequeue backfilling policy. Detailed performance analysis of the policy is given in Section 3. Concluding remarks are provided in Section 4.
74
2
Barry G. Lawson and Evgenia Smirni
Scheduling Policies
Successful contemporary schedulers utilize backfilling, a non-FCFS scheduling policy that reduces fragmentation of system resources by permitting jobs to execute in an order different than their arrival [4, 7]. A job that is backfilled is allowed to begin executing before previously submitted jobs that are delayed due to insufficient idle processors. Such non-FCFS execution order exploits otherwise idle processors, thereby increasing system utilization and throughput. The IBM LoadLeveler [3] and the Maui Scheduler [6] are examples of popular schedulers that incorporate backfilling. Aggressive backfilling permits a job to backfill provided the job does not delay the first job in the queue. Alternatively, conservative backfilling permits a job to backfill provided the job does not delay any previous job in the queue. Because the performance of aggressive backfilling has been shown superior to that of conservative backfilling [7], in this work we consider only aggressive backfilling. Standard aggressive backfilling assumes a single queue of jobs to be executed. In previous work, we showed that the performance of aggressive backfilling improves by directing incoming jobs to separate queues according to job duration [5]. The goal of this multiple-queue policy is to reduce the likelihood of delaying a short job behind a long job. By separating jobs into different queues, a queued job competes directly only with jobs in the same queue for access to resources. Relative to using a single queue, short jobs therefore tend to gain access to resources more quickly, while long jobs tend to be delayed slightly. As a result, short jobs are assisted at the expense of long jobs using the multiple-queue policy, thereby improving the average job slowdown. Using detailed workload characterization of traces from the Parallel Workload Archive, four queues provided a nearly equal proportion of jobs per queue. Hence, we employed a four-part job classification (using actual job execution times) to effectively separate short from long jobs, thereby improving job slowdown.1 However, if estimates are not accurate, the classification does not separate jobs of different lengths effectively, and the policy performance may degrade significantly. Indeed, according to trace data, users tend to overestimate run times. Analysis of the workloads shows that the mean estimated run time is consistently twice the mean actual run time. In addition, many jobs appear to crash, i.e., have very long estimates but very short actual run times. For three traces from the Parallel Workload Archive, Table 1 shows the significant proportion of total jobs that have estimated run times greater than 1000 seconds but actual run times less than 180 seconds. This combination of overestimates and crashed jobs causes the four-part classification presented in [5] to fail. In this work, we use estimated job execution times from the traces and assume that users overestimate the actual run times of jobs. Correspondingly, we use a three-part job classification in response to these overestimates. In addition, because actual job execution times cannot be known a priori, we employ speculative execution of jobs [9] to quickly remove from the system a large pro1
We direct the interested reader to [5] for further details on the four-part classification.
Multiple-Queue Backfilling Scheduling
75
Table 1. Proportion of (possibly) crashed jobs for three parallel workload traces Trace Jobs Crashed Proportion CTC 79 302 12 903 0.16 KTH 28 487 3000 0.11 SP2 67 665 15 974 0.24
portion of the jobs that appear to crash. The immediate benefit is that such jobs, which are actually short jobs, are not unwittingly grouped and scheduled with long jobs. These modifications permit our multiple-queue policy to improve job slowdown even in the presence of poor user estimates. Within the context of scheduling resources in a computational grid, we supplement our multiple-queue backfilling policy by considering static job priority levels and job reservations as follows. • We consider jobs submitted by local users to have high priority and those jobs submitted from an external source (i.e., from elsewhere in the computational grid) to have low priority. Our goal is to serve these external jobs as quickly as possible without inflicting delays on local jobs. • We also assume that the system serves jobs that require execution at a specific time. Our goal is to accommodate these reservations as quickly as possible regardless of the consequences on remaining jobs. We now describe in detail the multiple-queue backfilling policy with the necessary job prioritization and reservation schemes. 2.1
Multiple-Queue Backfilling with Job Priorities and Speculation
Multiple-queue backfilling allows the scheduler to automatically change system parameters in response to workload fluctuations, thereby reducing the average job slowdown [5]. The system is divided into multiple disjoint partitions, with one queue per partition. As shown in Figure 1, each partition is initially assigned an equal number of processors. As time evolves, the partitions may exchange control of processors so that processors idle in one partition can be used for backfilling in another partition. Therefore, partition boundaries become dynamic, allowing the system to adapt itself to changing workload conditions. Furthermore, the policy does not starve a job that requires the entire machine for execution. In [5], based on workload characterization of actual run times, four queues provided the best separation of jobs to improve slowdown. Here, we empirically determined that a similar separation is achieved by directing jobs into three queues according to estimated run times. When a job is submitted, it is classified and assigned to the queue in partition p according to 0 < te < 1000 1, (1) p = 2, 1000 ≤ te < 10 000 3, 10 000 ≤ te
processors
76
Barry G. Lawson and Evgenia Smirni
10
10
10
12
2
16
queue 1
queue 2
queue 3
queue 1
queue 2
queue 3
Fig. 1. In multiple-queue backfilling, initial partition boundaries adapt as workload conditions change. In this example, we consider 30 processors and three partitions (queues)
where te is the estimated job execution time in seconds. If the arriving job cannot begin executing immediately, it is placed into the queue in partition p after all jobs of the same priority that arrived earlier. More specifically, if the job has high priority, it is placed into the queue after any high priority jobs that arrived before it. If the job has low priority, it is placed into the queue after all high priority jobs and after any low priority jobs that arrived before it. We use speculative execution to address the issue of a significant proportion of jobs that appear to crash.2 If the estimated execution time of a submitted job is greater than 1000 seconds (i.e., belongs to partition two or three), the job is scheduled for speculative execution at the earliest possible time for a maximum of 180 seconds.3 If the job does not terminate (successfully or unsuccessfully) within the allotted 180 seconds, the job is killed and is then placed into the queue in partition p according to the job’s priority. Without speculative execution, jobs with long estimates that crash quickly and jobs with extremely poor estimates will be classified inappropriately, causing the performance of the multiple-queue policy to suffer. In general, the process of backfilling exactly one queued job (of possibly many queued jobs to be backfilled) proceeds as follows. Let p be the partition to which the job belongs. Define pivotp to be the first job in the queue in partition p, and define pivot start timep to be the time when pivotp can begin executing. If the job under consideration is pivotp , it begins executing only if the current time is equal to pivot start timep , in which case a new pivotp is defined. If the job is 2
3
Within the context of real systems, as a general rule, jobs cannot be killed and restarted. Speculative execution can be used, however, by permitting a user to flag a job as restartable (when appropriate) with the anticipation of improved slowdown [1]. We experimented with speculative execution times from one to five minutes. Speculative execution for a maximum of three minutes removes most of the jobs that appear to crash, as depicted in Table 1.
Multiple-Queue Backfilling Scheduling
77
not pivotp , the job begins executing only if there are sufficient idle processors in partition p without delaying pivotp , or if partition p can obtain sufficient idle processors from one or more other partitions without delaying any pivot. This process of backfilling exactly one job is repeated, one job at a time, until all queued jobs have been considered. The policy considers high priority jobs first (in their order of arrival, regardless of partition) followed by low priority jobs (in their order of arrival, regardless of partition). The multiple-queue aggressive backfilling policy with job priorities and speculation, outlined in Figure 2, is utilized whenever a job is submitted or whenever an executing job completes. If a high priority job arrives at partition p and finds pivotp to have low priority, the high priority job immediately replaces the low priority job as pivotp . Note that a high priority pivot takes precedence over any low priority pivot(s). In other words, the scheduling of a start time for a high priority pivot is permitted to delay other low priority pivots (but not other high priority pivots). The scheduling of a start time for a low priority pivot cannot delay any other pivots.
if (non-speculative arriving job or speculative job killed to be queued) 1. p ←− partition to which job is assigned 2. insert into queue in partition p after all earlier-arriving, same-priority jobs else schedule job immediately for speculative execution for (high priority jobs in arrival order, then low priority jobs in arrival order) 1. p ←− partition in which job resides 2. pivotp ←− first job in queue in partition p 3. pivot start timep ←− earliest time when sufficient procs (from this and perhaps other partitions) will be available for pivotp without delaying any pivot of equal or higher priority 4. idlep ←− currently idle procs in partition p 5. extrap ←− idle procs in partition p at pivot start timep not used by pivotp 6. if job is pivotp a. if current time equals pivot start timep I. if necessary, reassign procs from other partitions to partition q II. start job immediately 7. else a. if job requires ≤ idlep and will finish by pivot start timep , start job immediately b. else if job requires ≤ min{idlep , extrap }, start job immediately c. else if job requires ≤ (idlep plus some combination of idle/extra procs from other partitions) such that no pivot is delayed I. reassign necessary procs from other partitions to partition p II. start job immediately
Fig. 2. Multiple-queue aggressive backfilling algorithm with job priorities and speculation
78
2.2
Barry G. Lawson and Evgenia Smirni
Backfilling with Reservations
A user may schedule a reservation for future execution of a job if, for example, a dedicated environment is desired. Accordingly, when a request for a reservation is submitted, the scheduler determines the earliest time greater than or equal to the requested reservation time when the job can be serviced, and immediately schedules the job for execution at that time. For simplicity, we assume that once a job receives a reservation, the reservation will not be canceled nor can the time of the reservation be changed. Furthermore, we assume that all non-reservation jobs have the same priority. Therefore, the process of backfilling with reservations remains as described in Section 2.1, with the exception that all reservations must be honored.
3
Performance Analysis
In this section, we evaluate via simulation the performance of our multiple-queue backfilling policy relative to standard single-queue backfilling. Our simulation experiments are driven using the CTC, KTH, PAR (1996), and SP2 workload traces from the Parallel Workload Archive [8]. From the traces, for each job we extract the arrival time of the job (i.e., the submission time), the number of processors requested, the estimated duration of the job (if available), and the actual duration of the job. Because we do not use the job completion times from the traces, the scheduling strategies used on the corresponding systems are not relevant to our study. The selected traces are summarized below. • CTC: This trace contains entries for 79 302 jobs that were executed on a 512node IBM SP2 at the Cornell Theory Center from July 1996 through May 1997. • KTH: This trace contains entries for 28 487 jobs executed on a 100-node IBM SP2 at the Swedish Royal Institute of Technology from October 1996 through August 1997. • PAR: This trace contains entries for 38 723 jobs that were executed on a 416-node Intel Paragon at the San Diego Supercomputer Center during 1996. Because this trace contains no user estimates, we use the actual run times as accurate estimates. • SP2: This trace contains entries for 67 665 jobs that were executed on a 128node IBM SP2 at the San Diego Supercomputer Center from May 1998 through April 2000. For all results to follow, we compare the performance of multiple-queue backfilling (using the three-part classification described in Section 2.1) to single-queue backfilling, both employing speculative execution. We consider aggregate performance measures, i.e., average statistics computed for all jobs for the entire experiment, and transient performance measures, i.e., snapshot statistics for batches of 1000 jobs that are plotted across the experiment time and illustrate how well
Multiple-Queue Backfilling Scheduling
79
the policy reacts to sudden changes in the workload. The performance measure of interest here is the job slowdown s defined by s=1+
d ν
(2)
where d and ν are respectively the average delay time and actual service time of a job. To compare the performance results of multiple-queue backfilling with standard single-queue backfilling, we also define the slowdown ratio R by the equation s 1 − sm (3) R= min{s1 , sm } where s1 and sm are the single-queue and multiple-queue average slowdowns respectively. R > 0 indicates a gain in performance using multiple queues relative to a single queue. R < 0 indicates a loss in performance using multiple queues relative to a single queue. 3.1
Multiple-Queue Backfilling Performance
We first consider the performance of multiple-queue backfilling with no job priorities or reservations. Figure 3 depicts the aggregate slowdown ratio R of multiplequeue backfilling relative to single-queue backfilling (computed using the average slowdown obtained using each policy) for each of the four traces. Figure 3(a) depicts R for all job classes combined, while Figures 3(b)–(d) each depict R for an individual job class. As shown, multiple-queue backfilling provides better job slowdown (i.e., R > 0) for all classes combined (Figure 3(a)). With the exception of the long job class in the two SDSC workloads (Figure 3(d)), multiple-queue backfilling also provides better average job slowdown within each of the individual job classes. Because a system can experience significant changes in workload across time, we also consider the transient performance of multiple-queue backfilling. Figure 4 depicts transient snapshots of the slowdown ratio versus time for each of the four traces. Again, marked improvement in job slowdown is achieved (R > 0) using multiple-queue backfilling. Although single-queue backfilling provides better slowdown (R < 0) for a few batches, R is positive a majority of the time corresponding to performance gains with multiple-queue backfilling. 3.2
Performance under Heavy Load
Most policies perform well under low system load because little, if any, queuing is present. To further evaluate multiple-queue backfilling, we now consider its performance under heavy system load when scheduling is more difficult. We impose a heavier system load than that of the trace by linearly scaling (reducing) subsequent interarrival times in the trace. Effectively, we linearly increase the arrival rate of jobs in the system. Note that with this modification, we preserve the statistical characteristics of the arrival pattern in the original trace, except that the same jobs now arrive “faster”.
80
Barry G. Lawson and Evgenia Smirni
(b) Class 1 (time <= 1000) 1.2 1 Slowdown Ratio
Slowdown Ratio
(a) All Classes 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.8 0.6 0.4 0.2
CTC
KTH
PAR
0
SP2
CTC
PAR
SP2
(d) Class 3 (time > 10000)
(c) Class 2 (1000 < time <= 10000) 1.2
0.4
1
0.3 Slowdown Ratio
Slowdown Ratio
KTH
0.8 0.6 0.4
0.2 0.1 0
0.2
−0.1
0 CTC
KTH
PAR
SP2
CTC
KTH
PAR
SP2
Fig. 3. Overall and per-class aggregate slowdown ratio R for each of the four traces
(b) KTH
5
2
4
1
Slowdown Ratio
Slowdown Ratio
(a) CTC
3
0 −1 −2 −3 −4
2 1 0 −1
−5 −6
3
0
5
−2
10 15 20 25 30 35 40 45
0
5
10 15 20 25 30 35 40 45
week
week
(c) PAR
(d) SP2
8
40 35
6 Slowdown Ratio
Slowdown Ratio
30 25 20 15 10 5
2 0 −2
0 −5
4
0
10
20
30
week
40
50
−4
0
20
40
60
week
80
100
Fig. 4. Slowdown ratio R per 1000 job submissions as a function of time
Multiple-Queue Backfilling Scheduling (a) All Classes
(b) Class 1 (time <= 1000)
4 Slowdown Ratio
Slowdown Ratio
3.5 3 2.5 2 1.5 1 0.5 0 CTC
KTH
81
PAR
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
SP2
CTC
KTH
PAR
SP2
(d) Class 3 (time > 10000)
(c) Class 2 (1000 < time <= 10000) 0.4
20 Slowdown Ratio
Slowdown Ratio
0.3 15 10
0.2 0.1 0 −0.1 −0.2
5
−0.3 −0.4
0 CTC
KTH
PAR
SP2
CTC
KTH
PAR
SP2
Multiplicative Increase in Arrival Rate 1.00
1.25
1.50
Fig. 5. Overall and per-class aggregate slowdown ratio R for each of the four traces with increasing system load. All ratios are computed relative to singlequeue backfilling under the same load Figure 5 again depicts the aggregate slowdown ratio R of multiple-queue backfilling relative to single-queue backfilling for each of the four workloads. However, here for each workload we display R for the original arrival rate and for arrival rates multiplied by factors of 1.25 and 1.50. In all figures, the multiplequeue and single-queue backfilling policies experience the same rate of arriving jobs. Consistent with the original trace results presented in Figure 3 (i.e., a multiplicative factor of 1.00), multiple-queue backfilling provides better average job slowdown than single-queue backfilling for all job classes combined and for each individual job class (with the exception of the two SDSC workloads in the long job class). When we increase the arrival rate, multiple-queue backfilling continues to provide better average job slowdown for all job classes combined (Figure 5(a)) and for the small and medium job classes (Figure 5(b) and (c)). As discussed earlier, a queued job competes directly only with other jobs in the same queue so that short jobs tend to be scheduled more quickly and long jobs tend to be delayed slightly. Therefore, multiple-queue backfilling assists shorter jobs at the expense of long jobs, and a decline in the performance of the long job class is unavoidable (Figure 5(d)). However, the magnitude of this decline is generally much smaller than the magnitude of improvement achieved for the other job classes. Also notice in Figure 5(d) that, relative to single-queue backfilling, multiplequeue backfilling performs worse for a multiplicative increase of 1.25 than for
82
Barry G. Lawson and Evgenia Smirni
1.50 for the PAR and SP2 traces. Because job scheduling is very dependent on the arriving workload, the backfilling that occurs for an increase of 1.25 may be very different than for an increase of 1.50. Therefore, a monotone pattern in improvement or decline cannot be expected. 3.3
Performance under Job Priorities
We now consider a system in which 75% of the jobs are high priority, i.e., 25% of the submissions are from an external source in the computational grid. We select at random 75% of the jobs from the trace to be high priority jobs, so that the remaining 25% have low priority. Figure 6 depicts the corresponding aggregate slowdown ratio R of multiplequeue backfilling relative to single-queue backfilling for each of the four traces. Figure 6(a) shows R for all job classes combined, while Figures 6(b)–(d) each show R for an individual job class. For each trace, we also provide R as computed for high priority jobs, for low priority jobs, and for both priorities combined. As shown in this figure, multiple-queue backfilling provides better average job slowdown than single-queue backfilling for all job classes combined (Figure 6(a)). Also note that, with the exception of the long job class (Figure 6(d)), multiplequeue backfilling tends to perform better within each of the individual job classes. Again, because multiple-queue backfilling assists shorter jobs at the expense of long jobs, a relatively small decline in the performance of the long job class is unavoidable. We also consider the transient performance of multiple-queue backfilling under job priorities. Figure 7 depicts transient snapshots of the slowdown ratio versus time for each of the four traces with 75% high priority jobs. Each figure shows slowdown ratio snapshots for high priority jobs and low priority jobs. Again, marked improvement in slowdown is achieved (R > 0) using multiplequeue backfilling. Although single-queue backfilling provides better slowdown (R < 0) for a few batches, R is positive a majority of the time corresponding to performance gains with multiple-queue backfilling. In addition, we consider a system in which only 5% of the submissions are external, i.e., 95% of the jobs have high priority. Figures 8 and 9 are analogous to Figures 6 and 7 but with 95% high priority jobs. Again, multiple-queue backfilling achieves better job slowdown for all job classes combined and, with the exception of the long jobs in the two SDSC workloads, for the individual job classes. Also note the larger vertical axis scales in Figure 8 corresponding to even larger performance gains than with 75% high priority jobs. Notice that in Figures 6 and 8, multiple-queue backfilling often achieves a greater performance improvement for low priority jobs than for high priority jobs. By its very nature, a high priority job will receive preferential treatment in either single-queue or multiple-queue backfilling. However, by appropriately separating jobs, multiple-queue backfilling is able to assist low priority jobs as well, thereby achieving performance improvements over single-queue backfilling for high and low priority jobs.
Multiple-Queue Backfilling Scheduling (b) Class 1 (time <= 1000)
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Slowdown Ratio
Slowdown Ratio
(a) All Classes
CTC
KTH
83
PAR
SP2
CTC
(c) Class 2 (1000 < time <= 10000)
KTH
PAR
SP2
(d) Class 3 (time > 10000) 0.2
2 1.5
Slowdown Ratio
Slowdown Ratio
0.15
1 0.5
0.1 0.05 0 −0.05 −0.1
0
−0.15 −0.2
−0.5 CTC
KTH
PAR
SP2
CTC
All Jobs
High Priority
KTH
PAR
SP2
Low Priority
Fig. 6. Overall and per-class aggregate slowdown ratio R for each of the four traces with 75% high priority jobs
(a) CTC
(b) KTH
5
4 3 Slowdown Ratio
Slowdown Ratio
4 3 2 1 0 −1 −2
2 1 0 −1
0
5
10 15 20 25 30 35 40 45 week
0
5
10 15 20 25 30 35 40 45 week (d) SP2
5
25
4 Slowdown Ratio
Slowdown Ratio
(c) PAR 30
20 15 10 5
2 1 0 −1
0 −5
3
0
10
20
30 week
40
50
−2
0
20
40
60
80
100
week High Priority
Low Priority
Fig. 7. Slowdown ratio R per 1000 job submissions as a function of time for high priority and low priority jobs for each of the four traces with 75% high priority jobs
84
Barry G. Lawson and Evgenia Smirni (b) Class 1 (time <= 1000) 1.4
1.2
1.2 Slowdown Ratio
Slowdown Ratio
(a) All Classes 1.4
1 0.8 0.6 0.4
1 0.8 0.6 0.4
0.2
0.2
0
0 CTC
KTH
PAR
SP2
CTC
(c) Class 2 (1000 < time <= 10000)
Slowdown Ratio
Slowdown Ratio
3 2.5 2 1.5 1 0.5 0 KTH
PAR
PAR
SP2
(d) Class 3 (time > 10000)
4 3.5
CTC
KTH
0.4 0.3 0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5
SP2
CTC
All Jobs
High Priority
KTH
PAR
SP2
Low Priority
Fig. 8. Overall and per-class aggregate slowdown ratio R for each of the four traces with 95% high priority jobs
(b) KTH 12
3
10 Slowdown Ratio
Slowdown Ratio
(a) CTC 4
2 1 0 −1
8 6 4 2
−2
0
−3
−2
0
5
10 15 20 25 30 35 40 45 week
0
5
10 15 20 25 30 35 40 45 week
(c) PAR
(d) SP2
12
8
Slowdown Ratio
Slowdown Ratio
10
6 4 2 0 −2 −4
0
10
20
30 week
40
50
8 7 6 5 4 3 2 1 0 −1 −2
0
20
40
60
80
100
week High Priority
Low Priority
Fig. 9. Slowdown ratio R per 1000 job submissions as a function of time for high priority and low priority jobs for each of the four traces with 95% high priority jobs
Multiple-Queue Backfilling Scheduling
3.4
85
Performance under Reservations
We now consider a system incorporating reservation requests. For each of the four traces, Figure 10 depicts the average job slowdown for all classes combined with proportions of 0.01, 0.05, and 0.25 of the total jobs requesting reservations. As shown, multiple-queue backfilling provides better average job slowdown for the 0.01 and 0.05 proportions, and provides comparable slowdown for the 0.25 proportion. In addition, Table 2 shows the number of missed reservations for singleand multiple-queue backfilling for each of the four traces with proportions of 0.01, 0.05, and 0.25 of the total jobs requesting reservations. Note that single-queue and multiple-queue backfilling miss roughly the same number of reservations. For a proportion of 0.25 of the total jobs requesting reservations, Figure 11 depicts for each trace the tail of the distribution of delays experienced by jobs requesting reservations. As shown, multiple-queue and single-queue backfilling achieve roughly the same distribution for reservation delays. Although we cannot claim significant improvement relative to the number of missed reservations and the distribution of reservation delays, multiple-queue backfilling performs at least as well as single-queue backfilling.
4
Conclusions
Slowdown Ratio
We presented multiple-queue backfilling as a viable approach for scheduling resources in parallel systems that are part of a computational grid. Each job is assigned to a queue according to its expected execution time. Each queue is assigned a non-overlapping partition of system resources on which jobs from the
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.1 CTC
KTH
PAR
SP2
Proportion of Jobs Requesting Reservation 0.01
0.05
0.25
Fig. 10. Overall aggregate slowdown ratio R for each of the four traces with proportions of 0.01, 0.05, and 0.25 of the total jobs requesting reservations
86
Barry G. Lawson and Evgenia Smirni
Table 2. Number of missed reservations for single- and multiple-queue backfilling with proportions of 0.01, 0.05, and 0.25 of the total jobs requesting reservations Proportion of Number of Single-Queue Multiple-Queue Workload Reservations Reservations Missed Missed CTC 0.01 761 19 13 0.05 3908 90 90 0.25 19 897 1396 1441 KTH 0.01 273 9 12 0.05 1421 45 38 0.25 7178 567 546 PAR 0.01 374 5 1 0.05 1873 7 9 0.25 9543 138 119 SP2 0.01 652 54 46 0.05 3349 294 250 0.25 16 968 4051 3983
(b) KTH
0.02
0.008
0.016
Proportion of Jobs
Proportion of Jobs
(a) CTC
0.01
0.006 0.004 0.002 0
10000
30000
50000
70000
0.012 0.008 0.004 0
90000
50000
reservation delay
100000
150000
200000
reservation delay
(c) PAR
(d) SP2
0.04 Proportion of Jobs
Proportion of Jobs
0.003
0.002
0.001
0
10000
30000 50000 reservation delay
70000
Multiple−Queue
0.03 0.02 0.01 0
50000
150000 250000 reservation delay
350000
Single−Queue
Fig. 11. Distribution tails of the delays experienced by jobs requesting reservations with a proportion of 0.25 of the total jobs requesting reservations
Multiple-Queue Backfilling Scheduling
87
queue can execute. Partition boundaries change dynamically, adjusting to fluctuations in arrival intensities and workload mix. For the partitioning criteria, we assume users overestimate the job run time. We also incorporate speculative execution to combat the detrimental effect of crashed jobs on the policy. The proposed policy consistently outperforms single-queue backfilling, even under heavy load. The performance gains are a direct result of the fact that the multiple-queue policy significantly reduces the likelihood that a short job is overly delayed in the queue behind a very long job. Multiple-queue backfilling also yields prominent performance gains when jobs with different priorities are considered, and performs at least as well as single-queue backfilling when reservations are considered.
Acknowledgments We thank Tom Crockett and Daniela Puiu for useful discussions that contributed to this work. We also thank Dror Feitelson for the availability of workload traces through the Parallel Workload Archive. Thanks also to the anonymous referees for their helpful comments and suggestions.
References [1] Crockett, Tom: Private Communication, June 2002. [email protected], http://www.compsci.wm.edu/∼tom/. 76 [2] Feitelson, D. G.: A Survey of Scheduling in Multiprogrammed Parallel Systems. Technical Report RC 19790, IBM Research Division, October 1994. 73 [3] IBM LoadLeveler: http://www.ibm.com/. 74 [4] Keleher, P., Zotkin, D., and Perkovic, D.: Attacking the Bottlenecks in Backfilling Schedulers. Cluster Computing: The Journal of Networks, Software Tools and Applications. 3(4) (2000) 245–254. 73, 74 [5] Lawson, B. G., Smirni, E., and Puiu D.: Self-Adapting Backfilling Scheduling for Parallel Systems. Proceedings of the 2002 International Conference on Parallel Processing (ICPP 2002). August 2002, 583–592. 73, 74, 75 [6] Maui Scheduler Open Cluster Software: http://mauischeduler.sourceforge.net/. 73, 74 [7] Mu’alem, A. W. and Feitelson, D. G.: Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems. 12(6) (2001) 529–543. 73, 74 [8] Parallel Workload Archive: http://www.cs.huji.ac.il/labs/parallel/workload/. 73, 78 [9] Perkovic, D. and Keleher, P.: Randomization, Speculation, and Adaptation in Batch Schedulers. Proceedings of Supercomputing 2000 (SC 2000). November 2000. 73, 74 [10] Portable Batch System: http://www.openpbs.org/. 73 [11] Talby, D. and Feitelson, D. G.: Supporting Priorities and Improving Utilization of the IBM SP2 Scheduler Using Slack-Based Backfilling. Proceedings of the 13th International Parallel Processing Symposium. April 1999, 513–517. 73
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy William A. Ward, Jr., Carrie L. Mahood, and John E. West Computer Sciences Corporation Attention: CEERD-IH-C, U.S. Army Engineer Research and Development Center Major Shared Resource Center 3909 Halls Ferry Road, Vicksburg, Mississippi 39180, USA {William.A.Ward.Jr, Carrie.L.Mahood, John.E.West}@erdc.usace.army.mil http://www.erdc.hpc.mil
Abstract. Backfill is a technique in which lower priority jobs requiring fewer resources are initiated before one or more currently waiting higher priority jobs requiring as yet unavailable resources. Processors are frequently the resource involved and the purpose of backfilling is to increase system utilization and reduce average wait time. Generally, a scheduler backfills when the user-specified run times indicate that executing the lower priority jobs will not delay the anticipated initiation of the higher priority jobs. This paper explores the possibility of using a relaxed backfill strategy in which the lower priority jobs are initiated as long as they do not delay the highest priority job too much. A simulator was developed to model this approach; it uses a parameter ω to control the length of the acceptable delay as a factor times the wait time of the highest priority job. Experiments were performed for a range of ω values with both user-estimated run times and actual run times using workload data from two parallel systems, a Cray T3E and an SGI Origin 3800. For these workloads, overall average job wait time typically decreases as ω increases and use of user-estimated run times is superior to use of actual run times. More experiments must be performed to determine the generality of these results.
1
Scheduling Policies
Many practical job scheduling policies, whether for uniprocessor or multiprocessor systems, incorporate the notion of job priority. Perhaps the simplest example of a priority scheme is setting a job’s priority to elapsed time in the queue; if this priority scheme is used to dictate the order of job initiation, then a “first-come, first-served” (FCFS) policy results. Other, more elaborate schemes based on the number of processors requested and user estimates of run time are, of course, possible. A second important concept involves how to use the resulting prioritized list of jobs. If, when a job completes, the prioritized list is searched for the first job that will run using the available number of processors, then a “first-fit, firstserved” (FFFS) policy results; there are also “best-fit, first-served” (BFFS) (to fit the available number of processors the tightest) and “worst-fit, first-served” D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 88–102, 2002. c Springer-Verlag Berlin Heidelberg 2002
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
89
(WFFS) (to fit the most jobs) versions of this policy. Performance of these and other policies has been discussed in [1, 2, 3]. Another approach to using the resulting prioritized job list is to treat the highest priority job as the one that must execute next, and then save resources as other jobs complete until there are sufficient resources to run that job. Depending on the workload, this may lead to underutilization of the system, increased wait times, and also to job “starvation,” where a job is waiting to execute but is never initiated[4, p. 38]. A partial solution to this is to execute lower priority jobs, but then pre-empt them when necessary to execute the highest priority job. This is implemented by virtually all operating systems on uniprocessors and tightly-coupled multiprocessors; very often the job will remain memory resident while pre-empted so that it may be easily restarted at the next time quantum. Implementing this capability on a large-scale parallel system requires a job checkpointing capability; some parallel systems, e.g., the Cray T3E, have it while others, particularly cluster systems, do not. Even on those that do, it is sometimes undesirable to operate the system in that mode because of the resources wasted by frequently changing from one job to another. In fact, on highly parallel systems where the processors required to run the job are the critical resource, the scheduler will often allocate processors to a job for the job’s lifetime and simply allow the job to run to completion, never pre-empting it [5]. This is known as “variable partitioning” [6] or “pure space sharing” [1]. A second way to resolve this issue is to allow the scheduler to use “backfilling,” a technique in which lower priority jobs requiring fewer resources are initiated before one or more currently waiting higher priority jobs requiring as yet unavailable resources[7, 8]. Here, users supply a number of processors required to run their jobs along with an estimate of the time required. Based on this information, lower priority jobs are allocated idle processors and allowed to run as long as they complete before the earliest feasible time for running the highest priority job. Apparently, this scheme depends on user time estimates for its effectiveness. However, there is some evidence that poor estimates do not adversely affect overall performance of this scheduling policy, although individual jobs may be delayed [9, p. 91][10, pp. 140-141]. Generally, use of backfill significantly improves system utilization and reduces job wait time versus the same scheduling policy without backfill. Not all scheduling policies are amenable to the use of backfill; the strategy that repeatedly dispatches the highest priority job that fits until no more jobs can be started is an example.
2
Backfill Variants
There are two basic approaches to backfill: conservative backfill, which allows a lower priority job to run only if it will not delay any of the higher priority waiting jobs, and aggressive backfill, which will allows a lower priority job to run if it will not delay the highest priority job [11]. Regardless of which approach is used, there are further subvariants distinguished by how the backfilled job is chosen. Possibilities include (i) selecting the highest priority job that will fit,
90
William A. Ward, Jr. et al.
(ii) selecting the job that fits the best, and (iii) selecting a combination of jobs that fit the best. All three of these approaches are essentially implementations of a greedy strategy that “makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution”[12, p. 329]. Optimal performance is not achievable since it would require knowledge of jobs yet to be submitted. Obviously the interpretation of “fit” allows for variations. An easy approach considers only the number of processors requested; however, this may not be optimal in terms of system utilization. For example, suppose two jobs requesting the same number of processors may be backfilled, and the first job has a higher priority and a lower time estimate than the second. If, after running the first job, no other jobs may be backfilled, then system utilization over that period will be lower. If the number of processors over time is treated as a two-dimensional space to be filled with jobs, then packing algorithms that consider all waiting jobs as candidates may be applied. This notion of backfill fit has been extended by Talby and Feitelson in their concept of “slack-based backfilling” [5]. In this approach, three parameters – a job’s individual priority, a tunable systemwide “slack factor,” and the system average wait time – are used to compute a job’s “slack.” The system slack factor is used to control how long jobs will have to wait for execution; e.g., a slack factor of 3 implies that jobs may wait up to 3 times the average wait time. Once priorities and slacks have been calculated for all waiting jobs, then it is possible to compute a cost for a particular schedule of these jobs. Selecting the least costly schedule from all possible schedules is analogous to the knapsack problem[13, pp. 3, 261] and is, as noted in [5], an NP-hard problem. Talby and Feitelson provide several heuristics to reduce the search space of candidate schedules, and then use a simulator implementation of their method to demonstrate its effectiveness. A different perspective on selecting backfill candidates is found in [14] and [15]; there, the authors contrast the use of accurate run time estimates in backfill algorithms with user-supplied overestimates. A parameter R is used to specify the overestimate, with R = 1 corresponding to the actual run time. Their tests included two actual and one artificial workload traces. Significantly, they observed decreasing average wait time with increasing R, particularly for the actual workloads. (For convenience, this will be referred to as the “ZK” method after the last names of the authors in [14].) Instead of using the standard aggressive backfill criterion, this paper proposes a “relaxed” backfill technique in which jobs are backfilled as long as they do not delay the highest priority job “too much.” This approach is similar in spirit to the slack-based approach in that the highest priority job may not be scheduled at the earliest possible time, and similar in technique to the ZK method in that it increases the size of the backfill time window. A tunable system parameter, ω, expressing the allowable delay as a factor of the user-specified run time of the highest priority job, controls the degree of backfill relaxation used by the scheduler. For example, if ω = 1.2, then a job may be backfilled as long as it does not increase the wait time of the highest priority job by more than 20 percent. This relaxed backfill approach is illustrated in Fig. 1.
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
91
B # of CPUs
C (pri = 30)
E (pri = 10) D (pri = 20)
A
time C, D, & E queued a) No backfill
B # of CPUs
E (pri = 10)
C (pri = 30) D (pri = 20)
A
time C, D, & E queued b) Strict backfill
D (pri = 20)
B # of CPUs
E (pri = 10)
C (pri = 30)
A time C, D, & E queued c) Relaxed backfill
Fig. 1. Schedule for running five jobs under three scheduling policies
92
William A. Ward, Jr. et al.
Increasing ω has an effect similar to increasing R in the ZK approach in that it increases the backfill time window in which jobs may be scheduled. Setting the relaxation parameter to ∞ allows several interesting cases, depending on the priority calculation used; this will be discussed below. Other variants include making ω relative to CPU hours requested or to priority instead of run time. For example, in the latter case, a job would be eligible for backfilling if there were sufficient processors to run it and if its priority when multiplied by ω was greater than the priority of the highest priority job.
3
Job Priority
As has already been noted, use of backfill requires some notion of job priority, since that determines the highest priority job for which resources are being reserved and affects the choice of backfilled jobs. Intuitively, as a job waits longer, its priority should increase; e.g., the priority calculation should include a factor α such as (t − tq ) , where t is the current time and tq is the time the job was queued. Next, although this is somewhat arbitrary, jobs with shorter user time estimates may be favored over ones estimated to run longer, corresponding to a factor of uβ , β < 0, where u is the user-estimated walltime for the job. Further, the number of processors may be taken into account in calculating job priority. In a similar spirit to the estimated walltime, one might assign a higher priority to a job requesting fewer processors. However, running jobs requiring only one or a few processors is often considered not the best use of expensive, highly parallel systems, and so one might be led to the opposite conclusion and assign a lower priority to such a job. When selecting a backfill candidate, this latter alternative has the attractive property of tending to favor jobs that fit the backfill space tighter. This recommends inclusion of the factor nγ , where n is the user-requested number of processors. Note than even with a low priority, a job requiring few processors still makes a good backfill candidate and tends to be initiated relatively quickly. Finally, the local queue structure may also be reflected in the priority calculation; e.g., a site having four queues – background, primary, challenge, and urgent – might assign factors of 1, 10, 100, and 1000, respectively, to reflect the increasing importance of jobs in those categories. More generally, this factor could be of the form rδ , where r represents the relative importance of the queues. Suppose time is measured in seconds, a typical job uses 32 processors, and that the relative importance of the queues is background: r = 1, primary: r = 10, challenge: r = 100, and urgent: r = 1000. Then a possible job priority calculation is of the form α t − tq u β n γ r δ P = . (1) 3600 3600 32 10 Thus, a typical 32-processor job having an estimated 1-hour run time that has waited in the primary queue for 1 hour has a current priority of 1. Obviously,
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
93
the purpose of the denominators is to approximately equalize the importance of the various terms. Coupled with the backfill scheme described above, several policies are possible. If α = 1 and β = γ = δ = ω = 0, then an FCFS algorithm results. α = 1, β = γ = δ = 0, and ω = ∞ yields an FFFS policy. If α = β = δ = 0 and ω = ∞, then γ = 1 produces a BFFS policy, while γ = −1 produces a WFFS policy. Setting α, β, and δ to nonzero values specifies a family of interesting subvariants of these policies. Furthermore, the tunability of the parameters allows for emphasis of job aspects that are most important for a particular system.
4
Simulation Results
A scheduler simulator written in Perl was used to compare the effectiveness of the proposed approach with several other scheduling policies. The algorithm used in the simulator is shown in Fig. 2. The key element of this algorithm is the backfill window calculation. If t + hpwait is the earliest time that the highest priority job could run, then the backfill window is bf window = ω ∗ hpwait. Backfill candidates are jobs for which there are currently enough CPUs and that will either complete before t + bf window or that will run using less than or equal to the number of CPUs available after the highest priority job starts. The priority calculation in Eq. 1 was used to assign job priorities with α = β = γ = δ = 1 and the previously mentioned r = 1, 10, 100, 1000 values to discriminate between job u β term places jobs with longer run time estimates types. In particular, the 3600 n γ higher in the priority list while the 32 term does the same for jobs using larger numbers of processors; this tends to somewhat counteract the tendency to backfill with numerous short, small jobs. Note that the wait time term in the priority calculation causes jobs submitted at the same time to have the same zero priority even though they may have very different characteristics. This situation was remedied by assigning a minimum one minute wait time to each job. Finally, priorities for waiting jobs were recalculated once every simulated minute. Utilization data from two systems at the U.S. Army Engineer Research and Development Center (ERDC) Major Shared Resource Center (MSRC) were used in this study. The first machine is a Cray T3E currently running version 2.0.5.49 of the UnicosMK operating system (OS) and version 2.1p14 of the OpenPBS scheduler; the scheduler has been highly customized for use at the ERDC MSRC. The T3E was initially configured with 512 600-MHz Alpha processors, each with 256 Mb of memory. During 5-15 August 2001, part of the period under study, this system was out of service while it was reconfigured to include an additional 256 675-MHz Alpha processors, each with 512 Mb of memory. This change in number of processors is reflected in this study. These processors are termed “application processing elements” (PEs). There are additional “command PEs” available for interactive login and “OS PEs” for OS processes, but because they are not part of the processor pool available to the scheduler, they are not included in this study.
94
William A. Ward, Jr. et al.
SPECIFY number of CPUs in model SPECIFY queues (e.g., background, primary, challenge) SPECIFY dt (time interval for checking queue status) SPECIFY omega and parameters for priority calculation SET t = 0 SET twake = 0 WHILE ( more input OR jobs in ready queue OR jobs in run queue ) IF ( t >= twake AND more input ) READ next command IF ( command is "queue <job>" ) PUT job in ready queue ELSE IF ( command is "sleep " ) SET twake = t + tsleep END IF END IF REMOVE any completed jobs from run queue RECALCULATE priorities of jobs in ready queue SORT ready queue in descending order by priority WHILE ( enough CPUs ) START highest priority job (top of ready queue) END WHILE SET hpwait = time until highest priority job can run SET bfwindow = omega * hpwait SET npavail = CPUs idle after highest pri. job starts WHILE ( more jobs in ready queue to check ) IF ( enough CPUs to run this job ) IF ( this job will complete before t+bfwindow ) START backfill job ELSE IF ( this job needs <= npavail CPUs ) START backfill job npavail = npavail - this job’s no. of CPUs END IF END IF END WHILE SET t = t + dt END WHILE
Fig. 2. Simulator algorithm for relaxed backfill
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
95
The second machine is an SGI Origin 3800 (O3K) currently running version 6.5 of the IRIX OS and version 5.1.4C of the PBS Pro scheduler. The O3K has 512 400-MHz MIPS R12000 processors grouped into 4-processor nodes; each node has 4 Gb of memory. On 2 February 2002, this system was reconfigured to implement cpusets with the net effect being to reduce the number of available processors from 512 to 504, with 8 processors being reserved for the OS. This change is also reflected in the study. Ordered from lowest to highest priority, there are “background,” “primary,” “challenge,” and “urgent” queues on both systems. A guiding operational tenet is to give jobs in these latter two categories premium service. Additionally, there is a “dedicated” queue that allows a particular group of users exclusive access to the system. In preparation for running in dedicated mode, the scheduler stops submitting jobs from the other queues and running jobs are allowed to complete. Then, while in dedicated mode, only jobs in the dedicated queue are run. At the end of this period, the other queues are restarted. While in this mode a system is essentially operating outside of the normal scheduling regime because only jobs in one particular queue may run and other jobs, regardless of their priority and of system resource availability, may not. Consequently, 80 out of 66,038 jobs on the T3E and 362 out of 71,500 jobs on the O3K run in this mode were eliminated from the study. This left 65,958 jobs run on the T3E between 10 October 1999 and 10 April 2002 with an observed average wait time of 2.15 hours, and 71,138 jobs run on the O3K between 13 November 2000 and 9 April 2002 with an observed average wait time of 3.33 hours; these were used as input data to the simulation study. A range of backfill relaxation levels were studied using the run logs from both systems: ω = 1 (a variant of aggressive backfill) to ω = 10 in increments of 0.5, and ω = ∞ (simulated by setting ω = 1, 000, 000). For each of the two systems and for each of the 20 ω levels, two scenarios were simulated: one using the user-estimated run time, and one using the actual run time. In each case the run time (estimated or actual) was used in both the priority calculation (Eq. 1) and the backfill algorithm. In each of these 80 runs, overall average job wait time was computed. The results contrasting use of user-estimated run time with use of actual run time are shown in Figs. 3 and 4. As expected, as ω increases, overall average job wait time for the entire period generally decreases; this decrease is also reflected in overall averages by month (not shown). Specifically, relaxed backfill for higher values of ω typically outperforms aggressive backfill (ω = 1). Furthermore, job wait times for schedules based on user-estimated run times are consistently lower than those for schedules based on actual run times. (The average ratio of actual run time to user-estimated run time on the T3E for this period is 29 percent and on the O3K, 25 percent.) Even for the ω = ∞ case (the rightmost points on the graphs) where one would expect the wait time values to converge, there is a consistent gap between the wait times. The key difference here is that even though the backfill windows are both unbounded in size, the run times still affect
96
William A. Ward, Jr. et al.
the priority calculation and so change the order in which jobs are considered for initiation. As part of these simulation runs, average job wait time data broken down by job type were also gathered; these results are shown in in Figs. 5 and 6 for schedules based on user-estimated run times only. As before, the ω = ∞ case is shown in the rightmost points on the graphs. On the T3E, the primary beneficiaries of relaxed backfill are the lower priority background and primary queue jobs, while wait times for jobs in the higher priority challenge queue appear unaffected by increasing ω. On the O3K, as ω increases wait times for background, primary, and challenge jobs all decrease, but wait times for urgent jobs increase by about 50 percent, from 0.30 to 0.44 hours. (In practice, for urgent jobs the scheduler is often overridden and they are manually scheduled to give them best service.) The simulation results were further analyzed by breaking down results by job size (number of CPUs); these results are shown in in Figs. 7 and 8 for schedules based on user-estimated run times only. Again, the rightmost points on the graphs illustrate the ω = ∞ case; also note the use of a logarithmic scale on the y-axis. These results provide insight into which types of jobs are are being delayed by use of relaxed backfill. Not surprisingly, as ω increases wait times for the smallest categories consistently improve, “large” jobs requiring 64-256 CPUs (denoted by “L” on the graphs) appear unaffected, and “extra large” jobs requiring > 256 CPUs (denoted by “X” on the graphs) are significantly delayed. On the T3E delay for this latter category increases from 10.42 hours at ω = 1 to 24.13 hours for ω = 10 and on the O3K from 46.04 hours at ω = 1 to 114.33 hours for ω = 10. A final aspect of this simulation study involved computation of system utilizations and average job wait times by month for various values of ω. For the purposes of this computation, when a job began in one month and completed in the next, its contribution to both months (with respect to counting the number of jobs in a month) was based on the fraction of its total wait time accrued in each month. Results for ω = 1, 2, and 4, again for schedules based on userestimated run times only, are shown in in Figs. 9 and 10. Note that different schedules alter system utilization, so that data for a particular month are not vertically aligned. Although there is wide variation in these data, it is clear that schedules based on higher ω values outperform ω = 1 in terms of overall average job wait time as system utilization increases.
5
Conclusions
This study indicates that accurate user run time estimates are not necessary for a backfill scheme to be effective and that in some cases use of overestimates gives the scheduler more latitude in scheduling jobs and thereby reduces overall average wait time; however, the average wait time for some classes of jobs may increase. Although the formulation of this approach is somewhat different from the ZK methodology, the results here support similar findings in [14] and [15] in
Average Job Wait Time (hours)
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
97
2.5 A E A A A
2
A A A A
E
A A A A A A A A A A A A
E
1.5
E E E
0
2
E E E
4
E
ω
E E E
6
E E E E E E
8
E
10
Average Job Wait Time (hours)
Fig. 3. Average job wait times on the T3E versus ω. “E” denotes times for jobs scheduled based on user estimated run times and “A” denotes times for jobs scheduled based on actual run times. The righmost points correspond to ω = ∞
E
0.6 E
0.5
A
A A
0.4
E
0.3
A A A A A A A A A A A A A A A A E E A E E E E E E E E E E E E E E E
0.2 0
2
4
ω
6
8
10
Fig. 4. Average job wait times on the O3K versus ω. “E” denotes times for jobs scheduled based on user estimated run times and “A” denotes times for jobs scheduled based on actual run times. The righmost points correspond to ω = ∞
William A. Ward, Jr. et al.
Average Job Wait Time (hours)
98
B
5
B
4
B
3
B B B
E
B B B
B B B E C C C E C C C C C C C C P C C E E E E E E E E E E P P P P P P P P P P P P
2 1 0
2
B
4
ω
6
B B B C B B B B C C C C C C E E E E E E E P P P P P P P
8
10
Average Job Wait Time (hours)
Fig. 5. Average job wait times on the T3E versus ω for various job types. Jobs were scheduled based on user estimated run times. “E” denotes the overall average illustrated previously. “B”, “P”, and “C” denote background (lowest priority), primary, challenge (highest priority) jobs, respectively. There was only one urgent job during this observation period. The righmost points correspond to ω=∞ U
E
0.6
C
E P P U
0.5
U U C U
0.4
U P E B U U B
0.3 0.2
U U U U U U U U U U U U
C C E P P P C P E C C P P C P C E E P P C P C P C P C P C P P C P P E E C E E E E E E E E E E C E
B B
B B B B B B B B B B B B B B B B
0.1 0
2
4
ω
6
8
10
Fig. 6. Average job wait times on the O3K versus ω for various job types. Jobs were scheduled based on user estimated run times. “E” denotes the overall average and “B”, “P”, “C”, and “U” denote background (lowest priority), primary, challenge, and urgent (highest priority) jobs, respectively. The righmost points correspond to ω = ∞
Average Job Wait Time (hours)
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
99
100 X X X X X X X X X X X X X X X X X X X X
10
L L L L M M E E M M E E S 4 S S S 4 4 1 4 1 1 1
1
0.1 0
L L L L L L L L L L L L L L L L M M M M M M M M M M M M M M M E E E E E E E E E M E E E E E E E S S S S S S S S S S S S S S S S 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2
4
ω
6
8
10
Average Job Wait Time (hours)
Fig. 7. Average job wait times on the T3E versus ω for jobs requiring various numbers of processors. Jobs were scheduled based on user estimated run times. “E” denotes the overall average and “1”, “4”, “S”, “M”, “L”, “X”, denote jobs requiring 1, 2-4, 5-16, 17-64, 65-256, and >256 CPUs, respectively. The righmost points correspond to ω = ∞
X X X X X X X X X X X X X X X X X X
100 X X
10 L M 4 E S
1
1
0.1
L M 4 E S 1
L L L L L L L L L L L L L L L L L L M M M E E 4 E S 4 S S 4 1
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M E S 4
M M E E S S 4 4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0
2
4
ω
6
8
10
Fig. 8. Average job wait times on the O3K versus ω for jobs requiring various numbers of processors. Jobs were scheduled based on user estimated run times. “E” denotes the overall average illustrated previously. “1”, “4”, “S”, “M”, “L”, “X”9, denote jobs requiring 1, 2-4, 5-16, 17-64, 65-256, and >256 CPUs, respectively. The righmost points correspond to ω = ∞
William A. Ward, Jr. et al.
Average Job Wait Time (hours)
100
1
10 8
1 21
6 1
1 1 2 4 42 4 1 21 1 41 2 1 24 21 2 4 2 421 1 414 2 4 4 1 2 44 2
4 2 1 2 4 1 2 4
0 50
60
2 4
2 12 1 4 2 4 41 1 1 11 2 2 2 2 2 1 4 42 1 44 42 1 42 4
70
80
4
90
System Utilization (%)
Average Job Wait Time (hours)
Fig. 9. Monthly average job wait times on the T3E versus system utilization for three selected values of ω. Jobs were scheduled based on user estimated run times. “1”, “2”, and “4” denote ω = 1, 2, and 4, respectively
3 1
2 1 1
1
12 21 4
1 2 4
50
4 2 4
1 2 4
1 2 4
0
2
1 4 2 2 1 42 4
60
70
80
System Utilization (%)
Fig. 10. Monthly average job wait times on the O3K versus system utilization for three selected values of ω. Jobs were scheduled based on user estimated run times. “1”, “2”, and “4” denote ω = 1, 2, and 4, respectively
Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy
101
the sense that as the size of the backfill time window is increased, more jobs are backfilled and overall wait time decreases. More specifically, the ZK parameter R is analogous in function to the ω used here. Relaxed backfill with ω = ∞ is equivalent to a variant of some first-, best-, or worst-, fit policy. Thus, all of these approaches are part of a larger family of methods and are obtainable by selecting appropriate values for ω and for the parameters in the priority scheme. It is possible that this parameter space could form the basis for a taxonomy of scheduling policies. The results gathered here indicate that if relaxed backfill is allowable from a policy perspective, then it may be quite effective in reducing average job wait time versus aggressive backfill. Although the time span and size of the input data, as well as the use of data from two different systems, give credence to this conclusion, further study is necessary to provide additional confirmation of the method’s applicability. This would include use of data from other systems or other user populations, a different priority assignment scheme, a different backfill scheme, or use of ω values relative to processor hours or priority instead of hours. Obviously, some modification to this approach will be necessary to mitigate the effect on jobs requiring large numbers of CPUs. More specifically, assuming the technique is widely applicable, sensitivity analyses should be conducted to determine how the method behaves for various priority parameters and to determine the best settings for a given system and workload. An important aspect of this would also be to determine the best ω value for a given priority scheme. Finally, an appropriate modification to this approach should be developed to guarantee that jobs do not wait indefinitely.
6
Acknowledgments and Disclaimer
The authors gratefully acknowledge the inspiration for this approach provided by Dr. Daniel Duffy, Computer Sciences Corporation (CSC), ERDC MSRC, and also thank Dr. P. Sadayappan, Department of Computer and Information Science, The Ohio State University, for kindly correcting an algorithmic misconception regarding aggressive backfill. This work was supported by the U.S. Department of Defense (DoD) High Performance Computing Modernization Program through the ERDC MSRC under contract number DAHC94-96-C-0002 with CSC. The findings of this article are not to be construed as an official DoD position unless so designated by other authorized documents. Citation of trademarks herein does not constitute an official endorsement or approval of the use of such commercial products, nor is it intended to infringe in any way on the rights of the trademark holder.
102
William A. Ward, Jr. et al.
References [1] Aida, K., Kasahara, H., Narita, S.: Job scheduling scheme for pure space sharing among rigid jobs. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 1459 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (1998) 98–121 89 [2] Gibbons, R.: A historical application profiler for use by parallel schedulers. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 1291 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (1997) 58–77 89 [3] Parsons, E. W., Sevcik, K. C.: Implementing multiprocessor scheduling disciplines. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 1291 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (1997) 166–192 89 [4] Finkel, R.: An Operating System Vade Mecum. Prentice-Hall, Englewood Cliffs, New Jersey (1988) 89 [5] Talby, D., Feitelson, D. G.: Supporting priorities and improving utilization of the IBM SP2 scheduler using slack-based backfilling. In: 13th Intl. Parallel Processing Symp. (1999) 513–517 89, 90 [6] Feitelson, D. G.: A survey of scheduling in multiprogrammed parallel systems. Research Report RC 19790 (87657), IBM T. J. Watson Research Center (1994) 89 [7] Intel Corp.: iPSC/860 Multi-User Accounting, Control, and Scheduling Utilities Manual. (1992) Order Number 312261-002. 89 [8] Das Sharma, D., Pradhan, D. K.: Job scheduling in mesh multicomputers. In: Intl. Conf. Parallel Processing. Volume II. (1994) 1–18 89 [9] Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 2221 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (2001) 87–102 89 [10] Zhang, Y., Franke, H., Moreira, J. E., Sivasubramanian, A.: An integrated approach to parallel scheduling using gang-scheduling, backfill, and migration. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 2221 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (2001) 133–158 89 [11] Lifka, D. A.: The ANL/IBM SP scheduling system. In Feitelson, D. G., Rudolph, L., eds.: Job Scheduling Strategies for Parallel Processing. Volume 949 of Lecture Notes in Computer Science., Berlin Heidelberg New York, Springer-Verlag (1995) 295–303 89 [12] Cormen, T. H., Leiserson, C. E., Rivest, R. L.: Introduction to Algorithms. MIT Press, Cambridge, Massachusetts (1990) 90 [13] Moret, B. M. E., Shapiro, H. D.: Algorithms from P to NP. Benjamin/Cummings, Redwood City, California (1991) 90 [14] Zotkin, D., Keleher, P. J.: Job-length estimation and performance in backfilling schedulers. In: 8th High Performance Distributed Computing Conf., IEEE (1999) 90, 96 [15] Zotkin, D., Keleher, P. J., Perkovic, D.: Attacking the bottlenecks of backfilling schedulers. Cluster Computing 3 (2000) 245–254 90, 96
The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance Su-Hui Chiang, Andrea Arpaci-Dusseau, and Mary K. Vernon Computer Sciences Department, University of Wisconsin 1210 W. Dayton Street, Madison, Wisconsin {suhui, dusseau, vernon}@cs.wisc.edu
Abstract. The question of whether more accurate requested runtimes can significantly improve production parallel system performance has previously been studied for the FCFS-backfill scheduler, using a limited set of system performance measures. This paper examines the question for higher performance backfill policies, heavier system loads as are observed in current leading edge production systems such as the large Origin 2000 system at NCSA, and a broader range of system performance measures. The new results show that more accurate requested runtimes can improve system performance much more significantly than suggested in previous results. For example, average slowdown decreases by a factor of two to six, depending on system load and the fraction of jobs that have the more accurate requests. The new results also show that (a) nearly all of the performance improvement is realized even if the more accurate runtime requests are a factor of two higher than the actual runtimes, (b) most of the performance improvement is achieved when test runs are used to obtain more accurate runtime requests, and (c) in systems where only a fraction (e.g., 60%) of the jobs provide approximately accurate runtime requests, the users that provide the approximately accurate requests achieve even greater improvements in performance, such as an order of magnitude improvement in average slowdown for jobs that have runtime up to fifty hours.
1
Introduction
Many state-of-the-art production parallel job schedulers are non-preemptive and use a requested runtime for each job to make scheduling decisions. For example, the EASY Scheduler for the SP2 [3, 4] implements the First-Come First-Served (FCFS)-backfill policy, in which the requested runtime is used to determine whether a job is short enough to be backfilled on a subset of the nodes during a period when those nodes would otherwise be idle. The more recent Maui Scheduler ported to the NCSA Origin 2000 (O2K) [1] and the large NCSA Linux Cluster [2] implements a parameterized priority-backfill policy that uses the requested runtime to determine job priority as well as whether it can be backfilled.
D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 103–127, 2002. c Springer-Verlag Berlin Heidelberg 2002
104
Su-Hui Chiang et al.
Recent work [5] has shown that the priority-backfill policy on the O2K has similar performance to FCFS-backfill, but that modifying the priority function to favor jobs with short requested runtimes provides superior average wait, 95thpercentile wait, and average slowdown, as well as similar maximum wait time as for FCFS-backfill, for the large production workloads that run on the NCSA O2K. Thus, requested runtimes are needed not only for backfill decisions but also to enable favoring short jobs in a way that improves service for nearly all jobs. For example, in the LXF&W-backfill priority function derived from highperformance uniprocessor scheduling policies, a job’s priority increases linearly with its expansion factor, where the expansion factor is the ratio of the job’s wait time plus requested runtime to requested runtime. The key advantage of nonpreemptive scheduling policies is that they are significantly easier to implement than preemptive policies, particularly for systems with many processors. Furthermore, simulation results for the O2K job traces and system show that the non-preemptive LXF&W-backfill policy has performance that is reasonably competitive with high performance (but more difficult to implement) preemptive policies such as gang scheduling or spatial equi-partitioning [5]. This relatively high performance is achieved in spite of the fact that user requested runtimes are often highly inaccurate [6, 7, 8, 9]. For example, analysis of the NCSA O2K logs shows that 30% of the jobs that request 200 or more hours of runtime terminate in under ten hours [9]. The key open question addressed in this paper is whether the high performance backfill policies can be further improved with more accurate requested runtimes. Several previous simulation studies of FCFS-backfill show that more accurate requested runtimes have only minimal impact on the average wait time or average slowdown [6, 10, 11, 12, 7]. We briefly revisit the question for FCFSbackfill, using workloads from recent months on the O2K that have significantly heavier system load (e.g., up to 100% cpu demand), and using a more complete set of performance measures. More importantly, we investigate the question of whether more accurate requested runtimes can significantly improve the high performance backfill policies such as LXF&W-backfill that use requested runtimes to favor short jobs. We evaluate this question using complete workload traces from the NCSA O2K and consider not only average wait time and average slowdown as in previous studies, but also the maximum and 95th-percentile wait time. Each of these measures is obtained as a function of actual job runtime and as a function of the number of requested processors, to determine how performance varies with these job parameters. To study the above key question, two design issues that relate to preventing starvation in backfill policies that favor short jobs require further investigation. As discussed further in Sections 2.3 and 3.1, preventing starvation was not fully addressed in previous policy evaluations. In particular, the problem is more significant for the heavier system load in recent months on the O2K. The first policy design issues relate to reservations; that is, how many jobs are given reservations and, in the case of dynamic priority functions, whether the reservations are fixed or dynamic. The second design issue is the relative weight in the priority function
The Impact of More Accurate Requested Runtimes
105
for requested job runtime and current job wait time. A more complete analysis of these issues is needed in order to set these policy parameters properly for studying the potential improvement of more accurate requested runtimes. The key results in the paper are as follows: – For a set of high performance backfill policies that favor short jobs (i.e., √ √ LXF&W-, SJF&W-, L XF&W-, and S T F&W-backfill), accurate requested runtimes dramatically improve the average slowdown, greatly improve the average and maximum wait for short jobs without increasing the average wait for long jobs, and greatly improve the 95th-percentile wait for all jobs. More precisely, the 95th-percentile wait for all jobs is reduced by up to a factor of two, while the average slowdown is reduced by up to a factor of six, depending on the system load. Policy performance with more accurate requested runtimes is thus even more similar to the high performance preemptive policies such as gang scheduling or spatial equipartitioning. – Nearly all of the improvement is realized even if requested runtimes are only approximately accurate; that is, if all requested runtimes are up to a factor of two times the actual runtime. Furthermore, most of the improvement can be achieved (a) even if only 60% - 80% of the jobs provide the approximately accurate runtime requests, and (b) test runs are used to more accurately estimate requested runtime. – In systems where only a fraction (e.g., 60%) of the jobs provide approximately accurate requested runtimes, the jobs with improved runtime requests have even greater improvements in performance, such as more than an order of magnitude improvement in average slowdown of the jobs that have actual runtime up to fifty hours. Thus, there is a significant incentive for individual users to improve the accuracy of their runtime requests. Additional contributions of the paper include: – A summary of the very recent workloads (October 2000 - July 2001) on the O2K, including several months with heavier processor and memory demand than workloads used previously to design scheduling policies. Note that heavier system load can have a significant impact on the magnitude of the performance differences among alternative scheduling policies. For example, more accurate requested runtimes improves the average slowdown of FCFS-backfill more significantly for the recent heavy loads on the O2K. – For the NCSA O2K architecture and workload, using a small number of reservations (2 to 4) outperforms a single reservation, but a larger number of reservations results in poor performance during months with exceptionally heavy load. – Compared to the highest performance previous backfill policy, namely LXF&W-backfill with single reservation, LXF&W-backfill with two to four as well as two proposed new priority backfill policies √ reservations, √ (L XF &W and S T F&W-backfill) with two reservations, significantly improve the maximum wait time.
106
Su-Hui Chiang et al.
The remainder of this paper is organized as follows. Section 2 provides background on the system and workloads used in this study, and on related previous work. Section 3 evaluates the impact of reservation policies and the relative priority weight between job requested runtime and current job wait time for backfill policies. Section 4 evaluates the potential benefit of using more accurate requested runtimes in priority backfill policies. Section 5 shows whether the performance benefit of more accurate requested runtimes can be achieved if test runs are used to estimate the more accurate requested runtimes. Section 6 provides the conclusions of this work.
2 2.1
Background The NCSA Origin 2000 System
The NCSA O2K is a large production parallel system that provides 960 processors and 336 GB of memory for processing batch jobs that do not request a dedicated host. The processors are partitioned into eight hosts, each of which has 64 or 128 processors and 32 or 64 GB of memory. The jobs are scheduled using a ported version of the Maui Scheduler that implements a backfill policy with a parameterized priority function, and evicts a job if it has run one hour longer than its requested runtime. More detail about the system and scheduler configuration can be found in [1, 5]. 2.2
Workloads
In this study, we have evaluated scheduling policy performance using simulations with ten different one-month job traces obtained during October 2000 - July 2001 from the O2K. Three of these months (October - December 2000) were fully characterized in [9]. The load during each month is summarized in Table 1. The overall processing demand (”proc demand”) per month is the actual runtime of a job times the requested processors for the job, summed over all jobs submitted that month, expressed as a percentage of the total available processor-minutes for the month. The memory demand (”mem demand”) is the equivalent measure for the job memory requests. Processor and memory demand are also given for each job class, where job class is defined by the requested runtime and requested processor and memory resources, as defined below the table. There are two key differences in the traces summarized in the table compared to those considered previously [5, 9]. First, the actual job runtime in these traces includes the initial data setup time, during which the job occupies its requested resources (i.e., processors and memory) but it has not yet started its computation. The data setup time adds negligible (≤ 1%) total cpu and memory load each month, but it is significant (e.g., 10 hours) for some jobs. Second, the traces include four months (January - March and May 2001) that have exceptionally high demand for processor resources (i.e., very close to 100%), and three of those months (February, March, and May 2001) also have exceptionally high memory
The Impact of More Accurate Requested Runtimes
107
Table 1. Summary of Monthly NCSA O2K Workloads Month
Total vst
Oct00 #jobs 6552 1342 proc demand 82% 1% mem demand 81% 0% Nov00 #jobs 6257 1719 proc demand 85% 1% mem demand 61% 1% Dec00 #jobs 4782 1114 proc demand 89% 0% mem demand 63% 0% Jan01 #jobs 4837 945 proc demand *102% 1% mem demand 76% 0% Feb01 #jobs 6784 2328 proc demand *97% 1% mem demand *87% 1% Mar01 #jobs 5929 1915 proc demand *100% 1% mem demand *92% 1% Apr01 #jobs 6206 2106 proc demand 78% 1% mem demand 77% 1% May01 #jobs 6573 2220 proc demand *99% 2% mem demand *92% 1% Jun01 #jobs 6364 2076 proc demand 86% 2% mem demand 75% 1% Jul01 #jobs 5705 1363 proc demand 89% 1% mem demand 81% 1% ’*’ indicates exceptionally high
mt
lt
Job Class mj vst st mt
2491 11% 6%
576 9% 7%
276 7% 9%
248 624 240 0% 10% 11% 0% 6% 6%
50 2% 2%
57 362 208 78 0% 14% 13% 4% 0% 6% 18% 20%
2279 10% 5%
417 8% 5%
60 3% 2%
287 1% 0%
499 186 9% 12% 5% 6%
16 3% 1%
146 513 110 25 1% 21% 13% 3% 0% 11% 11% 14%
2056 563 10% 10% 6% 8%
164 9% 5%
100 0% 0%
203 215 4% 18% 2% 10%
59 4% 6%
45 0% 0%
135 113 15 8% 13% 12% 3% 13% 9%
2000 649 9% 13% 6% 8%
164 7% 5%
185 0% 0%
267 158 151 4% 18% 10% 3% 10% 14%
37 0% 0%
170 97 14 9% 15% 14% 6% 9% 14%
2264 9% 6%
180 8% 5%
357 0% 0%
333 119 6% 13% 4% 8%
63 7% 8%
281 219 91 70 0% 11% 12% 22% 0% 8% 14% 28%
1869 644 221 12% 11% 10% 7% 8% 9%
372 1% 0%
290 140 4% 10% 3% 6%
50 5% 8%
78 224 87 39 0% 11% 18% 17% 0% 9% 11% 30%
2304 643 13% 12% 6% 7%
202 9% 7%
235 0% 0%
238 5% 3%
78 5% 9%
47 0% 0%
2012 611 191 12% 10% 10% 5% 9% 6%
364 1% 0%
355 115 96 6% 10% 12% 3% 4% 14%
214 246 104 45 1% 8% 19% 9% 1% 10% 18% 20%
2317 690 12% 15% 7% 11%
271 1% 0%
346 113 86 8% 10% 9% 4% 4% 12%
91 1% 0%
sj st
479 9% 5%
82 6% 4%
70 8% 3%
lj lt
vst
st
159 8% 9%
mt
lt
90 34 8% 8% 8% 25%
189 84 9% 10% 9% 17%
19 4% 6%
2070 664 136 243 415 177 111 102 263 131 12% 15% 5% 1% 7% 14% 6% 1% 12% 10% 8% 9% 5% 0% 4% 7% 13% 0% 9% 18% load. Job Class Definition Requested Run Time Class Space Class vst st mt lt sj mj lj ≤5hrs [5, 50) [50, 200) [200, 400) P≤8 P ≤ 16 P ≤ 64 hrs hrs hrs M ≤ 2GB M ≤ 4GB M ≤ 25GB P = requested processors; M = requested memory
30 5% 8%
demand (> 90%). The other three months in 2001 (April, June, and July) have cpu demand (80 - 90%) and memory demand (70 - 80%) that is typical in earlier O2K workloads [5]. Results will be shown in the paper for three of the heavy load months (January - March 2001), one of the months that follows a heavy load month (June 2001) and one typical month (July 2001). Other characteristics of the workloads during 2001 are similar to the previous months in 2000. In particular, there is an approximately similar mix of job classes (i.e., sizes) from month to month (as shown in Table 1), and there is
R = 50h R = 400h
200h 400h
50h
5h 10h
1h
10m
0.4
50−percentile 35−percentile 20−percentile
0.3
0.6
0.2
0.4
0.1
0.2 0
actual runtime / requested
50h 100h 200h 400h
5h 10h
1h
10m
1 0.8
1m
Su-Hui Chiang et al.
cumulative fraction of jobs
108
0
10
1
10
2
10
3
10
4
10
actual runtime (minutes)
(a) Distribution of Actual Runtime (Jobs with R = 50 or 400 hours)
0
10
1
2
3
4
10 10 10 10 requested runtime (minutes)
(b) Distributions of T/R vs. R (All Jobs)
Fig. 1. Requested Runtime (R) Compared with Actual Runtime (T) (O2K Workloads, January 2001 - July 2001)
a large discrepancy between requested runtime (R) and actual runtime (T), as illustrated in Figure 1. Figure 1(a) plots the distribution of actual runtime for the jobs submitted during January - July 2001 that requested 50 hours (R = 50h) or 400 hours (R = 400h). These results show that almost 30% of the jobs with R = 400 hours terminate in under 10 hours, another 10% have actual runtime between 10 and 50 hours, and approximately 10% of the jobs with R = 50 hours or R = 400 hours terminate in under one minute. Figure 1(b) plots points in the distribution (i.e., the 20th-, 35th-, and 50th-percentile) of T/R as a function of the requested runtime of the job. This figure shows that for any requested runtime greater than one minute, 35% of the jobs use less than 10% of their requested runtime (i.e., R >= 10 T), and another 15% of the jobs have actual runtime between 10% and 30% of the requested runtime. Similarly large discrepancies between requested and actual runtimes have also recently been reported for many SP2 traces [7, 8]. In particular, the results by Cirne and Berman [8] show that for four SP2 traces, 50-60% of the jobs use under 20% of the requested runtime, which is very similar to the results for the NCSA O2K workloads examined here. 2.3
Previous Work
In this section, we review previous work on three topics: alternative priority functions for backfill policies, the impact of reservation policies, and the impact of using more accurate requested runtimes on backfill policies. The most comprehensive previous comparison of alternative priority backfill policies [5] shows that, among the priority functions defined in Table 2, the LXF&W-backfill policy that gives priority to short jobs while taking current job waiting time into account outperforms FCFS-backfill, whereas SJF-backfill has the problem of starvation (i.e., large maximum wait) under high load. This previous paper also provides a review of earlier papers [13, 11, 14] that compare the SJF-backfill and FCFS-backfill policies.
The Impact of More Accurate Requested Runtimes
109
Table 2. Priority Functions of Previous Backfill Policies Priority Weight Job Measure FCFS SJF LXF LXF&W(w) 1 0 0 w = 0.02 current wait time, Jw , in hours 1 0 1 0 0 inverse of requested runtime ( R ) Jw + R in hours 0 0 1 1 current job expansion factor ( R in hours )
Reservation policies concern (a) the number of jobs waiting in the queue that are given (earliest possible) reservations for processor and memory resources, and (b) whether the reservations are dynamic or fixed. Previous results by Feitelson and Weil [6] show that, for FCFS-backfill and a set of SP workloads, average slowdown is similar when only one (i.e., the oldest) waiting job has a reservation or when all jobs have a reservation. In more recent work [7] they find similar results for further SP2 workloads, for workloads from other systems, and for many synthetic workloads, but they find that for many other SP2 monthly workloads, a single reservation significantly improves the average slowdown (by > 40%) and average response time (by > 30%). Several papers evaluate backfill policies that have reservations for all waiting jobs [10, 14, 11], while still other papers evaluate backfill policies that give reservations to only one waiting job [3, 4, 15, 13, 5]. With dynamic reservations, job reservations and the ordering of job reservations can change when a new job arrives, or if the relative priorities of the waiting jobs change with time. For example, in SJF-backfill with a single dynamic reservation, an arriving job will preempt the reservation held by a longer job. With fixed reservations, in contrast, once a job is given a reservation, it may be given an earlier reservation when another job terminates earlier than its requested runtime, but recomputed job reservations will have the same order as the existing reservations, even if a job that has no reservation or a later reservation attains a higher priority. A single fixed reservation is used to reduce starvation in SJF-backfill in [5]. In [14], each job is given a reservation when it arrives. They compare a form of dynamic (”no guarantee”) reservations, in which reservations are only recomputed if and when a job finishes early but the recomputed reservations are done in priority (i.e., FCFS or SJF) order, against ”guaranteed reservations”, in which job reservations are recomputed only in the same order as the existing reservations. They find that the dynamic reservations have lower average slowdown and average wait than guaranteed reservations for the priority backfill policies studied, including SJF-backfill. This paper includes the maximum wait measure and concludes that fixed reservations significantly improve the performance of SJF-backfill; otherwise the results in this paper are consistent with their results. Two previous papers show that perfectly accurate requested runtimes for FCFS-backfill improve the average slowdown by no more than 30% [7] and the average wait time by only 10 - 20% [10], compared to using the highly inaccurate requested runtimes given in SP traces. Several papers [6, 13, 11, 12, 7] compare the performance of various models of requested runtimes against per-
110
Su-Hui Chiang et al.
fectly accurate runtime requests. For a given actual runtime, they model the requested runtime overestimation (i.e., requested runtime - actual runtime) as a factor times the actual runtime, where the factor is drawn from a uniform distribution between 0 and a fixed parameter C. The paper [13] also includes a model where the factor is deterministic. The results in those papers show that even for C as large as 300 [6, 7] (or 50 [13] or 10 [11, 12]), the average slowdown or average wait is similar to, or even slightly better than that of C = 0. Additional results in [7] show that multiplying the user requested runtimes by two slightly improves on average slowdown and response time for SP workloads and FCFS-backfill. These papers conclude that there is no benefit of using accurate requested runtimes for FCFS-backfill and SJF-backfill. We note that for large C (or when multiplying requested runtime by two), jobs with long runtimes can have very large runtime overestimation, which leaves larger holes for backfilling shorter jobs. As a result, average slowdown and average wait may be lower, as reported in these previous papers. On the other hand, these systems may have poorer maximum wait, which was not studied in any of these previous papers.
3
Reducing Starvation in Systems that Favor Short Jobs
Backfill policies that favor short jobs have the potential problem of poor maximum wait for long jobs. Mechanisms for reducing the maximum wait include using a larger number of reservations, and increasing the priority weight on the current job wait time. On the other hand, either of these mechanisms may increase the average and 95th-percentile wait for all jobs. The goal of this section is to provide a more comprehensive evaluation of the trade-offs in the wait time measures for different reservation policies and for alternative priority functions that give different relative weight to the current job waiting time. In evaluating the tradeoffs for each policy, we seek to achieve a maximum wait that is no greater than the maximum wait in FCFS-backfill, while reducing the average and 95th-percentile wait time as much as possible. In this section, and in the remainder of the paper, policy comparisons will be shown for five representative workloads. These workloads include (a) three of the four new exceptionally heavy load months (i.e., January - March 2001), which are the most important months for policy optimization, (b) June 2001, which has similar policy performance as in April 2001 since both of these months follow an exceptionally heavy load month, and (c) July 2001 which has a typical load and policy performance similar to October - December 2000 and other previously studied workloads. The other new exceptionally heavy load month (May 2001) has somewhat lower wait time statistics for each policy than the other three exceptionally heavy months, due to a larger number of short jobs submitted that month. Section 3.1 re-evaluates previous backfill policies, showing that starvation is a more significant issue for the new exceptionally heavy load months on the NCSA O2K. Section 3.2 evaluates several alternative reservation policies. Section 3.3 evaluates several new priority functions with different relative weights on
(a) Average Wait
(b) 95th-percentile Wait
600 400
Jul 01
Jun 01
Mar 01
0
Feb 01
200
Jan 01
max wait time (hrs)
Jul 01
0
Jun 01
Jun 01
Mar 01
Feb 01
0
10
Mar 01
previous typical
2
20
Feb 01
4
30
Jan 01
month after heavy load
111
FCFS−bf, 1 reservation SJF−bf, 1 fixed LXF&W(0.02)−bf, 1 dynamic
95−percentile wait (hrs)
new heavy load (Jan−Mar)
Jul 01
6
Jan 01
avg wait time (hrs)
The Impact of More Accurate Requested Runtimes
(c) Maximum Wait
1m
10m
1h
10h 50h
10
5
0
0
10
2
10
4
10
95−percentile wait (hrs)
avg wait time (hrs)
FCFS−bf, 1 reservation LXF&W(0.02)−bf, 1 dynamic SJF−bf, 1 fixed 1m
1h
10h 50h
20 15 10 5 0
0
10
actual runtime (minutes)
(d) Average Wait vs. Actual Runtime (January 2001)
10m
2
10
4
10
actual runtime (minutes)
(e) 95th-percentile Wait vs. Actual Runtime (January 2001)
Fig. 2. Performance of Previous Backfill Policies
the current job waiting time and compares the best new priority backfill policies against FCFS-backfill. 3.1
Re-evaluation of Previous Policies
In this section, we use the recent O2K workloads to re-evaluate the FCFSbackfill, SJF-backfill, and LXF&W-backfill policies that are defined in Table 2. Note that both SJF-backfill and LXF&W-backfill favor short jobs, but LXF&Wbackfill also has a priority weight for current job wait time. The reservation policies in these previously defined schedulers are: FCFS-backfill uses one reservation, LXF&W-backfill uses one dynamic reservation, and SJF-backfill uses one fixed reservation (to reduce the maximum wait). Figure 2 compares the three policies, showing (a) overall average wait, (b) 95th-percentile wait, (c) maximum wait, and (d)-(e) average and 95th-percentile wait, respectively, as a function of actual runtime, during a representative heavy load month. Results in previous work [5] are similar to the results for the July 2001 workload in figures (a) - (c). Conclusions for the new heavy load months that are similar to previous work are that (1) SJF-backfill and LXF&W-backfill have significantly lower 95th-percentile wait (for all ranges of actual runtime)
Su-Hui Chiang et al.
(c) Maximum Wait
100
0
Jul 01
50
Jun 01
Jul 01
Jun 01
Mar 01
Feb 01
100
Mar 01
200
150
Feb 01
300
Jan 01
avg slowdown
200
400
0
0
(b) 95th-percentile Wait
500
Jan 01
max wait time (hrs)
(a) Average Wait
10
Jul 01
Jul 01
Jun 01
Mar 01
0
Feb 01
1
20
Jun 01
2
Mar 01
3
reservation reservations reservations reservations
30
Feb 01
4
1 2 4 8
Jan 01
reservation reservations reservations reservations
5
Jan 01
avg wait time (hrs)
1 2 4 8
95−percentile wait (hrs)
112
(d) Average Slowdown
Fig. 3. Impact of Number of Reservations on LXF-backfill (Dynamic Reservations)
than that of FCFS-backfill, and (2) SJF-backfill has the problem of poor maximum wait for many of the workloads, as shown in figure (c). Conclusions for the new heavy load months that differ from the results in previous work (and also differ from the results for July 2001), are that (1) LXF&W-backfill and SJF-backfill have even greater improvement in average wait compared with FCFS-backfill (for most ranges of actual runtimes), and (2) LXF&W-backfill has higher maximum wait than FCFS-backfill. The starvation problems that lead to high maximum wait in LXF&W-backfill and SJF-backfill systems are addressed in the next two sections. The questions are (1) whether multiple reservations can improve the performance, particularly the maximum wait, of SJF-backfill and LXF&W-backfill, (2) whether fixed reservations can improve the maximum wait for LXF&W-backfill, and (3) whether new priority functions, such as adding a priority weight for current waiting time to the SJF-backfill priority function, or more generally whether new relative priority weights between requested job runtime and current job wait time, can improve on the previous policy priority functions. Section 3.2 addresses the first two questions. Section 3.3 studies the third question. 3.2
New Reservation Policy Comparisons
This section studies the impact of reservation policies, i.e., the number of reservations and dynamic versus fixed reservations, on backfill policies. We use three simple priority backfill policies to evaluate the reservation policies, namely: FCFS-backfill, SJF-backfill, and LXF-backfill (all with weight on current waiting
The Impact of More Accurate Requested Runtimes
113
time equal to zero). Adding weights for current waiting time will be studied in the next section. For each of the three policies, we evaluated the performance for the following numbers of reservations: 1, 2, 4, 6, 8, 12, and 16. For the LXF-backfill and SJF-backfill policies that have dynamic priority functions, we evaluate the performance of both dynamic and fixed reservations, each over the entire range of number of reservations. Figure 3 shows the performance of LXF-backfill with up to eight dynamic reservations. Twelve and sixteen reservations have similar or worse performance as that of eight reservations. The impact of the number of reservations is similar for FCFS-backfill and SJF-backfill (not shown to conserve space), except that four reservations performs slightly better for SJF-backfill. For months with a typical O2K load (e.g., July 2001), the impact of reservation policies on backfill policies is minimal, which agrees with previous results for the average slowdown of FCFS-backfill in [6]. However, for most of the new heavy load months, as shown in Figure 3(c), the key new result is that using a small number of reservations (i.e., 2-4) reduces the maximum wait time (by about 30%) compared to using a single reservation. Furthermore, as shown in Figure 3(a) - (c), using more than four reservations usually makes minimal further improvement for the maximum wait, yet significantly increases the average and 95th-percentile wait, for the new heavy load workloads or immediately following the heavy load months. Other results omitted to conserve space show that fixed and dynamic reservations (with 2-4 reservations) have similar performance for LXF-backfill and the policies developed in the next section. However, for SJF-backfill, dynamic reservations has higher maximum wait than fixed reservations because (particularly under heavy load) dynamic reservations for jobs with long requested runtimes are often usurped by newly arriving jobs that have short requested runtimes.
Table 3. Weights for New Priority Backfill Policies Priority Weight √ √ SJF&W(w) S T F&W(w) L XF&W(w) w = 0.05-0.2 1 0 0
w = 0.01-0.05 0 1 0
w = 0.01-0.02 0 0 1
Job Measure current wait time, Jw , in hours ∗ Jr = R in400hours √ Jr √ R in hours Jx , where Jx = Jw R+ in hours
(* R = requested runtime. The maximum value of R is 400 hours.)
3.3
New Priority Functions
In this section, we propose three alternative new priority functions, evaluate the impact of the alternative priority weights for current job wait time together with
(a) Average Wait
(b) 95th-percentile Wait
300 200
Jul 01
Jun 01
0
Mar 01
100 Feb 01
0
400
Jan 01
5
max wait time (hrs)
10
Jul 01
Jul 01
Jun 01
Mar 01
0
Feb 01
1
15
Jun 01
2
20
Mar 01
3
25
Feb 01
4
Jan 01
w=0 w = 0.005 w = 0.01 w = 0.02 w = 0.05 w = 0.1
95−percentile wait (hrs)
Su-Hui Chiang et al.
Jan 01
avg wait time (hrs)
114
(c) Maximum Wait
√ Fig. 4. Alternative Job Wait Priority Values (w) for L XF&W-backfill (One Dynamic Reservation)
the impact of reservation policies, and compare the best new priority functions against the previous backfill policies. The new (dynamic) priority functions √ are defined in Table 3. Note that in the Jr metric for the SJF&W and S T F&W priority functions, the inverse of requested runtime (1/R) is normalized to the maximum allowed requested runtime (i.e., 400 hours). The SJF&W priority function extends√the previous SJF function with a weight for the current job wait time. The S T F&W and √ the L XF&W priority functions are designed to increase the relative weight on the current job waiting time (Jw ), by applying a square root √ to the job metric that includes requested runtime. Results below show that S T F&W-backfill and √ L XF&W-backfill only very slightly outperform SJF&W-backfill and LXF&Wbackfill. Thus, further alternative relative weights for these job measures are not likely to lead to any significant improvement. Let w denote the priority weight on current job wait time in the new priority functions. Figure 4 shows the performance of alternative values of w in the √ L XF&W-backfill scheduler with one dynamic reservation. Results are similar for 2-4 reservations, and for each of the other two new priority functions, as well as for LXF&W (not shown). The Figure shows that average and 95thpercentile wait are not highly sensitive to w in the range of 0.005 - 0.05, and that during heavy load months, this range of w values significantly reduces the maximum wait (by 30-50%) compared to w = 0. Larger values of w, such as w = 1, significantly increase the average and 95th-percentile wait time, with only small improvements in the maximum wait (not shown). Similar to the previous section, we find that using a small number of reservations (i.e., two or three) outperforms a single reservation for each of the alternative new priority functions. Figure 5 compares the performance of FCFS-backfill, LXF&W(0.02)-backfill, √ and the √ two best alternative new priority backfill policies (i.e., L XF&W(0.01) and S T F&W(0.05)-backfill, which slightly outperform SJF&W-backfill), each with 2 - 3 reservations. One key result is that using 2-4 reservations instead of one reservation has improved the overall performance of all four policies.
(c) Maximum Wait
Jul 01
Jun 01
100
Jul 01
0
Jun 01
50
Mar 01
Jul 01
Jun 01
Mar 01
Feb 01
50
150
Feb 01
100
200
Jan 01
avg slowdown
150
0
0
(b) 95th-percentile Wait
200
Jan 01
max wait time (hrs)
(a) Average Wait
10
Mar 01
Jul 01
Jun 01
Mar 01
0
Feb 01
2
20
Feb 01
4
115
30
Jan 01
6
95−percentile wait (hrs)
FCFS, 2 reservations ST1/2F&W(0.05), 2 dynamic 1/2 LX F&W(0.01), 2 dynamic LXF&W(0.02), 3 dynamic
Jan 01
avg wait time (hrs)
The Impact of More Accurate Requested Runtimes
(d) Average Slowdown
Fig. 5. Performance of New Priority Backfill Policies that Favor Short Jobs
For example, compared to Figure 2, the maximum wait for FCFS-backfill and LXF&W(0.02)-backfill is reduced by up to 30% while the average or 95thpercentile √ wait is increased by on the order of 10% or less. Another key result is that L XF&W(0.01)-backfill with 2-4 reservations has maximum wait that is reasonably competitive with FCFS-backfill, yet significantly outperforms FCFSbackfill with √ respect to the other wait time statistics, particularly average slowhas slightly better overall performance than LXF&Wdown. L XF&W-backfill √ backfill.√Finally, S T F&W-backfill has better average and 95th-percentile wait than L XF&W-backfill, but more often has significantly poorer maximum wait than FCFS-backfill (e.g., in February and June 2001). The overall conclusion is that, similar to results in [5], giving priority to short jobs but also using an appropriate weight for current job wait can significantly outperform FCFS-backfill, particularly with respect to the 95th percentile wait time and the average slowdown measures. In the remainder of this paper, we study the impact of more accurate requested runtimes on such high performance backfill policies.
4
More Accurate Requested Runtimes
There is reason to believe that runtimes can be more accurately estimated for the jobs that run on the O2K. In particular, a majority of the jobs use one of the default requested runtimes, which are 5, 50, 200, or 400 hours. This indicates that users specify highly approximate requested runtimes due to the course-grain defaults that are available. Furthermore, since the current priority-backfill policy
116
Su-Hui Chiang et al.
Table 4. Notation Symbol T R R* P
Definition Actual job runtime User requested runtime from the O2K logs Simulated requested runtime Number of requested processors
provides similar 95th-percentile waiting time for the entire range of job runtimes (see Figure 2(e) and results in [5]), there isn’t currently any incentive for an individual user to provide a more accurate requested runtime. These factors explain why, for example, many of the jobs that have actual runtime of 10 hours have requested runtime of 50, 200, or 400 hours. Previous results suggest that using more accurate requested runtimes has only minimal impact on the average slowdown and average wait time for FCFSbackfill [6, 10, 13, 11, 12]. This section investigates whether the benefit of more accurate requested runtimes is more significant for the high performance priority backfill policies that use requested runtimes to favor short jobs. We consider various scenarios more accurate runtime requests, the O2K workloads that include exceptionally heavy load in recent months, and more extensive performance measures than in the previous evaluations of FCFS-backfill. Section 4.1 describes the scenarios of more accurate requested runtime that will be evaluated. Section 4.2 reassesses the impact of more accurate requested runtimes on FCFS-backfill, whereas Section 4.3 evaluates the impact of more accurate requested runtimes on the policies that favor short jobs. 4.1
More Accurate Requested Runtime Scenarios
Using the notation defined in Table 4, we consider three different scenarios of more accurate runtime requests for the O2K workloads. In the first (implausible) case, each requested runtime is perfect (i.e., the simulated requested runtime, R∗ = T ). In the second case, all requested runtimes are imperfect but are approximately accurate (i.e., R∗ = min{R, kT }, 1 < k ≤ 2). In the third case, only a fraction (e.g., 80% or 60%) of the jobs have the approximately accurate requested runtimes, while the rest of the jobs, selected randomly, have requested runtimes as given in the job log, which are generally highly inaccurate. In the third case, the fraction of jobs that have the inaccurate requested runtimes from the trace represent carelessly specified runtime requests or runtimes that can’t (easily) be accurately estimated. This fraction is varied in the experiments. The first case is used to provide a bound on the maximum benefit of more accurate requested runtimes, while the second and third cases are used to assess performance gains that are more likely to be achievable. Section 5 will explore the performance impact of using short test runs to achieve the more accurate runtime requests. We present results for k = 2. We also considered smaller values of k, in particular k = 1.2, which results in slightly better performance, but we omit
6 4 2 2
4
10
10
10 0
actual runtime (minutes)
(a) Average Wait vs. T (June 2001) 1m
10m
1h
10h
50h
80 60 40
R* = R R* = T
20 0
R* = R R* = T
200
avg slowdown
max wait time (hrs)
100
(b) 95th-Percentile Wait
150 100 50 0
0
10
2
10
actual runtime (minutes)
(c) Maximum Wait vs. T (Jan. 2001)
4
10
Jul 01
0
10
20
Jun 01
0
30
Jul 01
8
117
R* = R R* = T
40
Jun 01
50h
Mar 01
10h
Mar 01
1h
Feb 01
10m
Feb 01
1m R* = R R* = T
Jan 01
10
Jan 01
avg wait time (hrs)
12
95−percentile wait (hrs)
The Impact of More Accurate Requested Runtimes
(d) Average Slowdown
Fig. 6. Impact of Perfect Requested Runtimes for FCFS-backfill those results below to conserve space. As noted in Section 2.3, several previous papers [6, 13, 11, 12] have used a uniform distribution of requested runtime overestimations, with a large upper bound factor (e.g., 10, 50, or 300). In contrast, our scenarios assume that requested runtime is not larger than the actual requested runtime in the workload trace. 4.2
FCFS-backfill Results
Figure 6 compares the performance of perfectly accurate requested runtimes (i.e., R* = T) against user requested runtimes from the trace (i.e., R* = R) for FCFS-backfill with two reservations. The results for previous typical O2K workloads (e.g., July 2001) agree with previous results in [6]; that is, using more accurate runtimes has only very slight impact on system performance. Moreover, perfect requested runtimes have minimal impact on the overall average waiting time for each month (not shown), and on the 95th-percentile wait each month, shown in Figure 6(b). On the other hand, as shown in Figure 6(a) for June 2001, accurate runtime requests improve the average wait of very short jobs (T < 30 minutes) during and immediately following the new exceptionally heavy load months. More significantly, Figure 6(c) shows that accurate requested runtimes significantly improve maximum wait time for most actual job runtimes, for many of the exceptionally heavy load months and immediately following new heavy load months. Figure 6(d) shows that actual runtimes significantly reduce average slowdown under and immediately following new heavy load months (by up to 60% in Feb. 2001).
118
Su-Hui Chiang et al.
We note that perfect requested runtimes generally improves the wait time for short jobs because these jobs can be backfilled more easily. Accurate requested runtimes also improve the maximum wait for long jobs due to shorter backfill windows. Using approximately accurate requested runtimes (i.e., R∗ = kT ) has a somewhat lower impact on system performance than using perfect runtime requests (not shown to conserve space). 4.3
Results for High Performance Backfill Policies
This section evaluates the impact of more accurate requested runtimes on the performance of high performance backfill policies that favor short jobs. We √ present the results for L XF&W-backfill. √ Results are similar for the other high performance backfill policies such as S T F&W-backfilland LXF&W-backfill. Initially, we consider the case where all jobs have requested runtimes within a small factor of their actual runtimes. Then, we consider the case where only 80% or 60% of the jobs have approximately accurate requested runtimes. Figure 7 compares the performance of perfectly accurate runtime requests (i.e., R* = T) and approximately accurate runtime requests (i.e., R* = Min{R, 2T}) against requested runtimes from the trace (i.e., R* = R). Graphs (a)-(d) contain the overall performance measures each month, whereas graphs (e)-(h) show performance versus requested number of processors or actual runtime for a given representative monthly workload. Results for other months (not shown) are similar. In contrast to the FCFS-backfill results shown in Figure 6, √ there is an even larger benefit of using more accurate requested runtimes for L XF&W-backfill, because accurate runtime requests enable the scheduler to give priority to jobs that are actually shorter. In particular, Figures 7(a) - (d) show that perfectly accurate runtime requests improve not only the maximum wait and average slowdown, but also the average and 95th-percentile wait time over all jobs. Furthermore, the average slowdown is dramatically improved in every month, including the months with typical O2K load (e.g., July 2001). These four graphs also show that even if the the requested runtimes are only approximately accurate (i.e., R* = Min{R, 2T}), similar improvements are achieved. Figure 7(e) shows that accurate or approximately accurate requested runtimes improve the average wait time for jobs that require a large number of processors (i.e., greater than 16 processors). Figures 7(f)-(h) show that more accurate requested runtimes improve the average wait for short jobs (up to 10 hours), 95th-percentile wait for all jobs, and the maximum wait for all but the very largest jobs. Note that these improvements occur for typical system loads as well as exceptionally heavy loads that during months following the exceptionally heavy load. Note also that the improvement in the average wait for short jobs is significantly larger than the improvement for FCFS-backfill systems, and the improvement is achieved without increasing the average wait time for longer jobs. Furthermore, when requested runtimes are accurate or approximately accurate, the average wait under √ L XF&W-backfill decreases (monotonically) as the actual runtime decreases;
15 10 5
(b) 95th-Percentile Wait 150
50
(c) Maximum Wait R* = R R* = min{R,2T} R* = T
60 40 20 0
1
2
4
8
16
32
10
2
10h
50h
50
3 2 1
0
10
2
4
10
10
(f) Average Wait vs. T (March 2001)
4
0
0
10
actual runtime (minutes)
max wait time (hrs)
95−percentile wait (hrs)
1h
50h
4
(e) Average Wait vs. P (March 2001) 10m
10h
6
0
64
1m 10m 1h R* = R R* = min{R,2T} R* = T
8
number of requested processors
1m
Jul 01
(d) Average Slowdown 12
avg wait time (hrs)
avg wait time (hrs)
80
Jun 01
0
Jul 01
Jun 01
Feb 01
0
Mar 01
50
100
Mar 01
100
Feb 01
150
Jan 01
avg slowdown
200
Jan 01
max wait time (hrs)
(a) Average Wait
Jul 01
0
Jun 01
Jul 01
Jun 01
Feb 01
0
Mar 01
1
20
Mar 01
2
Feb 01
3
119
R* = R R* = T R* = min{R, 2T}
25
Jan 01
95−percentile wait (hrs)
4
Jan 01
avg wait time (hrs)
The Impact of More Accurate Requested Runtimes
2
10
actual runtime (minutes)
(g) 95th-percentile Wait vs. T (July 2001)
4
10
1m
10m
1h
10h
50h
40 30 20 10 0
0
10
2
10
4
10
actual runtime (minutes)
(h) Maximum Wait vs. T (July 2001)
√ Fig. 7. Impact of Accurate Runtime Requests for L XF&W-backfill this is a desirable property that, to our knowledge, has not been demonstrated for any previous backfill scheduler. Next, we consider scenarios in which not all jobs have approximately accurate requested runtimes. Two systems are evaluated: hybrid(4:1) and hybrid(3:2). In the hybrid(4:1) system, 4 out of 5 jobs (i.e., 80% of jobs), selected randomly, have approximately accurate requested runtime (i.e., R* = minR, 2T). The hybrid(3:2) system is similar to the hybrid(4:1) system, except that only three out of five (or 60%) of the jobs √ have the approximately accurate runtime requests. Results will be shown for L XF&W-backfill; the results are similar for the other priority backfill policies that favor short jobs. Figure 8 compares hybrid(4:1) and hybrid(3:2) against the case where all jobs have perfectly accurate runtime requests (i.e, R* = T), and the case where all jobs use requested runtimes from the trace (i.e, R* = R). The key conclu-
Su-Hui Chiang et al.
(c) Maximum Wait
Jul 01
100
Jul 01
0
Jun 01
50
Mar 01
Jul 01
Jun 01
Mar 01
50
R* = R R* = T hybrid(4:1) hybrid(3:2)
Feb 01
100
Feb 01
0
Jan 01
150
0
5
150
avg slowdown
200
10
(b) 95th-Percentile Wait R* = R R* = T hybrid(4:1) hybrid(3:2)
Jan 01
max wait time (hrs)
(a) Average Wait
15
Jun 01
Jul 01
Jun 01
Mar 01
0
Feb 01
1
20
Mar 01
2
25
Feb 01
3
R* = R R* = T hybrid(4:1) hybrid(3:2)
Jan 01
4
Jan 01
avg wait time (hrs)
R* = R R* = T hybrid(4:1) hybrid(3:2)
95−percentile wait (hrs)
120
(d) Average Slowdown
Fig. 8. Hybrid(x:y) Approximately Accurate:Inaccurate Requested Runtimes √ (L XF&W-backfill; Approximately Accurate R* = Min{R, 2T})
sion is that much of the benefit of accurate requested runtimes can be achieved even if only 60% or 80% of the jobs have approximately accurate requested runtimes. Specifically, Figures 8(a) and (b) show that hybrid(4:1) has similar average and 95th-percentile wait time as the system where all jobs have R* = T. Figure 8(c) shows that hybrid(4:1) has somewhat higher maximum wait than when requested runtimes are perfectly accurate, but has lower maximum wait than for user requested runtimes in the trace. Figure 8(d) shows that hybrid(4:1) has much lower average slowdown than the system with user requested runtimes from the trace. If only 60% of the jobs have improved requested runtimes, i.e., hybrid(3:2), the performance improvement is smaller than that in hybrid(4:1), but hybrid(3:2) still has lower average and 95th-percentile wait time and significantly lower average slowdown than that of using very inaccurate requested runtimes from the trace. Further reducing the fraction of the jobs to have improved requested runtimes results in a system increasingly more similar to the system where all jobs have requested runtimes from the trace. The next results show that the jobs in the hybrid systems that have more accurate requested runtimes experience substantial performance benefit. In particular, Figure 9 compares the wait time statistics for ’approx. accurate jobs’ (i.e., R* ≤ 2T) in the hybrid system against the wait time statistics for ’inaccurate jobs’ (i.e., R* = R > 2T) in the hybrid system. The figure also includes the performance when all jobs have requested runtimes as in the workload trace (i.e., all jobs have R* = R). The results are shown for hybrid(3:2), in which only 60% of the jobs have approximately accurate requested runtimes. As noted in the figure captions, only the jobs with under 50 hours of actual runtime are
The Impact of More Accurate Requested Runtimes
(a) Average Wait (T ≤ 50 hours)
(b) 95th-percentile Wait (T ≤ 50 hours)
avg wait (hrs)
150 100
Jul 01
Jun 01
Mar 01
Feb 01
Jan 01
50 0
8
(d) Average Slowdown (T ≤ 50 hours)
100
Jul 01
Jun 01
Feb 01
Mar 01
50 0
Jul 01
0
150
Jan 01
max wait (hrs)
5
all jobs have R* = R hybrid(3:2) − inaccurate jobs hybrid(3:2) − approx. accurate jobs
200
avg slowdown
10
Jun 01
Jul 01
Jun 01
Feb 01
0
Mar 01
1
200
15
Mar 01
2
all jobs have R* = R hybrid(3:2) − inaccurate jobs hybrid(3:2) − approx. accurate jobs
20
Feb 01
3
Jan 01
avg wait (hrs)
4
all jobs have R* = R hybrid(3:2) − inaccurate jobs hybrid(3:2) − approx. accurate jobs
Jan 01
95−percentile wait (hrs)
all jobs have R* = R hybrid(3:2) − inaccurate jobs hybrid(3:2) − approx. accurate jobs
121
(c) Maximum Wait (T ≤ 50 hours)
1h 10h 50h hybrid(3:2) − inaccurate jobs hybrid(3:2) − approx. accurate jobs
6 4 2 0
0
10
2
10
4
10
actual runtime (minutes)
(e) Average Wait vs. T (June 2001)
Fig. 9. √ Performance for Jobs with More Accurate Requested Runtimes (L XF&W-backfill; Approximately Accurate R* = Min{R, 2T})
considered in the first four graphs because requested runtime accuracy has little impact on jobs with larger actual runtime (as can be seen in Figure 9(e)). The key results are: – Figures 9(a) - (c) show that during and immediately following the extremely heavy load months, for actual runtime up to 50 hours, jobs with approximately accurate runtime requests have 20% lower average and 95thpercentile wait time and up to 50% lower maximum wait time than the jobs with inaccurate runtime requests. – Furthermore, the jobs with approximately accurate runtime requests improve the average and 95th-percentile wait time of inaccurate jobs, compared to when all jobs have the requested runtimes from the trace. – Figure 9(d) shows that for any month, the average slowdown of jobs with approximately accurate runtime requests is dramatically (i.e., more than an order of magnitude) lower than either the average slowdown of jobs with inaccurate requests, or the overall average slowdown when all jobs use the requested runtime from the trace (i.e., R* = R). – Figure 9(e) further shows that for actual runtime of up to 10 hours, jobs with approximately accurate requests achieve significantly lower average wait time than that of inaccurate jobs, and average wait decreases monotonically as actual runtime decreases for the jobs with approximately accurate requests.
122
5
Su-Hui Chiang et al.
Test Runs for Improving Requested Runtimes
Results in Section 4 show that if a majority of the jobs (e.g., 60% or more) have estimated runtimes within a factor of two of their actual runtime, system performance improves greatly, particularly for the jobs that have such approximately accurate runtime requests. Thus, if users are provided with incentives and tools to provide more accurate requested runtimes, the users will reap significant performance benefit. We hypothesize that approximately accurate requested runtimes are feasible in at least three cases. First, many large scale simulations are run with similar input parameters to previous runs, or with changes in the input parameters that will affect run time in an approximately predictable way (e.g., runtime can be estimated within a factor of two). Second, for other applications, the runtime request can be more accurate if a short test run is made before the full run. Example applications that can estimate requested runtime after a test run include those that involve iterative computation in which the number of iterations and/or the time per iteration are dependent on the input data, but can be estimated reasonably well after having run the first few iterations. In the third case, many applications such as stochastic optimization have a number of iterations that is dependent on how quickly the solution converges, which generally can’t be predicted ahead of time. In this case approximately accurate requested runtimes could be provided if the computation is performed in several runs, each except the last run having requested runtime that is shorter than needed to reach final convergence, and with the solution from one run being input to the next run. The remainder of this section investigates whether most of the benefit of more accurate requested runtimes shown in the previous section can still be realized if some jobs perform a short test run before providing an approximately accurate requested runtime. To address this question, the following assumptions are made regarding the test runs. If the user requested runtime is already within a factor of two of the actual runtime (i.e., R ≤ 2T ), we assume that the user is aware that a test run is not needed, and the job is simply submitted with the requested runtime supplied by the user. For the remaining jobs, a specified fraction (i.e., 100% in section 5.1 or 50% - 80% in section 5.2) are assigned more accurate requested runtimes than specified by the user. The jobs that do not have more accurate requested runtimes represent jobs for which the user is either not able or not interested in estimating runtime more accurately. Of the jobs that are given more accurate requested runtimes, some fraction (e.g., 25%) require a test run before the more accurate request is given. If the test run is not used, the more accurate runtime request is assumed to be estimated from previous runs of the application. The requested runtime for a test run is equal to: (a) 10% of the user requested runtime if the user requested runtime is under 10 hours, or (b) one hour if the user requested runtime is greater than 10 hours. That is, the requested runtime for the test run is equal to the minimum of 1 hour and 10% of the user requested runtime (R). The requested runtime for the test run represents the runtime the
The Impact of More Accurate Requested Runtimes
123
user believes is needed to estimate the full job runtime within a small factor. Note that because the user requested runtimes can be highly inaccurate, the actual job runtime may be shorter than the requested runtime for the testrun. In such cases only the test run is needed. This represents the case in the actual system, in which jobs complete during the test run either due to the user’s lack of experience in how long the test run should be, or due to an unexpected error in the execution. If the actual job runtime is longer than the test run or a test run is not needed, the job is submitted with an approximately accurate requested runtime (i.e., a requested runtime equal to twice the actual runtime, 2T ). Section 5.1 considers the scenario in which all full runs have requested runtime within a factor of two of the actual runtime, but two different fractions of jobs (i.e., 100% or 25%) (randomly selected) make test runs before submitting with the approximately accurate requested runtime. Section 5.2 considers the scenario in which only 50% or 80% of the jobs provide approximately accurate requested runtimes, whereas the other 50% or 20% of the jobs provide the same requested runtimes as in the job trace. Of the jobs that provide approximately accurate runtime requests, 25% make the test run before submitting with the approximately accurate request. 5.1
Improved Requested Runtimes for All Jobs
This section studies the impact of test runs for the optimistic (”best case performance”) scenario in which all jobs provide approximately accurate requested runtimes. In one case (”25% testrun”), 25% of jobs that do not have approximately accurate user requested runtime from the trace are randomly selected to have a test run. In the other case (”100% testrun”), every job with improved runtime request requires a test run. Note that ”100% testrun” is a pessimistic assumption that is not likely to occur in practice, since many applications are run a large number of times, and in many cases previous executions can be used to provide approximately accurate runtime requests. Thus, we consider the ”25% testrun” experiments to be more representative of the practical impact of using test runs to improve runtime estimate accuracy. During each month, 35-45% of the jobs have inaccurate requested runtimes (i.e., R > 2T ) and actual runtime T greater than the minimum of one hour and 10% of the user runtime request. For such jobs, if a test run is used to improve requested runtime, the job is resubmitted after the test run. The total extra load due to the test runs is very small (only 1-3% increase in processor and memory demand each month), even for 100% testrun. However, the additional waiting time for the test run, and the test run, must be included in the measures of total job waiting time. The results below address the impact of this extra waiting time on the overall system performance. Figure 10 compares 100% testrun and 25% testrun against the optimal case where all jobs use actual runtimes (i.e., R∗ = T ) without test runs and the case where all jobs use the requested runtimes from the trace (i.e., R∗ = R). The average total wait, 95th-percentile total wait, maximum total wait, and average slowdown, are shown for representative recent O2K workloads. For each of these
0
Jul 01
5
(c) Maximum Total Wait
Jul 01
0
Jun 01
50
Mar 01
Jul 01
Jun 01
Mar 01
Feb 01
50
100
Feb 01
100
Jan 01
avg slowdown
150
150
0
10
(b) 95-Percentile Total Wait
200
Jan 01
max wait time (hrs)
(a) Average Total Wait
15
Jun 01
Jul 01
Jun 01
Mar 01
Jan 01
0
Feb 01
1
20
Mar 01
2
25
Feb 01
3
R* = R R* = T − no test run 25% test run 100% test run
Jan 01
4
95−percentile wait (hrs)
Su-Hui Chiang et al.
avg wait time (hrs)
124
(d) Average Slowdown
√ Fig. 10. Impact of Test-Runs to Determine Requested Runtimes (L XF&W-backfill; Wait includes testrun wait and overhead; R* = Min{R, 2T})
measures except average slowdown during February 2001, the performance of the 25% testrun case is very similar to the case where R∗ = T . Overall the results show that a significant fraction of test runs can be made to improve requested runtimes, and if the improved requested runtimes are within a factor of two of the actual runtime, then nearly the maximum possible benefit of accurate requested runtimes can be achieved. The test run overhead becomes prominent if all jobs with R > 2T require a test run (i.e., 100% testrun). Even so, Figure 10(a) shows that during and immediately following the heavy load months, ”100% testrun” has lower average and 95th-percentile wait, and especially lower average slowdown, than using requested runtimes from the trace. 5.2
Improved Requested Runtimes for a Majority of the Jobs
This section evaluates scenarios where only 50% or 80% of the jobs have improved requested runtime accuracy, and test runs are needed for 25% of the jobs that have improved requested runtimes. The two scenarios are named hybrid(1:1) - 25% testrun and hybrid(4:1) - 25% testrun, respectively. Note that hybrid(1:1) with 25% testrun represents a reasonably pessimistic, but possibly realistic scenario, in which only 50% of the jobs have approximately accurate requested runtimes and one out of four jobs requires a test run to improve requested runtime accuracy. Again, we use R∗ = 2T for approximately accurate runtime requests.
The Impact of More Accurate Requested Runtimes 50h
0
0
2
10
4
10
10
15 10 5 0
actual runtime (minutes)
(a) Average Wait (Representative Jan. 2001)
(b) 95th-Percentile Wait R* = R hybrid(4:1) hybrid(4:1) − 25% test run hybrid(1:1) − 25% test run 150
(c) Maximum Wait
0
Jul 01
50
Jun 01
Jul 01
Jun 01
Mar 01
Feb 01
50
100
Mar 01
100
Feb 01
150
Jan 01
avg slowdown
200
Jan 01
max wait time (hrs)
R* = R hybrid(4:1) hybrid(4:1) − 25% test run hybrid(1:1) − 25% test run
0
Jul 01
2
20
Jun 01
4
25
Mar 01
6
Feb 01
8
R* = R hybrid(4:1) hybrid(4:1) − 25% test run hybrid(1:1) − 25% test run
Jan 01
avg wait time (hrs)
10
95−percentile wait (hrs)
1m 10m 1h 10h R* = R hybrid(1:1) − 25% test run hybrid(4:1) − 25% test run
12
125
(d) Average Slowdown
Fig. 11. Performance of Hybrid(x:y) with Test Runs √ (L XF&W-backfill; R* = Min{R, 2T})
Figure 11 compares the above two scenarios with 25% test run against that of using requested runtimes from the trace (i.e., ”R* = R”). The performance for hybrid(4:1) without test run is also included in Figure 11(b) - (d) for comparison with hybrid(4:1) with 25% testrun. The results show that for both hybrid systems with testruns, the average wait for short jobs, the 95th-percentile wait, and the average slowdown is significantly better than for the requested runtimes in the O2K traces. The results also show that test runs do not introduce significant overhead in the hybrid (4:1) system.
6
Conclusions
In this paper, we used ten one-month recent traces from the NCSA O2K to evaluate whether high-performance backfill policies can be significantly improved if the requested runtimes can be more accurate. Several of these months have exceptionally heavy load, which tends to result in larger policy performance differentials than for the lower loads used in previous work. To select the best backfill policies for studying this key question, we more fully evaluated the issues related to starvation in backfill policies that favor short jobs. The results show (1) that a few reservations (2 - 4) can significantly reduce the maximum wait time but a larger number of reservations result in poor performance, and (2) fixed reservations have similar performance to dynamic reservations in most cases, except for SJF-backfill which requires fixed reservation to reduce starvation. The results also show that two new priority backfill
126
Su-Hui Chiang et al.
√ √ policies, namely L XF&W-backfill and S T F&W-backfill, achieve a high performance trade-off between favoring short jobs and preventing starvation. The results for the high-performance backfill policies, heavier system load, and a more complete set of performance measures show that the potential benefit of more accurate requested runtimes is significantly larger than suggested in previous results for FCFS-backfill. Furthermore, the results show that most of the benefit of more accurate requested runtimes can be achieved by using test runs to improve requested runtime accuracy, in spite of the time needed to perform the test run. Another key result is that users who provide more accurate requested runtimes can expect improved performance, even if other jobs do not provide more accurate requested runtimes. Topics for future work include developing approaches for achieving more accurate requested runtimes in actual systems, improving job placement during backfill to reduce system fragmentation, and extending the high performance policies for use in more complex distributed Grid architectures such as the TeraGrid.
References [1] National Computational Science Alliance Scientific Computing: Silicon Graphics Origin2000. (http://www.ncsa.uiuc.edu/SCD/Hardware/Origin2000) 103, 106 [2] NCSA Scientific Computing: IA-32 Linux Cluster. (http://www.ncsa.uiuc.edu/ UserInfo/Resources/Hardware/IA32LinuxCluster) 103 [3] Lifka, D.: The ANL/IBM SP scheduling system. In: Proc. 1st Workshop on Job Scheduling Strategies for Parallel Processing, Santa Barbara, Lecture Notes in Comp. Sci. Vol. 949, Springer-Verlag (1995) 295–303 103, 109 [4] Skovira, J., Chan, W., Zhou, H., Lifka, K.: The EASY-Loadleveler API Project. In: Proc. 2nd Workshop on Job Scheduling Strategies for Parallel Processing, Honolulu, Lecture Notes in Comp. Sci. Vol. 1162, Springer-Verlag (1996) 41–47 103, 109 [5] Chiang, S. H., Vernon, M. K.: Production job scheduling for parallel shared memory systems. In: Proc. Int’l. Parallel and Distributed Processing Symp. (IPDPS) 2001, San Francisco (2001) 104, 106, 107, 108, 109, 111, 115, 116 [6] Feitelson, D. G., Mu’alem Weil, A.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: Proc. 12th Int’l. Parallel Processing Symp., Orlando (1998) 542–546 104, 109, 110, 113, 116, 117 [7] Mu’alem, A. W., Feitelson, D. G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel and Distributed Syst. 12 (2001) 529–543 104, 108, 109, 110 [8] Cirne, W., Berman, F.: A comprehensive model of the supercomputer workload. In: Proc. IEEE 4th Annual Workshop on Workload Characterization, Austin, TX. (2001) 104, 108 [9] Chiang, S. H., Vernon, M. K.: Characteristics of a large shared memory production workload. In: Proc. 7th Workshop on Job Scheduling Strategies for Parallel Processing, Cambridge, MA. (2001) 104, 106 [10] Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Proc. 5th Workshop on Job
The Impact of More Accurate Requested Runtimes
[11]
[12]
[13]
[14]
[15]
127
Scheduling Strategies for Parallel Processing, San Juan, Lecture Notes in Comp. Sci. Vol. 1659, Springer-Verlag (1999) 202–219 104, 109, 116 Zhang, Y., Franke, H., Moreira, J. E., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proc. Int’l. Parallel and Distributed Processing Symp. (IPDPS) 2000, Cancun (2000) 104, 108, 109, 110, 116, 117 Zhang, Y., Franke, H., Moreira, J. E., Sivasubramaniam, A.: An analysis of spaceand time-sharing techniques for parallel job scheduling. In: Proc. 7th Workshop on Job Scheduling Strategies for Parallel Processing, Cambridge, MA. (2001) 104, 109, 110, 116, 117 Zotkin, D., Keleher, P. J.: Job-length estimation and performance in backfilling schedulers. In: 8th IEEE Int’l Symp. on High Performance Distributed Computing, Redondo Beach (1999) 236–243 108, 109, 110, 116, 117 Perkovic, D., Keleher, P. J.: Randomization, speculation, and adaptation in batch schedulers. In: Proc. 2000 ACM/IEEE Supercomputing Conf., Dallas (2000) 108, 109 Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Proc. 3rd Workshop on Job Scheduling Strategies for Parallel Processing, Geneva, Lecture Notes in Comp. Sci. Vol. 1291, Springer-Verlag (1997) 109
Economic Scheduling in Grid Computing Carsten Ernemann, Volker Hamscher, and Ramin Yahyapour Computer Engineering Institute University of Dortmund, 44221, Dortmund, Germany {carsten.ernemann,volker.hamscher,ramin.yahyapour}@udo.edu
Abstract. Grid computing is a promising technology for future computing platforms. Here, the task of scheduling computing resources proves difficult as resources are geographically distributed and owned by individuals with different access and cost policies. This paper addresses the idea of applying economic models to the scheduling task. To this end a scheduling infrastructure and a market-economic method is presented. The efficiency of this approach in terms of response- and waittime minimization as well as utilization is evaluated by simulations with real workload traces. The evaluations show that the presented economic scheduling algorithm provides similar or even better average weighted response-times as common algorithms like backfilling. This is especially promising as the presented economic models have additional advantages as e.g. support for different price models, optimization objectives, access policies or quality of service demands.
1
Introduction
Grid computing is expected to provide easier access to remote computational resources that are usually locally limited. Distributed computer systems are joined in such a grid environment (see [5, 12]), in which users can submit jobs that are automatically assigned to suitable resources. The idea is similar to metacomputing [20] where the focus is limited to compute resources. Grid computing takes a broader approach by including networks, data, visualization devices etc. as accessible resources [17, 11]. In addition to the benefit of access to locally unavailable resource types, there is also the expectation that a larger number of resources is available for a single job. This is assumed to result in a reduction of the average job response time. Moreover, the utilization of the grid computers and the job-throughput is likely to improve due to load-balancing effects between the participating systems. Typically the parallel computing resources are not exclusively dedicated to the grid environment. Furthermore, they are usually not owned and maintained by the same administrative instance. Research institutes as well as laboratories and universities are examples for such resource owners. Due to the geographically distributed resources and the different owners the management of the grid environment becomes rather complex, especially the scheduling of the computational tasks. To this end, economic models for the scheduling are an adequate D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 128–152, 2002. c Springer-Verlag Berlin Heidelberg 2002
Economic Scheduling in Grid Computing
129
way to solve this problem. They provide support for individual access and service policies to the resource owners and grid users. Especially the ability to include cost management into the scheduling will become an important aspect in future grid economy as anonymous users compete for resources. In this paper, we present an architecture and an economical scheduling model for such grid environments. First examinations of the efficiency of this approach have been performed by simulations. The results are discussed in comparison to conventional scheduling algorithms that are not based on economic models. Note, that these classic methods are primarily optimized for response-time minimization. The following sections are organized as follows. Section 2 gives a short overview on the background of grid scheduling and economic market methods. In Section 3 an infrastructure for a grid environment is presented that supports economic scheduling models as well as common algorithms as for instance backfilling. The economic scheduling method itself is described in Section 4. The simulation and the results for this scheduling method are shown in Section 5. The paper ends with a brief conclusion in Section 6.
2
Background
Scheduling is the task of allocating resources to problems over time. In grid computing these problems are typically computational tasks called jobs. They can be described by several parameters like the submission time, run time, the needed number of processors etc. In this paper we focus only on the job scheduling part of a grid management infrastructure. A complete infrastructure has to address much more additional topics as e.g. network and data management, information collection or job execution. One example for a grid management infrastructure is Globus [10]. Note, that we examine scheduling for parallel jobs where the job parts can be executed synchronously on different machines. It is task of the grid scheduling system to find suitable resources for a job and to determine the allocation times. However, the actual transfer, execution, synchronization and communication of a job is not part of the grid scheduling system. Until now mostly algorithms as e.g. FCFS and backfilling have been used for the scheduling task [8, 16]. These classic methods have been subject to research for a long time and have a well known behavior in terms of worst-case and competitive analysis. These algorithms have been used for the management of single parallel machines. In later implementations they were adapted for the application in grid environments [13, 6]. As already mentioned, the requirements on the scheduling method differs from single machine scheduling as the resources are geographically distributed and owned by different individuals. The scheduling objective is usually the minimization of the completion time of a computational job on a single parallel machine. Especially for grid applications other objectives have to be considered as cost, quality of service etc. To this end, other scheduling approaches are necessary that can deal better with different user objectives as
130
Carsten Ernemann et al.
well as owner and resource policies. Here, naturally, economic models come into mind. An overview on such models can be found in [2] and economic concepts have been additionally examined in the Mariposa project which is restricted to distributed database systems [22]. In comparison to other economic approaches on job scheduling (e.g. [27, 21]), our model supports varying utility functions for the different jobs and resources. Additionally, the model is not restricted to single parallel machines and allows further co-allocation of resources from different owners without disclosing policy information. In this paper we just give a brief introduction on the background for our scheduling setting. 2.1
Market Methods
Market methods, sometimes called Market oriented programming in combination with Computer Science, are used to solve the following problems which occur in real scheduling environments ([4]): – The site autonomy problem arises as the resources within the system are owned by different companies. – The heterogeneous substrate problem that results from the fact that different companies use different resource management systems. – The policy extensibility problem means that local management systems can be changed without any effects for the rest of the system. – The co-allocation problem addresses the aspect that some applications need several resources of different companies at the same time. Market methods allow the combination of resources from different suppliers without further knowledge of the underlying schedules. – The online control problem is caused by the fact that the system works in an online environment. The supply and demand mechanisms provide the possibility to optimize different objectives of the market participants under the usage of costs, prices and utility functions. It is expected that such methods provide high robustness and flexibility in the case of failures and a high adaptability during changes. Next, the definitions of market, market method and agent will be presented briefly. A market can be defined as a virtual market or from an economical point of view as follows: “Generally any context in which the sale and purchase of goods and services takes place.” [25]. The minimal conditions to define a virtual market are: “A market is a medium or context in which autonomous agents exchange goods under the guidance of price in order to maximize their own utility.” [25]. The main aspect is that autonomous agents exchange voluntarily their goods in order to maximize their own utility. A market method can be defined as follows: “A market method is the overall algorithmic structure within which a market mechanism or principle is embedded.” [26]. It has to be emphasized that a market method is an equilibrium protocol and not a complete algorithm.
Economic Scheduling in Grid Computing
131
The definition of an agent can be found in [26]: “An agent is an entity whose supply and demand functions are equilibrated with those of others by the mechanism, and whose utility is increased through exchange at equilibrium ratios.”. It is now the question how the equilibrium can be obtained. One possible method is the application of auctions: “An auction is a market institution with an explicit set of rules determining resource allocation and price on the basis of bids from the market participants.” [28]. More details about the general equilibrium and its existence can be found in [29]. 2.2
Economic Scheduling in Existing Systems
Economic methods have been applied in various contexts. Besides the references explained in [2], we want to briefly mention some other typical algorithms of economic models. WALRAS The WALRAS method is a classic approach by translating a complex, distributed problem into an equilibrium problem [1]. One of the assumptions is that agents do not try to manipulate the prices with speculation, which is called a perfect competition. To solve the equilibrium problem the WALRAS method uses a Double Auction. During that process all agents send their utility functions to a central auctioneer who calculates the equilibrium prices. A separate auction is started for every good. At the end, the resulting prices are transmitted to all agents. As the utility of goods may not be independent for the agents, they can react on the new equilibrium prices by re-adjusting their utility functions. Subsequently, the process starts again. This iteration is repeated until the equilibrium prices are stabilized. The WALRAS method has been used for transportation problems as well as for processor rental. The transportation problem requires to transport different goods over an existing network from different start places to different end places. The processor rental problem consists of allocating one processor for different processes, while all processes have to pay for the utilization. Enterprise Another application example for market methods is the Enterprise [24] system. Here, machines create offers for jobs to be run on these machines. To this end, all jobs describe their necessary environment in detail. After all machines have created their offers the jobs select between these offers. The machine that provides the shortest response time has the highest priority and will be chosen by the job. All machines have a priority scheme where jobs with a shorter run time have a higher priority. Under the premise of these methods, we present in the next sections our infrastructure and scheduling method for the grid job scheduling.
3
Infrastructure
The scheduling model presented in this paper has been implemented within the NWIRE (Net-Wide-Resources) management infrastructure which has been
132
Carsten Ernemann et al.
Scheduler
Scheduler
MetaManager
Network
MetaManager
Resource 1 Resource m Resource k
MetaDomain
MetaDomain
Fig. 1. Structure of NWIRE
developed at our institute [19]. The general idea is that local management structures provide remote access to resources, which are represented by CORBA objects. The scheduling part is using those structures to trade resources between them. While staying locally controlled, the resources are offered throughout the connected management-systems. To address the site autonomy problem, NWIRE structures the system into separate domains, that are constituted by a set of local resources and local management instances. Each so called MetaDomain is controlled by a MetaManager, as shown in Figure 1. This MetaManager administers the local resources and answers to local job requests. Additionally, this MetaManager consists of a local scheduler and acts as a broker/trader to other remote MetaDomains respectively their MetaManagers. That is the local MetaManager can offer local resources to other domains or tries to find suitable resource allocations for local requests. The MetaManager can discover other domains by using directory services as well as exploring the neighborhood similar to peer-to-peer network strategies. If necessary, requests can be forwarded to the MetaManager of other domains. Parameters in the request are used to control depth and strategy of this search. Information on the location of specific resource types can be cached for later requests. Each MetaManager maintains a list with links to other dedicated MetaManagers. This list can be set up by the administrator to comply with logical
Economic Scheduling in Grid Computing
133
Fig. 2. Scheduling Steps
or physical relationships to other domains, e.g. according to network or business connections. Additionally, directory services can be introduced to find specific resource types. Information on remote resources can be cached and used to select suitable MetaManagers to which a request is forwarded. This concept provides several advantages e.g. an increased reliability and fail-safety as the domains act independently. A failure at one site has only local impact as the overall network is still intact. Another feature is the ability to allow different implementations of the scheduling and the offer generation. According to the policy at an institution, the owner can setup an implementation that suits his needs best. Note, the policy on how offers for remote job requests are created does not have to be revealed. This scheduling-infrastructure provides the base to implement different strategies for the scheduler. This also includes the ability to use conventional methods like for instance backfilling. Within the NWIRE system, this is achieved by using so called requests for the information exchange between the user and the components involved in the scheduling. The request is a flexible description of the conditions of a set of resources that are necessary for a job.
134
4
Carsten Ernemann et al.
Economic Scheduling
This section includes a description of the scheduling algorithm that has been implemented for the presented infrastructure. The general applinew cation flow can be seen in Figure 3. In contrast request to [3], our scheduling model does not rely on a single central scheduling instance. Moreover, each domain acts independently and may have local machines create offers different objective policies. Also the job requests for the new of the users can have individual objective funcrequest tions. The scheduling model has the task to combine these objectives to find the equilibrium of first selection the market. This is a derivation of the previously presented methods of WALRAS and Enterprise. In our scheduling model all users submit their interrogation of job requests to the local MetaManager of the doremote domains main as shown in Figure 2. For example, the user specifies that his job requires 3 processors with second selection certain properties as for instance the architecture. Additionally a utility function U F is supplied by the user. For instance the user in our excreate offer ample is interested in the minimization of the job start time, which can be achieved by maximizing the utility function U F = (−StartT ime). The estimated job run-time is also given in Fig. 3. General application addition to an earliest start and latest end time. Note, that a job is allocated for the requested flow run-time and is terminated if the job exceeds this time. If a job finishes earlier, the resulting idle time of resources can be allocated to later submitted jobs. These idle resources can be further exploited by introduction of a rescheduling step which has not been applied in this work. Rescheduling can be used to re-allocate jobs while maintaining the guarantees of the previous allocations. This can be compared with backfilling, although guaranteed allocations, e.g. due to remote dependencies by co-allocation, must be fulfilled. The rescheduling may require additionally requests for offers. The request is analyzed by the scheduler of the receiving MetaManager. The scheduler creates, if possible, offers for all local machines. After this step, a first selection takes place where only the best offers are kept for further processing. According to the job parameters and the found offers, the request is forwarded to the schedulers of other domains. This is possible as long as the number of hops (search depth of involved domains) for this request is not exceeded and the time to live for this request is still valid. In addition none of the domains must have received this request before. The remote domains create new offers and send their best combinations back. If a job has been processed before no
Economic Scheduling in Grid Computing
135
further offers are generated. A second selection process takes place in order to find the best offers among the returned result of this particular domain. Note, that this method is an auction with neither a central nor a decentral auctioneer. Moreover, the different objective functions of all participants are used for equilibration. For each potential offer o for request i the utility value U Vi,o is evaluated and returned within the offer to the originating MetaDomain that received the user’s request. The utility value is calculated by the user supplied utility function U Fi which can be formulated with the job and offer parameters. Additionally to this parameter set Pu the machine value M Vi,j of the corresponding machine j can be included. U Vi,o = U Fi (Pu , M Vi,j ) M Vi,j = M Fj (Pm ) The machine value results from the machine objective function M F which can depend on a parameter set Pm . The originating MetaManager selects the offer Request with the highest utility value U Vi,o . In principle this MetaManager serves the tasks of an auctioneer. Next, we examine the local offer generation in Check Request more detail. To this end the application flow is shown in Figure 4. [Multi−Site] Within the Check Request phase it is determined if either the best offer has to be automat[no Multi−Site] ically selected or if the user is going to select an offer interactively among a given number of possiSearch for free ble offers. Intervals within the schedule In the same step the user‘s budget is checked whether it is sufficient in order to process the job grain selection at the local machines. The actual accounting and of an interval billing was not part of this study and requires additional work. Furthermore in this step, it is verified fine selection if local resources meet the requirements of the reof an interval quest. Next, the necessary scheduling parameters are extracted which are included in the request, create e.g. the earliest start time of the job, the deadline offer (end time), the maximum search time, the time until the resources will be reserved for the job (reservation time), the expected run time and the number of required resources. Another parameter is the Fig. 4. Local offer creation utility function which is applied in the further selection process. If not enough resources can be found during the Check Request phase, but all other requirements can be fulfilled by the local resources, a multi-site scheduling
136
Carsten Ernemann et al.
will be initiated. In this case additional and modified offers are requested from remote domains to meet in combination the original job requirements. This is an example of co-allocating resources from different owners. The next step Search for free intervals within the schedule tries to find all free time intervals within the requested time frame on the suitable resources. As a simple example assume a parallel computer with dedicated processors as the resources. The example schedule is given in Figure 5. The black areas within the schedule are already allocated by other jobs. The job in our example requests three processors and has a start time A, an end time D and a run time less than (C − B). First, free time intervals are extracted for each processor. Next, the free intervals of several processors are combined in order to find possible solutions. To this end, a list is created with triples of the form {time, processor number, +/-1} which means that the processor with the specified processor number is free (+1) or not free (-1) at the examined time. The generated list is used to find possible solutions as shown in the following pseudo-code: list tempList; LOOP:while(generatedList not empty) { get the time t of the next element in the sourceList; test for all elements in tempList whether the difference between the beginning of the free interval and the time t is bigger or equal to the run time of the job; if(number of elements in tempList, which fulfill the time condition, is bigger or equal the needed number of processors) { create offers from the elements of the tempList; } if(enough offers found) { finish LOOP; } add or substract the elements of the sourceList to or from tempList which have time entry t; } The given algorithm creates potential offers that include e.g. start time, end time, run time and the requested number of processors as well as the user utility value (U Vi,o ).
Economic Scheduling in Grid Computing D
D C time B A 1
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2
3 4 processors
5
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
6
1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
time
A 7
Fig. 5. Start situation
time
2
3 4 processors
5
6
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111
D
time
A
A 1
2
3 4 processors
5
Fig. 7. Bucket 2
7
Fig. 6. Bucket 1
11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111
D
11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111
11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 1
137
6
7
1
2
3 4 processors
5
6
7
Fig. 8. Bucket 3
Note, that it remains to be shown how the offer is created from the elements of this list. Such an algorithm will be presented in the following. The goal is to find areas of enough resources within the schedule for a given list of free time intervals. This has to take into account that resources possibly have different start and end times. The resulting areas are characterized by the earliest start and latest end time. To this end a derivation of a bucket sort is used. In the first step all intervals with the same start time are collected in the same bucket. In the second step for each bucket the elements with the same end time are collected in new buckets. At the end each bucket has a list of resources available between the same start and end time. For the example above, the algorithm creates three buckets as shown in Figures 6, 7 and 8. After the creation of buckets suitable offers are generated either with elements from one bucket if the bucket includes enough resources or by combining elements of different buckets. Additional care must be taken as elements from different buckets can have different start and end times. The maximum start and the minimum end time must be calculated. In our example only bucket 1 can fulfill the requirements alone and therefore an offer can be build e.g. with resources 1, 2 and 5. In order to generate different offers a bucket for which an offer was only possible by its own elements is modified to contain one resource less than the required number. Afterwards, the process is continued. If yet not enough solutions are found and no further bucket can fulfill the request by itself as well as the number of remaining elements of all buckets is greater or equal to the requested resource number, new solutions are generated by combinations of bucket elements in regard to the intersecting time frames.
138
Carsten Ernemann et al.
In our example, together with the solution build from bucket 1 the whole set of solutions would be: {{1,2,5}, {1,2,3}, {1,2,4}, {1,2,7}, {1,3,4}, {1,3,7}, {1,4,7}, {2,3,4}, {2,3,7}, {3,4,7}}. After the end of the Search for free intervals within the schedule phase from Figure 4 a grain selection of one of these intervals takes place in the next phase. In principle a large number of solutions are possible by modifying the start and end time for the job in every combination and then selecting the interval with the highest utility value. In practice this is not applicable in regards to the runtime of the algorithm. Therefore a heuristic is used by selecting the combination having the highest utility value for the earliest start time. Next, the start and end time are modified to improve this utility value. The modification with the highest value is selected as the resulting offer during the phase “fine selection of an interval” in Figure 4. A number of steps can be defined which specifies the number of different start and end times within the given time interval. Note, that the utility function is not constrained in terms of monotony. Therefore, the selection process above is heuristic. After this phase the algorithm is finished and possible offers are generated. The utility functions of the machine owner and the user have not been discussed yet. This method allows both of them to define their own utility function. In our implementation any mathematical formula, using any valid time and resource variables, is supported. Overall, the resulting value for the user’s utility function is maximized. The linkage to the objective function of the machine owner is created by the price for the machine usage which equals the machine owner’s utility function. The price may be included in the user’s utility function. The owner of the machine can build the utility function with additional variables that are first available after the schedule has been generated. Figure 9 shows variables that are used in our implementation. The variable under specifies the area in the schedule in which the corresponding resources (processors) are unused before the job allocation. over determines the area of unused resources after the job to the next job start on the according resources or to the end of the schedule. The variable left right specifies the area on the left and right side of the job. The variable utilization specifies the utilization of the machine if the job is allocated. This is defined by the relation between the sum of all allocated areas to the whole available area from the current time instance to the end of the schedule. Note, that the network has explicitly not been considered. Further work can easily extend the presented model to include network dependencies into the selection and evaluation process. For example, the network latency and bandwidth during job execution can be considered by parameterizing the job run-time during the scheduling. However, the network is regarded in terms of resource partitioning and site autonomy. The presented model focuses on the cooperation scheme and economic scheduling scheme between the MetaManagers of independent domains. Herein,
Economic Scheduling in Grid Computing
time
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 00001111 000000000000001111 11111111111111 0000 000 111 00000000000000 11111111111111 0000 1111 00000000000000111 11111111111111 1111 000 000000 11 1
2
3
4
5
6
7
11 00 00 11 00 11 8
9
10
11
139
11111111 00000000 00000000 11111111 00000000 11111111 12
13
14
15
16
17
processors already alocated jobs
1 0 0 1
11 00 00 11
free area before job start (under)
11 00 00 11
new job
11 00 00 11
free area left and right (left_right)
free area after the job end until the next job or the ende of the schedule (over)
Fig. 9. Parameters for the calculation of the owner utility function
a MetaManager can allocate jobs without direct control over remote resources and without the exposure of local control.
5
Simulation and Evaluation
In this section the simulation environment is described. First, the resource configurations that are used for our evaluation are described followed by an introduction of the applied job model. 5.1
Resource Configurations
All three examined resource configurations have a sum of 512 processors in common. The configurations differ in the processor distribution on machines as shown in Table 1. The configurations m128 and m256 are scenarios that resemble companies with several branch offices or a combination of universities. The configuration m384 characterizes a large data processing center which is connected to several smaller client sites. The configurations m128 and m256 are balanced in the sense of an equal number of processors at each machine. The configuration m384 in comparison is unbalanced. The resource configuration m512 serves as a reference with a single large machine executing all jobs.
140
Carsten Ernemann et al.
Table 1. Used resource configurations identifier m128 m256 m384 m512
configuration
maximum size 4 · 128 128 2 · 256 256 1 · 384 + 1 · 64 + 4 · 16 384 1 · 512 512
sum 512 512 512 512
In order to apply economic scheduling methods utility functions are required as mentioned before. Therefore, 6 different owner objective functions have been chosen for the first evaluation. Further extensive study is necessary to optimize the objective functions in regards to better results. The first one describes the most general owner utility function from which all others are derived. The owner machine function M F1 consists of several terms. The first term: N umberOf P rocessors · RunT ime calculates the area that the job is using within the schedule. The second term calculates the free areas before and after the job as well as the parallel idle time for the other resources within the local schedule (see Figure 9): over + under + lef t right. The last term of the formula is: 1 − lef t right rel, where lef t right rel describes the relation between the free areas to the left and right of the job within the schedule (left right) and the area actual used by the job. A small factor describes that the free areas on both sides are small in comparison to the job area. This leads to the following objective function M F1 and its derivations M F2 - M F6 : M F1 = (N umberOf P rocessors · RunT ime + over + under + lef t right) · (1 − lef t right rel), M F2 = (N umberOf P rocessors · RunT ime + over + under + lef t right), M F3 = (N umberOf P rocessors · RunT ime + over + under) · (1 − lef t right rel), M F4 = (N umberOf P rocessors · RunT ime + lef t right) · (1 − lef t right rel), M F5 = (N umberOf P rocessors · RunT ime + over + lef t right)· (1 − lef t right rel), M F6 = (N umberOf P rocessors · RunT ime + under + lef t right) · (1 − lef t right rel).
Economic Scheduling in Grid Computing
5.2
141
Job Configurations
Unfortunately, no real workload is currently available for grid computing. For our evaluation we derived a suitable workload from real machine traces. These traces have been obtained from the Cornell Theory Center and are based on an IBM RS6000/SP parallel computer with 430 nodes. For more details on the traces and the configuration see the description of Hotovy [14]. The workload is available from the standard workload archive [23]. In order to use these traces for this study it was necessary to modify the traces to simulate submissions at independent sites with local users. To this end, the jobs from the real traces have been assigned in a round-robin fashion to the different sites. It is typical for many known workloads to favor jobs requiring a power of 2 nodes. The CTC workload shows the same characteristic. The modeling of configurations with smaller machines would put these machines into disadvantage if the number of nodes is not a power of 2. To this end, our configurations consist of 512 nodes. Nevertheless, the traces consist of enough workload to keep a sufficient backlog on conventional scheduling systems (see [13]). The backlog is the amount of workload that is queued at any time instance if there are not enough free resources to start the jobs. A sufficient backlog is important as a small or even no backlog indicates that the system is not fully utilized. In this case there is not enough workload available to keep the machines working. Many schedulers, e.g. the mentioned backfilling strategy, require that enough jobs are available for backfilling in order to utilize idle resources. This case usually leads to a bad scheduling quality and unrealistic results. Note, that backlog analysis is only possible for the conventional scheduling algorithms. The economic method does not use a queue as the job allocation is directly scheduled after submission time. Over all the quality of a scheduler is highly dependent on the workload. To minimize the risk to achieve singular effects the simulations have been done for 4 workload sets: 2. The synthetic workload is very similar to the CTC data set, see [15]. It has been generated to prevent that singular effects in real traces, e.g. machine down times, do not affect the accuracy of the result. Also the usage of 3 extracts of the real traces are used to get information on the consistency of the results for the CTC workload. Each workload set consists of 10000 jobs which corresponds to a period of more than three months in real time.
Table 2. The used workloads identifier 10 20k org 30 40k org 60 70k org syn org
description An extract of the original CTC traces An extract of the original CTC traces An extract of the original CTC traces The synthetically generated workload traces.
from job 10000 to 20000. from job 30000 to 40000. from job 60000 to 70000. derived from the CTC workload
142
Carsten Ernemann et al.
The same workloads have been applied for the simulations with conventional scheduling systems in [13, 6]. This allows the comparison of economic systems in this work to the non-economic scheduling systems in [13, 6, 7]. Additionally, a utility function for each job is necessary in economic scheduling to represent the preferences of the corresponding user. To this end, the following 5 user utility functions (UF) have been applied for our first evaluations. The first user utility function prefers the earliest start time of the job. All processing costs are ignored. U F1 = (−StartT ime) The second user utility function only considers the calculation costs caused by the job. U F2 = (−JobCost) The last user utility functions are combinations of the first two, but with different weights. U F3 = (−(StartT ime + JobCost)) U F4 = (−(StartT ime + 2 · JobCost)) U F5 = (−(2 · StartT ime + JobCost)) 5.3
Results
Discrete event-based simulations have been performed according to the previously described architecture and settings. Figure 10 shows a comparison of the average weighted response time for the economically based and for the conventional first-come-first-serve/backfilling scheduling system. The average weighted response time is the sum of the corresponding run and wait times weighted by the resource consumption which is the number of resources multiplied with the job execution time. Note that the mentioned weight prevents any prioritization of small over wider jobs in regard to the average weighted response time if no resources are left idle [18]. The average weighted response time is a mean for the schedule quality from the user perspective. A shorter AWRT indicates that the users have to wait less for the completion of their jobs. For both systems the best achieved results have been selected. Note, that the used machine and utility functions differ between the economic simulations. The results show for all used workloads and all resource configurations that the economically based scheduling system has the capability to outperform the conventional first-come-first-serve/backfilling strategy. Backfilling can be outperformed as the economic scheduling system is not restricted in the job execution order. Within this system a job, that was submitted after another already scheduled job, can be started earlier, if corresponding resources can be found. The conventional backfilling strategy used with the firstcome-first-serve algorithm ([16]) can only start jobs earlier if all jobs that were
Economic Scheduling in Grid Computing
143
Fig. 10. Comparison between economic and conventional scheduling
transmitted before are not additionally delayed. The EASY backfilling lowers this restriction to not delay the first job in the queue ([9]) does not result in a better performance. The restriction of out-of-order execution in backfilling prevents job starvation. The economic method does not encounter the starvation problem as the job execution is immediately allocated after submission. Figure 10 only shows the best results for the economic scheduling system. Now, in Figure 11, a comparison between the economic and the conventional scheduling system for only one machine/utility function combination is presented. The used combination of M F1 and U F1 leads to scheduling results that can outperform the conventional system for all used workloads and configurations m128 and m512. Note, that the benefit of the economic method was achieved by applying a single machine/utility function combination for all workloads. This indicates that suitable machine/user utility functions can provide good results for various workloads. Figure 12 presents the AWRT combined with the average weighted wait time (AWWT) using the same weight selection. In all cases the same resource configuration as well as the same machine/utility function combination are used. The time differences between the simulations for both resource configurations are small. This shows that the algorithm for multi-site scheduling (for resource configuration m128), although it is more complex, does not result in a much worse response time in comparison to a single machine. Note, that multi-site execution is not penalized by an overhead in our evaluation. Therefore, the
144
Carsten Ernemann et al.
Fig. 11. Comparison between economic and conventional scheduling for the resource configurations m128 and m512 using M F1 - U F1 optimal benefit of job splitting is examined and only the capability of supporting multi-site in an economic environment over remote sites is regarded. Here, effects of splitting jobs may even improve the scheduling results. Figure 13 demonstrates that the average weighted response as well as the average weighted wait time do not differ significantly between the different resource configurations. In this case, the machine configurations prove limited impact on the effect on multi-site scheduling. Here, the overall number of processors is of higher significance in our economic algorithm. Configurations with bigger machines have smaller average weighted response times than configurations with a collection of smaller machines. The influence of using different machine/utility function combinations for a resource set is shown in Figure 14. Here, the squashed area (the sum of the products of the run time and the number of processors) is given for different resource configuration. The variant m128 is balanced in the sense of having equal sized machines. The desired optimal behavior is usually an equal balanced workload distribution on all machines. The combination of (M F1 , U F1 ) leads to a workload distribution where the decrease of the local squashed area is nearly constant between the machines ordered by their number as shown in Figure 14. The maximum difference between the squashed areas is about 18%. In the second case, the combination (M F1 , U F2 ) presents a better outcome in sense of a nearly equally distributed workload.
Economic Scheduling in Grid Computing
145
Fig. 12. AWRT and AWWT for m128 and m512 using several workloads, machine function M F1 and utility function U F1 The third function combination (M F2 , U F2 ) leads to an unbalanced result. Two of the machines execute about 67% of the overall workload and the two remaining machines the rest. Simulation results are shown for keeping the same machine/utility function combinations in Figure 15. The combination of (M F1 , U F2 ) does not perform very well in terms of the utilization as all machines achieve less than 29%. This indicates in combination with Figure 14 that a well distributed workload corresponds with a lower utilization. The combination of (M F1 , U F1 ) leads to a utilization between 61% and 77% on all machines. The third examined combination (M F2 , U F2 ) shows a very good utilization of two machines (over 85%) and a very low utilization on the others (under 45%). In this case the distributed workloads correlates with the utilization of the machines. After the presentation of the distributed workload and the corresponding utilization the AWWT and AWRT, shown in Figure 16 clearly indicates that only the function combination (M F1 , U F1 ) leads to reasonable scheduling results. The results from Figures 14,15 and 16 demonstrate that different machine/utility function combinations may result in completely different scheduling behaviors. Therefore an appropriate selection of these functions is important for an economic scheduling system. In the following the comparison of different machine/utility functions is shown for the resource configuration m128. In Figure 17 the average weighted response time is drawn for all different machine function in combination with
146
Carsten Ernemann et al.
Fig. 13. AWRT and AWWT for all resource configurations and the syn org workload in combination with M F1 - U F1 utility function U F3 . The average weighted response time for the machine function M F2 performs significantly better than all other machine functions. Here, the factor 1 − lef t right rel, which is used in all other machine functions, does not work well for this machine configuration. It seems to be beneficial to use absolute values for the areas instead, e.g. (N umberOf P rocessors · RunT ime + over +under +lef t right). Unexpectedly, Figure 17 also shows that the intended reduction the free areas within the schedule before the job starts, with attribute under, results in very poor average weighted response times (see the results for M F1 , M F3 , M F6 ). As machine function (M F2 ) provided significantly better results, different user utility functions are compared in combination with M F2 in Figure 18. Utility function U F1 , which only takes the job start time into account, results in the best average weighted response time. In this case, no attention was paid to the resulting job cost. For our selection of the machine objective function this means that minimization of the free areas around the job is not regarded. The utility functions that include this job cost deliver inferior results in terms of the average weighted response times. The second best result originates from the usage of the utility function U F3 . In opposite to U F1 the starting time and the job costs are equally weighted. All other utility combinations in which either only the job costs (U F2 ) or unbalanced weights for the starting time and the job costs are used, lead to higher response times.
Economic Scheduling in Grid Computing
147
Fig. 14. The used squashed area of simulations with m128 and syn org using different machine and utility functions
Fig. 15. The resulting utilization of simulations with m128 and syn org using different machine and utility functions
148
Carsten Ernemann et al.
Fig. 16. The resulting average weighted response and wait times of simulations with m128 and syn org using different machine and utility functions
Fig. 17. The resulting average weighted response for resource configuration m128, utility function U F3 and several machine functions
Economic Scheduling in Grid Computing
149
Fig. 18. The resulting average weighted response for resource configuration m128, machine function M F2 and several utility functions Note that the execution time of the simulations on a SUN-Ultra III machine varied according to the chosen machine and user utility functions. For an example the scheduling of 10000 jobs required about 1 hour, which means that the scheduling of one job lasts about one second on average. Nevertheless, this highly depends on the number of available resources. In an actual implementation the search time can be limited by a parameter given by the user or chosen by a heuristic based on job length and/or job arrival rate.
6
Conclusion
In this paper we presented an infrastructure and an economic scheduling system for grid environments. The quality of the algorithm has been examined by discrete event simulations with different workloads (4, each with 10.000 jobs), different machine configurations (4, each with a sum of 512 processors) and several parameter settings for owner and user utility functions. The results demonstrate that the used economical model provides results in the range of conventional algorithms in terms of the average weighted response time. In comparison, the economical method leaves a much higher flexibility in defining the desired resources. Also the problems of site autonomy, heterogenous resources and individual owner policies are solved by the structure of this economic approach. Moreover, the owner and user utility function may be
150
Carsten Ernemann et al.
set individually for each job request. Additionally, features as co-allocation and multi-site scheduling over different resource domains are supported. Especially the possible advance reservation of resources is an advantage. In comparison to conventional scheduling systems there is instant feedback by the scheduler on the expected execution time of a job already at submit time. Note that conventional schedulers based on list scheduling as e.g. backfilling can provide estimates or bounds on the completion time. However, the economic method presented in this paper leads to a specific allocation in start and end-time as well as the resource. Guarantees can be given and maintained if requested. This includes the submission of jobs that request a specific start and end-time which is also necessary for co-allocating resources. Note, that the examined utility functions in the simulations are first approaches and leave room for further analysis and optimization. Nevertheless, the results presented in this paper indicate that an appropriate utility function for a given resource configuration delivers steady performance on different workloads. Further research is necessary to extend the presented model to incorporate the network as a limited resource which has to be managed and scheduled as well. In this case a network service can be designed similar to a managed computing resource which provides information on offers or guarantees for possible allocations, e.g. bandwidth or quality-of-service features. A more extensive parameter study for comprehensive knowledge on their influence on cost and execution time is necessary. To this end, future work can analyze scenarios in which different objective functions are assigned to each domain. Also the effect of a larger number of machines and domains in the grid must be evaluated. The presented architecture in general provides support for re-scheduling, that means improving the schedule by permanently exploring alternative offers for existing allocations. This feature should be examined in more detail for optimizing the schedule as well as for re-organizing the schedule in case of a system or job failure.
References [1] N. Bogan. Economic allocation of computation time with computation markets. In Master Thesis. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May 1994. 131 [2] R. Buyya, D. Abramson, J. Giddy, and H. Stockinger. Economic Models for Resource Management and Scheduling in Grid Computing. Special Issue on Grid Computing Environments, The Journal of Concurrency and Computation: Practice and Experience (CCPE), May 2002(accepted for publication). 130, 131 [3] R. Buyya, J. Giddy, and D. Abramson. An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications. In The Second Workshop on Active Middleware Services (AMS 2000), In conjuction with Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC 2000), Pittsburgh, USA, August 2000. Kluwer Academic Press. 134
Economic Scheduling in Grid Computing
151
[4] K. Czajkowski, I. Foster, C. Kesselman, S. Martin, W. Smith, and S. Tuecke. A Resource Management Architecture for Metacomputing Systems. In Job Scheduling Strategies for Prallel Processing, volume 1459 of Lecutre Notes in Computer Science, pages 62–68. Springer Verlag, 1998. 130 [5] European grid forum, http://www.egrid.org, August 2002. 128 [6] C. Ernemann, V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour. On Advantages of Grid Computing for Parallel Job Scheduling. In Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CC-GRID 2002), Berlin, pages 39–46, 2002. 129, 142 [7] C. Ernemann, V. Hamscher, A. Streit, and R. Yahyapour. On Effects of Machine Configurations on Parallel Job Scheduling in Computational Grids. In International Conference on Architecture of Computing Systems, (ARCS 2002), pages 169–179. VDE-Verlag, April 2002. 142 [8] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editors, IPPS’97 Workshop: Job Scheduling Strategies for Parallel Processing, pages 1–34. Springer–Verlag, Lecture Notes in Computer Science LNCS 1291, 1997. 129 [9] D. G. Feitelson and A. M. Weil. Utilization and Predictability in Scheduling the IBM SP2 with Backfilling. In Procedings of IPPS/SPDP 1998, pages 542–546. IEEE Computer Society, 1998. 143 [10] I. Foster and C. Kesselman. Globus: A metacomputing Infrastructure Toolkit. The International Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128, 1997. 129 [11] I. Foster and C. Kesselman, editors. The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998. 128 [12] The Grid Forum, http://www.gridforum.org, August 2002. 128 [13] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour. Evaluation of Job-Scheduling Strategies for Grid Computing. Lecture Notes in Computer Science, 1971:191–202, 2000. 129, 141, 142 [14] S. Hotovy. Workload Evolution on the Cornell Theory Center IBM SP2. In D. G. Feitelson and L. Rudolph, editors, IPPS’96 Workshop: Job Scheduling Strategies for Parallel Processing, pages 27–40. Springer–Verlag, Lecture Notes in Computer Science LNCS 1162, 1996. 141 [15] J. Krallmann, U. Schwiegelshohn, and R. Yahyapour. On the Design and Evaluationof Job Scheduling Systems. In D. G. Feitelson and L. Rudolph, editors, IPPS/SPDP’99 Workshop: Job Scheduling Strategies for Parallel Processing. Springer–Verlag, Lecture Notes in Computer Science, 1999. 141 [16] D. A. Lifka. The ANL/IBM SP Scheduling System. In D. G. Feitelson and L. Rudolph, editors, IPPS’95 Workshop: Job Scheduling Strategies for Parallel Processing, pages 295–303. Springer–Verlag, Lecture Notes in Computer Science LNCS 949, 1995. 129, 142 [17] M. Livny and R. Raman. High-Throughput Resource Management. In I. Foster and C. Kesselman, editors, The Grid - Blueprint for a New Computing Infrastructure, pages 311–337. Morgan Kaufmann, 1999. 128 [18] U. Schwiegelshohn and R. Yahyapour. Analysis of First-Come-First-Serve Parallel Job Scheduling. In Proceedings of the 9th SIAM Symposium on Discrete Algorithms, pages 629–638, January 1998. 142
152
Carsten Ernemann et al.
[19] U. Schwiegelshohn and R. Yahyapour. Resource Allocation and Scheduling in Metasystems. In P. Sloot, M. Bibak, A. Hoekstra, and B. Hertzberger, editors, Proceedings of the Distributed Computing and Metacomputing Workshop at HPCN Europe, pages 851–860. Springer–Verlag, Lecture Notes in Computer Science LNCS 1593, April 1999. 132 [20] L. Smarr and C. E. Catlett. Metacomputing. Communications of the ACM, 35(6):44–52, June 1992. 128 [21] I. Stoica, H. Abdel-Wahab, and A. Pothen. A Microeconomic Scheduler for Parallel Computers. In D. G. Feitelson and L. Rudolph, editors, IPPS’95 Workshop: Job Scheduling Strategies for Parallel Processing, pages 200–218. Springer– Verlag, Lecture Notes in Computer Science LNCS 949, 1995. 130 [22] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu. Mariposa: A Wide-Area Distributed Database System. VLDB Journal, 5(1):48–63, 1996. 130 [23] Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/, August 2002. 141 [24] K. R. Grant T. W. Malone, R. E. Fikes and M. T. Howard. Enterprise: A Marketlike Task Scheduler for Distributed Computing Environments. In The Ecology of Computation, volume 2 of Studies in Computer Science and Artifical Intelligence, pages 177–255, 1988. 131 [25] P. Tucker. Market Mechanisms in a Programmed System. Department of Computer Science and Engineering, University of California, 1998. 130 [26] P. Tucker and F. Berman. On market mechanisms as a software technique, 1996. 130, 131 [27] C. A. Waldspurger, T. Hogg, B. Huberman, J. O. Kephart, and W. S. Stornetta. Spawn: A distributed computational economy. IEEE Transactions on Software Engineering, 18(2):103–117, 1992. 130 [28] W. Walsh, M. Wellman, P. Wurman, and J. MacKieMason. Some economics of market-based distributed scheduling. In In Eighteenth International Conference on Distributed Computing Systems, pages 612–621, 1998. 131 [29] F. Ygge. Market-Oriented Programming and its Application to Power Load Management. PhD thesis, Department of Computer Science, Lund University, 1998. 131
SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems Karl Czajkowski1, Ian Foster2,3 , Carl Kesselman1 , Volker Sander4 , and Steven Tuecke2 1 Information Sciences Institute University of Southern California, Marina del Rey, CA 90292 U.S.A. {karlcz,carl}@isi.edu 2 Mathematics and Computer Science Division Argonne National Laboratory, Argonne, IL 60439 U.S.A. {foster,tuecke}@mcs.anl.gov 3 Department of Computer Science The University of Chicago, Chicago, IL 60657 U.S.A. 4 Zentralinstitut f¨ ur Angewandte Mathematik Forschungszentrum J¨ ulich, 52425 J¨ ulich, Germany
Abstract. A fundamental problem in distributed computing is to map activities such as computation or data transfer onto resources that meet requirements for performance, cost, security, or other quality of service metrics. The creation of such mappings requires negotiation among application and resources to discover, reserve, acquire, configure, and monitor resources. Current resource management approaches tend to specialize for specific resource classes, and address coordination across resources only in a limited fashion. We present a new approach that overcomes these difficulties. We define a resource management model that distinguishes three kinds of resource-independent service level agreements (SLAs), formalizing agreements to deliver capability, perform activities, and bind activities to capabilities, respectively. We also define a Service Negotiation and Acquisition Protocol (SNAP) that supports reliable management of remote SLAs. Finally, we explain how SNAP can be deployed within the context of the Globus Toolkit.
1
Introduction
A common requirement in distributed computing systems such as Grids [17, 20] is to negotiate access to, and manage, resources that exist within different administrative domains than the requester. Acquiring access to these remote resources is complicated by the competing needs of the client and the resource owner. The client needs to understand and affect resource behavior, often requiring assurance or guarantee on the level and type of service being provided by the resource. Conversely, the owner wants to maintain local control and discretion over how the resource can be used. Not only does the owner want to control usage policy, D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 153–183, 2002. c Springer-Verlag Berlin Heidelberg 2002
154
Karl Czajkowski et al.
he often wants to restrict how much service information is exposed to clients. A common means for reconciling these two competing demands is to negotiate a service-level agreement (SLA), by which a resource provider “contracts” with a client to provide some measurable capability or to perform a task. An SLA allows clients to understand what to expect from resources without requiring detailed knowledge of competing workloads or resource owners’ policies. This concept holds whether the managed resources are physical equipment, data, or logical services. However, negotiation of SLAs for distributed Grid applications is complicated by the need to coordinate access to multiple resources simultaneously. For example, large distributed simulations [5] can require access to many large computational resources at one time. On-line experiments [41] require that computational resources be available when the experiment is being conducted, and processing pipelines such as data-transfer [22], data-analysis [26, 3] and visualization pipelines [9] require simultaneous access to a balanced resource set. Given that each of the resources in question may be owned and operated by a different provider, establishing a single SLA across all of the desired resources is not possible. Our solution to this problem is to define a resource management model in which management functions are decomposed into different types of SLAs that can be composed incrementally, allowing for coordinated management across the desired resource set. Specifically, we propose three different types of SLAs: – Task service level agreements (TSLAs) in which one negotiates for the performance of an activity or task. A TSLA is, for example, created by submitting a job description to a queuing system. The TSLA characterizes a task in terms of its service steps and resource requirements. – Resource service level agreements (RSLAs) in which one negotiates for the right to consume a resource. An RSLA can be negotiated without specifying for what activity the resource will be used. For example, an advance reservation takes the form of an RSLA. The RSLA characterizes a resource in terms of its abstract service capabilities. – Binding service level agreements (BSLAs) in which one negotiates for the application of a resource to a task. For example, an RSLA promising network bandwidth might be applied to a particular TCP socket, or an RSLA promising parallel computer nodes might be applied to a particular job task. The BSLA associates a task, defined either by its TSLA or some other unique identifier, with the RSLA and the resource capabilities that should be met by exploiting the RSLA. As illustrated in Figure 1, the above SLAs define a resource management model in which one can submit tasks to be performed, get promises of capability, and lazily bind the two. By combining these agreements in different ways, we can represent a variety of resource management approaches including: batch submission, resource brokering, co-allocation and co-scheduling. One concrete example of a lazily-established BSLA might be to increase the number of physical memory pages bound to a running process, based on
SNAP: A Protocol for Negotiating Service Level Agreements t0
t1
t2
t4
RSLA 1
RSLA 2
TSLA
BSLA 1
155
SLAs Resource state
t3
t5
t6
Fig. 1. Three types of SLA—RSLA, TSLA, and BSLA—allow a client to schedule resources as time progresses from t0 to t6 . In this case, the client acquires two resource promises (RSLAs) for future times; a complex task is submitted as the sole TSLA, utilizing RSLA 1 to get initial portions of the job provisioned; later, the client applies RSLA 2 to accelerate provisioning of another component of the job; finally, the last piece of the job is provisioned by the manager without an explicit RSLA
observed data regarding the working-set size of the service. Another example is network QoS: a reservation regarding the path between two Internet host addresses may guarantee a client a minimum bandwidth flow as an RSLA. The client must bind TCP socket addresses to this reserved capability at runtime as a BSLA—the sockets are identifiable “tasks” most likely not managed with a TSLA. The complexity of real-world scenarios is addressed with combinations of such SLAs. The proposed SLA model is independent of the service being managed—the semantics of specific services are accommodated by the details of the agreement, and not in the types of agreements negotiated. Because of its general applicability, we refer to the protocols used to negotiate these SLAs as the Service Negotiation and Acquisition Protocol (SNAP). The service management approach proposed here extends techniques first developed within the Globus Toolkit’s GRAM service [8] and then extended in the experimental GARA system [21, 22, 36]. An implementation of this architecture and protocol can leverage a variety of existing infrastructure, including the Globus Toolkit’s Grid Security Infrastructure [19] and Monitoring and Discovery Service [10]. We expect the SNAP protocol to be easily implemented within the Open Grid Services Architecture (OGSA) [18, 39], which provides request transport, security, discovery, and monitoring. The remainder of this paper has the following structure: in Section 2 we present several motivating scenarios to apply SLA models to Grid RM problems; in Section 3 we present the SNAP protocol messages and state model, which embed a resource and task language characterized in Section 4. In Section 5, we briefly formalize the relationship between the various SLA and resource languages in terms of their satisfaction or solution spaces. Finally, in Sections 6 and 7, we describe how SNAP can be implemented in the context of Globus services and relate it to other QoS and RM work.
156
2
Karl Czajkowski et al.
Motivating Scenarios
The SNAP SLA model is designed to address a broad range of applications through the aggregation of simple SLAs. In this section we examine two common scenarios: a Grid with “community schedulers” mediating access to shared resources on behalf of different client groups, and a file-transfer scenario where QoS guarantees are exploited to perform data staging under deadline conditions. 2.1
Community Scheduler Scenario
A community scheduler (sometime referred to as a resource broker) is an entity that acts as an intermediary between the community and its resources: activities are submitted to the community scheduler rather than to the end resource, and the activities are scheduled onto community resources in such as way as to optimize the community’s use of its resource set. As depicted in Figure 2, a Grid environment may contain many resources (R1–R6), all presenting an RSLA interface as well as a TSLA interface. Optimizing the use of resources across the community served by the scheduler is only possible if the scheduler has some control over the resources used by the community. Hence the scheduler negotiates capacity guarantees via RSLAs with a pool of underlying resources, and exploits those capabilities via TSLAs and BSLAs. This set of agreements abstracts away the impact of other community schedulers as well as any “non-Grid” local workloads, assuming the resource managers enforce SLA guarantees at the resources. Community scheduler services (S1 and S2 in Figure 2) present a TSLA interface to users. Thus a community member can submit a task to the scheduler by negotiating a TSLA, and the scheduler in turn hands this off to a resource by
J1
J2
J3
J4
J5
J6
J7 T/BSLA
000 111
000 S2 111 000 111
S1
111 000 000 111 000 111
111 000 000 111 000 111 000 111 000 111 R1
R2
R3
R4
R5
RSLA
R6
Fig. 2. Community scheduler scenario. Multiple users (J1–J7) gain access to shared resources (R1–R6). Community schedulers (S1–S2) mediate access to the resources by making TSLAs with the users and in turn making RSLAs and TSLAs with the individual resources
SNAP: A Protocol for Negotiating Service Level Agreements
157
binding this TSLA against one of the existing RSLAs. The scheduler may also offer an RSLA interface. This would allow applications to co-scheduler activities across communities, or combine community scheduled resources with additional non-community resources. The various SLAs offered by the community scheduler and underlying resources result in a very flexible resource management environment. Users in this environment interact with community and resource-level schedulers as appropriate for their goals and privileges. A privileged client with a batch job such as J7 in Figure 2 may not need RSLAs, nor the help of a community scheduler, because the goals are expressed directly in the TSLA with resource R6. The interactive job J1 needs an RSLA to better control its performance. Jobs J2 to J6 are submitted to community schedulers S1 and S2 which might utilize special privileges or domain-specific knowledge to efficiently implement their community jobs. Note that not all users require RSLAs from the community scheduler, but S1 does act as an RSLA “reseller” between J2 and resource R3. Scheduler S1 also maintains a speculative RSLA with R1 to more rapidly serve future high-priority job requests. 2.2
File Transfer Scenarios
In these scenarios, we consider that the activity requested by the user is to transfer a file from one storage system to another. Generalizing the community scheduler example, we augment the behavior of the scheduler to understand that a transfer requires storage space on the destination resource, and network and endpoint I/O bandwidth during the transfer. The key to providing this service is the ability of the scheduler to manage multiple resource types and perform co-scheduling of these resources. File Transfer Service As depicted in Figure 3, the file transfer scheduler S1 presents a TSLA interface, and a network resource manager R2 presents an RSLA interface. A user submits a transfer job such as J1 to the scheduler with a deadline. The scheduler obtains a storage reservation on the destination resource R3 to be sure that there will be enough space for the data before attempting the transfer. Once space is allocated, the scheduler obtains bandwidth reservations from the network and the storage devices, giving the scheduler confidence that the transfer can be completed within the user-specified deadline. Finally, the scheduler submits transfer endpoint jobs J2 and J3 to implement the transfer J1 using the space and bandwidth promises. Job Staging with Transfer Service SLAs can be linked together to address more complex resource co-allocation situations. We illustrate this considering a job that consists of a sequence of three activities: data is transferred from a storage system to an intermediate location, some computation is performed using the data, and the result is transferred to a final destination. The computation is performed on resources allocated to a community of users. However, for
158
Karl Czajkowski et al. J1 T/BSLA
S1 J2
J3 RSLA
R1
R2
R3
Fig. 3. File transfer scenario. File transfer scheduler coordinates disk and network reservations before co-scheduling transfer endpoint jobs to perform transfer jobs for clients
security reasons, the computation is not performed using a group account, but rather, a temporary account is dynamically created for the computation (In [32], we describe a community authorization service which can be used to authorize activities on behalf of a user community). In Figure 4, TSLA1 represents a temporary user account, such as might be established by a resource for a client who is authorized through a Community Authorization Service. All job interactions by that client on the resource become linked to this long-lived SLA—in order for the account to be reclaimed safely, all dependent SLAs must be destroyed. The figure illustrates how the individual SLAs associated with the resources and tasks can be combined to address the end-to-end resource and task management requirements of the entire job. Of interest in this example are:
TSLA1 account tmpuser1 RSLA1 50 GB in /scratch filesystem BSLA1 30 GB for /scratch/tmpuser1/foo/* files TSLA2 Complex job TSLA3
1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 000000 111111 0000000 1111111 0000000 1111111 000000 111111 0000000 1111111 0000000 1111111 000000 111111
Stage in
Stage out
TSLA4
RSLA2 Net BSLA2
time
Fig. 4. Dependent SLAs for file transfers associated with input and output of a job with a large temporary data space. BSLA2 is dependent on TSLA4 and RSLA2, and has a lifetime bound by those two
SNAP: A Protocol for Negotiating Service Level Agreements
159
TSLA1 is the above-mentioned temporary user account. RSLA1 promises the client 50 GB of storage in a particular file-system on the resource. BSLA1 binds part of the promised storage space to a particular set of files within the file-system. TSLA2 runs a complex job which will spawn constituent parts for staging of input and output data. TSLA3 is the first file transfer task, to stage the input to the job site without requiring any additional QoS guarantees in this case. TSLA4 is the second file transfer task, to stage the large output from the job site, under a deadline, before the local file-system space is lost. RSLA2 and BSLA2 are used by the file transfer service to achieve the additional bandwidth required to complete the (large) transfer before the deadline. The job scheduled by TSLA2 might have built-in logic to establish the staging jobs TSLA3 and TSLA4, or this logic might be part of the provider performing task TSLA2 on behalf of the client. In the figure, the nesting of SLA “boxes” is meant to illustrate how lifetime of these management abstractions can be linked in practice. Such linkage can be forced by a dependency between the subjects of the SLAs, e.g. BSLA2 is meaningless beyond the lifetime of TSLA4 and RSLA2, or optionally added as a management convenience, e.g. triggering recursive destruction of all SLAs from the root to hasten reclamation of application-grouped resources. 2.3
Resource Virtualization
In the preceding scenarios, the Community Scheduler can be viewed as virtualizing a set of resources from other managers for the benefit of its community of users. This type of resource virtualization is important as it helps implement the trust relationships that are exploited in Grid applications. The user community trusts their scheduler to form agreements providing resources (whether basic hardware capabilities or complex service tasks), and the scheduler has its own trust model for determining what resources are acceptable targets for the community workload. Another type of virtualization in dynamic service environments like the Open Grid Service Architecture (OGSA) is captured in the factory service model [18]. A SNAP manager in such an environment produces SLAs, providing a long-lived contact point to initiate and manage the agreements. The SLA factory exposes the agreements as set of short-lived, stateful services which can be manipulated to control one SLA. Resource virtualization is particularly interesting when a TSLA schedules a job which can itself provide Grid services. This process is described for “active storage” systems in [26] and [9], where data extraction jobs convert a compute cluster with parallel storage into an application-specialized data server. The submission of a TSLA running such jobs can be thought of as the dynamic deployment of new services “on demand,” a critical property for a permanent, but adaptive, global Grid [20].
160
Karl Czajkowski et al.
setdeath TSLA
async agree
S0
agree RSLA
BSLA
active
S1
S2
S3
Fig. 5. Agreement state transitions. State of SLAs is affected by client requests (solid arrows) and other internal behaviors in the manager (dashed arrows)
3
The SNAP Agreement Protocol
The core of the SNAP architecture is a client-service interaction used to negotiate SLAs. The protocol applies equivalently when talking to authoritative, localized resource owners or to intervening brokers. We describe each operation in terms of unidirectional messages sent from client to service or service to client. All of these operations follow a client-server remote procedure-call (RPC) pattern, so we assume the underlying transport will provide correlation of the initiating and responding messages. One way of interpreting the following descriptions is that the client to service message corresponds to the RPC, and the return messages represent the possible result values of the call. This interpretation is consistent with how such a protocol would be deployed in a Web Services environment, using WSDL to model the RPC messages [7, 1]. 3.1
Agreement State Transitions
Due the the dependence of BSLAs on RSLAs (and possibly on TSLAs), there are four states through which SNAP progresses, as depicted in Figure 5: S0: SLAs either have not been created, or have been resolved by expiration or cancellation. S1: Some TSLAs and RSLAs have been agreed upon, but may not be bound to one another. S2: The TSLA is matched with the RSLA, and this grouping represents a BSLA to resolve the task. S3: Resources are being utilized for the task and can still be controlled or changed. As indicated in Figure 5 with solid arrows, client establishment of SLAs enters the state S1, and can also lead to state S2 by establishing BSLAs. It is possible for
SNAP: A Protocol for Negotiating Service Level Agreements
161
the manager to unilaterally create a BSLA representing its schedule for satisfying a TSLA, and only the manager can move from a BSLA into a run-state S3 where resources are actively supporting a task. Either client termination requests, task completion, or faults may lead back to a prior state, including termination or failure of SLAs in state S0. 3.2
Agreement Meta-language
The SNAP protocol maintains a set of manager-side SLAs using client-initiated messages. All SLAs contain an SLA identifier I, the client c with whom the SLA is made, and an expiration time tdead , as well as a specific TSLA, RSLA, or BSLA description d: I, c, tdead , d. Each SLA type defines its own descriptive content, e.g. resource requirements or task description. In this section we assume an extensible language J for describing tasks (jobs), with a subset language R ⊆J capable of expressing resource requirements in J as well as apart from any specific task description. The necessary features of such a language are explored later in Section 4. We also assume a relation a a, or a models a, which means that a describes the same terms of agreement as a but might possible add additional terms or further restrict a constraint expressed in a. In other words, any time SLA a conditions are met, so are a conditions. This concept is examined more closely in Section 5. RSLA Content An RSLA contains the (potentially complex) resource capability description r expressed in the R subset of the J language. Therefore, a complete RSLA in a manager has the form: I, c, tdead , rR . TSLA Content A TSLA contains the (potentially complex) job description j expressed in the J language. Therefore, a complete TSLA in a manager has the form: I, c, tdead , jT . The description j also includes a resource capability description r = j ↓R which expresses what capability r is to be applied to the task, and using what RSLA(s). If the named RSLAs are not sufficient to satisfy r, the TSLA implies the creation of one or more RSLAs to satisfy j. BSLA Content A BSLA contains the description j of an existing task in the language J . The description j may reference a TSLA for the task, or some other unique description in the case of tasks not initiated by a TSLA. Therefore, a complete stand-alone RSLA in a manager has the form: I, c, tdead , jB .
162
Karl Czajkowski et al.
Client
Messages
Manager State
getident(t) useident(I) request(SLA) agree(SLA)
T > R > B >
setdeath(I,t) willdie(I,t) error(descr)
Fig. 6. RM protocol messages. The protocol messages establish and maintain SLAs in the manager
As for TSLAs, the BSLA description j may reference existing RSLAs and if they do not satisfy the requirements in j, the BSLA implies the creation of one or more RSLAs to satisfy j. 3.3
Operations
Allocate Identifier Operation There are multiple approaches to obtaining unique identifiers suitable for naming agreements. To avoid describing a security infrastructure-dependent approach, we suggest a special light-weight agreement to allocate identifiers from a manager. This operation is analogous to opening a timed transaction in a database system. The client sends: getident(tdead ), asking the manager to allocate a new identifier that will be valid until time tdead . On success, the manager will respond: useident(I, tdead ), and the client can then attempt to create reliable RM agreements using this identifier as long as the identifier is valid. A common alternative approach would fold the identifier allocation into an initial SLA request, requiring a follow-up acknowledgment or commit message from the client to complete the agreement. With the above separation of identifier allocation, we avoid confusing this reliable messaging problem with a different multi-phase negotiation process inherent in distributed co-reservation (where the concept of “commitment” is more generally applicable).
SNAP: A Protocol for Negotiating Service Level Agreements
163
Agreement Operation A client negotiates an SLA using a valid identifier obtained using getident(. . .). The client issues a single message with arguments expressed in the agreement language from Section 3.2: request(I, c, tdead , a). The SLA description a captures all of the requirements of the client. On success, the manager will respond with a message of the form: agree(I, c, tdead , a ), where a a as described in Sections 3.2 and 5. In other words, the manager agrees to the SLA description a , and this SLA will terminate at tdead unless the client performs a setdeath(I, t) operation to change its scheduled lifetime. A client is free to re-issue requests, and a manager is required to treat duplicate requests received after a successful agreement as being equivalent to a request for acknowledgment on the existing agreement. This idempotence is enabled by the unique identifier of each agreement. Set Termination Operation We believe that idempotence (i.e. an at-mostonce semantics) combined with expiration is well-suited to achieving faulttolerant agreement. We define our operations as atomic and idempotent interactions that create SLAs in the manager. Each SLA has a termination time, after which a well-defined reclamation effect occurs. This termination effect can be exploited at runtime to implement a spectrum of negotiation strategies: a stream of short-term expiration updates could implement a heart-beat monitoring system [37] to force reclamation in the absence of positive signals, while a long-term expiration date guarantees SLAs will persist long enough to survive transient outages. With this operation, a client can set a new termination time for the identifier (and any agreement named as such). The client changes the lifetime by sending a message of the form: setdeath(I, tdead ), where tdead is the new wall-clock termination time for the existing SLA labeled by I. On success the manager will respond with the new termination time: willdie(I, tdead ), and the client may reissue the setdeath(. . .) message if some failure blocks the initial response. Agreements can be abandoned with a simple request of setdeath(I, 0) which forces expiration of the agreement. The lifetime represented by tdead is the lifetime of the agreement named by I. If the agreement makes promises about times in the future beyond its current lifetime, those promises expire with the SLA. Thus, it is a client’s responsibility to extend or renew an SLA for the full duration required.
164
3.4
Karl Czajkowski et al.
Change
Finally, we support the common idiom of atomic change by allowing a client to resend the request on the same SLA identifier, but with modified requirement content. The service will respond as for an initial request, or with an error if the given change is not possible from the existing SLA state. When the response indicates a successful SLA, the client knows that any preceding agreement named by I has been replaced by the new one depicted in the response. When the response indicates failure, the client knows that the state is unchanged from before the request. In essence, the service compares the incoming SLA request with its internal policy state to determine whether to treat it as a create, change, or lookup. The purpose of change semantics is to preserve state in the underlying resource behavior where that is useful, e.g. it is often possible to preserve an I/O channel or compute task when QoS levels are adjusted. Whether such a change is possible may depend both on the resource type, implementation, and local policy. If the change is refused, the client will have to initiate a new request and deal with the loss of state through other means such as task check-pointing. An alternative to implicit change would be an explicit change mechanism to perform structural editing of the existing SLA content, but we do not define concrete syntax for the R and J languages as would be needed to formalize such editing. Change is also useful to adjust the degree of commitment in an agreement. An expected use is to monotonically increase the level of commitment in a promise (or cancel it) as a client converges on an application schedule involving multiple resource managers. This use essentially implements a timed, multi-phase commit protocol across the managers which may be in different administrative domains. However, there is no architectural requirement for this monotonic increase—a client may also want to decrease the level of commitment if they lose confidence in their application plan and want to relax agreements with the manager.
4
Resource and Task Meta-language
The resource and scheduling language J assumed in Section 3 plays an important role in our architecture. Clients in general must request resources by property, e.g. by capability, quality, or configuration. Similarly, clients must understand their assignments by property so that they can have any expectation of delivery in an environment where other clients’ assignments and activities may be hidden from view. In this section we examine some of the structures we believe are required in this language, without attempting to specify a concrete syntax. As a general note, we believe that resource description must be dynamically extensible, and the correct mechanism for extension is heavily dependent on the technology chosen to implement SNAP. Sets of clients and resources must be able to define new resource syntax to capture novel devices and services, so the language should support these extensions in a structured way. However, a complex
SNAP: A Protocol for Negotiating Service Level Agreements
165
new concept may sometimes be captured by composing existing primitives, and hopefully large communities will be able to standardize a relatively small set of such composeable primitives. 4.1
Resource Metrics
Many resources have parameterized attributes, i.e. a metric describing a particular property of the resource such as bandwidth, latency, or space. Descriptions may scope these metrics to a window of time [t0 , t1 ] in which the client desires access to a resource with the given qualities. We use a generic scalar metric and suggest below how they can be composed to model conventional resources. A scalar metric can exactly specify resource capacity. Often requirements are partially constraining, i.e. they identify ranges of capacity. We extend scalar metrics as unary inequalities to use the scalar metrics as a limit. The limit syntax can also be applied to time values, e.g. to specify a start time of “≤ t” for a provisioning interval that starts “on or before” the exact time t. Time metrics t expressed in wall-clock time, e.g. “Wed Apr 24 20:52:36 UTC 2002.” Scalar metrics x u expressed in x real-valued units u, e.g. 512 Mbytes, or 10× 10−3 s/seek. Max limit < m and ≤ m specify an exclusive or inclusive upper limit on the given metric m, respectively. Min limit > m and ≥ m specify an exclusive or inclusive lower limit on the given metric m, respectively. These primitives are “leaf” constructs in a structural resource description. They define a syntax, but some of their meaning is defined by the context in which they appear. 4.2
Resource Composites
The resource description language is compositional. Realistic resources can be modeled as composites of simpler resource primitives. Assuming a representation of resources r1 , r2 etc. we can aggregate them using various typed constructs. Set [r1 , r2 , . . .] combining arbitrary resources that are all required. Typed Set [r1 , r2 , . . .]type combining type-specific resources. Groups are marked with a type to convey the meaning of the collection of resources, e.g. [x1 bytes, x2 bytes/s]disk might collect space and bandwidth metrics for a “file-system” resource. Array n × r is an abbreviation for the group of n identical resource instances [ r, r, . . . , r], e.g. for convenient expression of symmetric parallelism. The purpose of typed groups is to provide meaning to the metric values inside— in practice the meaning would be denoted only in an external specification of
166
Karl Czajkowski et al.
the type, and the computer system interrogating instances of R will be implemented to recognize and process the typed composite. For example, the [x1 bytes, x2 bytes/s]disk composite tells us that we are constraining the speed and size of a secondary storage device with the otherwise ambiguous metrics for space and bandwidth. Resources are required over periods of time, i.e. from a start time t0 to an end time t1 , and we denote this as r[t0 ,t1 ] . A complex time-varying description can be composed of a sequence of descriptions with consecutive time intervals: [t0 ,tn ] [t ,t ] [t ,t ] [t ,t ] r = [r1 ] 0 1 , [r2 ] 1 2 , . . . , [rn ] n−1 n . Each subgroup within a composite must have a lifetime wholly included within the lifetime of the parent group. 4.3
Resource Alternatives
We define disjunctive alternatives to complement the conjunctive composites from section 4.2. Alternative ∨ (r1 , r2 , . . .) differs from a resource set in that only one element ri must be satisfied. As indicated in the descriptions above, limit modifiers are only applicable to scalar metrics, while the alternative concept applies to all resource description elements. Alternatives can be used to express alternate solution spaces for the application requirements within distinct planning regimes, or to phrase similar requirements using basic and specialized metrics in the event that a client could benefit from unconventional extensions to J that may or may not be recognized by a given manager. 4.4
Resource Configuration
The final feature present in our description language is the ability to intermingle control or configuration directives within the resource statement. In an open environment, this intermingling is merely a notational convenience to avoid presenting two isomorphic statements—one modeling the requirements of the structured resource and one providing control data to the resource manager for the structured resource. Task configuration details are what are added to the language R to define the activity language J . Configure a := v specifies an arbitrary configuration attribute a should have value v. In an environment with limited trust and strict usage restrictions, some resources may be unavailable for certain configurations due to owner policy. We therefore suggest treating them as primitive metrics when considering the meaning of the description for resource selection, while also considering them as control data when considering the meaning of the description as an activity configuration.
SNAP: A Protocol for Negotiating Service Level Agreements
mpi 128 node net 2 cpu
ram
disk
ratio
size size rate
prog
[ 128 × [ 2 × [ 100% ]cpu , [ ≥ 256 × 220 bytes ]ram , [ ≥ 1 × 230 bytes, ≥ 30 × 220 bytes/s ]disk net := myrinet, prog := /usr/bin/a.out ]mpi
167
]node ,
Fig. 7. Hypothetical resource description. A parallel computer with 128 dedicated dual-processor nodes, each providing at least 256 MB of memory and 1 GB disk with disk performance of 30 MB/s, connected by Myrinet-enabled MPI. A parse tree is provided to help illustrate the nested expression 4.5
RSLA Binding
To support the referencing of RSLAs, we require a way to associate an existing RSLA with a sub-requirement in J : RSLA Binding [r, IB ]bind specifies requirement r but also says it should be satisfied using the RSLA identified by IB . This construct supports the explicit resource planning described in Section 3.2.
5
SLA Constraint-Satisfaction Model
In a fully-developed SLA environment, one can imagine agreements including auditing commitments, negotiated payments or exchange of service, and remediation steps in case of agreement violation. However, in this paper we focus on a weaker form of agreement where clients more or less trust resource providers to act in good faith, and cost models for service are not explicitly addressed nor proscribed. Nonetheless, the entire purpose of our protocol hinges on an understanding of satisfaction of SNAP SLAs. The satisfaction of an SLA requires a non-empty “solution set” of possible resource and task schedules which deliver the capabilities and perform the directives encoded in the J language elements within the SLA. A self-contradictory or unsatisfiable SLA has an empty solution set. We denote the ideal solution set with solution operators SR (r) and SJ (j) which apply to descriptions in R or J . While the language R is assumed to be a syntactic subset of J , the set of solution sets {SR (r) | r ∈ R } is a superset of the set of solution sets {SJ (j) | j ∈ J }, and given a projection of requirements j ↓R ∈ R , the solution set SR (j ↓R ) is a superset of SJ (j). This inversion occurs because the additional syntactic constructs in J are used to express additional task constraints beyond the resource capabilities expressible in R . We would like a relation between descriptions to capture this relationship between solution sets for the descriptions. We say that a refined description j models j, or j j, if and only if SJ (j ) ⊆ SJ (j). This concept of refinement is used to define the relationship between requested and agreed-upon SLAs in the SLA negotiation of Section 3.3.
168
Karl Czajkowski et al. Descriptions
Behavior
R
solves
J TSLA
SR RSLA
BSLA
SJ Tasks
Trace
Reserves Provisioning States
Fig. 8. Constraint domain. Lower items in the mate higher items. The solution spaces on the e.g. Provisioning⊆Reserves because provisioning to a particular task. Solution ordering maps to straints, e.g. BSLARSLA on the left
figure conservatively approxiright are ordered as subsets, constrains a resource promise the “model” relation for con-
Just as J is more expressive than R , BSLAs are more expressive than TSLAs or RSLAs. The TSLA says that a manager will “run job j according to its selfexpressed performance goals and provisioning requirements.” The RSLA says that a manager will “provide resource capability r when asked by the client.” A corresponding BSLA encompasses both of these and says the manager will “apply resource r to help satisfy requirements while performing job j.” Therefore we extend our use of the “models” relation to SLAs. This set-ordered structure in the SNAP concept domain is illustrated in Figure 8.
6
Implementing SNAP
The RM protocol architecture described in this article is general and follows a minimalist design principle in that the protocol captures only the behavior that is essential to the process of negotiation. We envision that SNAP would not be implemented as a stand alone protocol, but in practice would be layered on top of more primitive protocols and services providing functions such as communication, authentication, naming, discovery, etc. For example, the Open Grid Services Architecture [18] defines basic mechanisms for creating, naming, and controlling the lifetime of services. In the following, we explore how SNAP could be implemented on top of the OGSA service model. 6.1
Authentication and Authorization
Because Grid resources are both scarce and shared, a system of rules for resource use, or policy, is often associated with a resource to regulate its use [40]. We assume a wide-area security environment such as GSI [19] will be integrated
SNAP: A Protocol for Negotiating Service Level Agreements
169
with the OGSA to provide mutually-authenticated identity information to SNAP managers such that they may securely implement policy decisions. Both upward information flow and downward agreement policy flow in a complex service environment, such as depicted in Figure 9, are likely subject to policy evaluation that distinguishes between individual clients and/or requests. 6.2
Resource Heterogeneity
The SNAP protocol agreements can be mapped onto a range of existing local resource managers, to deploy its beneficial capabilities without requiring wholesale replacement of existing infrastructure. Results from GRAM testbeds have shown the feasibility of mapping TSLAs onto a range of local job schedulers, as well as simple time-sharing computers [16, 6, 35]. The GARA prototype has shown how RSLAs and BSLAs can be mapped down to contemporary network QoS systems [21, 22, 36]. Following this model, SNAP manager services represent adaptation points between the SNAP protocol domain and local RM mechanisms. 6.3
Monitoring
A fundamental function for RM systems is the ability to monitor the health and status of individual services and requests. Existing Grid RM services such as GRAM and GARA include native protocol features to signal asynchronous state changes from a service to a client. In addition to these native features, some RM state information is available from a more generalized information service, e.g. GRAM job listings are published via the MDS in the Globus Toolkit [8, 21, 10]. We expect the OGSA to integrate asynchronous subscription/notification features. Therefore, we have omitted this function from the RM architecture presented here. An RM service implementation is expected to leverage this common infrastructure for its monitoring data path. We believe the agreement model presented in Sections 1, 3.2 and 3.1 suggest the proper structure for exposing RM service state to information clients, propagating through the upward arrows in Figure 9. Information index services can cache and propagate this information because life-cycle of the agreement state records is well defined in the RM protocol semantics, and the nested request language allows detailed description of agreement properties. 6.4
Resource and Service Discovery
SNAP relies on the ability for clients to discover RM services. We expect SNAP services to be discovered via a combination of general discovery and registry services such as the index capabilities of MDS-2 and OGSA, client configuration via service registries such as UDDI, and static knowledge about the community (Virtual Organization) under which the client is operating. The discovery
170
Karl Czajkowski et al. user D
user E
community users and SNAP managers
community index services C4 D,E
user A
user B
C1
C2 A,B
1,2
(SLA flow) A−E PBS
user F
C3 C−E
1,2
D−F
3,4
1,2 R1
1−4
user C
R2
A−E
UNIX
R3
D−F QoS
3,4
(info flow)
site index services R4
D−F ...
local SNAP managers local RM mechanisms
Fig. 9. An integrated SNAP system. Discovery services provide indexed views of resources, while SNAP managers provide distributed and aggregated resource brokering abstractions to users
information flow is exactly as for monitoring in Figure 9, with information propagating from resources upward through community indexes and into clients. In fact, discovery is one of the purposes for a general monitoring infrastructure. Due to the potential for virtualized resources described in Section 2.3, we consider “available resources” to be a secondary capability of “available services.” While service environments provide methods to map from abstract service names to protocol-level service addresses, it is also critical that services be discoverable in terms of their capabilities. The primary capability of a SNAP manager is the set of agreements it offers, i.e. that it is willing to establish with clients. 6.5
Multi-phase Negotiation
There are dynamic capabilities that also restrict the agreement space, including resource load and RM policy. Some load information may be published to help guide clients with their resource selection. However, proprietary policy including priorities and hidden SLAs may effect availability to specific classes of client. The agreement negotiation itself is a discovery process by which the client determines the willingness of the manager to serve the client. By formulating future agreements with weak commitment and changing them to stronger agreements, a client is able to perform a multi-phase commit process to discover more information in an unstructured environment. Resource virtualization helps discovery by aggregating policy knowledge into a private discovery service—a community scheduler can form RSLAs with application service providers and then expose this virtual resource pool through community-specific agreement offers.
SNAP: A Protocol for Negotiating Service Level Agreements
6.6
171
Standard Modeling Language
In Section 4 we present the abstract requirements of an expressive resource language J . These requirements include unambiguous encoding of provisioning metrics, job configuration, and composites. We also identify above the propagation of resource and agreement state through monitoring and discovery data paths as important applications of the resource language. For integration with the OGSA, we envision this language J being defined by an XML-Schema [14] permitting extension with new composite element types and leaf metric types. The name-space features of XML-Schema permit unambiguous extension of the language with new globally-defined types. This language serves the same purpose as RSL in GRAM/GARA [8, 11, 21, 22] or Class Ads in Condor [34, 27]. With SNAP, we are proposing a more extensible model for novel resource composites than RSL and a more rigorously typed extension model than Class Ads, two features which we believe are necessary for large-scale, inter-operable deployments. 6.7
Agreement Delegation
In the preceding protocol description, mechanisms are proposed to negotiate agreement regarding activity implementation or. These agreements capture a delegation of resource or responsibility between the negotiating parties. However, it is important to note that the delegation concept goes beyond these explicit agreements. There are analogous implicit delegations that also occur during typical RM scenarios. The TSLA delegates specific task-completion responsibilities to the scheduler that are “held” by the user. The scheduler becomes responsible for reliably planning and enacting the requested activity, tracking the status of the request, and perhaps notifying the user of progress or terminal conditions. The RSLA delegates specific resource capacity to the user that are held by the manager. Depending on the implementation of the manager, this delegation might be mapped down into one or more hidden operational policy statements that enforce the conditions necessary to deliver on the guarantee. For example, a CPU reservation might prevent further reservations from being made or an internal scheduling priority might be adjusted to “steal” resources from a best-effort pool when necessary. Transfers of rights and responsibilities are transitive in nature, in that an entity can only delegate that which is delegated to the entity. It is possible to form RSLAs out of order, but in order to exploit an RSLA, the dependent RSLAs must be valid. Such transitive delegation is limited by availability as well as trust between RM entities. A manager which over-commits resources will not be able to make good on its promises if too many clients attempt to use the RSLAs at the same time. Viewing RSLAs and TSLAs as delegation simplifies the modeling of heavy-weight brokers or service providers, but it also requires a trust/policy evaluation in each delegation step. A manager may restrict its delegations to only permit certain use of the resource by a client—this client may attempt to
172
Karl Czajkowski et al.
broker the resource to other clients, but those clients will be blocked when they try to access the resource and the manager cannot validate the delegation chain. 6.8
Many Planners
Collective resource scenarios are the key motivation for Grid RM. In our architecture, the local resource managers do not solve these collective problems. The user, or an agent of the user, must obtain capacity delegations from each of the relevant resource managers in a resource chain. There are a variety of brokering techniques which may help in this situation, and we believe the appropriate technique must be chosen by the user or community. The underlying Grid RM architecture must remain open enough to support multiple concurrent brokering strategies across resources that might be shared by multiple user communities.
7
Other Related Work
Numerous researchers have investigated approaches to QoS delivery [23] and resource reservation for networks [12, 15, 42], CPUs [25], and other resources. Proposals for advance reservations typically employ cooperating servers that coordinate advance reservations along an end-to-end path [42, 15, 12, 24]. Techniques have been proposed for representing advance reservations, for balancing immediate and advance reservations [15], for advance reservation of predictive flows [12]. However, this work has not addressed the co-reservation of resources of different types. The Condor high-throughput scheduler can manage network resources for its jobs. However, it does not interact with underlying network managers to provide service guarantees [2] so this solution is inadequate for decentralized environments where network admission-control cannot be simulated in this way by the job scheduler. The concept of a bandwidth broker is due to Jacobson. The Internet 2 Qbone initiative and the related Bandwidth Broker Working Group are developing testbeds and requirements specifications and design approaches for bandwidth brokering approaches intended to scale to the Internet [38]. However, advance reservations do not form part of their design. Other groups have investigated the use of differentiated services (e.g., [43]) but not for multiple flow types. The co-reservation of multiple resource types has been investigated in the multimedia community: see, for example, [28, 31, 30]. However, these techniques are specialized to specific resource types. The Common Open Policy Service (COPS) protocol [4] is a simple protocol for the exchange of policy information between a Policy Decision Point (PDP) and its communication peer, called Policy Enforcement Point (PEP). Communication between PEP and PDP is done by using a persistent TCP connection in the form of a stateful request/decision exchange. COPS offers a flexible and extensible mechanism for the exchange of policy information by the use of the client-type object in its messages. There are currently two classes of COPS client:
SNAP: A Protocol for Negotiating Service Level Agreements
173
Outsourcing provides an asynchronous model for the propagation of policy decision requests. Messages are initiated by the PEP which is actively requesting decisions from its PDP. Provisioning in COPS follows a synchronous model in which the policy propagation is initiated by the PDP. Both COPS models map easily to SNAP with the SNAP manager as a PDP and the resource implementation as a PEP. A SNAP client can also be considered a PDP which provisions policy (SLAs) to a SNAP manager which is then the PEP. There is no analogue to COPS outsourcing when considering the relationship between SNAP clients and managers. 7.1
GRAM
The Globus Resource Allocation Manager (GRAM) provides job submission on distributed compute resources. It defines APIs and protocols that allow clients to securely instantiate job running agreements with remote schedulers [8]. In [11], we presented a light-weight, opportunistic broker called DUROC that enabled simultaneous co-allocation of distributed resources by layering on top of the GRAM API. This broker was used extensively to execute large-scale parallel simulations, illustrating the challenge of coordinating computers from different domains and requiring out-of-band resource provisioning agreements for the runs [5, 6]. In exploration of end-to-end resource challenges, this broker was more recently used to acquire clustered storage nodes for real-time access to large scientific datasets for exploratory visualization [9]. 7.2
GARA
The General-purpose Architecture for Reservation and Allocation (GARA) provides advance reservations and end-to-end management for quality of service on different types of resources, including networks, CPUs, and disks [21, 22]. It defines APIs that allows users and applications to manipulate reservations of different resources in uniform ways. For networking resources, GARA implements a specific network resource manager which can be viewed as a bandwidth broker. In [36], we presented a bandwidth broker architecture and protocol that addresses the problem of diverse trust relationships and usage policies that can apply in multi-domain network reservations. In this architecture, individual BBs communicate via bilaterally authenticated channels between peered domains. Our protocol provides the secure transport of requests from source domain to destination domain, with each bandwidth broker on the path being able to enforce local policies and modify the request with additional constraints. The lack of a transitive trust relation between source- and end-domain is addressed by a delegation model where each bandwidth broker on the path being able to identify all upstream partners by accessing the credentials of the full delegation chain.
174
8
Karl Czajkowski et al.
Conclusions
We have presented a new model and protocol for managing the process of negotiating access to, and use of, resources in a distributed system. In contrast to other architectures that focus on managing particular types of resources (e.g., CPUs or networks), our Service Negotiation and Acquisition Protocol (SNAP) defines a general framework within which reservation, acquisition, task submission, and binding of tasks to resources can be expressed for any resource in a uniform fashion. We have not yet validated the SNAP model and design in an implementation. However, we assert that these ideas have merit in and of themselves, and also note that most have already been explored in limited form within the current GRAM protocol and/or the GARA prototype system.
Acknowledgments We are grateful to many colleagues for discussions on the topics discussed here, in particular Larry Flon, Jeff Frey, Steve Graham, Bill Johnston, Miron Livny, Jeff Nick, and Alain Roy. This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38; by the National Science Foundation; by the NASA Information Power Grid program; and by IBM.
References [1] SOAP version 1.2 part 0: Primer. W3C Working Draft 17. www.w3.org/TR/soap12-part0/. 160 [2] Jim Basney and Miron Livny. Managing network resources in Condor. In Proc. 9th IEEE Symp. on High Performance Distributed Computing, 2000. 172 [3] Michael Beynon, Renato Ferreira, Tahsin M. Kurc, Alan Sussman, and Joel H. Saltz. Datacutter: Middleware for filtering very large scientific datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, pages 119–134, 2000. 154 [4] J. Boyle, R. Cohen, D. Durham, S. Herzog, R. Rajan, and A. Sastry. The COPS (Common Open Policy Service) protocol. IETF RFC 2748, January 2000. 172 [5] S. Brunett, D. Davis, T. Gottschalk, P. Messina, and C. Kesselman. Implementing distributed synthetic forces simulations in metacomputing environments. In Proceedings of the Heterogeneous Computing Workshop, pages 29–42. IEEE Computer Society Press, 1998. 154, 173 [6] Sharon Brunett, Karl Czajkowski, Steven Fitzgerald, Ian Foster, Andrew Johnson, Carl Kesselman, Jason Leigh, and Steven Tuecke. Application experiences with the Globus toolkit. In HPDC7, pages 81–89, 1998. 169, 173 [7] E. Christensen, F. Curbera, G. Meredith, and S. Weerawarana. Web services description language (WSDL) 1.1. Technical report, W3C, 2001. http://www.w3.org/TR/wsdl/. 160
SNAP: A Protocol for Negotiating Service Level Agreements
175
[8] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke. A resource management architecture for metacomputing systems. In The 4th Workshop on Job Scheduling Strategies for Parallel Processing, pages 62–82, 1998. 155, 169, 171, 173 ´ [9] Karl Czajkowski, Alper K. Demir, Carl Kesselman, and M. Thiebaux. Practical resource management for grid-based visual exploration. In Proc. 10th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 2001. 154, 159, 173 [10] Karl Czajkowski, Steven Fitzgerald, Ian Foster, and Carl Kesselman. Grid information services for distributed resource sharing. In Proc. 10th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 2001. 155, 169 [11] Karl Czajkowski, Ian Foster, and Carl Kesselman. Co-allocation services for computational grids. In Proc. 8th IEEE Symp. on High Performance Distributed Computing. IEEE Computer Society Press, 1999. 171, 173 [12] M. Degermark, T. Kohler, S. Pink, and O. Schelen. Advance reservations for predictive service in the internet. ACM/Springer Verlag Journal on Multimedia Systems, 5(3), 1997. 172 [13] D. Draper, P. Fankhauser, M. Fern´ andez, A. Malhotra, K. Rose, M. Rys, J. Sim´eon, and P. Wadler, editors. XQuery 1.0 Formal Semantics. W3C, March 2002. http://www.w3.org/TR/2002/WD-query-semantics-20020326/. 177 [14] D. C. Fallside. XML schema part 0: Primer. Technical report, W3C, 2001. http://www.w3.org/TR/xmlschema-0/. 171 [15] D. Ferrari, A. Gupta, and G. Ventre. Distributed advance reservation of realtime connections. ACM/Springer Verlag Journal on Multimedia Systems, 5(3), 1997. 172 [16] I. Foster and C. Kesselman. The Globus project: A status report. In Proceedings of the Heterogeneous Computing Workshop, pages 4–18. IEEE Computer Society Press, 1998. 169 [17] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a Future Computing Infrastructure. Morgan Kaufmann Publishers, 1999. 153, 175 [18] I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration. Technical report, Globus Project, 2002. www.globus.org/research/papers/ogsa.pdf. 155, 159, 168 [19] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. A security architecture for computational grids. In ACM Conference on Computers and Security, pages 83–91. ACM Press, 1998. 155, 168 [20] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid: Enabling scalable virtual organizations. Intl. Journal of High Performance Computing Applications, 15(3):200–222, 2001. http://www.globus.org/research/papers/anatomy.pdf. 153, 159 [21] I. Foster, A. Roy, and V. Sander. A Quality of Service Architecture that Combines Resource Reservation and Application Adaptation. In International Workshop on Quality of Service, 2000. 155, 169, 171, 173 [22] I. Foster, A. Roy, V. Sander, and L. Winkler. End-to-End Quality of Service for High-End Applications. Technical report, Argonne National Laboratory, Argonne, 1999. http://www.mcs.anl.gov/qos/qos papers.htm. 154, 155, 169, 171, 173 [23] Roch Gu´erin and Henning Schulzrinne. Network quality of service. In [17], pages 479–503. 172
176
Karl Czajkowski et al.
[24] A. Hafid, G. Bochmann, and R. Dssouli. A quality of service negotiation approach with future reservations (nafur): A detailed study. Computer Networks and ISDN Systems, 30(8), 1998. 172 [25] Hao hua Chu and Klara Nahrstedt. CPU service classes for multimedia applications. In Proceedings of IEEE International Conference on Multimedia Computing and Systems, pages 296–301. IEEE Computer Society Press, June 1999. Florence, Italy. 172 ¨ [26] Tahsin Kurc, Umit C ¸ ataly¨ urek, Chialin Chang, Alan Sussman, and Joel Salz. Exploration and visualization of very large datasets with the Active Data Repository. Technical Report CS-TR-4208, University of Maryland, 2001. 154, 159 [27] M. Livny. Matchmaking: Distributed resource management for high throughput computing. In Proc. 7th IEEE Symp. on High Performance Distributed Computing, 1998. 171 [28] A. Mehra, A. Indiresan, and K. Shin. Structuring communication software for quality-of-service guarantees. In Proc. of 17th Real-Time Systems Symposium, December 1996. 172 [29] R. Milner, M. Tofte, R. Harper, and D. MacQueen. The Definition of Standard ML (Revised). MIT Press, 1997. 177 [30] K. Nahrstedt, H. Chu, and S. Narayan. QoS-aware resource management for distributed multimedia applications. Journal on High-Speed Networking, IOS Press, December 1998. 172 [31] K. Nahrstedt and J. M. Smith. Design, implementation and experiences of the OMEGA end-point architecture. IEEE JSAC, Special Issue on Distributed Multimedia Systems and Technology, 14(7):1263–1279, September 1996. 172 [32] L. Pearlman, V. Welch, I. Foster, C. Kesselman, and S. Tuecke. A community authorization service for group collaboration. In The IEEE 3rd International Workshop on Policies for Distributed Systems and Networks, June 2002. 158 [33] Gordon Plotkin. A structural approach to operational semantics. Technical Report DAIMI FN-19, Computer Science Department, Aarhus University, 1981. 177 [34] Rajesh Raman, Miron Livny, and Marvin Solomon. Resource management through multilateral matchmaking. In Proc. 9th IEEE Symp. on High Performance Distributed Computing, 2000. 171 [35] L. Rodrigues, K. Guo, P. Verissimo, and K. Birman. A dynamic light-weight group service. Journal on Parallel and Distributed Computing, (60):1449–1479, 2000. 169 [36] V. Sander, W. A. Adamson, I. Foster, and A. Roy. End-to-End Provision of Policy Information for Network QoS. In Proc. 10th IEEE Symp. on High Performance Distributed Computing, 2001. 155, 169, 173 [37] P. Stelling, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. In Proc. 7th IEEE Symp. on High Performance Distributed Computing, pages 268–278, 1998. 163 [38] B. Teitelbaum, S. Hares, L. Dunn, V. Narayan, R. Neilson, and F. Reichmeyer. Internet2 QBone - Building a testbed for differentiated services. IEEE Network, 13(5), 1999. 172 [39] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham, and C. Kesselman. Grid services specification. Technical report, Globus Project, 2002. www.globus.org/research/papers/gsspec.pdf. 155
SNAP: A Protocol for Negotiating Service Level Agreements
177
[40] J. Vollbrecht, P. Calhoun, S. Farrell, L. Gommans, G. Gross, B. de Bruijn, C. de Laat, M. Holdrege, and D. Spence. AAA authorization application examples. Internet RFC 2905, August 2000. 168 [41] Gregor von Laszewski, Ian Foster, Joseph A. Insley, John Bresnahan, Carl Kesselman, Mei Su, Marcus Thiebaux, Mark L. Rivers, Ian McNulty, Brian Tieman, and Steve Wang. Real-time analysis, visualization, and steering of microtomography experiments at photon sources. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 1999. 154 [42] L. C. Wolf and R. Steinmetz. Concepts for reservation in advance. Kluwer Journal on Multimedia Tools and Applications, 4(3), May 1997. 172 [43] Ikjun Yeom and A. L. Narasimha Reddy. Modeling TCP behavior in a differentiated-services network. Technical report, TAMU ECE, 1999. 172
A
SNAP Operational Semantics
Below, we provide a formal specification of the behavior of SNAP managers in response to agreement protocol messages. This might be useful to validate the behavior of an implementation, or to derive a model of the client’s belief in the state of a negotiation. We use a variant of structural operational semantics (SOS) which is commonly used to illustrate the transformational behavior of a complex system [33, 29, 13]. We define our own system configuration model to retain as intuitive a view of protocol messaging as possible. Our SOS may appear limited in that it only models the establishment of explicit SLAs without capturing the implicitly-created SLAs mentioned in Section 3.3. We think these implicit SLAs should not be part of a standard, interoperable SNAP protocol model, though a particular implementation might expose them. There are four main parts to our SOS: 1. Agreement language. An important component of the semantics captures the syntax of a single agreement, embedding the resource language from Section 4. 2. Configuration language. The state of a negotiation that is evolving due to manager state and messages between clients and manager. 3. Service functions. A set of function declarations that abstract complex decision points in the SNAP model. These functions are characterized but not exactly defined, since they are meant to isolate the formal model from implementation details. 4. Transition rules. Inference rules showing how the configuration of a negotiation evolves under the influence of the service predicates and the passage of time. This SOS is not sufficient to understand SNAP behavior and SLA meaning until a concrete language is specified to support the R ⊆J languages proposed above.
178
A.1
Karl Czajkowski et al.
Agreement Language
An agreement a appears in the SLA language A, a generic 4-tuple as introduced in Section 3.2: d ∈ D = R R + J T + J B + a∈A = I×N×T×D The domain D of SLA descriptions is a union of the individual descriptive languages described in Section 4. Because these descriptions share the same R ⊂J language, we wrap them with type designation to distinguish the content of RSLA, TSLA, and BSLA descriptions. An SLA containing the special -description represents an identifier which is allocated but not yet associated with SLA terms. Additional terminal domains I, N, and T are assumed for identifiers, client names, and time values, respectively. A.2
Configuration Model
Abstractly, a configuration of negotiation between clients and a manager is a tuple of an input message queue Q, the agreement state A of the manager, an output message set X, and the manager’s clock t: Q, A, X, t The syntax of the configuration is specified as follows using a mixture of BNF grammar and domain-constructors: q ∈ Min := getident(c, t) | setdeath(I, c, t) | request(I, c, t, d) | clock(t) Mout := useident(I, c, t) | willdie(I, c, t) | agree(I, c, t, d) | error() Q, A, X, t ∈ M∗in × P(A) × P(Mout ) × T For the benefit of the following SOS rules, we include client identifiers in the message signatures which were omitted from the messages when presented in Section 3. A.3
Service Functions
This formulation depends on a number of abstractions to isolate the implementation or policy-specific behavior of a SNAP manager. The following support functions are described in terms of their minimal behavioral constraints, without suggesting a particular implementation strategy.
SNAP: A Protocol for Negotiating Service Level Agreements
179
Set Manipulation We use polymorphic set operators + and − to add and remove distinct elements from a set, respectively: + : P(τ ) × τ → P(τ ) = λ S, v . S ∪ {v} − : P(τ ) × τ → P(τ ) = λ S, v . {x | x ∈ S ∧ x = v} Requirements Satisfaction As discussed in Sections 3.2 and 5, we assume a relation between descriptions indicating how their solution spaces are related: : R × R → Bool : J × J → Bool Basic Services Function authz maps a client name to a truth value, yielding true if and only if the client is authorized to participate in SNAP negotiations: authz : N → Bool Function newident provides a new identifier that is distinct from all identifiers in the input agreement set: newident : A → I = λ A . i | i, . . . ∈ A Initial Agreement The “reserve,” “schedule,” and “bind” functions choose a new SLA to satisfy the client’s request, or ⊥ (bottom) if the manager will not satisfy the request. Function reserve chooses a new RSLA: reserve : A × I × N × T × R → A I, c, t, r R | r r = λ A, I, c, t, r . ⊥ Function schedule chooses a new TSLA: schedule : A × I × N × T × J → A I, c, t, j T | j j = λ A, I, c, t, j . ⊥
180
Karl Czajkowski et al.
Function bind chooses a new BSLA: bind : A × I × N × T × J → A I, c, t, j B | j j = λ A, I, c, t, j . ⊥ Change Agreement The “rereserve,” “reschedule,” and “rebind” functions choose a replacement SLA to satisfy the client’s request as discussed in Section 3.4, or ⊥ if the manager will not satisfy the request. Function rereserve chooses a replacement RSLA: rereserve : A × I × N × T × R → A I, c, t, r R | r r = λ A, I, c, t, r . ⊥ Function reschedule chooses a replacement TSLA: reschedule : A × I × N × T × J → A I, c, t, j T | j j = λ A, I, c, t, j . ⊥ Function rebind chooses a replacement BSLA: rebind : A × I × N × T × J → A I, c, t, j B | j j = λ A, I, c, t, j . ⊥ A.4
Transition Rules
The following transitions rules serve to describe how a SNAP configuration of manager SLA set and message environment evolves during and after negotiation. Input messages are processed according to these rules to change the SLA set of the manager and to issue response messages. Each transition is structured as an inference rule with a number of antecedent clauses followed by a consequent rewrite of the SNAP configuration: antecedent1 ... q.Q, A, X, t ⇒ Q, A , X , t The first matching rule is used to rewrite the configuration.
SNAP: A Protocol for Negotiating Service Level Agreements
181
Lifetime Management New identifiers are allocated as needed: authz(c) t 0 < t1 I = newident(A) a = I, c, t1 , getident(c, t1 ).Q, A, X, t0 ⇒ Q, A + a, X + useident(I, c, t1 ), t0 Timeout changes affect existing agreements: a1 = I, c, t1 , . . . ∈ A a2 = I, c, t2 , . . . A = A − a1 + a2 setdeath(I, c, t2 ).Q, A, X, t0 ⇒ Q, A , X + willdie(I, c, t2 ), t0 Clock advances trigger removal of stale agreements: t0 < t1 A = {I, c, t, . . . | I, c, t, . . . ∈ A ∧ t > t1 } clock(t1 ).Q, A, X, t0 ⇒ Q, A , X, t1 The clock message is not originated by clients, but rather synthesized within the implementation. It is formalized as a message to capture the isochronous transition semantics of the manager state with regard to messages and the passing time. Initial Agreement A new agreement is considered when a client requests an agreement on a stub identifier agreement. New RSLA t 0 < t2 a1 = I, c, t1 , ∈ A a2 = I, c, t2 , r R = reserve(A, I, c, t2 , r) r r A = A − a1 + a2 reqest(I, c, t2 , rR ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , r R ), t0 New TSLA t0 < t2 a1 = I, c, t1 , ∈ A a2 = I, c, t2 , j T = schedule(A, I, c, t2 , j) j j A = A − a1 + a2 reqest(I, c, t2 , jT ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j T ), t0
182
Karl Czajkowski et al.
New BSLA t 0 < t2 a1 = I, c, t1 , ∈ A a2 = I, c, t2 , j B = bind(A, I, c, t2 , j) j j A = A − a1 + a2 reqest(I, c, t2 , jB ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j B ), t0 Repeat Agreement If a client requests an agreement on an existing agreement, and the existing agreement already satisfies the request, then a repeat acknowledgment is sent and the termination time of the existing agreement is adjusted to the current request. Repeat RSLA t0 < t 2 a1 = I, c, t1 , r R ∈ A a2 = I, c, t2 , r R r r A = A − a1 + a2 reqest(I, c, t2 , rR ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , r R ), t0 Repeat TSLA t0 < t2 a1 = I, c, t1 , j T ∈ A a2 = I, c, t2 , j T j j A = A − a1 + a2 reqest(I, c, t2 , jT ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j T ), t0 Repeat BSLA t0 < t 2 a1 = I, c, t1 , j B ∈ A a2 = I, c, t2 , j B j j A = A − a1 + a2 reqest(I, c, t2 , jB ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j B ), t0 Change Agreement If a client requests an agreement on an existing agreement of the same type, but the existing agreement does not satisfy the request, an SLA change is considered.
SNAP: A Protocol for Negotiating Service Level Agreements
183
Change RSLA t 0 < t2 a1 = I, c, t1 , r R ∈ A a2 = I, c, t2 , r R = rereserve(A, I, c, t2 , r) r r A = A − a1 + a2 reqest(I, c, t2 , rR ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , r R ), t0 Change TSLA t0 < t2 a1 = I, c, t1 , j T ∈ A a2 = I, c, t2 , j T = reschedule(A, I, c, t2 , j) j j A = A − a1 + a2 reqest(I, c, t2 , jT ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j T ), t0 Change BSLA t0 < t 2 a1 = I, c, t1 , j B ∈ A a2 = I, c, t2 , j B = rebind(A, I, c, t2 , j) j j A = A − a1 + a2 reqest(I, c, t2 , jB ).Q, A, X, t0 ⇒ Q, A , X + agree(I, c, t2 , j B ), t0 Error Clause If none of the above inference rules match, this one signals an error to the client. A quality implementation would provide more elaborate error signaling content. q.Q, A, X, t ⇒ Q, A, X + error(), t
Local versus Global Schedulers with Processor Co-allocation in Multicluster Systems Anca I.D. Bucur and Dick H.J. Epema Faculty of Information Technology and Systems Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands
Abstract. In systems consisting of multiple clusters of processors which employ space sharing for scheduling jobs, such as our Distributed ASCI1 Supercomputer (DAS), co-allocation, i.e., the simultaneous allocation of processors to single jobs in different clusters, may be required. We study the performance of co-allocation by means of simulations for the mean response time of jobs depending on a set of scheduling decisions such as the number of schedulers and queues in the system, the way jobs with different numbers of components are distributed among these queues and the priorities imposed on the schedulers, and on the composition of the job stream.
1
Introduction
Over the last decade, clusters and distributed-memory multiprocessors consisting of hundreds or thousands of standard CPUs have become very popular. In addition, recent work in computational and data GRIDs [2, 12] enables applications to access resources in different and possibly widely dispersed locations simultaneously—that is, to employ processor co-allocation [8]—to accomplish their goals, effectively creating single multicluster systems. Most of the research on processor scheduling in parallel computer systems has been dedicated to multiprocessors and single-cluster systems, but hardly any attention has been devoted to multicluster systems. In this paper we study through simulations the performance of processor co-allocation policies in multicluster systems employing space sharing for rigid jobs [4], depending on several scheduling decisions and on the composition of the job stream. The scheduling decisions we consider are the number of schedulers and queues in the system, the way jobs with different numbers of components are distributed among queues and the priorities and restrictions imposed on the schedulers. Our performance metric is the mean job response time as a function of the utilization. Using co-allocation does not mean that all jobs have to be split into components and spread over the clusters, small jobs can also be submitted as singlecomponent jobs and go to a single cluster. In general, there is in the system 1
In this paper, ASCI refers to the Advanced School for Computing and Imaging in The Netherlands, which came into existence before, and is unrelated to, the US Accelerated Strategic Computing Initiative.
D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 184–204, 2002. c Springer-Verlag Berlin Heidelberg 2002
Local versus Global Schedulers
185
a mix of jobs with different numbers of components. In this context, an important decision to make is whether there will be one global scheduler with one global queue in the system, or more schedulers and in the second case how jobs will be divided among schedulers. Our results show that a multicluster which employs co-allocation and treats all job requests as unordered requests, i.e., the user specifies the numbers of processors needed in separate clusters but not the clusters, also improves the performance of single-component jobs by not restricting them to a cluster, and choosing from all the clusters in the system one where they fit. Evaluating different scheduling decisions, we find the best choice to be a system where there is one scheduler for each cluster, and all schedulers have global information and place jobs using co-allocation over the entire system. Our four-cluster Distributed ASCI Supercomputer (DAS) [10] was designed to assess the feasibility of running parallel applications across wide-area systems [5, 13, 17]. In the most general setting, GRID resources are very heterogeneous; in this paper we restrict ourselves to homogeneous multicluster systems, such as DAS. Showing the viability of co-allocation in such systems may be regarded as a first step in assessing the benefit of co-allocation in more general GRID environments.
2
The Model
In this section we describe our model of multicluster systems based on the DAS system. 2.1
The DAS System
The DAS [1, 10] is a wide-area computer system consisting of four clusters of identical Pentium Pro processors, one with 128, the other three with 24 processors each. The clusters are interconnected by ATM links for wide-area communications, while for local communication inside the clusters Myrinet LANs are used. The system was designed for research on parallel and distributed computing. On single DAS clusters a local scheduler is used that allows users to request a number of processors bounded by the cluster’s size, for a time interval which does not exceed an imposed limit. 2.2
The Workload
Although co-allocation is possible on the DAS, so far it has not been used enough to let us obtain statistics on the sizes of the jobs’ components. However, from the log of the largest cluster of the system we found that over a period of three months, the cluster was used by 20 different users who ran 30, 558 jobs. The sizes of the job requests took 58 values in the interval [1, 128], for an average of 23.34 and a coefficient of variation of 1.11; their density is presented in Fig. 1.
186
Anca I.D. Bucur and Dick H.J. Epema
6000 powers of 2 other numbers
Number of Jobs
5000
4000
3000
2000
1000
0 0
20
40
60 80 Nodes Requested
100
120
Fig. 1. The density of the job-request sizes for the largest DAS cluster (128 processors)
The results comply with the distributions we use for the job-component sizes in that there is an obvious preference for small numbers and powers of two. From the jobs considered, 28, 426 were recorded in the log with both starting and ending time, and we could compute their service time. Due to the fact that during working hours jobs are restricted to at most 15 minutes of service (they are automatically killed after that period), 94.45% of the recorded jobs ran less than 15 minutes. Figure 2 presents the density of service time values on the DAS, as it was obtained from the log. The average service time is 356.45 seconds and the coefficient of variation is 5.37. Still, not all jobs in the log were short: the longest one took around 15 hours to complete. Figure 3 divides the service times of the jobs into eight intervals: < 10s, 10 − 30s, 30 − 60s, 60 − 300s, 300 − 900s, 900 − 1800s, 1800 − 3600s, and > 3600s, each segment in the graph parallel to the horizontal axis corresponds to an interval. The vertical axis coordinate of any point of a segment represents the total number of jobs in that interval. In our simulations, beside an exponential distribution with mean 1 we also use for the service-time distribution the distribution derived from the log of the DAS, cut off at 900 seconds (which is the run-time limit during the day). The average service time for the jobs in the cut log is 62.66 and the coefficient of variation is 2.05. We made the choice to use both distributions because with the DAS distribution we obtain a more accurate, realistic evaluation of the DAS performance, but in the same time this distribution might be very specific and make our results hard to compare to those from other systems. On the other hand, the exponential distribution is less realistic but more general and more suited for analysis.
Local versus Global Schedulers
187
1400
Number of Jobs
1200 1000 800 600 400 200 0 0
100
200
300
400 500 600 Service Time (s)
700
800
900
Fig. 2. The density of the service times for the largest DAS cluster (128 processors)
10000
Number of Jobs
8000
6000
4000
2000
0 0
500
1000
1500 2000 2500 Service Time (s)
3000
3500
Fig. 3. The service times of jobs divided into eight main intervals
2.3
The Structure of the System
We model a multicluster system consisting of C clusters of processors, cluster i having Ni processors, i = 1, . . . , C. We assume that all processors have the same service rate. By a job we understand a parallel application requiring some number of processors, possibly in multiple clusters (co-allocation). Jobs are rigid, so the numbers of processors requested by and allocated to a job are fixed. We call a task the part of a job that runs on a single processor. We assume that jobs only
188
Anca I.D. Bucur and Dick H.J. Epema
request processors and we do not include in the model other types of resources. For interarrival times we use exponential distributions. 2.4
The Structure of Job Requests and the Placement Policies
Jobs that require co-allocation have to specify the number and the sizes of their components, i.e., of the sets of tasks that have to go to the separate clusters. The distribution of the sizes of the job components is D(q) defined as follows: D(q) takes values on some interval [n1 , n2 ] with 0 < n1 ≤ n2 , and the probability of having job-component size i is pi = q i /Q if i is not a power of 2 and pi = 3q i /Q if i is a power of 2, with Q such that the pi sum to 1. This distribution favours small sizes, and sizes that are powers of two, which has been found to be a realistic choice [11]. A job is represented by a tuple of C values, each of which is either generated from the distribution D(q) or is of size zero. We consider only unordered requests, where by the components of the tuple the job only specifies the numbers of processors it needs in the separate clusters, allowing the scheduler to choose the clusters for the components. Unordered requests model applications like FFT, where tasks in the same job component share data and need intensive communication, while tasks from different components exchange little or no information. To determine whether an unordered request fits, we try to schedule its components in decreasing order of their sizes on distinct clusters. We use Worst Fit (WF; pick the cluster with the largest number of idle processors) to place the components on clusters. 2.5
The Scheduling Policies
In a multicluster system where co-allocation is used, jobs can be either singlecomponent or multi-component, and in a general case both types are simultaneously present in the system. It is useful to make this division since the single-component jobs do not use co-allocation while multi-component jobs do. A scheduler dealing with the first type of jobs can be local to a cluster and does not need any knowledge about the rest of the system. For multi-component jobs, the scheduler needs global information for its decisions. Treating both types of jobs equally, or keeping single-component jobs local and scheduling only multi-component jobs globally over the entire multicluster system, having a single global scheduler or schedulers local to each cluster, all these are decisions that influence the performance of the system. We consider the following approaches: 1. [GS] The system has one global scheduler with one global queue, for both single- and multi-component jobs. All jobs are submitted to the global queue. The global scheduler knows at any moment the number of idle processors in each cluster and based on this information chooses the clusters for each job.
Local versus Global Schedulers
189
2. [LS] Each cluster has its own local scheduler with a local queue. All queues receive both single- and multi-component jobs and each local scheduler has global knowledge about the numbers of idle processors. However, singlecomponent jobs are scheduled only on the local cluster. The multi-component jobs are co-allocated over the entire system. In a scheduling step, all enabled queues are repeatedly visited, and in each round at most one job from each queue is started. When the job at the head of a queue does not fit, the queue is disabled until the next job departs from the system. At each job departure all the queues are enabled, in a fixed order. 3. [EQ] The system has both a global scheduler with a global queue, and local schedulers with local queues. Multi-component jobs go to the global queue and are scheduled by the global scheduler using co-allocation over the entire system. Single-component jobs are placed in one of the local queues and are scheduled by the local scheduler only on its corresponding cluster. When a job departs all queues are enabled, starting with the local queues. Then the queues are repeatedly visited in this order until no queue is enabled. This favours the local schedulers allowing them to try to place jobs before the global scheduler, but since with the chosen job stream compositions the load of the local queues is low (each of them receives maximum 12.5% of the jobs in the system — see Sect. 3), it is a bearable burden for the global scheduler. The opposite choice would be much to the disadvantage of the jobs in the local queues because, depending on the job stream composition, up to 75% of the jobs can be multi-component and go to the global queue; enabling first the global queue would give little chance to the local schedulers to fit their jobs. The order in which the local queues are enabled does not matter since those jobs are only started on the local clusters. 4. [GP] Again both global and local schedulers with their corresponding queues. Like before, the global queue receives the multi-component jobs while the single-component jobs are placed in the local queues. The local schedulers are allowed to start jobs only when the global scheduler has an empty queue. 5. [LP] Both global and local schedulers, but this time the local schedulers have priority: the global scheduler gets the permission to work only when at least one local queue is empty. When a job departs, if one or more of the local queues are empty first the global queue is enabled and then the local queues. If no local queue is empty only the local queues are enabled and repeatedly visited; the global queue is enabled and added to the list of queues which are visited when at least one of the local queues gets empty. 6. [LQ] Both global and local schedulers; at any moment either the local schedulers are allowed to work, or the global one, depending on the lengths of their queues. The global queue is enabled if it is longer than all the local queues, otherwise the local queues are enabled. This strategy might seem to favour the local schedulers (the global scheduler is only permitted to schedule jobs when its queue is longer than all the others), but our results show that this is not the case. It only takes into account the fact that each of the local schedulers accesses just one cluster, so they can be simultaneously enabled.
190
Anca I.D. Bucur and Dick H.J. Epema
To allow the local schedulers to work only when more of their queues are longer than the global queue would be much to the disadvantage of the local schedulers, especially if the load of their queues is unbalanced. When the local queues receive only single-component jobs, the local schedulers manage disjoint sets of resources (a local scheduler starts jobs on a single cluster) and there is no need for coordination among them. However, for systems with both a global scheduler and local ones, or when the local schedulers also deal with the multi-component jobs and may use more clusters, the access to the data structures used in the process of scheduling (numbers of idle processors, queue lengths) has to be mutually excusive since we made the choice to keep that data consistent at all moments. The global scheduler always uses global information since it does co-allocation over the entire system; except for the case when they also schedule multi-component jobs, the local schedulers only need access to the data associated to their own cluster. In the extreme case, GP can indefinitely delay the single-component jobs, and LP can do the same with the multi-component jobs. In practice, an aging mechanism has to be implemented in order to prevent this behaviour. In all the cases considered, both the local and the global schedulers use the First Come First Served (FCFS) policy to choose the next job to run. All the local schedulers are assumed to have the same load. We choose not to include communication in our model because it would not change the quality of the results since all policies are tested with identical job streams (the same numbers of components).
3
Performance Evaluation for the Different Scheduling Decisions
In this section we assess the performance of multicluster systems for the six scheduling policies introduced (Sect. 2.5) depending on the composition of the job stream. Jobs can have between 1 and 4 components, and the percentages of jobs with the different numbers of components influence the performance of the system. We consider the following cases: – (25%, 25%, 25%, 25%) Equal percentages of 1-, 2-, 3- and 4-component jobs are submitted to the system. – (100%, 0%, 0%, 0%) There are only 1-component jobs. – (50%, 0%, 0%, 50%) Only 1- and 4-component jobs are present in the system, in equal percentages. – (0%, 0%, 0%, 100%) There are only 4-component jobs. – (50%, 25%, 25%, 0%) No 4-component jobs, half of the jobs are singlecomponent ones. – (0%, 50%, 50%, 0%) Just 2- and 3-component jobs in equal proportions. – (50%, 50%, 0%, 0%) Only 1- and 2-component jobs are submitted.
Local versus Global Schedulers
191
The simulations in this section are for a system with 4 clusters of 32 processors each, and the job-component sizes are generated from D(0.9) on the interval [1, 8]. The simulation programs were implemented using the CSIM simulation package [15]. For all the graphs in this section we computed confidence intervals; they are at the 95%-level. For the distribution of service times we use an exponential distribution with mean 1 in Sect. 3.1 and a distribution derived from the DAS log (see also Sect 2.2) in Sect. 3.2. 3.1
Simulations with Exponential Service
Figure 4 compares the different scheduling strategies for a job stream containing 1-, 2-, 3- and 4-component jobs in equal proportions. The best performance is obtained for LS, where all jobs go to the local schedulers and all four schedulers are allowed to spread the multi-component jobs over the entire system. At any moment, the system tries to schedule up to four jobs (when no queue is empty), one from each of the four local queues, and the FCFS policy is transformed this way into a form of backfilling with a window of size 4. This explains why LS is better than GS. A disadvantage for LS compared to GS is that LS can place 1-component jobs only on the cluster where they were submitted, while the other can choose from the four clusters one where the job fits. However, in the case from Fig. 4, only 25% of jobs have one component, so their negative influence on the performance of LS is small. GP, LP, EQ, and LQ try to schedule up to 5 jobs at a time, but since 75% of the jobs in the system are multi-component and they all go to the global queue, and only the rest of 25% is distributed among the local queues, their performance is worse than that of LS. GP displays the worst performance; it gives priority to the global scheduler and only allows the local schedulers to run jobs when the global queue is empty. Even if the job at the head of the global queue does not fit, the policy does not allow jobs from the local queues to run and this deteriorates the performance. Since most of the jobs are multi-component, the global queue is the longest in most of the cases when a scheduling decision has to be taken and LQ behaves similarly to GP. This explains why its performance is the second worst. LP and EQ also run mostly jobs from the global queue, but they do not delay the jobs from the local queues when the job at the top of the global queue does not fit and this improves their performance. Moreover, they both favour the local jobs, decision which also has a good effect on performance. Figures 5, 7 and 9 compare only the GS and LS strategies. The system in Fig. 5 contains only single-component jobs, so EQ, GP, LP, and LQ are reduced to LS. In the other two cases there are only multi-component jobs, so EQ, GP, LP and LQ become GS. We also used these cases to check our simulations and gain confidence in the results. When there are only single-component jobs in the system (Fig. 5), GS has better performance due to the fact that it chooses the clusters for the jobs (with WF), while with LS jobs can be scheduled only on the clusters they were submitted to. With single-component jobs GS does a sort of load balancing over
192
Anca I.D. Bucur and Dick H.J. Epema
10
LS LP GS GP
Average Response Time
8
6
4
2
0 0
10
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
LS EQ LQ GP
8 Average Response Time
0.1
6
4
2
0 0
Fig. 4. A performance comparison of the scheduling strategies for a job stream with composition (25%, 25%, 25%, 25%) and exponential service times
the entire system (it does not look at the actual loads however) while LS keeps the clusters in isolation. In Figures 7 and 9 LS proves to be better because for multi-component jobs the local schedulers are not restricted to their own clusters and there are up to four jobs at a time from which to choose one that fits in the system. Figures 6, 8 and 10 show that for GP the performance decreases with the increase of the percentage of jobs with 3 and 4 components. Since jobs with more components cause a higher capacity loss, it is a bad choice not to allow the local schedulers to try to fit jobs from their own queues when the job at the head of the global queue does not fit. Waiting for enough idle processors in multiple clusters for that job results in a deterioration of the performance. This is shown also by the fact that LQ has worse performance when the percentage of multi-component jobs is higher. EQ has a good performance for all chosen job mixes because it tries to fit as many jobs at possible from all queues without taking into account the characteristics of the job stream. Favouring the single-
Local versus Global Schedulers
10
193
GS LS
Average Response Time
8
6
4
2
0 0
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
Fig. 5. A performance comparison of the scheduling strategies for a job stream with composition (100%, 0%, 0%, 0%) and exponential service times
component jobs by enabling the local queues first at job departures also has a positive influence on the performance. The best performance in Figs. 6, 8 and 10 is displayed by LP. This suggests that allowing first the 1-component jobs, which are restricted to a certain cluster, to be placed and only then trying to schedule multi-component jobs for which the scheduler can shuffle the components to fit them, improves the utilization of the system. It also seems that when none of the local queues is empty it is a good choice to delay the global jobs waiting for the local jobs to fit, since LP constantly gives better results than EQ, while the opposite decision taken in the case of GP made this policy constantly worse than EQ. The disadvantage of LP is that it tends to delay the multi-component jobs, similarly as GP delays the single-component ones. The differences in performance are larger in Fig. 6 where there are 50% 1-component jobs and 50% 4-component jobs. In Figs. 8 and 10 where there are no 4-component jobs, all strategies display more similar performance. In these two cases there are 50% 1-component jobs and the rest are 2- and 3-component jobs. Increasing the percentage of 1-component jobs would improve the performance of GS and deteriorate all the others (when there are 100% singlecomponent jobs GP, EQ, LQ and LP all become LS). Increasing the percentage of multi-component jobs would improve the performance of LS, but worsen it for the rest (when there are only multi-component jobs GP, EQ, LQ and LP become GS). In all the graphs discussed so far we looked at the total (average) response time. However, when there are both local and global queues in the system we can expect that the performance differs between the global and local queues and is dependent on the policy. Figures 11 and 12 show beside the total average response time, the average response times for the local queues and the global queue for the EQ, GP, LP and LQ policies and for the four job compositions
194
Anca I.D. Bucur and Dick H.J. Epema
10
LP GS LS GP
Average Response Time
8
6
4
2
0 0
10
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
LP LQ EQ GP
8 Average Response Time
0.1
6
4
2
0 0
Fig. 6. A performance comparison of the scheduling strategies for a job stream with composition (50%, 0%, 0%, 50%) and exponential service times
which include both single- and multi-cluster jobs. For each utilization value where we approximated before an average response time for the entire system, now we also depict the average response times for the jobs in the global and local queues respectively. While LP and EQ provide much better performance for local jobs, GP and LQ are better for the global jobs. We cannot say that LQ favours the global jobs in general, since in a system with many single-cluster jobs it would be exactly the opposite. LQ is also fair to all jobs from the perspective that if there is a large job, be it single- or multi-cluster, which is difficult to fit on the system, not only that LQ will give that job a chance to run sooner than with other policies (unless they directly favour that type of jobs), but it will also limit the delay for the jobs behind it in the queue. In fact, LQ keeps the load of the queues balanced, switching its behaviour between GP and LP depending on the queue lengths. On the negative side, with LQ the performance of jobs of one type is more sensitive to the performance of jobs of the other type than for EQ, GP or LP.
Local versus Global Schedulers
10
195
LS GS
Average Response Time
8
6
4
2
0 0
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
Fig. 7. A performance comparison of the scheduling strategies for a job stream with composition (0%, 0%, 0%, 100%) and exponential service times
The figures show that EQ has better performance for the local queues and worse for the global queue than LP. A reason for this is that for EQ at job departures the queues are enabled starting with the local ones. If this decision is reversed the average response time for the local queues increases, for the global queue decreases and the overall performance of the system is improved. When none of the local queues is empty the LP policy strongly favours the local schedulers by not letting the global scheduler run. However, when at least one local queue is empty, the global queue is disabled (the job at the top of the queue does not fit) and a job departs, first the global scheduler is allowed to try to place a job. This decision has a positive effect on the overall performance but slightly deteriorates the performance of the local queues and makes it dependent on the global jobs: the better the global jobs fit, the worse the performance of the local jobs is. From these four policies the most practical would be either LP or EQ since the other two tend to delay the local jobs and it can be expected that the organizations owning the different clusters would not like their local jobs to be delayed in favour of the global, multi-component jobs. Our results show that, for policies like LP and EQ, even a high percentage of global jobs in the system does not deteriorate the performance of the local jobs. However, the users submitting multi-component jobs to a system implementing such a policy should be aware that the performance of their jobs is much influenced by the local jobs and it can be significantly lower than the overall performance of the system. In most of our graphs, at high utilizations some of the curves are rather close and one might think that it means that the performance is very similar. However, it only shows that the maximum utilizations are close, and not that the average response times are similar. Due to the steepness of the curves at high utilizations, for the same utilization the corresponding response times on two curves that seem very close are very different. To illustrate this, Fig. 13 compares
196
Anca I.D. Bucur and Dick H.J. Epema
10
LP LS GS GP
Average Response Time
8
6
4
2
0 0
10
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
LP EQ LQ GP
8 Average Response Time
0.1
6
4
2
0 0
Fig. 8. A performance comparison of the scheduling strategies for a job stream with composition (50%, 25%, 25%, 0%) and exponential service times
the average response time for the six policies considered for the four job stream compositions which contain both global and local jobs and a utilization high enough to be on the steep side of the curves for all policies, and close to the maximum utilization for the policy with the worst performance. The values for the utilization in all cases correspond to a system that is not saturated for any of the policies. Although they only depict the response time values at a single utilization value each, the charts in Fig. 13 are useful to show that there are large differences in response times for utilization points where the curves in Figs. 4 — 8 are hardly distinguishable. Since the displayed results are at different utilizations it is not meaningful to compare the bar charts in Fig. 13 to each other. 3.2
Simulations with the DAS Service-Time Distribution
In this section for the service-time distribution we use the cut distribution derived from the DAS log. We only present simulations for LS, LP and GS and
Local versus Global Schedulers
10
197
LS GS
Average Response Time
8
6
4
2
0 0
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
Fig. 9. A performance comparison of the scheduling strategies for a job stream with composition (0%, 50%, 50%, 0%) and exponential service times
the first four job-stream compositions. The results are very much in line with those from the previous section: in Figs. 14 (a) and 14 (d), LS displays the best performance, in Fig. 14 (b) GS is the best and in Fig. 14 (c) is LP. This shows that the previous use of exponential distributions did not alter the results and that our conclusions are valid for systems such as the DAS.
4
Related Work
In two previous papers, we have assessed the influence on the performance of coallocation of the structure and sizes of jobs and of the scheduling policy [6], and of the overhead due to communication among the tasks of jobs [7]. In [9] a model similar to ours is used, with different multicluster configurations and a single central scheduler. In this paper workloads derived from the CTC workload [3] are used, with jobs split up into components, and the EASY backfilling scheduling policy is implemented. Co-allocation (called multi-site computing in this paper) with flexible jobs and cluster-filling is compared to load balancing and to a system where clusters are working in isolation. The communication overhead due to the slow wide-area links among clusters is included in the model as an extension of the service time of jobs using co-allocation. This service time extension is used as a parameter in the simulations and it is concluded that multi-site computing is advantageous for service-time extensions of up to 1.25. In [16], a queueing system in which jobs require simultaneous access to multiple resources is studied. The interarrival and service-time distributions are only required to be stationary. Feasible job combinations are defined as the sets of jobs that can be in service simultaneously. A linear-programming problem based on an application of Little’s formula for these feasible job combinations is formulated for finding the maximal utilization, regardless of the scheduling policy employed. In [18], a performance comparison of two meta-schedulers is presented. It is shown that
198
Anca I.D. Bucur and Dick H.J. Epema
10
LP GS LS GP
Average Response Time
8
6
4
2
0 0
10
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
0.1
0.2
0.3
0.4
0.5 0.6 Utilization
0.7
0.8
0.9
1
LP EQ LQ GP
8 Average Response Time
0.1
6
4
2
0 0
Fig. 10. A performance comparison of the scheduling strategies for a job stream with composition (50%, 50%, 0%, 0%) and exponential service times
dedicating parts of subsystems to jobs that need co-allocation is not a good idea. In [19], NUMA multiprocessors are split up into processor pools of equal sizes along architectural lines. The number of threads into which a job is split, and the number of pools—the ones with the lowest loads are chosen—across which it is spread—a parallel job incurring more overhead when it spans multiple pools— is controled with parameters. The main result is that using intermediate pool sizes and limiting the number of pools a job is allowed to span yields the lowest response times, as this entails the best locality. In [14], simulations of two offline algorithms for multidimensional bin-packing, a problem that resembles scheduling ordered jobs without communication with deterministic service times, are presented. These algorithms search for items that will reduce the imbalance in the current bin. In order to relate these algorithms to scheduling in multiclusters with deterministic service demands, the algorithms are also simulated for short item lists, with replacement of items before a new bin is started.
Local versus Global Schedulers
EQ 12
local total average global
10 Average Response Time
Average Response Time
EQ 12
local total average global
10 8 6 4
8 6 4
2
2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization
Average Response Time
Average Response Time
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
global total average local
10
8 6 4
8 6 4
2
2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization
LP 12
local total average global
local total average global
10 Average Response Time
10
0.5 Utilization
LP 12
Average Response Time
0.6
GP 12
global total average local
10
0.5 Utilization
GP 12
8 6 4
8 6 4
2
2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization
LQ 12
global total average local
global total average local
10 Average Response Time
10
0.5 Utilization
LQ 12
Average Response Time
199
8 6 4 2
8 6 4 2
0
0 0
0.1
0.2
0.3
0.4
0.5 Utilization
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Utilization
Fig. 11. Comparing EQ, GP, LP and LQ for a job stream with composition (25%, 25%, 25%, 25%) (left) and (50%, 0%, 0%, 50%) (right), and including the separate performance for the local and global queues
200
Anca I.D. Bucur and Dick H.J. Epema
EQ 12
local total average global
10 Average Response Time
10 Average Response Time
EQ 12
local total average global
8 6 4
8 6 4
2
2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization GP 12
Average Response Time
Average Response Time
8 6
0.8
0.9
1
4
0.6
0.7
0.8
0.9
1
0.6
0.7
0.8
0.9
1
8 6 4 2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization
LP 12
local total average global
local total average global
10 Average Response Time
10
0.5 Utilization
LP 12
Average Response Time
0.7
global total average local
10
2
8 6 4
8 6 4
2
2
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization
LQ 12
global total average local
global total average local
10 Average Response Time
10
0.5 Utilization
LQ 12
Average Response Time
0.6
GP 12
global total average local
10
0.5 Utilization
8 6 4 2
8 6 4 2
0
0 0
0.1
0.2
0.3
0.4
0.5 Utilization
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Utilization
Fig. 12. Comparing EQ, GP, LP and LQ for a job stream with composition (50%, 25%, 25%, 0%) (left) and (50%, 50%, 0%, 0%) (right), and including the separate performance for the local and global queues
Local versus Global Schedulers
201
30 25
Utilization 0.869
20 Local Total Average Global
15 10
18 16 14
Utilization 0.860
12 10
Local Total Average Global
8 6 4
5
2 0
0 GS
LS
EQ
GP
LP
LQ
GS
20 18 16 14 12 10 8 6 4 2 0
Utilization 0.901
Local Total Average Global
LS
EQ
GP
LP
LQ
18 16 14
Utilization 0.915
12 10
Local Total Average Global
8 6 4
GS
LS
EQ
GP
LP
LQ
2 0 GS
LS
EQ
GP
LP
LQ
Fig. 13. A comparison of the average response times for the policies considered, for job stream compositions (25%, 25%, 25%, 25%) (top-left), (50%, 0%, 0%, 50%) (top-right), (50%, 25%, 25%, 0%) (bottom-left) and (50%, 50%, 0%, 0%) (bottomright)
5
Conclusions
In this paper we looked at different scheduling policies for co-allocation in multicluster systems and evaluated the performance of the system in terms of response time as a function of the utilization of the system. Co-allocation with unordered requests is a good choice not only for large jobs, which can get to run faster if split into more components and spread over the clusters, it also deals well with small single-component jobs. For a high percentage of single-component jobs, allowing them to run on any of the clusters, even if scheduled by a single global scheduler, proved to be a better choice than keeping them local to the cluster they were submitted to. For multi-component jobs, having more schedulers in the system and distributing the jobs among them improves the performance; any of the jobs at the heads of the queues can be chosen to run if it fits, which generates a form of backfilling with a window equal to the number of queues in the system, and increases the utilization. When there are separate queues for single- and multi-component jobs, favouring the multi-component jobs lowers the performance. In order to improve the system’s performance it is good to employ as many processors as possible, so if the job at the head of the global queue does not fit it is better to try to run jobs
202
Anca I.D. Bucur and Dick H.J. Epema
LS LP GS
400
350 Average Response Time
Average Response Time
350
GS LS
400
300 250 200 150
300 250 200 150
100
100
50
50
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
Utilization LP LS GS
400
0.6
0.7
0.8
0.9
1
LS GS
400 350 Average Response Time
Average Response Time
350
0.5 Utilization
300 250 200 150
300 250 200 150
100
100
50
50
0
0 0
0.1
0.2
0.3
0.4
0.5 Utilization
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Utilization
Fig. 14. Performance comparison of the scheduling strategies for a job-stream with composition: (a) (25%, 25%, 25%, 25%) (top-left), (b) (100%, 0%, 0%, 0%) (top-right), (c) (50%, 0%, 0%, 50%) (bottom-left) and (d) (0%, 0%, 0%, 100%) (bottom-right), and a service time distribution from the DAS
from the other queues even if it might delay that job, than to wait for enough free processors for it. If single-component jobs are restricted to one cluster, it is better to try to place them first and then to try to schedule multi-component jobs since their components can be shuffled (unordered requests) and there is a higher chance for them to fit this way, than to fit the same set of jobs starting with the multicomponent ones. Considering at one extreme a system with one global scheduler which manages all the jobs using co-allocation over the entire system, and at the other a system with a local scheduler for each cluster, where the schedulers have no global information and only provide resources from the cluster they are associated to, we choose for a combination of the two. Our results show that from all the strategies we considered the best is to have more schedulers (for example one for each cluster), and to drop the requirement of keeping single-component jobs local. As long as we treat all jobs the same and we do not know the composition of the job stream, there is no reason to separate single- and multi-component jobs in different queues and it is better to distribute jobs evenly among queues. Our choice would be for the LS without
Local versus Global Schedulers
203
restricting jobs to the local clusters, since this strategy is both simple and brings good performance. However, we might expect that if the clusters have different owners a version of LS that favours the local jobs or LP would be preferred in order to give priority to the local jobs.
References [1] The Distributed ASCI Supercomputer (DAS) site. http://www.cs.vu.nl/das. 185 [2] The Global Grid Forum. http://www.gridforum.org. 184 [3] The Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/. 197 [4] K. Aida, H. Kasahara, and S. Narita. Job Scheduling Scheme for Pure Space Sharing Among Rigid Jobs. In D. G. Feitelson and L. Rudolph, editors, 4th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1459 of LNCS, pages 98–121. Springer-Verlag, 1998. 184 [5] H. E. Bal, A. Plaat, M. G. Bakker, P. Dozy, and R. F. H. Hofman. Optimizing Parallel Applications for Wide-Area Clusters. In Proc. of the 12th International Parallel Processing Symposium, pages 784–790, 1998. 185 [6] A. I. D. Bucur and D. H. J. Epema. The Influence of the Structure and Sizes of Jobs on the Performance of Co-allocation. In D. G. Feitelson and L. Rudolph, editors, 6th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1911 of LNCS, pages 154–173. Springer-Verlag, 2000. 197 [7] A. I. D. Bucur and D. H. J. Epema. The Influence of Communication on the Performance of Co-allocation. In D. G. Feitelson and L. Rudolph, editors, 7th Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of LNCS, pages 66–86. Springer-Verlag, 2001. 197 [8] K. Czajkowski, I. Foster, and C. Kesselman. Resource Co-Allocation in Computational Grids. In 8th IEEE Int’l Symp. on High Perf. Distrib. Comp., pages 219–228, 1999. 184 [9] C. Ernemann, V. Hamscher, U. Schwiegelshohn, and R. Yahyapour. On Advantages of Grid Computing for Parallel Job Scheduling. In 2th IEEE/ACM Int’l Symp. on Cluster Computing and the Grid, pages 39–46, 2002. 197 [10] H. E. Bal et al. The Distributed ASCI Supercomputer Project. ACM Operating Systems Review, 34(4):76–96, 2000. 185 [11] D. G. Feitelson and L. Rudolph. Theory and Practice in Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editors, 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume 1291, pages 1–34. Springer-Verlag, 1997. 188 [12] I. Foster and C. Kesselman (eds). The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1999. 184 [13] T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems. In ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 131–140, 1999. 185 [14] W. Leinberger, G. Karypis, and V. Kumar. Multi-capacity Bin Packing Algorithms with Applications to Job Scheduling under Multiple Constraints. In Int’l Conf. on Parallel Processing, pages 404–412, 1999. 198 [15] Mesquite Software, Inc. The CSIM18 Simulation Engine, User’s Guide. 191
204
Anca I.D. Bucur and Dick H.J. Epema
[16] K. J. Omahen. Capacity Bounds for Multiresource Queues. J. of the ACM, 24:646– 663, 1977. 197 [17] A. Plaat, H. E. Bal, R. F. H. Hofman, and T. Kielmann. Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects. Future Generation Computer Systems, 17:769–782, 2001. 185 [18] Q. Snell, M. Clement, D. Jackson, and C. Gregory. The Performance Impact of Advance Reservation Meta-Scheduling. In D. G. Feitelson and L. Rudolph, editors, 6th Workshop on Job Scheduling Strategies for Parallel Processing, volume 1911 of LNCS, pages 137–153. Springer-Verlag, 2000. 197 [19] S. Zhou and T. Brecht. Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors. In ACM Sigmetrics ’91, pages 133–142, 1991. 198
Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences Christopher Pinchak, Paul Lu, and Mark Goldenberg Department of Computing Science University of Alberta, Edmonton, Alberta, T6G 2E8, Canada {pinchak,paullu,goldenbe}@cs.ualberta.ca http://www.cs.ualberta.ca/~paullu/Trellis/
Abstract. A practical problem faced by users of high-performance computers is: How can I automatically load balance my jobs across different batch queues, which are in different administrative domains, if there is no existing grid infrastructure? It is common to have user accounts for a number of individual high-performance systems (e.g., departmental, university, regional) that are administered by different groups. Without an administration-deployed grid infrastructure, one can still create a purely user-level aggregation of individual computing systems. The Trellis Project is developing the techniques and tools to take advantage of a user-level overlay metacomputer . Because placeholder scheduling does not require superuser permissions to set up or configure, it is well-suited to overlay metacomputers. This paper contributes to the practical side of grid and metacomputing by empirically demonstrating that placeholder scheduling can work across different administrative domains, across different local schedulers (i.e., PBS and Sun Grid Engine), and across different programming models (i.e., Pthreads, MPI, and sequential). We also describe a new metaqueue system to manage jobs with explicit workflow dependencies. Keywords: scheduling, metascheduler, metacomputing, computational grids, load balancing, placeholders, overlay metacomputers, metaqueue
1
Introduction
Metacomputing and grid computing are active research areas with the goal of developing the infrastructure and technology to create virtual computers from a collection of computers (for example, [7, 3, 8, 6, 12]). However, the constituent computers may be heterogeneous in their operating systems, local schedulers, and administrative control. The Trellis Project at the University of Alberta is addressing some of these issues through platform-independent systems to access computational resources [16, 18], remote data access [24], and scheduling [19, 9, 20]. The goals of the Trellis Project are not as comprehensive as other grid and metacomputing projects, but all of the projects share the goal of making it easier to take advantage of distributed computational and storage resources. D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 205–228, 2002. c Springer-Verlag Berlin Heidelberg 2002
206
Christopher Pinchak et al.
Table 1. Design Options for Grid and Metacomputer Scheduling Design Option Description
Main Advantages
Current Disadvantages
Metaqueue
Front-end queue that can redirect jobs to other queues. (E.g., routing queues in OpenPBS [17].) Computational Common set of Grid protocols and software infrastructure for metacomputing. (E.g., Globus Toolkit [8, 6] and Legion [12].)
Load balancing. Unified interface.
Requires common software systems, protocols, and administrative support.
Comprehensive set of features, including resource discovery and load balancing.
Relies on common grid infrastructure and cooperation of administrative domains. Generally speaking, unprivileged users cannot install or configure a grid.
User Scripts
Manual job placement and partitioning.
Simplicity.
Placeholder Scheduling
User-level implementation of metaqueue. No special infrastructure or administrative support required.
Load balancing. Flexibility to create per-user and per-workload overlay metacomputers. Can be layered on top of existing (heterogeneous) queues, metaqueues, administrative domains, and grids.
Poor load balancing. Slow queue problem. Requires user intervention. Single job and advance reservations cannot span multiple queues or domains. No support for cross-domain resource discovery, etc.
In this paper, we extend our previous work [20] on the problems related to effectively scheduling computational tasks on computers that have different system administrators, especially in the absence of a single batch scheduler. 1.1
Motivation: Overlay Metacomputers
Users often want to harness the cumulative power of an ad hoc collection of highperformance computers. Often, the problem is that each computer has a different batch scheduler, independent queues, and different groups of system administrators. Two known solutions to this problem are: (1) implement a system-level metaqueue or (2) deploy a computational grid (Table 1).
Practical Heterogeneous Placeholder Scheduling
207
Overlay Metacomputer B Overlay Metacomputer A
Group HPC
HPC Centre 1
Dept. HPC
Server
HPC Centre 2
Fig. 1. Overlay Metacomputers
First, if all of the individual computers are under a single group of system administrators, it would be possible (and preferable) to create a system-level metaqueue. For example, the OpenPBS implementation of the Portable Batch System (PBS) [17] supports routing queues. Similar capabilities exist in other workload management systems, such as Platform Computing’s LSF [15]. Jobs are submitted to routing queues that decide which execution queue should receive the jobs. The advantage of a system-level and system-scheduled metaqueue is that more efficient scheduling decisions can be made. In effect, there is a single scheduler that knows about all jobs in all queues. Also, a system-level metaqueue would, presumably, be well-supported and conform to the security and sharing policies in force within the administrative domain. However, if the collection of computers with execution queues spans multiple administrative domains, it may be difficult and impractical to implement such a metaqueue. The disadvantage of a system-scheduled metaqueue is that the local system administrators may be required to relinquish some control over their queues. If the centres are located at different institutions, it can be difficult to obtain such administrative concessions. Second, if the various system administrators can be persuaded to adopt a single grid infrastructure, such as Globus [8, 6], Legion [12], or Condor [3, 7], a metaqueue can be implemented as part of a computational grid. The advantage of computational grids is that they offer a comprehensive set of features, including resource discovery, load balancing, and a common platform. However, if the system administrators have not yet set up a grid, the user cannot take advantage of the grid features. Furthermore, what if a user has access to two systems that belong to two separate grids? A practical problem that exists today is that many researchers have access to a variety of different computer systems that do not share a computational grid or a data grid (Figure 1). In fact, each of the individual systems may have a different local batch scheduler (e.g., OpenPBS, LSF, Sun Grid Engine [26]). The researcher merely has an account on each of the systems. For example, Researcher A has access to his group’s system, a departmental system, and a system at a high-performance computing centre. Researcher B has access to her group’s server and (perhaps) a couple of different high-performance computing
208
Christopher Pinchak et al.
centres, including one centre in common with Researcher A. It would be ideal if all of the systems could be part of one metacomputer or computational grid. But, the different systems may be controlled by different groups who may not run the same grid software. Yet, Researchers A and B would still like to be able to exploit the aggregate power of their systems. Of course, the user can manually submit jobs to the different queues at different centres. In the case of user-scheduled jobs, the schedulers at each queue are unaware of the other jobs or queues. The user has complete control and responsibility for job placement and monitoring. Although this strategy is inconvenient, it is a common situation. The advantage is that different administrative groups do not have to agree on common policies; the user merely has to have an account on each machine. The disadvantage of user-scheduled jobs is that they are labour-intensive and inefficient when it comes to load balancing [20]. A better solution than manual interaction with the local schedulers is to create an overlay metacomputer , a user-level aggregate of individual computing systems (Figure 1). A practical and usable overlay metacomputer can be created by building upon existing networking and software infrastructure, such as Secure Shell (ssh) [1], Secure Copy (scp), and World Wide Web (WWW) protocols. Because the infrastructure is accessible at the user-level (or part of a well-supported, existing infrastructure) Researcher A can create a personal Overlay Metacomputer A. Similarly, Researcher B can create a personal Overlay Metacomputer B, which can overlap with Researcher A’s metacomputer (or not). 1.2
Motivation: Placeholder Scheduling
Placeholder scheduling creates a user-level metaqueue that interacts with the local schedulers and queues of the overlay metacomputer. More details are provided in Section 2. Instead of a push model, in which jobs are moved from the metaqueue to the local queue, placeholder scheduling is based on a pull model in which jobs are dynamically bound to the local queues on demand. The individual local schedulers do not have to be aware of the user-level metaqueue (which preserves all of the local scheduler’s policies) because only the placeholder jobs have to communicate with the user-level metaqueue; the local scheduler does not interact with the metaqueue. Placeholder scheduling has three main advantages. First, the user-level metaqueue is built using only standard software or well-supported infrastructure. Software systems that require a lot of new daemons, applications, configuration, and administration are less likely to be adopted and supported by a wide community. Our system is layered on top of existing secure network infrastructure (i.e., Secure Shell) and existing batch scheduler systems (i.e., we use OpenPBS [17] and Sun Grid Engine [26]). Second, placeholder scheduling does not require superuser privileges or special administrative support. Different users can create private metaqueues that can load balance across different systems. Third, userlevel metaqueues have similar load balancing benefits to system-level metaqueues, except that placeholder scheduling works across heterogeneous systems even
Practical Heterogeneous Placeholder Scheduling
209
if the different administrators do not have common scheduling infrastructure or policies. In the absence of a system-level metaqueue or a computational grid, which is still the common case, placeholder scheduling can still be used to load balance jobs across multiple queues. 1.3
Contributions
In our previous work [20], we described a prototype implementation of placeholder scheduling and a set of experiments. That was a proof-of-concept system and empirical evidence for the efficacy of placeholder scheduling. This paper extends our previous work and contributes to the practical aspects of computational grids and metacomputing by detailing a new implementation of placeholder scheduling that: 1. Works across three different administrative domains, none of which are part of the same system-level grid or metacomputer. We use systems located in our department, at the University of Alberta’s highperformance computing centre, and at the University of Calgary. 2. Works with different local batch scheduler systems. Our previous experiments used only PBS. For the first time, we show how the Sun Grid Engine can interoperate with our user-level metaqueue as easily as PBS. 3. Can use an SQL database, instead of a flat file, to maintain the state of the user-level metaqueue. The original flat file approach is still supported and used when appropriate. The SQL-based option adds the benefits of sophisticated concurrency control and fault tolerance. We have also implemented support for specifying and maintaining workflow dependencies between jobs. Therefore, as with a dataflow model, all jobs of a larger computation can be submitted to the system, but jobs will only be executed when their predecessor jobs have been completed. 4. Includes dynamic monitoring and throttling of placeholders. We demonstrate a simple but effective system for controlling the number of placeholders in each local queue. When the local system is lightly loaded, more placeholders are created in order to maximize the throughput of the metaqueue. When the local system is heavily loaded, fewer placeholders are used because there is no benefit in having more placeholders.
2 2.1
Placeholders The Concept
A placeholder can be defined as a unit of potential work. For an actual unit of work (i.e., a job), it is possible for any placeholder, within a group of placeholders, to actually complete the work. For example, in Figure 2, six placeholders (i.e., PH1 to PH6) have been submitted to six different queues on three different computer systems. Any one of the placeholders is capable of executing
210
Christopher Pinchak et al.
Computer 1
Computer 2
Computer n
Queue 1 Queue 2 Queue 3
Queue 1 Queue 2
Queue 1
PH1
PH2
PH3
PH4
PH5
PH6
Front of the queue
Secure Shell Command−line Server
Command lines
Contents of metaqueue
Fig. 2. Placeholder System Architecture
the next job in the metaqueue. The run-time binding of placeholder to job occurs at placeholder execution time (not placeholder submission time) under the control of a command-line server (discussed in Section 2.2). We provide the implementation details in Section 3, but for now, one can think of a placeholder as a specially-crafted job submitted to the local batch scheduler. The placeholder job does not have any special privileges. The first placeholder to request a new unit of work is given the next job in the metaqueue, which minimizes the mean response time for that job. The placeholder “pulls” the job onto the local computer system. Ignoring fault-tolerance, the same job is never given to more than one placeholder, and multiple placeholders can request individual jobs from a single metaqueue containing many jobs. If there are no jobs in the metaqueue when the placeholder begins execution, it can either exit the local queue or it can re-submit itself to the same queue. Informally, if there is no work to give to a placeholder when it reaches the front of the queue, the placeholder can go back to the end of the line without consuming a significant amount of machine resources. Other practical aspects of placeholder management are discussed in Section 6.
Practical Heterogeneous Placeholder Scheduling
211
All placeholders that are submitted to any system are done so on behalf of the user (i.e., the jobs belong to the user’s account identity). Therefore, all per-user resource accounting mechanisms remain in place. Some metacomputing systems execute jobs submitted to the metaqueue under a special account. We preserved submission from user accounts for three reasons: (1) job priority, (2) job accounting, and (3) security. Some sites base job priority on information about past jobs submitted by the user; other sites record this information for accounting (and possibly billing) purposes. Finally, security breaches of user accounts are significantly less dangerous than those of a superuser or privileged account. 2.2
Command-Line Server
The command-line server controls what executables and arguments should be executed by the placeholders. As an intermediary between the placeholders and the user-level metaqueue, it is possible for users to dynamically submit jobs to the command-line server and be assured that, at some point, a placeholder will execute the job. We have augmented the command-line server with the ability to sequence jobs (and their respective command-line arguments) according to workflow dependencies. When jobs are submitted to the metaqueue, which is used by the command-line server, the user can optionally list job dependencies. Jobs cannot be assigned to placeholders (i.e., executed) until the predecessor jobs have been completed. Consequently, jobs may be executed in an order different from that in which they were submitted to the metaqueue, but the order of execution is always with respect to the required workflow.
3
Implementation
The basic architecture of our system is presented in Figure 2. We use the Secure Shell [1] for client-server communication across networks and either OpenPBS [17] or Sun Grid Engine [26] for the local batch schedulers. In our simple experimental system, placeholders contact the command-line server via Secure Shell. Placeholders use a special-purpose public-private key pair that allows it to authenticate and invoke the command-line server on the remote system. All placeholders within the experimental system are submitted using the same user accounts. Currently, the placeholders and command-line server execute under normal user identities that do not have any special privileges. In fact, as discussed above, it is important that the placeholders are submitted via the user account to allow for proper prioritization and accounting at the local queue. And, should a malicious user acquire the private key of the placeholder, the damage would be limited because normal user accounts are non-privileged.
212
Christopher Pinchak et al.
st−brides brule dque 2 ssh getcmdline
1
a
Placeholder 5
Cmd−line Server
b
3
Command−line Arguments 4
File
PGSQL DB
Fig. 3. Steps in Placeholder Execution
3.1
Example: Steps in Placeholder Execution
The flow of control for an example placeholder on the machine st-brides is shown in Figure 3. The actions this placeholder takes before executing are as follows: 1. The placeholder reaches the front of the batch scheduler queue dque. 2. The placeholder script contacts the command-line server on machine brule via Secure Shell. The name of the current machine (st-brides) is sent along as a parameter. 3. The command-line server retrieves the next command line. Command lines are stored in either (a) a flat file (as with the parallel sorting application described in Section 3.3), or in (b) a PostgreSQL [21] database (as with the checkers database application described in Section 3.4). 4. The results of the query are returned to the waiting placeholder. In the event that there are more command lines available, but none can be assigned because of dependencies, the placeholder is instructed to wait a short time and resubmit itself. If no more command lines are available, a message is sent notifying the placeholder to terminate without further execution. 5. The placeholder uses the returned command line to begin execution. 3.2
Dynamic Monitoring and Throttling of Placeholders
Because placeholders progress through the queue multiple times, it may be advantageous to consider the queue waiting time of the placeholder. Waiting time
Practical Heterogeneous Placeholder Scheduling
213
information may be utilized in order to decide how many placeholders to simultaneously maintain in a given queue. Low waiting times indicate that the queue is receiving “fast” service, and it may be a good idea to submit multiple placeholders to take advantage of the favourable conditions. For example, on a multiprocessor system, it may be possible to have different jobs execute concurrently on different processors; one job per placeholder. Conversely, high waiting times indicate that the queue is “slow” for the placeholder parameters and little will be gained by increasing the number of placeholders in the queue. Also, one does not want to have too many placeholders in the queue if the queue is making slow progress, lest they interfere with other users. This ability to throttle the number of placeholders may further reduce the makespan of a set of jobs. 3.3
Parallel Sort
A sorting application was chosen because of ease of implementation and because it may be implemented in a variety of different ways. Sorting may be done sequentially using a well-known efficient sorting algorithm (in our case, QuickSort), and in parallel (we used, Parallel Sorting by Regular Sampling (PSRS) [14]). Additionally, PSRS may be implemented in both a shared and distributed memory environment, allowing it to perform a sort on a variety of parallel computers. The variety of platforms on which a sort can be performed allows us to experiment with heterogeneous placeholder scheduling, with respect to the programming model. A generic PBS placeholder is shown in Figure 4. The placeholder includes the ability to dynamically increase and decrease the number of placeholders in the queue. As illustrated, a placeholder is similar to a regular PBS job script. The lines beginning with #PBS (lines 4-11, Figure 4) are directives interpreted by PBS at submission time. The command line is retrieved from the commandline server (in our case, using the program getcmdline) and stored into the OPTIONS shell variable (line 18, Figure 4). This variable is later evaluated at placeholder execution time with the command(s) that will be executed (line 50, Figure 4). The late binding of placeholder to executable name and command-line arguments is key to the flexibility of placeholder scheduling. The placeholder then evaluates the amount of time it has been queueing for (line 32, Figure 4), and consults a local script to determine what action to take (line 38, Figure 4). It may increase the placeholders in the queue by one (lines 44-47, Figure 4), maintain the current number of placeholders in the queue by resubmitting itself after finishing the current command line (line 60, Figure 4), or decrease the number of placeholders in the queue by not resubmitting itself after completing the current command line (lines 53-55, Figure 4). Likewise, the basic command-line server is simple. Command lines themselves are stored in flat files, and the command-line server is implemented as a C program that accesses these files as a consumer process. Each invocation of the command-line server removes exactly one line from the flat file, which contains the arguments for one job. Each request to the command-line server invokes
214
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
Christopher Pinchak et al.
#!/bin/sh ## Generic placeholder PBS script #PBS #PBS #PBS #PBS #PBS #PBS #PBS #PBS
-S -q -l -N -l -m -M -j
/bin/sh queue ncpus=4 Placeholder walltime=02:00:00 ae [email protected] oe
## Environment variables: ## CLS_MACHINE - points to the command-line server’s host. ## CLS_DIR - remote directory in which the command-line server is located. ## ID_STR - information to pass to the command-line server. ## Note the back-single-quote, which executes the quoted command. OPTIONS=‘ssh $CLS_MACHINE "$CLS_DIR/getcmdline $ID_STR"‘ if [ $? -ne 0 ]; then /bin/rm -f $HOME/MQ/$PBS_JOBID exit 111 fi if [ -z $OPTIONS ]; then /bin/rm -f $HOME/MQ/$PBS_JOBID exit 222 fi STARTTIME=‘cat $HOME/MQ/$PBS_JOBID‘ NOWTIME=‘$HOME/bin/mytime‘ if [ -n "$STARTTIME" ] ; then let DIFF=NOWTIME-STARTTIME else DIFF=-1 fi ## Decide if we should increase, decrease, or maintain placeholders in the queue WHATTODO=‘$HOME/decide $DIFF‘ if [ $WHATTODO = ’reduce’ ] ; then /bin/rm -f $HOME/MQ/$PBS_JOBID fi if [ $WHATTODO = ’increase’ ]; then NEWJOBID=‘/usr/bin/qsub $HOME/psrs/aurora-pj.pbs‘ $HOME/bin/mytime > $HOME/MQ/$NEWJOBID fi ## Execute the command from the command-line server $OPTIONS ## leave if ’reduce’ if [ $WHATTODO = ’reduce’ ] ; then exit 0 fi /bin/rm -f $HOME/MQ/$PBS_JOBID ## Recreate ourselves if ’maintain’ or ’increase’ NEWJOBID=‘/usr/bin/qsub $HOME/psrs/aurora-pj.pbs‘ $HOME/bin/mytime > $HOME/MQ/$NEWJOBID
Fig. 4. Generic PBS Placeholder
Practical Heterogeneous Placeholder Scheduling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
215
#!/bin/sh ## Checkers DB Placeholder PBS script #PBS #PBS #PBS #PBS #PBS #PBS #PBS #PBS
-S -N -q -l -l -j -M -m
/bin/sh CheckersPH dque ncpus=1 walltime=02:00:00 oe [email protected] n
OPTIONS=‘ssh $CLS_MACHINE $CLS_DIR/next_job.py $ID_STR‘ RETURNVAL="$?" if [ "$RETURNVAL" -eq 2 ]; then exit 111 fi if [ "$RETURNVAL" -eq 1 ]; then sleep 5 qsub checkers_script.pbs exit fi if [ -z "$OPTIONS" ]; then exit 222 fi cd $CHECKERS_DIR $OPTIONS ssh $CLS_MACHINE $CLS_DIR/done_job.py $ID_STR qsub checkers_script.pbs
Fig. 5. PBS Placeholder for Computing Checkers Databases
a new process, and mutual exclusion is implemented using the flock() system call. 3.4
Checkers Database
The checkers database program is an ongoing research project that aims to compute endgame databases for the game of checkers [10]. For this paper, we are only concerned with the application-specific workflow properties of the computation. The placeholders for this application are simpler than in the previous example as they are not capable of regulating the number of jobs in the queue (see Figures 5 and 6; note the similarities between the placeholder scripts for PBS and SGE). For our experiment, the local computer systems are uniprocessors and they are dedicated to the computation. Therefore, there is little advantage in having more than one placeholder per queue. The databases are computed using retrograde analysis [10]. To create parallelism and reduce memory requirements, the databases are logically divided into individual jobs, called slices. We denote a slice using four numbers. These
216 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Christopher Pinchak et al.
#!/bin/sh ## Checkers DB Placeholder SGE script #$ #$ #$ #$ #$
-S -N -j -M -m
/bin/sh CheckersPH y [email protected] n
OPTIONS=‘ssh $CLS_MACHINE $CLS_DIR/next_job.py $ID_STR‘ RETURNVAL="$?" if [ "$RETURNVAL" -eq 2 ]; then exit 111 fi if [ "$RETURNVAL" -eq 1 ]; then sleep 5 qsub checkers_script.sge exit fi if [ -z "$OPTIONS" ]; then exit 222 fi cd $CHECKERS_DIR $OPTIONS ssh $CLS_MACHINE $CLS_DIR/done_job.py $ID_STR qsub checkers_script.sge
Fig. 6. Sun Grid Engine Placeholder for Computing Checkers Databases 3200 2210
3101
1220
2111
3002
0230
1121
2012
0131
1022 0032
Fig. 7. Dependencies Between Slices of the Checkers Endgame Databases
numbers stand for the number of black kings, white kings, black checkers and white checkers. The slices are further subdivided into smaller slices based on the position of the most advanced checker of each side (see [10] for details). Because the results of one slice may be needed before another slice can be computed, there is an inherent workflow dependency.
Practical Heterogeneous Placeholder Scheduling
217
CREATE TABLE Targets ( tar_id int PRIMARY KEY, tar_name varchar(64) UNIQUE ); CREATE TABLE Jobs ( tar_id int REFERENCES Targets, -- target ID j_num int, -- number within target comm_line varchar(800), -- command line PRIMARY KEY (tar_id, j_num) ); CREATE TABLE Before ( pre_id int REFERENCES Targets ON DELETE CASCADE, -- prerequisite dep_id int REFERENCES Targets, -- dependent target PRIMARY KEY (pre_id, dep_id) ); CREATE TABLE Running ( tar_id int REFERENCES Targets, -- target ID j_num int, -- number within target machine varchar(20), -- host name PRIMARY KEY (tar_id, j_num) );
Fig. 8. Definition Script for Jobs Database
Figure 7 shows the dependencies between slices of the databases for the case in which black has three pieces and white has two pieces on the board. For example, consider a position with 2 black kings, 2 white kings, 1 black checker and no white checkers. This position is in slice “2 2 1 0” of the databases. Now, if a black checker advances to a king, then we have 3 black kings, 2 white kings and no checkers. The new position is in slice “3 2 0 0”. Thus, positions in slice “2 2 1 0” can play into positions in slice “3 2 0 0”. This is reflected by an edge at the top of Figure 7. Therefore, slice “3 2 0 0” has to be computed before slice “2 2 1 0”. In general, slices at the same level of the lattice in Figure 7 can be computed in parallel; slices at different levels of the lattice have to be computed in the proper order (i.e., from top to bottom). Information about the dependencies between board configurations is conveniently stored in a Makefile. This Makefile is automatically produced by a C program. Commands in the Makefile are calls to a script (called mqsub.py) that inserts job descriptions and dependencies into a simple relational database (i.e. PostgreSQL [21]). The schema definition script is shown in Figure 8. An example of the submission of a job to the database is shown in Figure 9. We provide
./mqsub.py -deps "0022 0031" -l "3200" -c "Bin/run.it 3 2 0 0 0 0 >& Results/3200.00"
Fig. 9. Submission of the Job For Computing a Slice in “3 2 0 0”
218
Christopher Pinchak et al.
Table 2. Experimental Platform for the Parallel Sorting Application System
Description
SGI Origin 2000, 46 × 195 MHz R10000, 12 GB RAM, Irix 6.5.14f B Single Pentium II, 400 (lacrete) MHz, 128 MB RAM, Linux 2.2.16 C Alpha Cluster, mixture of (maci-cluster) Compaq XP1000, ES40, ES45, and PWS 500au, 206 processors in 122 nodes, each node has from 256 MB to 8 GB RAM, Tru64 UNIX V4.0F A (aurora)
Interconnect Scheduler Method Shared Memory NUMA None
PBS
Gigabit Ethernet
PBS
Sun Grid Engine
Parallel Shared Memory Sequential
Parallel Distributed Memory (i.e., MPI)
a name (or label) for the current target (or job) (following -l), the labels of the jobs on which the current job depends (following -deps), and the command line for computing the slice (following -c). Tuples in the Targets table (Figure 8) correspond to targets in the Makefile. Commands within targets are assigned consecutive numbers. Thus a command is uniquely identified given its target ID and job number within the target (see table Jobs). Table Before summarizes information about dependencies between targets. Table Running contains the jobs that are currently being run; for each such job, the host name of the machine on which the job is being run is stored. The command-line server consults and modifies the database of jobs in order to return the command line of a job that can be executed without violating any dependencies. The command-line server is invoked twice for each job: once to get the command line for the job (Figure 5, line 14) and the other to let the server know that the job has been completed (Figure 5, line 33). Both times, the host name is passed as a parameter to the server. The design of the jobs database simplifies the task of the command-line server. All prerequisites for a target are met if the target does not appear in the dep id field of any tuple in the Before table. Also, when the last job of a target is returned, the target is deleted from the Targets table, which results in cascading deletion of the corresponding tuples in the database.
4 4.1
Experiments Parallel Sort
The goals of the parallel sorting experiment are to show the performance of placeholders in four orthogonal dimensions of heterogeneity: (1) parallel vs. sequential computer; (2) machine architecture; (3) distributed vs. shared memory;
Practical Heterogeneous Placeholder Scheduling
219
and (4) local scheduling system. A summary of the systems with respect to these dimensions is shown in Table 2. We performed an on-line experiment with three different computers, in three different administrative domains, and with three different local schedulers. These are not simulated results. System A sorted four million integer keys using four processors, System B sorted four million integer keys sequentially, and System C sorted four million integer keys using eight processors. During our experiment, there were other users on two of the systems (i.e., System A and C). Although the specific quantitative results are not repeatable, the qualitative results are representative. Also, note that System A is administered by the high-performance computing centre of the University of Alberta. System B is in our department and effectively under our administrative control. System C is administered by the of the University of Calgary. The primary goal of placeholder scheduling is to maximize throughput across a number of machines. The throughput, as evidenced by the rate of execution, is shown in Figure 10. The cumulative number of work units performed by each system is shown, and the rate of execution is determined by the slopes of the lines. System A exhibits a good initial execution rate, but then suddenly stops executing placeholders. System B, the dedicated sequential machine, exhibits a steady rate of execution. System C is somewhere in between, exhibiting a more or less constant rate of execution, although this rate is below that of the others. The bottom-most (bar) graph in Figure 10 shows the number of work units completed per 5000 second time period. An interesting point illustrated in Figure 10 is the abrupt halt of execution of System A. By examining the PBS logs, we believe that our placeholders used up our user account’s quota of CPU time on the system. As a result, System A becomes unable to execute additional work after roughly 7000 seconds, and this can be perceived as a failure of System A. However, because of the placeholders, the other two systems (B and C) are able to compensate for the loss of System A. After 7000 seconds, only Systems B and C complete work units and are responsible for finishing off the remainder of the workload. Should the loss have occurred without a scheduling system such as placeholder scheduling, users would likely have to discover and correct for this loss on their own. Figures 11 and 12 show the queue lengths and placeholders per queue, respectively. As Figure 11 shows, System A is significantly more loaded than System C. However, System A is also more powerful than System C, and therefore execution rates are higher. System A is also able to sustain more placeholders in the queue for the first 7000 seconds, and both queues exhibit increases and decreases in placeholder counts due to changing queue conditions (Figure 12). It must be emphasized that these results are obtained from computers working on other applications in addition to our own. No attempt has been made to control the queues on Systems A or C.
220
Christopher Pinchak et al.
Execution Totals 2000
System A
1500
Queue stops executing jobs
1000 500 Total Units Completed
0 2000
System B
1500 1000 500 0 2000
System C
1500 1000 500
Units Completed
0 2000 System A System B System C Total
1500 1000 500 0 0
10000 5000 15000 Time (Seconds from Start)
20000
Fig. 10. Throughput for the Parallel Sorting Application
4.2
Checkers Database
The purpose of the checkers database experiment is twofold. First, the checkers database application is a non-trivial application. Second, the computation of one slice is dependent upon the completed computation of other slices. Therefore, some form of workflow management must be present to coordinate the compu-
Practical Heterogeneous Placeholder Scheduling
221
Queue Length 20
System A
10
0
Queue Length (Jobs)
5
Due to placeholders
System C
4 3 2 1 0 0
5000
10000 15000 Time (Seconds from Start)
20000
Fig. 11. Queue Lengths of the Parallel Machines
tation of board configurations. As was described above, a new command-line server was implemented to coordinate the computation. Two different computers were used for computing the checkers databases (see Table 3). Figure 13 shows the throughput of the two computers in terms of the number of board configurations each computed. Because of the dependencies between some board configurations (see Figure 7), some board configurations must be computed sequentially. In our case, System E computes more of these sequential configurations than does System D. This is verified by the load averages shown in Figure 14. Overall, System E has a higher load, which indicates that it is performing more work. Unlike the parallel sorting experiment, there are dependencies between jobs in the checkers database application. Furthermore, the number of jobs that can
Table 3. Experimental Platform for the Checkers Database Application System
Description
D (samson-pk) E (st-brides)
Single AMD Athlon XP 1800+, 256 MB Sun Grid Engine RAM, Linux 2.4.9 Single AMD Athlon XP 1900+, 256 MB PBS RAM, Linux 2.4.9
Scheduler
222
Christopher Pinchak et al.
Placeholders in Queue 5 System A
4
Queue stops executing jobs 3 2 1 0 5 System C
Placeholders
4 3 2 1 0 0
5000
10000 15000 Time (Seconds from Start)
20000
Fig. 12. Number of Placeholders in Parallel Machine Queues
be computed concurrently varies from one (at the very top and bottom of the lattice) to a significant number (at the middle of the lattice) (Figure 7). Therefore, there are bottleneck jobs and the two computers are not fully-utilized during those bottleneck phases (Figure 14). However, when there are concurrent jobs, our placeholder scheduling system and the workflow-based command-line server are able to exploit it.
5
Related Work
The basic ideas behind placeholder scheduling are quite simple, but there are some important differences between placeholder scheduling and other systems. In concept, placeholders are very similar to the GlideIn mechanism of Condor-G [7]. GlideIns are daemon processes submitted to the local scheduler of a remote execution platform using the GRAM remote job submission protocol of Globus [4]. As with placeholders, the local scheduler controls when the placeholder or daemon begins its execution and GlideIns also support the late binding of jobs to a computational resource. In terms of implementation, placeholders and GlideIns have some significant differences. For example, placeholders do not require any additional software infrastructure beyond a batch scheduler and the Secure Shell, whereas Condor-G and GlideIns (as currently implemented) require the Globus infrastructure. Also,
Practical Heterogeneous Placeholder Scheduling
223
Execution Totals 800
System D
600 400 200
Board Configurations Computed
0 800 System E 600 400 200 0 300 System D System E Total
250 200 150 100 50 0 0
1000
2000 3000 4000 Time (Seconds from Start)
5000
6000
Fig. 13. Throughput for the Checkers Database Application
placeholders have a simple dynamic monitoring and throttling capability that is compatible with local schedulers. As previously discussed, one of the goals of user-level overlay metacomputers is to only build upon existing networking and software infrastructure. Of course, placeholders can be retargeted for Globus-based computational grids by using GridSSH [8] as a plug-in replacement for the standard Secure
224
Christopher Pinchak et al.
Load Averages 1.5 System D 1
0.5
0 1.5 System E
Load
1
0.5
0 0
1000
2000 3000 4000 Time (Seconds from Start)
5000
6000
Fig. 14. Load Averages for the Checkers Database Application
Shell. And, GlideIns can, in theory, be reimplemented for non-Globus-based grids. More generally, we suspect that, prior to the availability of full-featured and open-source batch schedulers such as OpenPBS, most users wrote custom scripts to distribute their work (for example, [10]), without generalizing the system in the manner of this paper. We feel that our contribution is in demonstrating how placeholder scheduling can be implemented in a contemporary context and how it relates to metacomputing and computational grids. More tangentially, large-scale distributed computation projects such as SETI@home [23] use software clients that are, in essence, single-purpose placeholders that pull work on-demand from a server. Placeholder scheduling shares many similarities with self-scheduling tasks within a parallel application and the well-known master-worker paradigm [22], in which placeholders are analogous to worker processes. Of course, our presentation of placeholder scheduling is in the context of job scheduling and not task scheduling. Nonetheless, the basic strategies are identical. Of course, there is a large body of research in the area of job scheduling and queuing theory (for example, [5, 11, 13]). This paper has taken a more systemsoriented approach to scheduling. Our scheduling discipline at the metaqueue (i.e., command-line server) is currently simple: first-come-first-served. In the future, we hope to investigate more sophisticated scheduling algorithms that understand the dependencies between jobs and try to compute a minimal schedule.
Practical Heterogeneous Placeholder Scheduling
6
225
Discussion and Future Work
In this section, we discuss some other important, practical aspects of placeholder scheduling. Many of the following issues are to be addressed as part of future work. 1. Advanced Placeholder Monitoring. We have implemented a simple form of placeholder monitoring and throttling. However, there are some other forms of placeholder monitoring that are also important and will be addressed in future work. Placeholders should be removed from local batch queues if the command-line server has no more jobs or too few jobs. We do not want a placeholder to make it to the front of the queue, allocate resources (which may involve draining a parallel computer of all the sequential jobs so that a parallel job can run), and then exit immediately when the command-line server has no work for it. A similarly undesirable situation occurs when there are fewer jobs in the metaqueue than there are placeholders. In both situations, placeholders should be automatically removed from the queues in order to minimize the negative impact that they might have on other users. If, later on, more work is added to the command-line server, placeholders can be re-started. 2. Fault Tolerance. Placeholders, by their nature, contain some amount of fault tolerance. Because placeholders are usually present in more than one queue, some queue failures (e.g., a machine shutdown or network break) can occur and the jobs will still be executed by placeholders in the remaining queues. However, a more systematic approach to detecting and handling faults is required to improve the practicality of placeholder scheduling. As part of advanced placeholder monitoring (discussed above), future placeholder scheduling systems have to monitor and re-start placeholders that disappear due to system faults. Also, the system should be able to allocate the same job to two different placeholders if a fault is suspected and, if both placeholders end up completing the job, deal with potential conflicts due to job side effects. 3. Resource Matching. Modern batch scheduler systems provide the ability to specify constraints on the placement of jobs due to specific resource requirements. For example, some jobs require a minimum amount of physical memory or disk space. Currently, our implementation of placeholder scheduling does not provide this capability, but it is an important feature for the future. 4. Data Movement. Another practical problem faced by users of metacomputers and computational grids is: If my computation can move from one system to another, how can I ensure that my data will still be available to my computation? Depending on the level of software, technical, and administrative support available, a data grid (for example, [2, 25, 27]) or a distributed file system (e.g., AFS, NFS) would be reasonable solutions. However, as with systemlevel metaqueues, it is not always possible (or practical) to have a diverse
226
Christopher Pinchak et al.
group of systems administrators agree to adopt a common infrastructure to support remote data access. Yet, having transparent access to any remote data is an important, practical capability. Data movement is something that the Trellis Project has started to address. We have developed the Trellis File System (Trellis FS) to allow programs to access data files on any file system and on any host on a network that can be named by a Secure Copy Locator (SCL) or a Uniform Resource Locator (URL) [24]. Without requiring any new protocols or infrastructure, Trellis can be used on practically any POSIX-based system on the Internet. Read access, write access, sparse access, local caching of data, prefetching, and authentication are supported.
7
Concluding Remarks
The basic ideas behind placeholders and placeholder scheduling are fairly straightforward: centralize the jobs of the workload into a metaqueue (i.e., the command-line server), use placeholders to pull the job to the next available queue (instead of pushing jobs), and use late binding to give the system maximum flexibility in job placement and load balancing. Our contribution is in showing how such a system can be built using only widely-deployed and contemporary infrastructure, such as Secure Shell, PBS, and SGE. As such, placeholder scheduling can be used in situations in which metaqueues and grids have not yet been implemented by the administrators. As an extension of our original work with placeholder scheduling, we have now empirically demonstrated that placeholder scheduling can (1) load balance a workload across heterogeneous administrative domains (Table 2), (2) work with different local schedulers (Table 2), (3) implement workflow dependencies between jobs (Section 3.4, Section 4.2), and (4) automatically monitor the load on a particular system in order to dynamically throttle the number of placeholders in the queue (Section 3.2). Given the growing interest in metacomputers and computational grids, the problems of distributed scheduling will become more important. Placeholder scheduling is a pragmatic technique to dynamically schedule, place, and load balance a workload among multiple, independent batch queues in an overlay metacomputer. Local system administrators maintain complete control of their individual systems, but placeholder scheduling provides the same user benefits as a centralized meta-scheduler.
Acknowledgments Thank you to C3.ca, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Canada Foundation for Innovation (CFI) for their research support. Thank you to Lesley Schimanski, the anonymous referees, and the attendees of the 8th Workshop on Job Scheduling Strategies for Parallel Processing (Edinburgh, Scotland, July 24, 2002) for their valuable comments.
Practical Heterogeneous Placeholder Scheduling
227
References [1] D. J. Barrett and R. E. Silverman. SSH, the Secure Shell: The Definitive Guide. O’Reilly and Associates, Sebastopol, CA, 2001. 208, 211 [2] J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke. GASS: A Data Movement and Access Service for Wide Area Computing Systems. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, 1999. 225 [3] Condor. http://www.cs.wisc.edu/condor/. 205, 207 [4] K. Czajkowski, I. Foster, N. Karonis, S. Martin, W. Smith, and S. Tuecke. A Resource Management Architecture for Metacomputing Systems. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1459 of Lecture Notes in Computer Science, pages 62–82. Springer-Verlag, 1998. 222 [5] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, volume 1291 of Lecture Notes in Computer Science, pages 1–34. Springer-Verlag, 1997. 224 [6] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications, 11(2):115–128, 1997. 205, 206, 207 [7] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: A Computation Management Agent for Multi-Institutional Grids. In Proceedings of the 10th International Symposium on High Performance Distributed Computing (HPDC-10), San Francisco, California, U. S.A, August 7–9 2001. 205, 207, 222 [8] Globus. http://www.globus.org/. 205, 206, 207, 223 [9] M. Goldenberg. A System For Structured DAG Scheduling. Master’s thesis, Dept. of Computing Science, University of Alberta, Edmonton, Alberta, Canada, in preparation. 205 [10] R. Lake, J. Schaeffer, and P. Lu. Solving Large Retrograde-Analysis Problems Using a Network of Workstations. In Proceedings of Advances in Computer Chess 7, pages 135–162, Maastricht, Netherlands, 1994. University of Limburg. 215, 216, 224 [11] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik. Quantitative System Performance. Computer Systems Analysis Using Queueing Network Models. Prentice Hall, Inc., 1984. 224 [12] Legion. http://www.cs.virginia.edu/˜legion/. 205, 206, 207 [13] M. R. Leuze, L. W. Dowdy, and K. H. Park. Multiprogramming a DistributedMemory Multiprocessor. Concurrency–Practice and Experience, 1(1):19–34, September 1989. 224 [14] X. Li, P. Lu, J. Schaeffer, J. Shillington, P. S. Wong, and H. Shi. On the Versatility of Parallel Sorting by Regular Sampling. Parallel Computing, 19(10):1079– 1103, October 1993. Available at http://www.cs.ualberta.ca/˜paullu/. 213 [15] Load Sharing Facility (LSF). http://www.platform.com/. 207 [16] G. Ma and P. Lu. PBSWeb: A Web-based Interface to the Portable Batch System. In Proceedings of the 12th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pages 24–30, Las Vegas, Nevada, U. S. A., November 6–9 2000. Available at http://www.cs.ualberta.ca/˜paullu/. 205 [17] OpenPBS: The Portable Batch System. http://www.openpbs.com/. 206, 207, 208, 211
228
Christopher Pinchak et al.
[18] PBSWeb. http://www.cs.ualberta.ca/˜paullu/PBSWeb/. 205 [19] C. Pinchak. Placeholder Scheduling for Overlay Metacomputers. Master’s thesis, Dept. of Computing Science, University of Alberta, Edmonton, Alberta, Canada, in preparation. 205 [20] C. Pinchak and P. Lu. Placeholders for Dynamic Scheduling in Overlay Metacomputers: Design and Implementation. Journal of Parallel and Distributed Computing. Under submission to special issue on Computational Grids. 205, 206, 208, 209 [21] PostgreSQL Database Management System. http://www.postgresql.org/. 212, 217 [22] L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A Simple Load Balancing Scheme for Task Allocation In Parallel Machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 237–245, Hilton Head, South Carolina, U. S.A, July 21–24 1991. ACM Press. 224 [23] SETI@home. http://setiathome.ssl.berkeley.edu/. 224 [24] J. Siegel and P. Lu. User-Level Remote Data Access in Overlay Metacomputers. In Proceedings of the 4th IEEE International Conference on Cluster Computing, September 2002. 205, 226 [25] H. Stockinger, A. Samar, B. Allcock, I. Foster, K. Holtman, and B. Tierney. File and Object Replication in Data Grids. In Proceedings of the 10th International Symposium on High Performance Distributed Computing (HPDC-10), San Francisco, California, U. S.A, August 7 – 9 2001. 225 [26] Sun Grid Engine. http://www.sun.com/software/gridware/sge.html. 207, 208, 211 [27] B. S. White, M. Walker, M. Humphrey, and A. S. Grimshaw. LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications. In SC2001: High Performance Networking and Computing., Denver, CO, November 10–16 2001. 225
Current Activities in the Scheduling and Resource Management Area of the Global Grid Forum Bill Nitzberg 1 and Jennifer M. Schopf 2 Veridian, PBS Products 2672 Bayshore Pkwy, Suite 810, Mountain View, CA 94043, USA [email protected] 2 Mathematics and Computer Science Division, Argonne National Laboratory 9700 S. Cass Ave., Argonne, IL 60439, USA [email protected] 1
Abstract. The Global Grid Forum’s Scheduling and Resource Management Area is actively pursuing the standards that are needed for interoperability of Grid resource management systems. This includes work in defining architectures, language standards, APIs and protocols. In this article we overview the state of the working groups and research groups in the area as of September 2002.
1
Introduction
The Global Grid Forum (GGF) [1] is an open standards body focused on Grid computing. Organized similarly to the Internet Engineering Task Force (IETF) [2], the GGF consists of groups of committed individuals from academia, research labs, and industry, working toward standards to promote common understanding and, more importantly, interoperability. The current areas in the GGF are Architecture, Data, Information Systems and Performance, Peer-to-peer Computing, Scheduling and Resource Management, Security, and Applications, Programming Models and Environments. The main focus of the Scheduling and Resource Management Area is agreements and standards in Grid resource management: architecture, specifications for resources and requirements, queuing, scheduling and superscheduling, starting and stopping jobs (task management), and accounting and logging. Generally, the process begins by looking at what is done today and what is desired; then gathering requirements, refining protocols, interactions, capabilities, and the like, and, finally, working to standardize APIs and protocols. Overall, the goal of this area is to “enable better use of resources”. The current makeup of active participants in the Scheduling Area covers Grid “operating systems” level developers, researchers, application developers, students, Grid system managers, and a smattering of others. All GGF activities are open; anyone is welcome to participate (visit www.gridforum.org). The “output” of Global Grid Forum activities is documents relating to Grid standards. D.G. Feitelson et al. (Eds.): JSSPP 2002, LNCS 2537, pp. 229-235, 2002. Springer-Verlag Berlin Heidelberg 2002
230
Bill Nitzberg and Jennifer M. Schopf
It is important to understand how different levels of standardization promote interoperability: models (or frameworks, architectures) create a human-level common understanding among people, APIs (or interfaces) enable code portability and re-use, but not necessarily interoperability between different code bases; protocols enable interoperability between different code bases, but not necessarily code portability; and, languages (or sets of tokens) are a building block for all of the above. For instance, MPI [3] is an example of a standard API—allowing a programmer to write a single parallel program that can run either on a cluster of Linux machines with MPICH or on an IBM SP with IBM’s proprietary MPI implementation. A simple recompilation is all that is required. MPI does not, however, support communication between two MPI programs (one on each of the above systems); that is, it supports code portability, not interoperability. TCP/IP, on the other hand, is a standard protocol that supports interoperability but says nothing about code portability. A program running on Microsoft Windows using the WinSock API can easily communicate with another program running on UNIX that uses the sockets API.
2
Current GGF Scheduling and Resource Management Area Efforts
As of GGF-5, July 2002, the Scheduling Area had two finishing groups, four active groups and five new groups proposed, with over two hundred people participating. Roughly, these activities fall into the following categories: Architecture and Overview • “Ten Actions When Superscheduling” (Architecture, completed Group) • Scheduling Dictionary Working Group (Language) • Grid Scheduling (Architecture, proposed Group) • Grid Economic Brokering (Architecture, proposed Group) Standards to “run my job” • Advance Reservation API Working Group (API, completed Group) • Distributed Resource Management Application API Working Group (API) • Grid Resource Allocation and Agreement Protocol Working Group (Protocol) Super-Scheduling • Scheduling Attributes Working Group (Language) • Scheduling Optimization (proposed Group) Basic accounting for interoperability • Usage Record (Language, proposed Group) • OGSA Resource Usage Service (Protocol, proposed Group) The first document prepared by the Scheduling Area was the “Ten Actions When Superscheduling” document [4], led by J. Schopf. This document outlines the steps a user goes through when scheduling across Grid resources; the basic steps are shown in Figure 1. These are grouped into three phases – resource discovery, system selection,
Activities in the Scheduling and Resource Management Area of the Global Grid Forum
231
and running a job – and spell out the basic steps of scheduling a job. This document is in final review and has been updated and extended for publication in a journal special issue as well [5].
Phase One-Resource Discovery 1. Authorization Filtering 2. Application Definition 3. Min. Requirement Filtering
Phase Three- Job Execution 6. Advance Reservation 7. Job Submission 8. Preparation Tasks
Phase Two - System Selection 4. Information Gathering 5. System Selection
9. Monitoring Progress 10 Job Completion 11. Clean-up Tasks
Fig. 1. Ten steps for superscheduling
Another document that is in the final review process is the “Advance Reservation API”, by A. Roy and V. Sander [6]. This document defines an experimental API for quality of service reservations for different types of resources. It is strongly based on GARA [7]. This document is in final stages of review. The active working group, Scheduling Attributes [8], led by U. Schwiegelshohn and R. Yahyapour, is defining a set of attributes of lower-level scheduling instances that can be used to make resource management decisions by higher-level schedulers. The document created by this group [9] in final stages of review. The Scheduling Dictionary working group [10], led by Wieder and Ziegler, is identifying and defining the terms needed to discuss schedulers. Early on, we observed that each researcher in the area used the same terms in slightly different ways. The goal of this group is to aid interoperability (especially among people working in this field). This group has a draft of it’s document available online. The Distributed Resource Management Application API group (DRMAA) [11], led by J. Tollefsrud and H. Rajic, is defining an API for the submission and control of jobs to one or more distributed resource management systems. They plan to present a semi-final draft presented at GGF-6 in Chicago.
232
Bill Nitzberg and Jennifer M. Schopf
J. Maclaren, V. Sander, and W. Ziegler lead the working group on Grid Resource Allocation Agreement Protocol [12]. This group is defining the interactions between a higher-level service and a local resource management system. The goal is to facilitate the allocation and reservation of Grid resources. Much of this work is growing out of the SNAP [13] work as well. At GGF-5 in Scotland, 5 groups were proposed as part of the Scheduling Area. U. Schwiegelshohn proposed the Grid Scheduling Architecture working group [14]. This group will define an architecture that details the interactions between a Grid scheduler and other components, such as a Grid information system, a local resource management system, and network management systems. This group is awaiting full development of a charter and assessment of interest. Three groups related to accounting issues were proposed. The first, and cornerstone to the others, is the Usage Record working group [15], presented by L. McGinnis. The goal of this group is to define a common accounting usage record (format and contents) to promote the exchange of accounting information between sites. This isn’t to replace the records that are being used at current sites, but is to be used to exchange them. The TeraGrid project [16] has identified this as a key need. A second group related to accounting issues is the proposed Grid Economic Service Architecture working group [17], currently being led by S. Newhouse, J. MacLaren, and K. Keahey. This architecture-focused group will define a supporting infrastructure that enables organizations to “trade” services between each other. The infrastructure will include the definition of protocols and service interfaces that will enable the exploration of different economic mechanisms (but not the economic models). A charter for this group is being finalized. The third accounting-focused group is the OGSA Resource Usage Service [18], with proposed chairs of S. Newhouse and J. Magowan. To track resource use within OGSA Grid services, we need to develop a service interface that supports the recording and retrieval of resource usage. The charter for this group is being finalized. The fifth group proposed was a research group on the topic of Scheduling Optimisation [19], led by V. Di Martino and E. Talbi. This group proposes to define measures of scheduling algorithm performance and to foster the development of Grid-wide scheduling methodology on top of available schedulers.
3
Fruitful Directions – What’s Next?
We expect the UR and DRMAA activities to complete this year and have a positive impact on the community. The ability to exchange Usage Record (accounting) data between sites participating in Grid activities is a fundamental prerequisite to achieving acceptance and commitment of resources from both the funding agencies and the resource owners. The proposed UR group already has active participation from the TeraGrid, NASA’s IPG, and industry. DRMAA will greatly ease the burden on the applications programmer’s use of resource management systems and will foster thirdparty Grid-enabled commercial products. DRMAA has strong industry participation (including Sun, Intel, Veridian, Cadence, and HP).
Activities in the Scheduling and Resource Management Area of the Global Grid Forum
233
Looking outside the current activities within the GGF, we believe the following would be fruitful directions: • Language for resource and job specification – many different languages exist today; a standard language to promote interchange between existing systems would enable easier job migration among these distinct systems. • API for scheduling (especially for superscheduler-scheduler interaction) – not only would this ease implementation of superschedulers, but it would also enable “research” schedulers to be plugged into production environments for real-world experience. • Language for describing site-specific scheduling policy and requirements – tuning any scheduling system is a complicated, iterative process; a standard language would allow one to duplicate policies at different sites, each using its own resource management system, and, in the longer term, would allow a superscheduler to reason about site policies. • Agreements on resource fungability – to enable economy-based trading of resources. (The proposed GESA working group may attack this topic.) • Work on Grid-level policy management across scheduling systems. The best standards build on existing work. Over the next ten years, we expect a snowball effect as the work coming out of the Global Grid Forum excites the community to explore new directions.
4
How to Become Involved
GGF participants come from over two hundred organizations in over thirty countries, with financial and in-kind support coming from GGF Sponsor Members including commercial vendors and user organizations as well as academic and federal research institutions. Anyone interested in Grid computing, or in the Global Grid Forum activities specifically, is welcome to participate in a GGF meeting or event. To join the GGF Scheduling and Resource Management Area mailing list, please send mail to [email protected] with the message “subscribe sched-wg”.
Acknowledgments We thank everyone involved in the GGF Scheduling and Resource Management Area. This work was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under contract W-31-109-Eng-38, and by NAS NASA Ames Research Center.
234
Bill Nitzberg and Jennifer M. Schopf
References [1] [2] [3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13]
[14] [15] [16]
The Global Grid Forum, www.gridforum.org The Internet Engineering Task Force, www.ietf.org Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, Sept. 2001, http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html Schopf, J. M., “Super Scheduler Steps/Framework”, Global Grid Forum Scheduling Area Working Document SchedWD 8.5, http://www-unix.mcs.anl.gov/~schopf/ggf-sched/WD/schedwd.8.5.pdf Schopf, J. M., “A General Architecture for Scheduling on the Grid”, submitted to special issue of the Journal of Parallel Distributed Computing on Grid Computing, 2002. Available from http://www-unix.mcs.anl.gov/~schopf/Pubs/sched.arch.jpdc.pdf Roy, A., and Sander, V., “Advanced Reservation API”, Scheduling Area Working Document SchedWD 9.4, http://www-unix.mcs.anl.gov/~schopf/ggf-sched/WD/schedwd.9.4.pdf Roy, Alain, "End-to-End Quality of Service for High-End Applications", Ph.D. dissertation, University of Chicago, Department of Computer Science, August 2001. Global Grid Forum, Scheduling Attributes Working Group, Scheduling and Resource Management Area, http://ds.e-technik.uni-dortmund.de/~yahya/ggf-sched/WG/sa-wg.html Schwiegelshohn, U. and Yahyapour, R., “Attributes for Communication about Scheduling Instances”, Scheduling Area Working Document SchedWD 10.5, http://www-unix.mcs.anl.gov/~schopf/ggf-sched/WD/schedwd.10.5.pdf Global Grid Forum, Scheduling Dictionary Working Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/sd-wg.html Global Grid Forum, Distributed Resource Management Application API Working Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/drmaa-wg.html Global Grid Forum, Grid Resource Allocation Agreement Protocol Working Group, Scheduling and Resource Management Area, http://people.man.ac.uk/~zzcgujm/GGF/graap-wg.html Czajkowski, K., Foster, I., Kesselman, C., Sander, V., and Tuecke, S., “SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems”, 8th Workshop on Job Scheduling Strategies for Parallel Processing, Edinburgh, Scotland, July 2002. Global Grid Forum, Proposed Grid Scheduling Architecture Working Group, Scheduling and Resource Management Area, http://ds.e-technik.uni-dortmund.de/~yahya/ggf-sched/WG/arch-rg.html Global Grid Forum, Proposed Usage record Working Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/ur-wg.html The TeraGrid Project, www.teragrid.org
Activities in the Scheduling and Resource Management Area of the Global Grid Forum
235
[17] Global Grid Forum, Proposed Grid Economic Brokering Architecture Working Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/geba-wg.html [18] Global Grid Forum, Proposed OGSA Resource Usage Service Working Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/rus-wg.html [19] Global Grid Forum, Proposed Scheduler Optimization Research Group, Scheduling and Resource Management Area, http://www.mcs.anl.gov/~jms/ggf-sched/WG/opt-wg.html
Author Index
Arpaci-Dusseau, Andrea . . . . . . . 103 Bucur, Anca I.D. . . . . . . . . . . . . . . 184 Casta˜ nos, Jos´e G. . . . . . . . . . . . . . . . 38 Chiang, Su-Hui . . . . . . . . . . . . . . . . 103 Clement, Mark J. . . . . . . . . . . . . . . . 24 Czajkowski, Karl . . . . . . . . . . . . . . 153
Mahood, Carrie L. . . . . . . . . . . . . . . 88 Moreira, Jos´e E. . . . . . . . . . . . . . . . . 38 Nitzberg, Bill . . . . . . . . . . . . . . . . . . 229 Pinchak, Christopher . . . . . . . . . . 205
Hamscher, Volker . . . . . . . . . . . . . . 128
Sadayappan, Ponnuswamy . . . . . . 55 Sander, Volker . . . . . . . . . . . . . . . . . 153 Schopf, Jennifer M. . . . . . . . . . . . . 229 Smirni, Evgenia . . . . . . . . . . . . . . . . . 72 Snell, Quinn O. . . . . . . . . . . . . . . . . . 24 Srinivasan, Srividya . . . . . . . . . . . . . 55 Streit, Achim . . . . . . . . . . . . . . . . . . . . 1 Subramani, Vijay . . . . . . . . . . . . . . . 55
Jackson, David B. . . . . . . . . . . . . . . 24
Tuecke, Steven . . . . . . . . . . . . . . . . .153
Epema, Dick H.J. . . . . . . . . . . . . . . 184 Ernemann, Carsten . . . . . . . . . . . . 128 Foster, Ian . . . . . . . . . . . . . . . . . . . . . 153 Goldenberg, Mark . . . . . . . . . . . . . 205
Kesselman, Carl . . . . . . . . . . . . . . . 153 Kettimuthu, Rajkumar . . . . . . . . . 55 Krevat, Elie . . . . . . . . . . . . . . . . . . . . .38 Lawson, Barry G. . . . . . . . . . . . . . . . 72 Lu, Paul . . . . . . . . . . . . . . . . . . . . . . . 205
Vernon, Mary K. . . . . . . . . . . . . . . 103 Ward, William A., Jr. . . . . . . . . . . . 88 West, John E. . . . . . . . . . . . . . . . . . . 88 Yahyapour, Ramin . . . . . . . . . . . . . 128