This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
OA , there exists a system configuration E with D ~ ~ E . (The proof is omitted because of lack of space.) Provided that type checking succeeds for an initial system configuration, if an object has a nonempty buffer of received messages, the execution of this object cannot be blocked. Especially, the next message in the buffer is acceptable. Theorem 1 also implies that separate compilation is supported: If a process pi in C is replaced with another p~ satisfying the conditions, the consequences of the theorem still hold.
~ (~* (rme*x-- pmin--l- ll) --~lrT(J) -- P mht > OAO
~;r
270 One may note that iV/ gives the block number for T ( J ) with respect to the /th dimension. The previous system is effectively linear only if nT and pmax _ p m i n are constant vectors. Computation of a parametric solution in other cases is left for future work but it is possible to obtain a result by asking the user to provide the values of some key parameters. Integer program (7) may so be rewritten with only linear constraints. For our p r o g r a m e x a m p l e II~ttInit, the t e m p l a t e is d i s t r i b u t e d on t h e p r o c e s s o r grid using two CYCLIC p a t t e r n s , hence the d i s t r i b u t i o n is defined by:
OT
j2
=
j~
, ~T =
T h e i n t e g e r p r o b l e m to solve is lexmin (C(n, m), S ( i l , i2, J l , j 2 ) ) w i t h
C(n,m)= {1<_jl
,
Jl - 1 = 8q~ + pl - ] ^ J; - 1 = Sq~ + p2 - 1 ql _> 0 ^ q 2
> 0 ^ q C ;> 0 h q ;
>__ 0
One m a y n o t e t h a t , in this example, there is no variable r e p r e s e n t i n g the r e m a i n d e r in the division by ~T (as _R in (5.2)) since the block sizes are equal to 1. T h e r e s u l t of P I P is t h a t t h e r e e x i s t s a s o l u t i o n n o t e q u a l t o 2. in t h e p o l y h e d r o n defined by: g(n,m)
= { 1 <_jl < n ^ l
<_j2 < m a i l
=jlA
1 <_ i2 < m h j 2
-- 1 = 8 q A q >_ 0
T h e new p a r a m e t e r q is used to express t h a t j2 - 1 m u s t be a m u l t i p l e of 8.
5.3
Counting t h e E l e m e n t s o f t h e C o m m u n i c a t i o n Set
The last stage of our method consists in counting the number of integer vectors in the sub-domains computed by PIP. These parametric sub-domains D ( I , J, P ) are defined in function of a parameter J representing the subscript of a general template element T ( J ) , in function of a parameter I which represents the subscript of an operation storing a value on T ( J ) and in function of a vector of program parameters P. We need to compute the number of integer vectors in D ( I , J, P ) in function of the program parameters. Fortunately, Loechner and Wilde have extended the polyhedron library to include a function able to count the number of integer vectors in a parametric polyhedron (see [4]). Like PIP, this function splits the parameter space in sub-domains on which the result can be given by a parametric expression. Hence, to apply relation (6) one has to implement an addition on Quasts. For the e x a m p l e MatInit the final result (in the c o n t e x t n _> 1 and m >_ 1) is
Count(C(n,m))-Count(C(n,m))
=n.rn
-
0, S , 4 , 8 , 2 , 8 , 4 ,
,n
T h e b r a c k e t s on the previous expression d e n o t e a p e r i o d i c number: if we d e n o t e by v t h e v e c t o r (0,~,7 7,a g,s $,1 g,a 7,1 ~), the value of the p e r i o d i c n u m b e r is Vm%s. W h e n the p a r a m e t e r m is a m u l t i p l e of 8 we have t h e e x p e c t e d r e s u l t of 7'~'srn~ a t o m i c c o m m u n i c a t i o n s at the t e m p l a t e level.
For a detailled description of how the pre-evaluation can be done in an automatic way take a look at the report available at the URL f t p : / / f t p , l i f l . f r / p u b / r e p o r t s/AS-publi/an98/as-182, ps. gz
271
6
Conclusion
D a t a partitioning is a m a j o r performance factor in H P F programs. To help the p r o g r a m m e r design a good d a t a distribution strategy, we have studied the evaluation of the communication cost of a p r o g r a m during the writing of this program. We have presented here a method to compute the communication volume of a H P F program. This method is based on the polyhedral model. So, we are able to handle loop nests with affine loop bounds and affine array access functions. Our m e t h o d is parameterized and machine independent. Indeed all is done at the language level. An implementation is done using the polyhedral library and the P I P software. Ongoing work includes extending this m e t h o d to a larger class of programs and adding compiler optimizations in the model. The last point is quite important since the pure counting of elements exchanged is only one of the factors in the actual communication costs. We will have to recognize special communications patterns as broadcasts which can be implemented more efficiently t h a n general communications. We are also integrating this method in the HPF-builder tool [6].
References 1. Jennifer M. Anderson and Monica S. Lain. Global optimizations for parallelism and locality on scalable parallel machines. ACM Sigplan Notices, 28(6):112-125, June 1993. 2. Vincent Bouchitte, Pierre Boulet, Alain Darte, and Yves Robert. Evaluating array expressions on massively parallel machines with communication/computation overlap. International Journal of Supercomputer Applications and High Performance Computing, 9(3):205-219, 1995. 3. S. Chatterjee, $. R. Gilbert, R. S. Schreiber, and S.-H. Tseng. Automatic array alignment in data-parallel programs. In ACM Press, editor, Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 16-28, Charleston, South Carolina, January 1993. 4. Philippe Clauss, Vincent Loechner, and Doran Wilde. Deriving formulae to count solutions to parameterized linear systems using ehrhart polynomials: Applications to the analysis of nested-loop programs. Technical Report RR 97-05, ICPS, apr 1997. 5. Fabien Coelho. Contributions to HPF Compilation. Phi) thesis, Ecole des mines de Paris, October 1996. 6. Jean-Luc Dekeyser and Christian Lefebvre. Hpf-builder: A visual environment to transform fortran 90 codes to hpf. International Journal of Supercomputing Applications and High Performance Computing, 11(2):95-102, 1997. 7. Thomas Fahringer. Compile-time estimation of communication costs for data parallel programs. Journal of Parallel and Distributed Computing, 39(1):46-65, November 1996. 8. Paul Feautrier. Parametric integer programming. RAIRO Recherche Opdrationnelle, 22:243-268, September 1988.
272 9. Paul Feautrier. Towards automatic distribution. Parallel Processing Letters, 4(3):233-244, 1994. 10. M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, College of Engineering, University of Illinois at Urbana-Champaign, September 1992. 11. Manish Gupta and Prithviraj Banerjee. Compile-time estimation of communication costs of programs. Journal of Programming Languages, 2(3):191-225, September 1994. 12. Ken Kennedy and Ulrich Kremer. Automatic data layout for High Performance Fortran. In Sidney Karin, editor, Proceedings of the 1995 A CM/IEEE Supercom-
puting Conference, December 3-8, 1995, San Diego Convention Center, San Diego, CA, USA, New York, NY 10036, USA and 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1995. ACM Press and IEEE Computer Society Press. 13. Kathleen Knobe, Joan D. Lukas, and Guy L. Steele. Data optimization: Allocation of arrays to reduce communication on SIMD machines. Journal of Parallel and Distributed Computing, 8:102-118, 1990. 14. Jingke Li and Marina Chen. Index domain alignment: Minimizing cost of crossreferencing between distributed arrays. In Frontiers 90: The 3rd Symposium on the Frontiers of Massively Parallel Computation, College Park, MD, October 1990. 15. S. Wholey. Automatic data mapping for distributed-memory parallel computers. In ACM, editor, Conference proceedings / 1992 International Conference on Supercomputing, July 19-23, 1992, Washington, DC, INTERNATIONAL CONFERENCE ON SUPERCOMPUTING 1992; 6th, pages 25-34, New York, NY 10036, USA, 1992. ACM Press. 16. Doran Wilde. A library for doing polyhedral operations. Master's thesis, Oregon State University, Corvallis, Oregon, dec 1993.
Modeling the Communication Behavior of Distributed Memory Machines by Genetic Programming* L. Heinrich-Litan, U. Fissgus, St. Sutter, P. Molitor, and Th. Rauber Computer Science Institute, Department for Mathematics and Computer Science, Martin-Luther-Universit~t Halle-Wittenberg, D-06099 HMle (Scale), Germany < name> @inf ormat ik. t m i - h a l l e , de
A b s t r a c t . Due to load imbalance and communication overhead the behavior of the runtime of distributed memory machines is very complex. The contribution of this paper is to show that runtime functions predicting the execution time of the communication operations can be generated by means of the genetic programming paradigm. The rtmtime functions generated dominate those presented in literature, till today.
1
Introduction
Distributed memory machines (DMMs) provide large computing power which can be exploited for solving large problems or computing solutions with high accuracy. Nevertheless, DMMs are still not broadly accepted. One of the main reasons is the costly development process for a specific parallel algorithm on a specific DMM. This is due to the fact that parallel algorithms on DMMs may show a complex runtime behavior caused by communication overhead and load imbalance. Thus, there is considerable research effort to model the performance of DMMs. This includes modeling the runtimes of communication operations with parametrized formulas [6, 5,4, 8, 1]. The modeling of the execution time of communication operations can be used in compiler tools, in parallelizing compilers or simply as concise information of the performance behavior for the application programmer. Genetic algorithms (GAs) are stochastic optimization techniques which simulate the natural evolutionary process of beings. They often outperform conventional optimization methods when applied to difficult problems. We refer to [2] for an overview on complex problems in the area of industrial engineering, where GAs have been successfully applied. The genetic programming approach (GP) has been introduced by Koza [7] and is a special form of a genetic algorithm. The contribution of this paper is to show that runtime functions of high quality, which model the execution time of communication operations, can be modeled by the genetic programming approach. To illustrate the effectiveness of the approach we have chosen [8] for comparison. In [8], collective communication * This work has been supported in part by DFG grant Ra 524/5 and Mo 645/5.
274 operations from the communication libraries PVM and MPI are investigated and compared. Their execution time is modeled by runtime functions found by curve fitting. In this paper we demonstrate the effectiveness of GPs to model the execution time of communication operations on DMMs, by example. Especially we demonstrate that GP generates runtime functions of higher quality than those published in literature till now. The paper is structured as follows. Section 2 gives a brief overview on the investigations made by [8]. Section 3 presents how performance modeling can be attacked by GPs. Experimental results are shown and discussed in Section 4. 2
Runtime
Functions
Generated
by Curve
Functions
First, we give an overview on the communication operations which we use in our experiments. Then we present the model of the communication behavior on the IBM SP2 obtained by curve fitting. The specific execution platform that [8] used for their experiments is an IBM SP2 with 37 nodes and 128 MByte main memory per processor from GMD St.Augustin, Germany, with IBM MPI-F, Version 1.41. They investigate and compare different communication operations. Due to space limitation, we only report on two communication operations.
Single-transfer operation: In MPI, the standard point-to-point communication operations are the blocking MPI_Send() and MPI_Kecv(). For blocking send, the control does not return to the executing process before the send buffer can be reused safely. Some implementations use an intermediate buffer. Scatter operation: A single-scatter operation is executed by the global operation M P I _ S c a t t e r () which must be called by all participating processes. As effect, the specified root process sends a part of its own local data to each other process. In [8], the runtimes of MPI communication operations on the IBM SP2 are modeled by runtime formulas found by curve fitting that depend on various machine parameters including the number of processors, the bandwidth of the interconnecting networks, and the startup times of the correspondin 5 operations. The investigations resulted in the parameterized runtime formula fcslcatter(p , b) -: 10.28- 10 - s + 91.59 910 -6 "p + 0.030.10 -6 9p. b [#see] for the scatter operation, and fJ~gl~(b) = 211.8-10 -6 + 0.030. 10 - 6 . b [#see] for the single-transfer operation. The parameters b and p are the message size in bytes and the number of processors, respectively. 3
The
Genetic
Programming
Approach
The usual form of genetic algorithm was described by Goldberg [3]. Genetic algorithms are stochastic search techniques based on the mechanism of natural selection and natural genetics. They start with an initial population, i.e., an initial set of random solutions which are represented by chromosomes. The chromosomes evolve through successive iterations, called generations. During each
275 generation, the solutions represented by the chromosomes of the population are evaluated using some measures of fitness. To create the next generation, new chromosomes, called offsprings, are formed. The chromosomes involved in the generation of a new population are selected according to their fitness values. Fitter chromosomes have higher probabilities of being selected. After several generations, the algorithm converges to the best chromosome which hopefully represents the optimal or suboptimal solution to the problem. GAs whose chromosomes are syntax trees of formulas, i.e., computer programs, are GPs. In the next sections we present the GP we have applied for the problem of finding runtime functions of high quality modeling the execution time of the MPI communication operations. We use terminology given in [7]. 3.1
The Chromosomes
In genetic programming, a chromosome is an element of the set of all possible combinations of functions that can be composed recursively from a set F of basic functions and a set T of terminals. Each particular function f C F takes a specified number of arguments, specifying the arity of f . The basic functions we allowed in our application are addition (+) and multiplication (.) both of arity 2, and the operations sqr, sqrt, In, and exp, all of arity 1. The set T is composed of one or two variables p and b (p specifies the number of processors and b specifies the message size), and the constants 1 to 10, 0.1, and 10 -6. We represent each chromosome by the syntax tree corresponding to the expression. In the following, we identify a chromosome with the runtime function which is represented by that chromosome in order to make the diction easier. 3.2
Fitness Function
Fitness is the driving force of Darwinian natural selection and, likewise, of genetic programming. It may be measured in many different ways. The accuracy of the fitness function with respect to the application is crucial for the quality of the results produced by GP. In the application handled here experiments have shown that the average percental error of a chromosome as fitness results in best solutions. Of course, as the aim is to minimize the average percental error, a chromosome is said to be fit if its average percental error is small. Let Pos be the set of measuring points and m(~) be the measured data at measuring point a, then the percental error error] (c~) of a runtime function f at point a and the average percental error error] of f are given by
error](c~) = abs(m((~) - f ( a ) ) . 100 m(c~)
and
error] = ~-~ePo8 error](~) I Pos I
respectively. There are two problems we have to care about when applying this fitness function. The populations can contain chromosomes whose average percental error is greater than the maximal decimal value (which is about 1.79 9 103~ in GNU
276 g + + ) . In this case, we have to redefine the error of these chromosomes to be some large value. The other problem is that some operations of the syntax trees are not defined, e.g., In a with a < 0. The fitness function of our G P replaces these faulty operations by constants and adds a penalty to the error. Chromosomes with faulty operations are not taken as final result of the GP. 3.3
Crossover and Mutation Operators
The crossover operator for GP creates variation in the population by producing new offsprings that consist of parts taken from each parent. It starts with two parental syntax trees selected with error-inverse-proportionate selection, and produces two offspring syntax trees by independently selecting, using an uniform probability distribution, one random point in each parent to be the crossover point for that parent. The offsprings are produced by exchanging the crossover fragments. To direct the algorithm to generate smooth runtime functions we restricted the solution region to syntax trees of depth up to a constant c. The mutation operator introduces random changes in structures randomly selected from the population. It begins by selecting a point within the syntax tree by random. Then, it removes the fragment which is below this mutation point and inserts a randomly generated subtree at that point. The depth of the new syntax tree is also limited by c. 3.4
Initialization and Termination
There are several methods to initialize the GP. At the beginning of our experiments we generated each chromosome of the first population recursively top down. The semantics of a node v of depth less than c is taken from F t3 T by random. If a node has depth c, the semantics is randomly taken from the set T of terminals. Experiments showed, that the results are better if we generate a part of the initial chromosomes by approximation, and the rest of them as described above. These initial approximation functions represent linear dependencies between the measuring points and the measured data. In our experiments, the GP stops after having considered a predefined number of generations. 4
Experimental
Results
We begin by quantifying the quality of the runtime functions given in [8]. Then we compare these runtime functions to those generated by GP and discuss the results. We close the section by presenting a runtime formula generated by GP. We evaluated the runtime functions given in [8] (see Section 2) by using the fitness function specified in Section 3.2. The test set of measured data contained communication times for p = 4, 8, 16, 32 processors and message sizes between 4 and 800 bytes with stepsize 20, between 800 and 4.000 with stepsize 400, and between 4.000 and 240.000 with stepsize 4.000.
277
Table 1. Deviations of the runtime functions Operation
Approach
scatter
curve fitting GP single-transfer curve fitting l GP I
% of oe E P o s <0.01'<0.1 <1 0.0 0.6 2.6 0.0 2.3 14.2 0.0 1.2 20.9 0.3 1 17.6
errory '
w i t h . e r r o r f (a) < x
5'7.36 9.68 71.90 14.64
<:10 23.8' 62.9 47.4 59.6
<25 < 5 0 49.0 72.5 93.7 97.7 59.4 67.4 75.6 93
<100 83.4 100.0 73.4 100.0
To generate runtime functions by GP, we have to set some parameters. By setting the size of each population to 50, the maximum number of generations to 12000, the maximum depth of syntax trees to 15, and the probabilities for applying crossover, mutation, and the copy operation to 80%, 10% and 10% respectively, we obtained runtime functions by GP which dominate by far the runtime functions from [8] obtained by curve fitting. In Table 1 we compare the deviations of the runtime functions found by curve fitting to the deviations of the best runtime functions generated after having run the GP algorithm 3 times, which takes about 6 hours on a SUN SPARC station Ultra 1, 128 MByte RAM. The first column specifies the communication operation, the second column the used approach and the third column the average percentM error of the corresponding runtime function. Columns 4 to 10 give the percentage of measuring points a E P o s for which the percental error e r r o r I (a) is less than x for x = 0.01, 0.1, 1, 10, 25, 50 and 100, respectively. Figure 1 graphically compares the deviation of the runtime function from [8] and the deviation of the runtime function generated by G P for the scatter operation with respect to some set of measured data. The curves differ very much for small values of parameter b whereby the curve generated by G P dominates by far the curve found by curve fitting. For large parameter values, both runtime functions predict the execution time of the scatter operation roughly alike. Figure 2 shows the comparison of the deviations of the runtime functions found by curve fitting and GP, respectively, for the single-transfer operation (which depends only on parameter b) with respect to several sets of measured data. Figure 2 shows only the cut-out determined by the small message sizes (up to 500 byte). Let us discuss the results, The disadvantage of curve fitting is that usually only one simple function is selected to model a communication operation for a large range of message sizes. This may lead to poor results in some regions since often different methods are used for different message sizes, G P on the other hand allows for generating complex functions. We close the section by presenting the runtime formula f ~ P t t ~ r predicting the execution time of the scatter operation which has been generated by G P : GP
b)
:
lo -6
9 (b . p . ( p , / 2
_
1)11
)11
.
(b 9p . 0.000625 + 0.00005. b + 0.00375 9p2 + 0.003 9p + ln(e p/2) + 49) 1/2. It shows that the runtime formulas generated by GPs are rather complex. To our opinion, this is the only disadvantage of the approach we presented.
278 Devlatk~ of the runtime function found by curve flttin . . . . . OeV~aliO~ of the runtime thnctlon generated by G ~ - Error [%]
.......
,"",,.,
4SO r 4 0 0 l" 300 250 200
~o
a
;~
,"
~,:' ~ / ' ~. .' . . , '~;
,~" ' '-..,,
,~
: " ~ :" "%
...--.....
253o 1 Messagesize [byte)
1 e+06
Number of prccessors
Messagesize [byte]
Fig. 1. Deviation of the runtime function found by curve fitting and deviation of the runtime function generated by GP for the scatter operation.
5
Fig. 2. Deviation of the runtime function obtained by curve fitting and deviation of the runtime function generated by GP for the single-transfer operation.
Conclusions
We showed by example that genetic p r o g r a m m i n g provides runtime functions predicting the execution time of communication operations of DMMs which dominate the runtime functions found by curve fitting. The next step of our investigations we will make is to automatically generate different runtime functions for different intervals of the parameters. The intervals themselves have to be determined by genetic programming, too. Using such non uniform runtime functions seems to be necessary to predict the execution functions more accurately, as DMMs use various communication protocols depending on the message size.
References 1. Foschia, R., Rauber, Th., and Riinger, G.: Prediction of the Communication Behavior of the Intel Paragon. In Proceedings o] the 1997 IEEE MASCOTS Con]erence, pp.117-124, 1997. 2. Gen, M., and Cheng, R.: Genetic Algorithms & Engineering Design. John Wiley & Sons, Inc., New York, 1997. 3. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley, Reading, MA, 1989. 4. Hu, Y., Emerson, D., and Blake, R.: The communication performance of the Cray T3D and its effect on iterative solvers. Parallel Computing, 22:829-844, 1996. 5. Hwang, K., Xu, Z., and Arakawa, M.: Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing. IEEE Transactions on Parallel and Distributed Systems, 7(2):522-536, 1996. 6. Johnson, L.: Performance Modeling of Distributed Memory Architecture. Journal of Parallel and Distributed Computing, 12:300-312, 1991. 7. Koza, J.: Genetic Programming. The MIT Press, 1992. 8. Rauber, Th., and Riinger, G.: PVM and MPI Communication Operations on the IBM SP2: Modeling and Comparison. In Proceedings of HPCS 97, 1997.
Representing and Executing Real-Time Systems Rafael Ramirez National University of Singapore Information Systems and Computer Science Department Lower Kent Ridge Road, Singapore i19260 [email protected]
A b s t r a c t . In this paper, we describe an approach to the representation, specification and implementation of real-time systems. The approach is based on the notion of concurrent object-oriented systems where processes are represented as objects. In our approach, the behaviour of an object (its safety properties and time requirements) is declaratively stated as a set of temporal constraints among events which provides great advantages in writing concurrent real-time systems and manipulating them while preserving correctness. The temporal constraints have a procedural interpretation that allows them to be executed, also concurrently. Concurrency issues and time requirements are separated from the code, minimizing dependency between application functionality and concurrency/timing control.
1
Introduction
Parallel computers and distributed systems are becoming increasingly i m p o r t a n t . Their impressive computation to cost ratios offer a considerable higher performance than that possible with sequential machines. Yet there are few commercial applications written for them. The reason is that p r o g r a m m i n g in these environments is substantially more difficult t h a n p r o g r a m m i n g for sequential machines, in respect of b o t h correctness (to achieve correct synchronization) and efficiency (to minimize slow interprocess communication). While in traditional sequential p r o g r a m m i n g the problem is reduced to make sure that the p r o g r a m ' s final result (if any) is correct and that the program terminates, in concurrent p r o g r a m m i n g it is not necessary to obtain a final result but to ensure t h a t several properties hold during program execution. These properties are classified into safety properties, those t h a t must always be true, and progress (or liveness) properties, those that must eventually be true. Partial correctness and termination are special cases of these two properties. To make things even worse, there exists a wide variety of parallel architectures and a corresponding variety of concurrent p r o g r a m m i n g paradigms. For most problems, it is not possible to envisage a general concurrent algorithm which is well suited to all parallel architectures. Real-time systems are inherently concurrent systems which, in addition to usuMly require synchronisation and communication with both their environment and within their own components, need to execute under timing constraints. We
280 propose an approach which aims to reduce the inherent complexity of writing concurrent real-time systems. Our approach consists of the following: 1. In order to incorporate the benefits of the declarative approach to concurrent real-time programming, it is necessary that a program describe more than its final result. We propose a language based on classical first-order logic in which all safety properties and time requirements of programs are declaratively stated. 2. Programs are developed in such a way that they are not specific to any particular concurrent programming paradigm. The proposed language provides a framework in which algorithms for a variety of paradigms can be expressed, derived and compared. 3. Applications can be specified as objects, which provides encapsulation and inheritance. The language object-oriented features produce structured and understandable programs, therefore reducing the number of potential errors. 4. Concurrency issues and timing requirements are separated from the rest of the code, minimizing dependency between application functionality and concurrency control. In addition, concurrency issues and timing requirements can also be independently specified. Thus, it is in general possible to test different synchronisation schemes without modifying the timing requirements and vice versa. Also, a synchronisation/time scheme may be reused by several applications. This provides great advantages in terms of program flexibility, reuse and debugging. 2
Related
Work
This work is strongly related to Tempo ([8]) and Tempo++ ([17, 16]). Tempo is a declarative concurrent programming language based on first-order logic. It is declarative in the sense that a Tempo program is both an algorithm and a specification of the safety properties of the algorithm. Tempo++ extends Tempo (as presented in [8]) by adding numbers and data structures (and operations and constraints on them), as well as supporting object-oriented programming. Both languages explicitly described processes as partially ordered set of events. Events are executed in the specified order, but their execution times are only implicit. Here, we extend Tempo++ to support real-time by making event execution times explicit as well as allowing the specification of the timing requirements in terms of relations among these execution times. Concurrent logic programming is also an important influence for this work. This is because concurrent logic programming languages (e.g. Parlog [6], KL1 [22]) and their object-oriented extensions (e.g. Polka [4], Vulcan [10]) preserve many of the benefits of the abstract logic programming model, such as the logical reading of programs and the use of logical terms to represent data structures. However, although these languages preserve many benefits of the logic programming model, and their programs explicitly specify their final result, important program properties, namely safety and progress properties, remain implicit. These properties have to be preserved by using control features such as
281 modes and sequencing, producing programs with little or no declarative reading [17]. In addition, traditional concurrent logic programming languages do not provide support for real-time programming and thus, are not suitable for this kind of applications. Concurrent constraint programming languages [19] and their object-oriented and real-time extensions (e.g. [21, 20]) suffer from the same problems as concurrent logic languages. Program safety and progress properties remMn implicit. These properties are preserved by checking and imposing value constraints on shared variables in the store. Also, there is no clear separation of programs concurrency control, timing requirements and application functionMity. In addition, the concurrent constraint model is most naturally suited for shared memory architectures not being easily adapted to model distributed programming systems. Our work is also related to specification languages for concurrent real-time systems. It is closer to languages based on temporal logic, e.g. Unity [3] and TLA [13], and real-time extensions of temporal logic, e.g. [12], than process algebras [9, 14] and real-time extensions of state-transition formalisms [2, 5]. Unity and TLA specifications can express the safety and progress properties of concurrent systems but they are not executable. Both formalisms model these systems by sequences of actions that modify a single shared state which might be a bottleneck if the specifications were to be executed. Section 3 introduces the core concepts of our approach to real-time programming, namely events, precedence constraints and real-time, and outlines a methodology for the development of concurrent real-time systems (to make the paper sel-contained there is a small overlap with [8] in Sections 3.1 and 3.3). Section 4 describes how these concepts can be extended by adding practical programming features such as data structures and operations on them as well as supporting object-oriented programming. Finally, Section 5 summarizes our approach and its contributions as well as some areas of future research.
3 3.1
Events, Constraints and Real-Time Events and Precedence
Many researchers, e.g. [11,15], have proposed methods for reasoning about temporM phenomena using partially ordered sets of events. Our approach to the specification and implementation of real-time systems is based on the same general idea. We propose a language in which processes are explicitly described as partially ordered sets of events. The event ordering relation X < Y, read as "X precedes Y", is the main primitive predicate in the language (there are only two primitive predicates), its domain is the set of events, and is defined by the following axioms (the last axiom is actually a corollary of the other three):
V X V Y V Z ( X < Y A Y < Z - + X < Z) VXVY(time(Y, eternity) -+ X < Y ) V X ( X < X --+ time(X, eternity))
282 VXVY (Y < X A time(Y, eternity) -+ time(X, eternity)) The meaning of predicate time(X, Value) is "event X is executed at time Value" and eternity is interpreted as a time point that is later than all others. Events are atomically executed in the specified order (as soon as its predecessors have been executed) and no event is executed for a variable that has execution time eternity. Long-lived processes (processes comprising a large or infinite set of events) are specified allowing an event to be associated with one or more other events: its offsprings. The offsprings of an event E are named E + 1, E + 2, etc., and are implicitly preceded by E, i.e. E < E + N, for all N. The first offspring E + 1 of E is referred to by E + . Syntactically, offsprings are allowed in queries and bodies of constraint definitions, but not in their heads. Long-lived processes may not terminate because body events are repeatedly introduced. Interestingly, an infinitely-defined constraint need not cause nontermination: a constraint, all whose arguments have time value eternity will not be expanded. The time value of an event E can be bound to eternity (as a consequence of the axioms defining "<") by enforcing the constraint E < E. No offsprings of E will be executed. Their time values are known to be bound to eternity since they are implicitly preceded by E and the time value of E is
eternity. In the language, disjunction is specified by the disjunction operator ';' (which has lower priority than ',' but higher than %-'). The clause H t-- C s l ; . . . ; Csn abbreviates the set of clauses H +- Csl, ..., H +- Csn. In the absence of disjunction, a query determines a unique set of constraints. An interpreter produces any execution sequence satisfying those constraints, it does not m a t t e r which one. With disjunction, a single set of constraints must be chosen from among many possible sets, i.e., a single alternative must be selected from each disjunction. In the language described above, processes are explicitly described as partially ordered set of events. Their behavior is specified as logical formulas which define temporal constraints among a set of events. This logical specification of a process has a well defined and understood semantics and allows for the possibility of employing both specification and verification techniques based on formal logic in the development of concurrent systems. 3.2
Real-Time
Time requirements in the language may be specified by using the primitive predicate time(X, Value). This constrains the execution time of event X by forcing X to be executed at time Value. In this way, quantitative temporal requirements (e.g. maximal, minimal and exact distance between events) can be expressed in the language. For Instance, Maximal distance between two events E and F may be specified by the constraint max(E, F, N), meaning "event E is followed by event F within N time units", and defined by
max(E, F, N) 6-- E < F, time(E, Et), time(F, Ft), Ft -~ Et + N.
283 where -< is the usual less than arithmetic relationship among real numbers. Thus, maximal distance between families of events, i.e. events and their offsprings, can be specified by the constraint max~ (E, F, N), meaning "occurrences of events E, E+, E + + .... are respectively followed by occurrences of events F, F + , F + +, ... within N time units", and defined by
max, (E, F, N) +-- max(E, F, N), max, (E+, F+, N) Minimal and exact distance between two events, as well as other common quantitative temporal requirements, may be similarly specified.
Example 1. Consider an extension to the traditional producer and consumer system where each product cannot be held inside the buffer longer than time t. The system may be represented by the timed Petri net in Figure 1. Readyto send P r o d u ~ S e n d
Readyto consume ~r
R e c ~ o n s u m e
(0,t) Readyto produce
Readyto receive
Fig. 1. A producer-consumer system with timed buffer The producer and consumer components of the system may be respectively specified by P <*S, S < * P + and R <*C, C <*R+, where P and S represent occurrences of events produce and send and R and C represent occurrences of events receive and consume (P+, S+, R+, C+, ... represent later occurrences of these events). The behaviour of the buffer may be specified by
tbuf(S, R, T) +-- max (S, R, T), tbuf(S+, R+, T); tbuf(S+, R, T). where T is the longest time the buffer can hold a product.
3.3
Development of Concurrent Real-Time Systems
In our approach, the specification of the behaviour of the processes in the system is a program, i.e. it is possible to directly execute the specification. Thus, a program P may be transformed into a program that logically implies P (in [7] some transformations rules that can be applied to programs are presented). The derived program is guaranteed to have the same safety properties as the original one, though its progress properties may differ, e.g. one may terminate and the
284 other not. The program may be incrementally strengthened by introducing timing constraints to specify the system time requirements. Finally, the program can be turned into a concurrent one by grouping constraints into processes. This final step affects neither the safety nor the progress properties of the algorithm, provided that some restrictions are observed.
4
Object-Oriented Programming
Our approach to the specification and implementation of real-time systems is based on an extension to the logic presented in the previous section. The logic is extended by adding data structures and operations on them, by allowing values to be assigned to events for inter process communication and by supporting object-oriented programming. A detailed discussion of these ideas can be found in [17]. Our language supports object-oriented programming by allowing a class to encapsulate a set of constraints, specified by a constraint query, together with the related constraint definitions, specified by a set of clauses, in such a way that it describes a set of potential run-time objects. The constraint query defines a partial order among a set of events and the timing requirements on their execution times, and the constraint definitions provide meaning to the userdefined constraints in the query. Both the query and definitions are local to the class. Each of the class run-time objects corresponds to a concurrent process. The name of a class may include variable arguments in order to distinguish different instances of the same class. Events appearing in the constraint query of an object implicitly belong to that object. If an event is shared between several objects (it belongs to two or more objects), it cannot be executed until it has been enabled by all objects that share it. In order to specify the object computation, actions may be associated with events. The representation and manipulation of data as well as the spawning of new objects is handled by these actions. Our current implementation of the language assumes an action to be a definite goal. In order to execute an event, the goal associated with it (if any) has to be solved first. Objects communicate via shared events' values. Shared events represent communication channels and values assigned to them represent messages. A novel feature of the approach described here is that an object can partially inherit another object, i.e. an object can inherit either another object's temporal constraints or actions. Thus, inheritance of concurrency issues and inheritance of code are independently supported. This allows an object to have its synchronisation scheme inherited from another object while defining its own code, or vice versa, or even inherit its synchronisation scheme and code from two different objects. Our language appears to add to the proliferation of concurrent programming paradigms: processes (objects) communicate via a new medium, shared events. However, our objective is rather to simplify matters by providing a framework in which algorithms for a variety of concurrent programming paradigms can be expressed, derived, and compared. Among the paradigms we have considered are
285 synchronuos message passing, asynchronous message passing and shared mutable variables. E x e c u t i o n : Our current implementation uses a constraint set CS containing the constraints still to be satisfied, and a try list TL containing the events that are to be tried but have not yet been executed. The interpreter constantly takes events from TL and checks if they are enabled, i.e. if they are not preceded by any other event (according to CS), and the timing constraints on their execution times are satisfiable, in which case they are executed. The order in which the events are tryed is determined by the timing constraints in CS.
5
Conclusions
We have described an approach to the representation, specification and implementation of concurrent real-time systems. The approach is based on the notion of concurrent object-oriented systems where processes are represented as objects. In the approach, each object is explicitly described as a partially ordered set of events and executes its events in the specified order. The partial order is defined by a set of temporal constraints and object synchronisation and communication are handled by shared events, object behaviour (safety properties and time requirements) are declaratively stated which provides great advantages in writing real-time systems and manipulating them while preserving correctness. The specification of the behaviour of an object has a procedural interpretation that allows it to be executed, also concurrently. Our approach can also be used as the basis for a development methodology for concurrent systems. First, the system can be specified in a perpicuous manner, and then this specification may be incrementally strengthened and divided into objects that communicate using the intended target paradigm. An object can partially inherit another object, i.e. an object can inherit either another object's temporal constraints, actions or actions definitions. Thus, inheritance of concurrency issues and inheritance of code are independently supported. C u r r e n t s t a t u s . A prototype implementation of the complete language has been written in Prolog, and used to test the code of a number of applications. The discussion of these applications is out of the scope of this paper. F u t u r e w o r k . In the language presented, the action associated with an event can, in principle, be specified in any programming language. Thus, different types of languages, such as the imperative languages, should be considered and their interaction with the model investigated. Events are considered atomic. Instead of being atomic, they could be treated as time intervals during which other events can occur ([1] and [11]). Such events can be further decomposed to provide an arbitrary degree of detail. This could be useful in deriving programs from specifications. Object behaviour is specified as logical formulas which define temporal constrMnts among a set of events. This logical specification of an object has a well defined and understood semantics and we are planning to look carefully into the
286 possibility of employing both specification and verification techniques based on formal logic in the development of concurrent real-time systems.
References 1. Allen, J.F. 1983. Maintaining knowledge about temporal intervals. Comm. ACM 26, 11, pp.832-843. 2. Alur, R., and Dill, D.L. 1990. Automata for modeling real-time systems, in ICALP'90: Automata, Languages and Programming, LNCS 443, pp.322-335. SprJnger-Verlag. 3. Chandy, K.M. and Misra, J. 1988. Parallel Program Design. Addison-Wesley. 4. Davison, A. 1991. From Parlog to Polka in Two Easy Steps, in PLILP'91: 3rd Int. Syrup. on Programming Language Implementation and LP, Springer LNCS 528, pp.171-182, Passau, Germany, August. 5. Dill, D.L. 1989. Timing assumptions and verification of finite-state concurrent systems, in CAV'89: Automatic Verification Methods for Finite-state Systems, LNCS 407, pp. 197-212, Springer-Verlag. 6. Gregory, S. 1987. Parallel Logic Programming in PARLOG, Addison-Wesley. 7. Gregory, S. 1995. Derivation of concurrent algorithms in Tempo. In LOPSTR95: Fifth International Workshop on Logic Program Synthesis and Transformation. 8. Gregory, S. and Ramirez, R. 1995. Tempo: a declarative concurrent programming language. Proc.of the ICLP (Tokyo, June), MIT Press, 1995. 9. Hoare, C.A.R. 1985. Communicating Sequential Processes, Prentice Hall. 10. Kahn, K.M., Tribble, D., Miller, M.S., and Bobrow, D.G. 1987. Vulcan: Logical Concurrent Objects, In Research Directions in Object-Oriented Programming, B. Shriver, P. Wegner (eds.), MIT Press. 11. Kowalski R., and Sergot, M. 1986. A Logic-based Calculus of Events, New Generation Computing, 4, 1, pp.67-95. 12. Koymans, R. 1990. Specifying real-time properties with metric temporal logic, Realtime Systems, 2, 4, pp.255-299. 13. Lamport, L. 1994. The temporal logic of actions. ACM Trans. on Programming Languages and Systems, 16, 3, pp. 872-923. 14. Milner, R. 1989. Communication and Concurrency, Prentice Hall. 15. Pratt, V. 1986. Modeling concurrency with partial orders, International Journal of Parallel Programming, 1(15):33-71. 16. Ramirez, R. 1995. Declarative concurrent object-oriented programming in Tempo++. In Proceedings of the ICLP'95 Workshop on Parallel Logic Programming Japan, T. Chikayama, H. Nakashima and E. Tick (Ed.). 17. Ramirez, R. 1996. A logic-based concurrent object-oriented programming language, PhD thesis, Bristol University. 18. Ramirez, R. 1996. Concurrent object-orientedprogramming in Tempo++. In Proceedings of the Second Asian computing Science Conference (Asian'96), Singapore. LNCS 1179, pp. 244-253. Springer-Verlag 19. Saraswat V. 1993. Concurrent constraint programming languages, PhD thesis, Carnegie-Mellon University, 1989. Revised version appears as Concurrent constraint programming, MIT Press, 1993. 20. Saraswat V. 1993 et al. Programming in timed concurrent constraint languages. In Constraint Programming - Proceedings of the 1993 NATO ACM Symposium, pp. 461-410. Springer-Verlag.
287 21. Smolka, G. 1995. The Oz programming model, Lecture Notes in Computer Science Vol. 1000, Springer-Verlag, pp.324-243. 22. Ueda, K. and Chikayama, T. 1990. Design o] the kernel language ]or the parallel inference machine. Computer Journal 33, 6, pp.494-500.
Fixed Priority Scheduling of Age Constraint Processes Lars Lundberg Department of Computer Science, University of Karlskrona/Ronneby, S-372 25 Ronneby, Sweden, Lars. Lundberg@ide. hk-r. se
A b s t r a c t . Real-time systems often consist of a number of independent processes which operate under an age constraint. In such systems, the maximum time from the start process Li in cycle k to the end in cycle k + l must not exceed the age constraint Ai for that process. The age constraint can be met by using fixed priority scheduling and periods equal to Ai/2. However, this approach restricts the number of process sets which are schedulable. In this paper, we define a method for obtaining process periods other than Ai/2. The periods are calculated in such a way that the age constraints are met. Our approach is better in the sense that a larger number of process sets can be scheduled compared to using periods equal to A,/2.
1
Introduction
Real-time systems often consist of a number of independent periodic processes. These processes may handle external activities by monitoring sensors and then producing proper outputs within certain time intervals. A similar example is a process which continuously monitors certain variables in a database. When these variables or sensors change, the system have to produce certain outputs within certain time intervals. These outputs must be calculated from input values which are fresh, i.e. the age of the input value must not exceed certain time limits. The processes in these kinds of systems operate under the age constraint. The age constraint defines a limit on the m a x i m u m time from the point in time when a new input value appears to the point in time when the appropriate output is produced. Figure 1 shows a scenario where a value Ei appears shortly after process Li has started its k:th cycle (denoted L/k). Process Li starts its execution by reading the sensor or variable. Consequently, Ei will not affect the output in cycle k. The output Fi corresponding to value Ei (or fresher) is produced at the end of cycle k + l . The age constraint Ai is defined as the m a x i m u m time between the beginning of the process' execution in cycle k to the end of the process' execution in cycle k + l . A scheduling scheme can be either static or dynamic. In dynamic schemes the priority of a process is decided at run-time, e.g. the earliest deadline algorithm [3]. In static schemes, processes are assigned a fixed priority, e.g. the rate-monotone
289 algorithm [4]. Fixed priority scheduling is relatively easy to implement and it requires less overhead than dynamic schemes. Most studies in this area have looked at scenarios where the computation time and the period of a process are known. However, for age constraint processes the period is not known. We have instead defined a maximum time between the beginning of the process' execution in cycle k to the end of the process' execution in cycle k + l . We would like to translate this restriction into a period for the process thus making it possible to use fixed priority schemes. The age constraint is met by specifying a process period T[ = Ai/2 (we will use the notation T/ for other purposes), thus obtaining a set of processes with known periods T~ and computation times Ci. For such scenarios, it is well known that rate-monotone scheduling is optimal, and a number of schedulability tests have been obtained [4]. Specifying a process period T[ = Ai/2 for age constraint processes is, however, an unnecessary strong restriction, which do not allow that the start of Li in cycle k and the end in cycle k+2 may be separated by a time greater than 3T[, whereas the age constraint allows a separation of up to 2A~ = 4 T ' .
In this paper we show that, by using rate-monotone scheduling, it is possible to define periods which are better than using periods T{ : Ai/2. Our m e t h o d is better in the sense that we will be able to schedule a larger number of process sets than using T[ = Ai/2.
A~
C~ Time E,
F,
Fig. 1. The age constraint for process Li.
2
Calculating process periods
Consider a set of n processes L = [L1, L2, ..., Ln], with associated age constraints Ai and computation times Ci. The priority of each process is defined by its age constraint Ai. The smaller the value Ai, the higher the priority of Li, i.e. the priority order is the same as for rate-monotone scheduling with T[ = Ai/2. We assume preemptive scheduling and we order the processes in such a way that Ai _< Ai+l.
290
We start by considering L1. This process has the highest priority, and is thus not interrupted by any other process. We want to select as long periods as possible, thus minimizing processor utilization. Obviously, the period of a process must not exceed Ai - Ci. Since L1 has the highest priority, it is safe to set the period of L1 to A1 - C1 In order to distinguish our periods from the ones which are simply equal to half the age constraint Ai, we denote our periods as Ti, i.e. T1 = A1 - C1. We now consider process L2. The m a x i m u m response time R2 of process L2 is defined as the m a x i m u m time from the release of L2 in cycle k to the time that L2 completes in the same cycle. If the execution of L2 is not interrupted by any other process, the response time is simply equal to the c o m p u t a t i o n time C2, i.e. R1 = C1. However, the execution of L2 m a y be interrupted by L1. Process n l m a y in fact interfere with as much as ~Ru/7~IC1 [3]. Consequently, R= = C2 + [R2/TllCz. The only unknown value in this equation is R2. The equation is somewhat difficult to solve due to tile ceiling function (IR2/Tll). In general there could be m a n y values of R2 that solve this equation. The smallest such value represents the worst-case response time for process L2. It has been shown that R2 can be obtained fi'om this equation by forming a recurrence relationship. The technique for doing this is shown in [3]. The beginning of L2 in cycle k and the end of L2 in cycle k + l m a y be separated with as much as T~ + R2, where T2 is the period that we will assign to process L~. From the age constraint we know t h a t Tu + R2 < A2. In order to minimize processor utilization we would like to select as long a period Tu as possible. Consequently, T2 = A2 - R2. In general, the m a x i m u m response time of process i can be obtained from i-1 the relation Ri = Ci + Y'~j=IIRi/TjICj [3]. When we know Ri, the cycle time for Li is set to Ti = A; - Ri. Figure 2 shows a set with three processes L1, L2 and La, defined by A1 = 8, C1 = 2, A2 = 10, C2 = 2, A3 = 12 and C3 = 2. From these values we obtain the period T1 = A1 - 6'1 = 8 - 2 = 6. The m a x i m u m response time R2 for process L2 is obtained from the relation R2 = C2 + [R2/T1]C1 = 2 + [R2/612. The smallest value R2 which solves this equation is 4, i.e. R2 = 4. Consequently, T2 = A s - R 2 = 1 0 - 4 = 6. The m a x i m u m response time I/3 for process L3 is obtained from the relation Ra = Ca + [Ra/T1]C1 + [Ra/T2]C2 = 2 + I R a / 6 ] 2 + [R3/612. The smallest value Ra which solves this equation is 6, i.e. R3 = 6. Consequently, T3 = A3 - R3 = 12 - 6 = 6. In figure 2, the first release of L1 is done at time 2, the first release of L 2 is done at time 1 and the first release of L3 is done at time 0. In the worst-case scenario, process Li m a y suffer from the m a x i m u m response time Ri in two consecutive cycles, i.e. in order to meet the age constraint we know that 2Ri <_Ai. Consequently, there is no use in selecting a 5q smaller than Ri. However, as long as we obtain J} which are longer than or equal to Ri, the age constraint will be met. Therefore, process Li can be scheduled if and only if
Ri < Ai/2.
291
L1 defined by A1 = 8 and C1 = 2 from this we calculate T1 = 6
~ ~S E[~
, T1 = 6
L2 defined by A2 = 10 and C2 = 2 from this we calculate T2 = 6 L3definedbyA3=12 and Ca = 2 from this we calculate 7"3 = 6
E denotes the end of a computation
Al=8
.... ,~] r
A2 =
10
[-~
S denotes the start of a computation
[~ .~-~ T2 = 6 Aa = 12 ~ ~
T3 = 6
Time
J
I
I
1
I
I
I
I
I
I
0
2
4
6
8
10
12
14
16
18
i-
Fig. 2. A set with three processes L1, L2 and L3, defined by Aa = 8, Cl = 2, A2 = 10, C2 = 2 , A3 = 1 2 a n d C 3 - - - 2 .
1. A set of processes which is schedulable using rate-monotone priority assignment and T[ = A i / 2 is also schedulable using our scheme.
Theorem
Proof. T h e difference between our scheme and r a t e - m o n o t o n e with T / = A i / 2 is t h a t we use different periods T/. Since the process set is schedulable using the T' periods we know t h a t R~ < T[(1 < i < n), where R~ denotes the m a x i m u m response time using the 71' periods. By use of induction, we show t h a t Ri 5 R~(1 < i < n).
9 RI=CI_
= Tj. I f T j < Tj, then
x
I
TI
.
Consequently, Ri < R~ < T~ = Ai/2, i.e. Ri ~ A i / 2 , which m e a n s t h a t process Li(1 < i < n) can be scheduled using our scheme. Consider the processes in figure 2. If we would have used the periods Ai/2, we would have got T~ = 4, T~ = 5 and T~ = 6. This would have resulted in a utilization of C1/T~ + C2/T~ + C3/T~ = 2 / 4 + 2 / 5 + 2 / 6 = 1.23 > 1, i.e. the process set would not have been schedulable. Consequently, our scheme is better t h a n r a t e - m o n o t o n e and T~ = A l l 2 in the sense t h a t we are able to schedule a larger n u m b e r of process sets.
3
Simple analysis
In the scheme t h a t we propose, the period of a process depends on the priority of the process. T h e period for a process Li gets shorter if the priority of Li is reduced and vice versa. This p r o p e r t y makes our scheme h a r d to analyze. However, for the limited case when there are two processes, a t h o r o u g h analysis is possible.
292 T h e o r e m 2. The priority assignment used in our s c h e m e is optimal f o r all sets containing two processes. Proof. Consider two processes L1 and L2, such t h a t Aa < A2. Assume t h a t these two processes are schedulable if the priority of L-., is higher t h a n the priority of L 1. We will now show t h a t if this is the case, the two processes are also schedulable if La has higher priority than L2. If L1 and L2 are schedulable when the priority of L2 is higher t h a n the priority of La, then R1 = Ca + [ R a / ( A 2 - C~)]C2 = Ca + kC2 < A 1 / 2 (for some integer k > 0). Consequently, C,. <_ A l l 2 - Ca. If we consider the opposite priority assignment we know that the schedulability criterion is that R2 = C 2 + [ R 2 / ( A a - C a ) ] C 1 <_ A 2 / 2 . Since C2 <_ A ~ / 2 - C I , we know that the m a x i m u m interference from process L1 on process L2 is Ca, i.e. R2 = C2 -}- Ca. Since C2 _< A 1 / 2 - C1 and R2 = C2 + C1, we know that tzt2 < A 1 / 2 , and since A1 _< A2, we know t h a t R2 < A 2 / 2 , thus proving the theorem.
T h e o r e m 3. All sets containing two processes L1 and L2 f o r which C 1 / ( A 1 Ca) ---}-C 2 / ( A 2 - C2) is less than 2 ( v ~ - 1) = 0.83 are schedulable using our scheme, and there are process sets containing two processes f o r which C , / ( A a C l ) + C 2 / ( A 2 - C2) = 2 ( v ~ - 1) + e (for any e > O) which are not schedulable using our scheme. Proof. We assume that A1 < A2, and t h a t we use our scheme for calculating periods and priorities. We want to find the m i n i m u m value C 1 / ( A 1 - Ca) + C 2 / ( A 2 - C2), such that the process set is not sehedulable. We know that as long as C a / ( A 1 - Ca) _< 1, process L1 can be scheduled. This is a trivial observation. Process L2 can be scheduled if R2 _< A 2 / 2 , i.e. we want to minimize C a / ( A 1 Ca) + C 2 / ( A 2 - C2) under the constraint that R2 = C2 + [R2/(A1 - C1)]Ca > A2/2. If [ R 2 / ( A ~ - Ca)] -- k (for some integer k > 0), then the m i n i m u m for C1/(AI-Ca)+C2/(A2-C2) is obtained when R~ = k ( A I - C 1 ) = C2+ [ R ~ / ( A ~ C1)]C~ = C2 + kC1 = > C2 = k(A~ - 2C~). Consequently, we want to find the k which minimizes C 1 / ( A a - C~) + k(A1 - 2 C 1 ) / ( A 2 - k(Aa - 2C~)). Since, C2 = k(Aa - 2C~) _< A 2 / 2 we see that A2 - k(Aa - 2Ca) > 0, and since 0 < A~ - 2C1, we see that the m i n i m u m for C 1 / ( A a - C1) + k(Aa - 2 C 1 ) / ( A 2 - k(Aa - 2Ca)) is obtained for k = 1. Consequently, we want to minimize Ca/(A~ - C~) + (Aa 2 C 1 ) / ( A 2 - (Aa - 2Ca)), under the constraint that Re = Aa - C1 > A 2 / 2 . The m i n i m u m is obviously obtained when A1 - Ca is as small as possible, i.e. when A t - C1 = A 2 / 2 + e (for some infinitely small positive n u m b e r e). Consequently, we want to minimize C 1 / ( A 2 / 2 + e) + ( A 2 / 2 + e - C 1 ) / ( A ~ - ( A ~ / 2 + e - C1)). Without loss of generality we assume that As = 2. In that case we obtain the following function (disregarding e)
f(C1) = C1 + (1 - C1)/(1 + C1)
293
From this we obtain the derivative of .r ( c l )
f:
: 1 + ((1 - c l ) - (L + 61))/(1 + C1) ~
By setting f ' ( C j = 0 we will find the values C1 which minimizes f .
:
o :>
i+2c,+c +I-2c
From this we t h a t rain f ( C J = f ( v ~ -
: o =>
:
,/{-i
1) = 2(x/~ - 1) = 0.83.
It is interesting to note that the sehedulability bound for C1/(A1 - C1) + C 2 / ( A 2 - C2) is the same as the schedulability bound for rate-monotone scheduling and two processors. However, in t h a t case we have C1/T~ +C2/T~ = 2C1/A1 Jr 2C2/A2 = 0.83. At this point we do not know if it is a coincident that the values are the same or not. It would be interesting to examine the case with three processes and see if the schedulability bound for C1/(A1 - C1) Jr C2/(A2 C2) Jr C3/(A3 - - C3) = 3(~/2 - 1) : 0,78, which is the bound for rate-monotone scheduling and three processes. 4
Improving
the
scheme
In the previous sections we assumed t h a t the interference from a higher priority process Lj affected the m a x i m u m response time Rj+x for a process Lj+~ according to the formula Rj+~ : Cj+~ + . . . + [Rj+~/Tj]Q. However, if Tj+~ : kTj (for some integer k > 0), we can adjust the phasing of Lj and Lj+~ in such a way t h a t the interference of Lj on nj+~ is limited to ([Rj+~/Tj] - 1)Cj (see figure 3). The phasing is adjusted in such a way that a release of process Lj+~ always occurs at exactly the same time as a release of process L j. Consequently, if Tj+~ < kTj <_Tj+~ - Cj we can extend the period of Lj+~ to kTj. In order to distinguish the periods obtMned when using the optimized version of the scheme from the ones obtained using the unoptimized version, we denote the optimized period for process Lj as tj. In order to obtain the optimized periods t j , for a set containing n processes, we start with the periods 2) and we then use the following algorithm:
tl =T1 fory=2tonloop
$y = Ty forj:l toy-]loop if ty < ktj < Ty - Cj then t u = ktj end loop y=y+ l end loop T h e execution of the process set is started by releasing all processes at the same time.
294
4
Al=3
P
S c e n a r i o (a)
T1=2
Release of La
Release of
Ls Ts = 3 . 9
4
9 9
D
As = 7 . 2
4
Al=3 S c e n a r i o (b)
T1=2
Release of La
t
Release of Ls T2=4 A2 = 6.3 9 I
)
I
I
]
]
0
1
2
3
4
5
6
Time
I
I
I
7
8
9
I
10
9
11
F i g . 3. In scenario (a) t h e r e are two processes L1 a n d L2. P r o c e s s L1 h a s a p e r i o d T1 = 2 a n d a c o m p u t a t i o n t i m e C1 --- 1. P r o c e s s L2 h a s a p e r i o d 7'2 = 3.9 a n d a c o m p u t a t i o n t i m e 6'2 = 1.3. In this scenario we are able to m e e t t h e age c o n s t r a i n t s A1 = 3 a n d A2 = 7.2, i.e. A2 = 7'2 + R 2 = T2 + C 2 + 2 C 1 = 3 . 9 + 1 . 3 + 2 = 7.2. In scenario (b) we c o n s i d e r t h e s a m e processes, w i t h t h e e x c e p t i o n t h a t t h e p e r i o d of L2 h a s b e e n e x t e n d e d , i.e. T2 = 27"1 = 4. T h e p h a s i n g of LI a n d L2 h a s also b e e n a d j u s t e d such t h a t a release of Ls always coincides w i t h a release of L1. In this scenario we are able t o m e e t t h e age c o n s t r a i n t s A1 = 3 a n d As = 6.3, i.e. As = Ts + Rs = Ts + Cs + C1 = 4 + 1.3 + 1 = 6.3. C o n s e q u e n t l y , b y e x t e n d i n g t h e p e r i o d of Ls we were able to m e e t t o u g h e r age c o n s t r a i n t s .
295 T h e o r e m 4. A set of processes which is schedulable using the unoptimized pe-
riods Ti and an arbitrary phasing of processes is also schedulable using the optimized periods ti , provided that we are able to adjust the phasing of the processes. Proof. Let ri denote the m a x i m u m response time for process Li using the periods ti and the optimized phasing. Obviously, T/ < ti. From this we conclude that i-1 Tj]Cj. Consequently, if Ri < ri = Ci + ~ j i-1 = l [ri/tj]Cj _< Ri = Ci + ~'-~j=l[Ri/ Ai/2, then ri < Ai/2. Figure 4 shows s process set which is sehedulable using the improved version of our scheme. This process set would not have been sehedulable using the unoptimized version of our scheme. Consequently, the improved version of the scheme is better in the sense that we are able to schedule a larger number of process sets.
L1 defined by A1 = 7
Al=8
[~]
[~
~]
and C~ = 1 from this 4
we calculate tx ----6 L2 defined by A2 = 8 and C2 = 2 from this we calculate t2 = 6
P'
ta --- 6
A2 = 8
~ 4
L3 defined by A3 = 9
A3 = 9
.... ~
and C3 = 3 from this we calculate t3 = 6
,
D
t2 -----6
~ l-
,
-i
~
E-]
t3 = 6
Time
I
t
t
i
i
t
t
0
2
4
6
8
10
12
r
14
t
t
16
18
Fig. 4. Three processes which are schedulable using the improved scheme.
5
Conclusions
In this paper a method for scheduling age constraint processes has been presented. An age constraint process is defined by two values: the m a x i m u m time between the beginning of the process' execution in cycle k to the end of the process' execution in cycle k + 1 (the age constraint), and the m a x i m u m c o m p u t a t i o n time in each cycle, i.e. the period of each process is not explicitly defined. We present an algorithm for calculating process periods. Once the periods have been calculated the process set can be executed using preemption and fixed priority scheduling. The priorities are defined by the rate-monotone algorithm. Trivially, the age constraint can be met by using periods equal to half the age constraint. However, the periods obtained from our m e t h o d are better in the sense that they make it possible to schedule a larger ntlmber of process sets
296 than we would have been able to do if we had used periods equal to half the age constraint. All process sets which are schedulable using periods equal to half the age constraint are also schedulable using our method. A simple analysis of our method shows that the rate-monotone priority assignment algorithm is optimal for all process sets containing two processes, using our process periods. We also show that all process sets with two processes, for which C l / ( A 1 - C 1 ) + C ~ / ( A ~ - C ~ ) is less than 2 ( ~ / 2 - 1 ) = 0 . 8 a are schedulable using our scheme. There are, however, process sets containing two processes for which C l / ( A ~ - C~) + C2/(A~ - C2) = 2 ( ~ 1) + e (for any e > 0) which are not schedulable using our scheme, i.e. we provide a simple schedulability test for process sets containing two processes. We also define an improved version of our method. The improved version capitalizes on the fact that there is room for optimization when the period of one process is an integer multiple of the period of another process. The improved version of the method is better than the original version in the sense that we are able to schedule a larger number of process sets. All process sets which are schedulable using the original version are also schedulable using the improved version. Previous work on age constraint process have concentrated on creating cyclic interleavings of processes [1]. Other studies have looked at age constraint pro~esses which communicate [5]. One such scenario is scheduling of age constraint ocesses in the context of hard reM-time database systems [2].
References 1. W. Albrecht, and R. Wisser, "Schedulers for Age Constraint Tasks and Their Performance Evaluation", In Proceedings of the Third International Euro-Par Conference, Passau, Germany, August 1997. 2. N. C. Audsley, A. Burns, M.F. Richardson, and A.J. Wellings, "Absolute and Relative Temporal Constraints in Hard Real-Time Databases", In Proceedings of IEEE Euromicro Workshop on Real-Time Systems, Los Alamitos, California, February 1992. 3. A. Burns, and A. Wellings, "Real-Time Systems and Programming Languages", Addison-Wesley, 1996. 4. J.P. Lehoczky, L. Sha, and Y. Ding, "The rate monotone scheduling algorithm: exact characterization and average case behavior", In Proceedings of IEEE 10th Real-Time Systems Symposium, December 1989. 5. X. Song, and J. Liu, "How Well can Data Temporal Consistency be Maintained", In Proceedings of the 1992 IEEE Symposium on Computer Aided Control Systems design, Napa, California, USA, March 1992.
Workshop 03: Scheduling and Load Balancing Susan Flynn Hummel, Graham Riley and Rizos Sakellariou Co-chairmen
Introduction Mapping a parallel computation onto a parallel computer system is one of the most important questions for the design of efficient parallel algorithms. In the case of irregular data structures, the problem of initially distributing and then maintaining an even workload as computation progresses, becomes very complex. This workshop will discuss the state-of-the-art in this area. Besides the discussion of novel techniques for mapping and scheduling irregular dynamic or static computations onto a processor architecture, a number of open problems are of special interest for this workshop: One is the development of partitioning algorithms used in numerical simulation, taking into account not only the cutsize but also the special characteristics of the application and their impact on the partitioning. Another topic of special relevance for this workshop concerns re-partitioning algorithms for irregular and adaptive finite element computations that minimize the movement of vertices in addition to balancing the load and minimising the cut of the resulting new partition. A third topic is the development of dynamic load balancing algorithms that adapt themselves to the special characteristic of the underlying parallel computer, easing the development of parable applications. Other topics, in this uncomplete list of open problems, relate to applicationspecific graph partitioning, adaptable load balancing algorithms, scheduling algorithms, novel applications of scheduling and load balancing, load balancing on workstation clusters, parallel graph partitioning algorithms, etc.
The Papers The papers selected for presentation in the workshop were split into four sessions each of which consists of work which covers a wide spectrum of different topics. We would like to thank sincerely the more than 20 referees that assisted us in the reviewing process. Rudolf Berrendorf considers the problem of minimising load imbalance along with communication on distributed shared memory systems; a graph-based technique using data gathered from execution times of different program tasks and memory access behaviour is described. An approach for repartitioning and data remapping for numerical computations with highly irregular workloads on distributed memory machines is presented by Leonid Oliker, Rupak Biswas, and
298 Harold Gabow. Sajal Das and Azzedine Boukerche experiment with a load balancing scheme for the parallelisation of discrete event simulation applications. The problem of job scheduling on a parallel machine is tackled by Christophe Rapine, Isaac Scherson and Denis Trystram, who analyse an on-line time/space scheduling scheme. The second session starts with a paper by Wolf Zimmermann, Martin Middendorf and Well Loewe who consider k-linear scheduling algorithms on the LogP machine. Cristina Boeres, Vinod Rebello and David Skillicorn consider also a LogP-type model and investigate the effect of communication overheads. The problem of mesh partitioning and a measure for efficient heuristics is analysed by Ralf Diekmann, Robert Preis, Frank Schlimbach and Chris Walshaw. Konstantinos Antonis, John Garofalakis and Paul Spirakis analyse a strategy for balancing the load on a two-server distributed system. The third session starts with Salvatore Orlando and Raffaele Perego who present a method for load balancing of statically predicted stencil computations in heterogeneous environments. Fabricio Alves Barbosa da Silva, Luis Miguel Campos and Isaac Scherson consider the problem of parallel job scheduling again. A general architectural framework for building parallel schedulers is described by Gerson Cavalheiro, Yves Denneulin and Jean-Louis Roch. A dynamic loop scheduling algorithm which makes use of information from previous executions of the loop is presented by Mark Bull. A divide-and-conquer strategy motivated by problems arising in finite element simulations is analysed by Stefan Bischof, Ralf Ebner and Thomas Erlebach. The fourth session starts with a paper by Dingchao Li, Akira Mizuno, Yuji Iwahori and Naohiro Ishii that proposes an algorithm for scheduling a task graph on an heterogeneous parallel architecture. Mario Dantas considers a scheduling mechanism within an MPI implementation. An approach of using alternate schedules for achieving fault tolerance on a network of workstations is described by Dibyendu Das. An upper bound for the load distribution by a randomized algorithm for embedding bounded degree trees in arbitrary networks is computed by Jaafar Gaber and Bernard Toursel.
O p t i m i z i n g Load B a l a n c e and C o m m u n i c a t i o n on Parallel C o m p u t e r s w i t h D i s t r i b u t e d Shared Memory Rudolf Berrendorf Central Institute for Applied Mathematics Research Centre J/ilich D-52425 Jfilich, Germany r. b e r r e n d o r f O f z - j u e l i c h , de
A b s t r a c t . To optimize programs for parallel computers with distributed shared memory two main problems need to be solved: load balance between the processors and minimization of interprocessor communication. This article describes a new technique called data-driven scheduling which can be used on sequentially iterated program regions on parallel computers with a distributed shared memory. During the first execution of the program region, statistical data on execution times of tasks and memory access behaviour are gathered. Based on this data, a special graph is generated to which graph partitioning techniques are applied. The resulting partitioning is stored in a template which is used in subsequent executions of the program region to efficiently schedule the parallel tasks of that region. Data-driven scheduling is integrated into the SVMFortran compiler. Performance results are shown for the Intel Paragon XP/S with the DSM-extension ASVM and for the SGI Origin2000.
1
Introduction
Parallel computers with a global address space share an important abstraction appreciated by programmers as well as compiler writers: the global, linear address space seen by all processors. To build such a computer in a scalable and economical way, such systems usually distribute the m e m o r y with the processors. Parallel computers with physically distributed m e m o r y but a global address space are termed distributed shared m e m o r y machines (DSM). Examples are SGI Origin2000, KSR-I, and Intel Paragon X P / S with ASVM [3]). To implement the global address space on top of a distributed memory, techniques for multi-cache systems are used which distinguish between read and write operations. If processors read from a memory location, the d a t a is copied to the local m e m o r y of that processor where it is cached. On a write operation of a processor, this processor gets exclusive ownership of this location and all read copies get invalidated. The unit of coherence (and therefore the unit of communication) is a cache line or a page of the virtual memory system. For this reason, care has to be taken to avoid false sharing (independent data objects are mapped to the same page).
300 There are two main problems to be solved for parallel computers with a distributed shared memory and a large number of processors: load balancing and minimization of interprocessor communication, Data-driven scheduling is a new approach to solve both problems. The compiler modifies the code such that at run-time d a t a on task times and m e m o r y access behaviour of the tasks is gathered. With this data a special graph is generated and partitioned for communication minimization and load balance. The partitioning result is stored in a template and it is used in subsequent executions of the program region to efficiently schedule the parallel tasks to the processors. The paper is organized as follows. After giving an overview of related work in section 2, section 3 gives an introduction to SVM-Fortran. In section 4 the concept of data-driven scheduling is discussed and in section 5 performance results are shown for an application executed on two different machines. Section 6 concludes and gives a short outlook of further work.
2
Related
Work
There are a number of techniques known for parallel machines which try to balance the load, or to minimize the interprocessor communication, or both. Dynamic scheduling methods are well known as an attempt to balance the load on parallel computers, usually for parallel computers with a physically shared memory. The most rigid approach is self scheduling [10] where each idle processor requests only one task to be executed. With Factoring [7]) each idle processor requests at the beginning of the scheduling process larger chunks of tasks to reduce the synchronization overhead. All of these dynamic scheduling techniques take in some sense a greedy approach and therefore they have problems if the tasks at the end of the scheduling process have significantly larger execution times than tasks scheduled earlier. Another disadvantage is the local scheduling aspect with respect to one parallel loop only and the fact that the data locality aspect is not taken into account. There are several scheduling methods known which have a main objective in generating data locality. Execute-on-Home [6] uses the information of a data distribution to execute tasks on that processor to which the accessed data is assigned (i.e. the processor which owns the data). An efficient implementation of the execute-on-home scheduling based on data distributions is often difficult if the d a t a is accessed in an indirect way. In that case, run-time support is necessary. C H A O S / P A R T I [13] (and in a similar manner RAPID [4]) is an approach to handle indirect accesses as it is for example c o m m o n in sparse matrix problems. In an inspector phase the indices for indirection are examined and a graph is generated which includes through the edges the relationship between the data. Then the graph is partitioned and the data is redistributed according to the partitioning. Many problems in the field of technical computing are modeled by the technique of finite elements. The original (physical) domain is partitioned into discrete elements connected through nodes. A usual approach of mapping such
301 problems to parallel computers is the partitioning (e.g. [8]) of the resulting grid such that equally sized subgrids are mapped to the processors. For regular grids, partitioning can be done as a geometric partitioning. Non-uniform computation times involved with each node and data that need to be communicated between the processors have to be taken into account in the partitioning step. The whole problem of mapping such a finite element graph onto a parallel computer can be formulated as a graph partitioning problem where the nodes of the graph are the tasks associated which each node of the finite element grid and the edges represent communication demands. There are several software libraries available which can be used to partition such graphs (e.g. Metis [8], Chaco [5], Party [9],
JOSTLn [12]). Getting realistic node costs (i.e. task costs) is often difficult for non-uniform task execution times. Also, graph partitioners take no page boundaries into account when they map finite element nodes onto the m e m o r y of a DSM-computer and thus they ignore the false sharing problem. [11] discusses a technique to reorder the nodes after the partitioning phase with the aim to minimize communication.
3
SVM-Fortran
SVM-Fortran [2] is a shared memory parallel Fortran77 extension targeted mainly towards data parallel applications on DSM-systems. SVM-Fortran supports coarsegrained functional parallelism where a parallel task itself can be d a t a parallel. A compiler and run-time system is implemented on several parallel machines such as lntel Paragon X P / S with ASVM, SGI Origin2000, and SUN and DEC multiprocessor machines. SVM-Fortran provides standard features of shared m e m o r y parallel Fortran languages as well as specific features for DSM-computers. In SVM-Fortran the main concept to generate data locality and to balance the load is the dedicated assignment of parallel work to processors, e.g. the distribution of iterations of a parallel loop to processors. Data locality is not a problem to be solved on the level of individual loops but it is a global problem. SVM-Fortran uses the concept of processor arrangements and templates as a tool to specify scheduling decisions globally via template distributions. Loop iterations are assigned to processors according to the distribution of the appropriate template element. Therefore, in SVM-Fortran templates are used to distribute the work rather than used to distribute the data as it is done in HPF. Different to HPF, it is not necessary for the SVM-Fortran-compiler to know the distribution of a template at compile time.
4
Data-Driven Scheduling
User-directed scheduling, where the user specifies the distribution of work to processors (e.g. specifying a block distribution for a template), makes sense and
302 is efficient if the user has reliable information on the access behaviour (interprocessor communication) and execution times (load balance) of the parallel tasks in the application. But the prerequisite for this is often a detailed understanding of the program, all data structures, and their content at run-time. If the effort to gain this information is too high or if the interrelations are too complex (e.g. indirect addressing on several levels, unpredictable task execution times), the help of a tool or an automated scheduling is desirable. Data-driven scheduling tries to help the programmer in solving the load balancing problem as well as the communication problem. Due to the way datadriven scheduling works, the new technique can be applied to program regions which are iterated several times (sequentially iterated program regions; see Fig. 1 for an example). This type of program region is common in iterative algorithms (e.g. iterative solvers), nested loops where the outer loop has not enough parallelism, or nested loops where data dependencies prohibit the parallel execution of the outer loop. The basic idea behind data-driven scheduling is to gather run-time data on task execution times and m e m o r y access behaviour on the first execution of the program region, use this information to find a good schedule which optimizes communication and load balance, and use the computed schedule in subsequent executions of the program region. For simplicity, we restrict the description to schedule regions with one parallel loop inside, although more complex structures can be handled. Fig. 1 shows a small example program. This program is an artificial program to explain the basic idea. In a real program, the false sharing problem introduced with variable a could be easily solved with the introduction of a second array. In this example, the strategy to optimize for data locality and therefore how to distribute the iterations of the parallel loop to the processors might differ dependent on the content of the index field ix. On the other side, the execution time for each of the iterations might differ significantly based on the call to w o r k ( i ) and therefore load balancing might be a challenging task. Data-driven scheduling tries to model both optimization aspects with the help of a weighted graph. The compiler generates two different codes (code-l, code-2) for the program region. Code-1 is executed on the first execution of the program region 1. Beside the normal code generation (e.g. handling of PDOs), the code is further instrumented to gather statistics. At run-time, execution times of tasks and accesses to shared variables (marked with the VARS-attribute in the directive) are stored in parallel in distributed data structures. Reaching the end of the schedule region, the gathered data is processed. A graph is generated by the run-time system where each node of the graph represents a task and the node weights are the task's execution time. To get a realistic value, measured execution times have to be subtracted by overhead times (measurement times, page miss times etc.) and synchronization times. To explain the generation of edges in the graph, have a look at Fig. 2(a) which shows the assignments done in the example program. Task 1 and 3 access page 1, I The user can reset a schedule region through a directive in which case code-1 is executed again, for example if the content of an index field has changed.
303
C
C
C
--- simple, artificial example to explain the b a s i c idea --- a m e m o r y page shall consist of 2 words PARAMETER (n=4) REAL,SHARED,ALIGNED a(2,n) ! s h a r e d variable I N T E G E R ix(n) ! index f i e l d DATA /ix/I,2,1,2 --- declare prec. field, template, and d i s t r i b u t e template ---
CSVM$ PROCESSORS:: proc(numproc()) CS VM$ TEMPLATE tempi(n) CSVM$ DISTRIBUTE (BLOCK) ONTO proc:: tempi C C
--- sequentially iterated p r o g r a m r e g i o n --DO iter=l ,niter --- parallel loop enclosed in schedule r e g i o n ---
CS VM$ S C H E D ULE_REGION(ID(1), VARS(a)) CSVM$ PDO (STRATEGY (ON_HOME (tempi(i)))) D O i=l ,n a(1,i) -- a(2,ix(i)) CALL work(i) ENDDO
! possible communication ! p o s s i b l e load imbalance
CS V M $ S C H E D ULE_REGION_END ENDDO Fig. 1. (Artificial) Example of a sequentially iterated program region.
task h task3: task4."
~ a(1,3) a(1,4)
=~
10
= ~ = ~
1
n o : ~ : ~ S t o s t L ~ tasknumber
access to page 1 access to page 2
(b) Generated graph.
(a) Executed assignments. Fig. 2. Memory accesses and resulting graph for example program. task 1
2
#/:tasks 3
4
page 1 w/r r page 2 w/r r page 3 w page 4 w task time 10 10 10 10
2 2 1 1
Fig. 3. Page accesses and task times (w=write access, r e r e a d access).
304 task 2 and 4 access page 2 either with a read access or with a write access. Page j (j=3,4) is accessed only within task j. Table 4 shows the corresponding access table as it is generated internally as a distributed sparse d a t a structure at runtime. The m o s t interesting column is labeled # t a s k s and gives the n u m b e r of different tasks accessing a page. If this entry is 1, no conflicts exist for this page and therefore no communication between processors is necessary for this page, independent of the schedule of tasks to processors. If two or more tasks access a page ( ~ t a s k s > 1) of which at least one access is a write access, communication might be necessary due to the multi-cache protocol if these tasks are not assigned to the same processor. This is modeled in the graph with an edge between the two tasks/nodes with edge costs of one page miss. Fig. 2(b) shows the graph corresponding to the access table in table 4 (for this simple example, task times are assumed 10 units and a page miss counts 1 unit). The handling of compulsory page misses is difficult as this depends on previous d a t a accesses. If the distribution of pages to processors can be determined at run t i m e (e.g. in ASVM[3]) and if this distribution is similar for all iterations, then these misses can be modeled by artificial nodes in the graph which are pre-assigned to processors. In a next step, this graph is partitioned. We found that multilevel algorithms as found in most partitioning libraries give good results over a large variety of graphs. T h e partitioning result (i.e. the m a p p i n g of task i to processor j) is stored in the t e m p l a t e given in the PDO-directive. The d a t a gathering phase is done in parallel, but the graph generation and partitioning phase is done sequentially on one processor. We investigate to use parallel partitioners in the future. Code-2 is executed at the second and all subsequent executions of the program region and this code is optimized code without any instrumentation. T h e iterations of the parallel loop are distributed according to the t e m p l a t e distribution which is the schedule found in the graph partitioning step.
5
Results
We show performance results for the A S E M B L - p r o g r a m which is the assembly of a finite element m a t r i x done in a sequential loop modeling time steps as it is used in m a n y areas of technical computing. The results were obtained on an Intel Paragon X P / S with the extension ASVM [3] implementing SVM in software (embedded in the MACH3 micro kernel) and on a SGI Origin2000 with a fast DSM-implementation in hardware. A multilevel algorithm with Kernighan-Lin refinement of the Metis library [8] was used for partitioning the graph. For the Paragon X P / S , a data set with 900 elements was taken, and for the Origin2000 a d a t a set with 8325 elements was chosen. Due to a search loop in every parallel task, task times differ. The sparse m a t r i x to be assembled is allocated in the shared m e m o r y region; the variable accesses are protected by page locks where only one processor can access a page but accesses to different pages can be done in parallel. The bandwidth of the sparse m a t r i x is small,
305 therefore strategies which partition the iteration space in clusters will generate better data locality. Fig. 4 shows speedup values for the third sequential time step. Because of the relative small size of the shared array and the large page size on Paragon XP/S (8 KB), the performance improvements are limited to a small number of processors. On both systems, the performance with the Factoring strategy is limited because data locality (and communication) is of no concern with this strategy. Particularly on the Paragon system with high page miss times, this strategy shows no real improvements. Although the shared matrix has a low bandwidth and therefore the Block strategy is in favor, differences in task times cause load imbalances with the Block strategy. Data-driven Scheduling (DDS) performs better than the other strategies on both systems.
"~-.-,, oos 4.5-=
-
4O
" ~
35
~
Block
c
c
~
_-
DDS
30 3,5
25
~. 15 10 5 0,
1
2
4
6
Anzahl prozessoren
8
10
0 number of processors
(a) Speedup on Paragon XP/S.
(b) Speedup on SGI Origin2000.
Fig. 4. Results for ASEMBL.
6
Summary
We have introduced data-driven scheduling to optimize interprocessor communication and load balance on parallel computers with a distributed shared memory. With the new technique, statistics on task execution times and data sharing are gathered at run-time. With this data, a special graph is generated and graph partitioning techniques are applied. The resulting partitioning is stored in a template and used in subsequent executions of that program region to efficiently schedule the parallel tasks to the processors. We have compared this method with two loop scheduling methods on two DSM-computers where data-driven scheduling performs better than the other two methods.
306 Currently, we use deterministic values for all model parameters. To refine our model, we plan to investigate into non-deterministic values, e.g. n u m b e r of page misses, page miss times.
7
Acknowledgments
Reiner Vogelsang ( S G I / C r a y ) and Oscar P l a t a (University of Malaga) gave me access to SGI Origin2000 machines. Heinz Bast (Intel SSD) supported me on the Intel Paragon. I would like to thank the developers of the graph partitioning libraries I used in my work, namely: George Karypis (Metis), Chris Walshaw (Jostle), Bruce A. Hendrickson (Chaco), and Robert Preis (Party).
References 1. R. Berrendorf, M. Gerndt. Compiling SVM-Fortran for the Intel Paragon XP/S. Proc. Working Conference on Massively Parallel Programming Models (MPPM'95), pages 52-59, Berlin, October 1995. IEEE Society Press. 2. R. Berrendorf, M. Gerndt. SVM Fortran reference manual version 1.4. Technical Report KFA-ZAM-IB-9510, Research Centre Jfilich, April 1995. 3. R. Berrendorf, M. Cerndt, M. Mairandres, S. Zeisset. A programming environment for shared virtual memory on the Intel Paragon supercomputer. In Proc. Intel User Group Meeting, Albuquerque, NM, June 1995. http ://www. cs. sandia, gov/ISUG/ps/pesvm, ps.
4. C. Fu, T. Yang. Run-time compilation for parallel sparse matrix computations. In Proc. ACM [nt'l Conf. Supercomputing, pages 237-244, 1996. 5. B. Hendrickson, R. Leland. The Chaco user's guide, version 2.0. Technical Report SAND95-2344, Sandia National Lab., Albuquerque, NM, July 1995. 6. High Performance Fortran Forum. High Performance Fortran Language Specification, 2.0 edition, January 1997. 7. S. F. Hummel, E. Schonberg, L. E. Flynn. Factoring - a method for schedufing parallel loops. Comm. ACM, 35(8):90-101, August 1992. 8. G. Karypis, V. Kumar. Analysis of multilevel graph partitioning. Technical Report 95-037, Univ. Minnesota, Department of Computer Science, 1995. 9. R. Preis, R. Dieckmann. The P A R T Y Partitioning-Library, User Guide, Version 1.1. Univ. Paderborn, September 1996. 10. P. Tang, P.-C. Yew. Processor self-scheduling for multiple nested parallel loops. In Proc. IEEE Int'l Conf. Parallel Processing, pages 528-535, August 1986. 11. K. A. Tomko, S. G. Abraham. Data and program restructuring of irregular applications for cache-coherent multiprocessors. In Proc. A C M Int'l Conf. Supercomputing, pages 214-225, July 1994. 12. C. Walshaw, M. Cross, M.G. Everett, S. Johnson, K. McManus. Partitioning & mapping of unstructured meshed to parallel machine topologies. In Proe. Irregular 95: Parallel Algorithms for Irregularly Structured Problems, volume 980 of LNCS, pages 121-126. Springer, 1995. 13. J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Computers, 44(6), 1995.
Performance Analysis and Portability of the PLUM Load Balancing System Leonid Oliker l, R u p a k Biswas ~, and Harold N. G a b o w 3 1 RIACS, NASA Ames Research Center, Moffett Field, CA 94035, USA 2 MR J, NASA Ames Research Center, Moffett Field, CA 94035, USA 3 CS Department, University of Colorado, Boulder, CO 80309, USA
Abstract. The ability to dynamically adapt an unstructured mesh is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult. To address this problem, we have developed PLUM, an automatic portable framework for performing adaptive numerical computations in a message-passing environment. PLUM requires that all data be globally redistributed after each mesh adaption to achieve load balance. We present an algorithm for minimizing this remappingoverhead by guaranteeing an optimal processor reassignment. We also show that the data redistribution cost can be significantly reduced by applying our heuristic processor reassignment algorithm to the default mapping of the parallel partitioner. Portability is examined by comparing performance on a SP2, an Origin2000, and a T3E. Results show that PLUM can be successfully ported to different platforms without any code modifications.
1
Introduction
The ability to dynamically adapt an unstructured mesh is a powerful tool for efficiently solving computational problems with evolving physical features. S t a n d a r d fixed-mesh numerical methods can be m a d e more cost-effective by locally refining and coarsening the mesh to capture these p h e n o m e n a of interest. Unfortunately, an efficient parallelization of these adaptive m e t h o d s is rather difficult, primarily due to the load imbalance created by the dynamically-changing nonuniform grid. Nonetheless, it is generally thought that unstructured adaptive-grid techniques will constitute a significant fraction of future high-performance supereomputing. With this goal in mind, we have developed a novel method, called PLUM [7], that dynamically balances processor workloads with a global view when performing adaptive numerical calculations in a parallel message-passing environment. The mesh is first partitioned and m a p p e d a m o n g the available processors. Once an acceptable numerical solution is obtained, the mesh adaption procedure [8] is invoked. Mesh edges are targeted for coarsening or refinement based on an error indicator c o m p u t e d from the solution. T h e old mesh is then coarsened, resulting in a smaller grid. Since edges have already been marked for refinement, the new mesh can be exactly predicted before actually performing the refinement step. P r o g r a m control is thus passed to the load balancer at this time.
308 If the current partitions will become load imbalanced after adaption, a repartitioner is used to divide the new mesh into subgrids. The new partitions are then reassigned among the processors in a way that minimizes the cost of data movement. If the remapping cost is compensated by the computational gain that would be achieved with balanced partitions, all necessary d a t a is appropriately redistributed. Otherwise, the new partitioning is discarded. The computational mesh is then refined and the numerical calculation is restarted.
2
2.1
Dynamic
Load
Balancing
Repartitioning the Initial Mesh Dual Graph
Repeatedly using the dual of the initial computational mesh for dynamic load balancing is one of the key features of PkUM [7]. Each dual graph vertex has a computational weight, Wcomp, and a remapping weight, Wremap- These weights model the processing workload and the cost of moving the corresponding element from one processor to another. Every dual graph edge also has a weight, Wcomm, that models the runtirne communication. New computational grids obtained by adaption are represented by modifying these three weights. If the dual graph with a new set of Wcomp is deemed unbalanced, the mesh is repartitioned.
2.2
Processor R e a s s i g n m e n t
New partitions generated by a partitioner are mapped to processors such that the data redistribution cost is minimized. In general, the number of new partitions is an integer multiple F of the number of processors, and each processor is assigned F partitions. Allowing multiple partitions per processor reduces the volume of data movement but increases the partitioning and reassignment times [7]. We first generate a similarity measure M that indicates how the remapping weights Wremap of the new partitions are distributed over the processors. It is represented as a matrix where entry Mij is the sum of the Wremap values of all the dual graph vertices in new partition j that already reside on processor i. Various cost functions are usually needed to solve the processor reassignment problem using M for different machine architectures. We present three general metrics: Togalg, MaxV, and MaxSR, which model the remapping cost on most multiprocessor systems. T o t a l Y minimizes the total volume of data moved among all the processors, Maxg minimizes the m a x i m u m flow of data to o r from any single processor, while NaxSR minimizes the sum of the m a x i m u m flow of data to a n d from any processor. Experimental results [2] have indicated the usefulness of these metrics in predicting the actual remapping costs. A greedy heuristic algorithm to minimize the remapping overhead is also presented. T o t a l V M e t r i c . The T o t a l l / metric assumes that by reducing network contention and the total number of elements moved, the remapping time will be reduced. In general, each processor cannot be assigned F unique partitions cot-
309 responding to their F largest weights. To minimize TotalV, each processor i must be assigned F partitions ji_f, f = 1, 2,..., F, such that the objective function P
F
y-- E E
f;l
{~I
is maximized subject to the constraint
ji_rejk-~, f o r i e k o r r e s ;
i,k=l,2,...,P;
r,s=l,2,...,F.
We can optimally solve this by mapping it to a network flow optimization problem described as follows. Let G = (17, E) be an undirected graph. G is bipartite if V can be partitioned into two sets A and B such that every edge has one vertex in A and the other vertex i n / 3 . A matching is a subset of edges, no two of which share a common vertex. A maximum-cardinality matching is one that contains as many edges as possible. If G has a real-vMued cost on each edge, we can consider the problem of finding a maximum-cardinality matching whose totM edge cost is maximized. We refer to this as the maximally weighted bwartite graph (MWBG) problem (also known as the assignment problera). When F = l, optimally solving the T o t a l V metric trivially reduces to M W B G , where V consists of P processors and P partitions in each set. An edge of weight M~j exists between vertex i of the first set and vertex j of the second set. If F > 1, the processor reassignment problem can be reduced to M W B G by duplicating each processor and all of its incident edges F times. Each set of the bipartite graph then has P• vertices. After the optimal solution is obtained, the solutions for all F copies of a processor are combined to form a one-to-F mapping between the processors and the partitions. The optimal solution for the To~;alV metric and the corresponding processor assignment of an example similarity matrix is shown in Fig. l(a). The fastest MWBG algorithm can compute a matching in O(IVI 2 log IVI + IVIIEI) time [3], or in O(IVI1/2IEI log(IVlC)) time if all edge costs are integers of absolute value at most C [5]. We have implemented the optimal algorithm with a runtime of O([V[3). Since M is generally dense, [E[ ~, IV[ 2, implying that we should not see a dramatic performance gain from a faster implementation. M a x V M e t r i c . The metric HaxY, unlike TotalV, considers data redistribution in terms of solving a load imbalance problem, where it is more important to minimize the workload of the most heavily-weighted processor than to minimize the stun of all the loads. During the process of remapping, each processor must pack and unpack send and receive buffers, incur remote-memory latency time, and perform the computational overhead of rebuilding internal and shared data structures. By minimizing max (a x max(~.lemsSent),/~ x max(~.lerasRecd)), where a and/3 are machine-specific parameters, MaxV attempts to reduce the total remapping time by minimizing the execution time of the most heavily-loaded processor. We can solve this optimally by considering the problem of finding a maximum-cardinality matching whose maximum edge cost is minimum. We refer to this as the bottleneck maximum cardinality matching (BMCM) problem. To find the BMCM of the graph G correspond!ng to the similarity matrix, we first need to transform M into a new matrix M . Each entry Mi'j represents
310 New Partitions
New Processors TotalV moved = 525 MaxV moved = 275 MaxSR moved = 485 (a)
New Partitions
New Processors TotalV moved = 640 MaxV moved = 245 MaxSR moved = 475 (b)
New Partitions
New Processors TotalV moved = 570 MaxV moved = 255 MaxSR moved = 465 (c)
New Partitions
New Processors TotalVmoved = 550 MaxVmoved = 260 MaxSRmoved = 470 (d)
Fig. 1. Various cost metrics of a similarity matrix M for P = 4 and F = 1 using (a) the optimal MWBG, (b) the optimal BMCM, (c) the optimal DBMCM, and (d) our heuristic algorithms the m a x i m u m cost of sending data to or receiving data from processor i and partition j: P
P
y~l
x=l
Currently, our framework for the Max]/metric is restricted to F = 1. We have implemented the BMCM algorithm of Bhat [1] which combines a m a x i m m n cardinality matching Mgorithm with a binary search, and runs in O(IVI1/2IEI log IVI). The fastest known BMCM algorithm, proposed by Gabow and Tarjan [4], has a runtime of O((IV Ilog IV[)I/21EI). The new processor assignment for the similarity matrix in Fig. 1 using this approach with a = ,3 = 1 is shown in Fig. 1 (b). Notice that the total number of elements moved in Fig. l(b) is larger than the corresponding value in Fig. l(a); however, the m a x i m u m number of elements moved is smaller. M a x S R M e t r i c . Our third metric, MaxSR, is similar to MaxV in the sense that the overhead of the bottleneck processor is minimized during the remapping phase. MaxSR differs, however, in that it minimizes the sum of the heaviest data flow from any processor and to any processor, expressed as (a • + fl• We refer to this as the double bottleneck m a x i m u m cardinality matching (DBMCM) problem. The MaxSR formulation allows us to capture the computational overhead of packing and unpacking data, when these two phases are separated by a barrier synchronization. Additionally, the 14axSl~ metric may also approximate the m a n y - t o - m a n y communication pattern of our remapping phase. Since a processor can either be sending or receiving data, the overhead of these two phases should be modeled as a sum of costs. We have developed an algorithm for computing the m i n i m u m MaxSR of the graph G corresponding to our similarity matrix. We first transform M to a new matrix M . Each entry Mij contains a pair of values (Send,Receive) correspond9
l/
l/
311 ing to the total cost of sending and receiving data, when partition j is mapped to processor i: P
P
tt
. . j : { s.j = (4 E
y
j) .
y=l
=
E Mx . x
i) }
x=l
Currently, our algorithm for the I~axSR metric is restricted to F = 1. Let cq,~r2,...,~rk be the distinct Send values appearing in M", sorted in increasing order. Thus, ~ri < c~i+l and k _< p2. Form the bipartite graph Gi = (V, Ei), where V consists of processor vertices u = 1 , 2 , . . . , P and partition vertices v = 1 , 2 , . . . , P , and Ei contains edge (u,v) if Su~ _< ~ri; furthermore, edge (u, v) has weight R ~ if it is in El. For small values of i, graph Gi may not have a perfect matching. Let imin be the smallest index such that Gim~, has a perfect matching. Obviously, Gi has a perfect matching for all i _> imin. Solving the BMCM problem of Gi gives a matching that minimizes the maximum Receive edge weight. It gives a matching with llaxSR value at most ~ri+ l~axV(Gi). Defining MaxSR(i) =
min
imia<j
(crj + MaxV(aj))
it is easy to see that blaxSR(k) equals the correct value of MaxSR. Thus, our algorithm computes MaxSR by solving k BMCM problems on the graphs Gi and computing the minimum value MaxSR(k). However, we can prematurely terminate the algorithm if there exists an imp• such that Cqm.~+X > M a x S R ( i ~ ) , since it is then guaranteed that the llaxSR solution is MaxSR(im~x). Our implementation has a runtime of O(IVI1/21EI2 log IVI) since the BMCM algorithm is called IEI times in the worst case; however, it can be decreased to O(IEI~). The following is a brief sketch of this more efficient implementation. Suppose we have constructed a matching 3/I that solves the BMCM problem of Gi for i _> imin. We solve the BMCM problem of Gi+I as follows. Initialize a working graph G to be Gi+l with all edges of weight greater than Maxg(Gi) deleted. Take the matching 3,t on G, and delete all unmatched edges of weight MaxV(Gd. Choose an edge (u, v) of maximum weight in 3.4. Remove edge (u, v) from 3,t and G, and search for an augmenting path from u to v in G. If no such path exists, we know that MaxV(G~) =MaxV(G~+~). If an augmenting path is found, repeat this procedure by choosing a new edge (u ~, v~) of m a x i m u m weight in the matching and searching for an augmenting path. After some number of repetitions of this procedure, the maximum weight of a matched edge will have decreased to the desired value MaxV(Gi+l). At this point our algorithm to solve the BMCM problem of Gi+l will stop, since no augmenting path will be found. To see that this algorithm runs in O(IEI~), note that each search for an augmenting path uses time O(IE]) and that there are O(IEI) such searches, a successful search for an augmenting path for edge (u, v) permanently eliminates it from all future graphs, so there are at most I~:l successful searches. Furthermore, there are at most IEI unsuccessful searches, one for each value of i. The new processor assignment for the similarity matrix in Fig. 1 using the DBMCM algorithm with c~ = ~ = 1 is shown in Fig. l(c). Notice that the MaxSR
312 solution is minimized; however, the number of TotalV elements moved is larger than the corresponding value in Fig. l(a), and more MaxV elements are moved than in Fig. l(b). Also note that the optimal similarity m a t r i x solution for MaxSR is provably no more than twice that of l~axV. H e u r i s t i c A l g o r i t h m . We have developed a heuristic greedy algorithm that gives a suboptimal solution to the WotalV metric in O(IEI) steps [7]. All partitions are initially flagged as unassigned and each processor has a counter set to F that indicates the remaining number of partitions it needs. The non-zero entries of the similarity matrix M are then sorted in descending order. Starting from the largest entry, partitions are assigned to processors that have less than F partitions until done. If necessary, the zero entries in M are also used. Oliker and Biswas [7] proved that a processor assignment obtained using the heuristic algorithm can never result in a data movement cost that is more than twice that of the optimal TotalY assignment. In addition, experimental results in Sec. 3.1 demonstrate that our heuristic quickly finds high quality solutions for all three metrics. Applying this heuristic algorithm to the similarity matrix in Fig. ] generates the new processor assignment shown in Fig. l(d). 2.3 R e m a p p i n g C o s t M o d e l Once the reassignment problem is solved, a model is needed to quickly predict the expected redistribution cost for a given architecture. Our redistribution algorithm consists of three major steps: first, the data objects moving out of a partition are stripped out and placed in a buffer; next, a collective communication distributes the data to its destination; and finally, the received data is integrated into each partition and the boundary information is consistently updated. This remapping procedure closely follows the superstep model of BSP [9]. The expected time for the redistribution procedure on bandwidth-rich systems can be expressed as 3~ x MaxSR + O, where MaxSR = max(ElemsSent) + max(ElemsRecd), 7 is the total computation and communication cost to process each redistributed element, and O is the sum of all constant overheads [7]. This formulation demonstrates the need to model the MaxSR metric when performing processor reassignment. By minimizing MaxSR, we can guarantee a reduction in the computational overhead of our remapping algorithm. To compute 7 and O, a simple least squares fit through several data points for various redistribution patterns and their corresponding runtimes can be used. This procedure needs to be performed only once for each architecture, and the values of 7 and O can then be used in actual computations to estimate the redistribution cost.
3
Experimental Results
The 3D_TAG parallel mesh adaption procedure [8] and the PLUM global load balancing strategy [7] have been implemented in C and C+q-, with the parallel activities in MPI for portability. All experiments were performed on the widenode SP2 at NASA Ames, the Origin2000 at NCSA, and the T 3 E at NASA Goddard, without any machine-specific optimizations.
313 The computational mesh used in this paper is one used to simulate an acoustics wind-tunnel experiment of a UH-1H helicopter rotor blade [7]. Three different cases are studied, with varying fractions of the domain targeted for refinement based on an error indicator calculated directly from the flow solution. The strategies, called Real_l, Real_2, and Real_3, subdivided 5%, 33%, and 60% of the 78,343 edges of the initial mesh. This increased the number of mesh elements from 60,968 to 82,489,201,780, and 321,841, respectively. 3.1
Comparison of Reassignment
Algorithms
Table 1 presents a comparison of our five different processor reassignment algorithms in terms of the reassignment t i m e (in secs) and the a m o u n t of d a t a movement. Results are shown for the ReaI_2 strategy on the SP2 with F = 1. T h e PMeTiS [6] case does not require any explicit processor reassignment since we choose the default partition-to-processor m a p p i n g given by the partitioner. The poor performance for all three metrics is expected since PMeTiS is a global partitioner that does not a t t e m p t to minimize the remapping overhead. Previous work [2] compared the performance of PMeTiS with other partitioners. T a b l e 1. Comparison of reassignment algorithms for Real_2 on the SP2 with F = 1
Algthm. PMeTiS MWBG BMCM DBMCM Heuristic
P =32 TotalV MaxV MaxgR Reass. Metric Metric Metric Time
P =64 TotaYV MaxY MaxSR Reass. Metric Metric Metric Time
58297 34738 49611 50270 35032
67439 38059 52837 54896 38283
5067 4410 4410 4414 4410
7467 5822 5944 5733 5809
0.0000 0.0177 0.0323 0.0921 0.0017
2667 2261 2261 2261 2261
4452 3142 3282 3121 3123
0.0000 0.0650 0.1327 1.2515 0.0088
The execution times of the other four algorithms increase with the n u m b e r of processors because of the growth in the size of the similarity matrix; however, the heuristic time for 64 processors is still very small and acceptable. T h e total volume of d a t a movement is obviously the smallest for the M W B G algorithm since it optimally solves for the T o t a l V metric. In the optimal B M C M method, the m a x i m u m of the n u m b e r of elements sent or received is explicitly minimized, but all the other algorithms give almost identical results for the MaxV metric. In our helicopter rotor experiment, only a few localized regions of the domain incur a d r a m a t i c increase in the n u m b e r of grid points between refinement levels. These newly-refined regions must shift a large number of elements onto other processors in order to achieve a balanced load distribution. Therefore, a similar MaxV solution should be obtained by any reasonable reassignment algorithm. The DBMCM algorithm optimally reduces MaxSR, but achieves no more than a 5% improvement over the other algorithms. Nonetheless, since we believe t h a t the MaxSR metric can closely a p p r o x i m a t e the remapping cost on m a n y architectures, computing its o p t i m a l solution can provide useful information. Notice
314 that the minimum TotalV increases slightly as P grows from 32 to 64, while MaxSR is dramatically reduced by over 45%. This trend continues as the number of processors increases, and indicates that PLUM will remain viable on a large number of processors, since the per processor workload decreases as P increases. Finally, observe that the heuristic algorithm does an excellent job in minimizing all three cost metrics, in a trivial amount of time. Although theoretical bounds have only been established for the Tota:W metric, empirical evidence indicates that the heuristic algorithm closely approximates both }laxV and }laxSR. Similar results were obtained for the other edge-marking strategies. 3.2
Portability Analysis
The top three plots in Fig. 2 illustrate parallel speedup for the three edgemarking strategies on the SP2, Origin2000, and T3E. Two sets of results are presented for each machine: one when data remapping is performed after mesh refinement, and the other when remapping is done before refinement. The speedup numbers are almost identical on all three machines. The Real_3 case shows the best speedup values because it is the most computation intensive. Remapping data before refinement has the largest relative effect for Real_l, because it has the smallest refinement region and predictively load balancing the refined mesh returns the biggest benefit. The best results are for Real_3 with remapping before refinement, showing an efficiency greater than 87% on 32 processors. 60 84
30
SP2 .... R e m a p after r e f i n e m e n t / 45 - - R e m a p before r e f i n e m e J 1 J / /
t, R
~ ~./1~/ e a l ~ ...-9 ..-"" .._.E?...----"
t8
~ ........ ~ ...... -.....
12 ...
T3E
,601
24
i::a.
~ 30 oo RReeaall - 21
80,
Origin2000
.a y " ..--"" ..o--" .....
6
..... -.... . ............ .... o ........... e- ...........
0 -*
S P2 ""o.: .......
~
'
a.
.... a., O r i g i n 2 0 0 0 'B:. o " "
T3E
101
101
10~
,1J
16 2,4 3'2 40 4'8 5'6 64 N u m b e r o f processors
4 ~ 1'2 f6 20 2`4 28 N u m b e r o f processors
"~".....::~.:.....:. 1'6 3'2 48 64 80 9'6 l i 2 128 N u m b e r o f processors
Fig. 2. Refinement speedup (top) and remapping time (bottom) within PLUM on the SP2, Origin2000, and T3E, when data is redistributed after or before mesh refinement To compare the performance on the SP2, Origin2000, and T 3 E more critically, one needs to look at the actual times rather than the speedup values. Table 2 shows how the execution time (in sees) is spent during the refinement and subsequent load balancing phases for the Real_2 case when data is remapped
315 Table 2. Anatomy of execution times for Real_2 on the Origin2000, SP2, and T3E
P 2 4 8 16 32 64 128
Adaption Time 02000 SP2 T3E
Remapping Time 02000 SP2 T3E
Partitioning Time 02000 SP2 T3E
5.261 2.880 1.470 0.794 0.458
3.005 3.005 2.963 2.346 0.491
0.628 0.584 0.522 0.396 0.389
12.06 3.455 6.734 1.956 3.434 1.034 1.846 0.568 1.061 0.333 0.550 0.188 0.121
3.440 3.440 3.321 2.173 1.338 0.890
2.648 1.501 1.449 0.880 0.592 0.778 1.894
0.815 0.537 0.424 0.377 0.429 0.574
0.701 0.477
0.359 0.301 0.302 0.425 0.599
before the subdivision phase. The processor reassignment times are not presented since they are negligible compared to other times, as is evident from Table 1. Notice t h a t the T 3 E adaption times are consistently more t h a n 1.4 times faster than the Origin2000 and three times faster than the SP2. One reason for this performance difference is the disparity in the clock speeds of the three machines. Another reason is that the mesh adaption code does not use the floating-point units on the SP2, thereby adversely affecting its overall performance. T h e b o t t o m three plots in Fig. 2 show the r e m a p p i n g t i m e for each of the three cases on the SP2, Origin2000, and T3E. In almost every case, a significant reduction in remapping time is observed when the adapted mesh is load balanced by performing d a t a movement prior to refinement. This is because the mesh grows in size only after the data has been redistributed. In general, the remapping times also decrease as the number of processors is increased. This is because even though the total volume of d a t a m o v e m e n t increases with the n u m b e r of processors, there are actually more processors to share the work. The r e m a p p i n g times when d a t a is moved before mesh refinement are reproduced for the Real_2 case in Table 2 since the exact values are difficult to read off the log-scale. Perhaps the most remarkable feature of these results is the peculiar behavior of the T 3 E when P > 64. When using up to 32 processors, the r e m a p p i n g performance of the T 3 E is very similar to t h a t of the SP2 and Origin2000. It closely follows the redistribution cost model given in Sec. 2.3, and achieves a significant runtime improvement when r e m a p p i n g is performed prior to refinement. However, for 64 and 128 processors, the remapping overhead on the T 3 E begins to increase and violates our cost model. The runtime difference when d a t a is r e m a p p e d before and after refinement is dramatically diminished; in fact, all the r e m a p p i n g times begin to converge to a single value! This indicates t h a t the remapping t i m e is no longer affected only by the volume of d a t a redistributed but also by the interprocessor communication pattern. One way of potentially improving these results is to take advantage of the T 3 E ' s ability to efficiently perform one-sided communication. Another surprising result is the d r a m a t i c reduction in remapping times when using 32 processors on the Origin2000. This is probably because network contention with other jobs is essentially removed when using the entire machine. When using up to 16 processors, the r e m a p p i n g times on the SP2 and the Ori-
316 gin2000 are comparable, while the T 3 E is a b o u t twice as fast. Recall that the remapping phase within PLUM consists of b o t h communication and c o m p u t a tion. Since the results in Table 2 indicate t h a t c o m p u t a t i o n is faster on the Origin2000, it is reasonable to infer that bulk communication is faster on the SP2. These results generally demonstrate t h a t our methodology within PLUM is effective in significantly reducing the d a t a r e m a p p i n g time and improving the parallel performance of mesh refinement. Table 2 also presents the PMeTiS partitioning times for Real_2 on all three systems; the results for Real_l and Real_3 are almost identical because the time to repartition mostly depends on the initial problem size. There is, however, some dependence on the number of processors used. When there are too few processors, repartitioning takes more time because each processor has a bigger share of the total work. When there are too m a n y processors, an increase in the c o m m u nication cost slows down the repartitioner. Table 2 demonstrates t h a t PMeTiS is fast enough to be effectively used within our framework, and t h a t PLUM can be successfully ported to different platforms without any code modifications.
4
Conclusions
In this paper, we verified the effectiveness of our PLUM load balancer for adaptive unstructured meshes on a helicopter acoustics problem. We developed three generic metrics to model the reinapping cost on most multiprocessor systems. Optimal solutions for these metrics, as well as a heuristic approach were implemented. We showed that the data redistribution overhead can be significantly reduced by applying our heuristic processor reassignment algorithm to the default m a p p i n g given by the global partitioner. Portability was d e m o n s t r a t e d by presenting results on the three vastly different architectures of the SP2, Origin2000, and T3E, without the need for any code modifications. Results showed that, in general, PLUM will remain viable on large numbers of processors. However, our redistribution cost model was violated on the T 3 E when 64 or more processors were used. Future research will address the i m p r o v e m e n t of these results, and the development of a more comprehensive r e m a p p i n g cost model.
References 1. Bhat, K.: An O(n 2"5 log 2 n) time algorithm for the bottleneck assignment problems. AT&T Bell Laboratories Unpublished Report (1984) 2. Biswas, R., Oliker, L.: Experiments with repartitioning and load balancing adaptive meshes. NASA Ames Research Center Technical Report NAS-97-021 (1997) 3. Fredman, M., Tarjan, R.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34 (1987) 596-615 4. Gabow, H., Tarjan, R.: Algorithms for two bottleneck optimization problems. J. of Alg. 9 (1988) 411-417 5. Gabow, H., Tarjan, R.: Faster scaling algorithms for network problems. SlAM J. on Comput. 18 (1989) 1013-1036 6. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregular graphs. University of Minnesota Technical Report 96-036 (1996)
317 7. Oliker, L., Biswas, R.: PLUM: Parallel load balancing for adaptive unstructured meshes. NASA Ames Research Center Technical Report NAS-97-020 (1997) 8. Oliker, L., Biswas, R., Strawn, R.: Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2. Springer-Verlag LNCS 1117 (1996) 35-47 9. Valiant, L.: A bridging model for parallel computation. Comm. ACM 33 (1990) 103-111
Experimental Studies in Load Balancing* Azzedine Boukerche and Sajal K. Das Department of Computer Sciences, University of North Texas, Denton, TX. USA
A b s t r a c t . This paper takes an experimental approach to the load balancing problem for parallel simulation applications. In particular, it focuses upon a conservative synchronization protocol, by making use of an optimized version of Chandy&Misra null message method, and propose a dynamic load balancing algorithm which assumes no compile time knowledge about the workload parameters. The proposed scheme is also implemented on an Intel Paragon A4 multicomputer, and the performance results for several simulation models are reported.
1
Introduction
A considerable number of research projects on load balancing in parallel and distributed systems in general has been carried out in the literature due to the potential performance gain from this service. However, all of these algorithms are not suitable for parallel simulation since the synchronization constraints exacerbate the dependencies between the LPs. In this paper, we consider the load balancing problem associated with the conservative synchronization protocol that makes use of an optimized version of Chandy-Misra Null messages protocol [3]. Our p r i m a r y goal is to minimize the synchronization overhead in conservative parallel simulation, and significantly reduce the total execution time of the simulation by evenly distributing the workload a m o n g processors [1, 2].
2
Load Balancing Algorithm
Before the load balancing algorithm can determine the new assignment of processes, it must receive information from the processes. We propose to implement the load balancing facility as two kinds of processes: load balancer and process migration processes. A load balancer makes decision on when to move which process to where, while migration process carries out the decision m a d e by the load balancer to move processes between processors. To prevent the bottleneck and reduce the message traffic over a fully distributed approach, we propose a Centralized approach (CL), and a Multi-Level (ML) hierarchical scheme; where processors are grouped and work loads of processors are balanced hierarchically through multiple levels. At the present time, * Part of this research is supported by Texas Advanced Technology Program grant TATP-003594031
319 we settle with a Two-Level scheme. In level 1, we use a centralized approach that makes use of a (Global) Load Balancing Facitilty, (GLBF), and the global state consisting of process/processors mapping table. The motivation for this is that many contemporary parallel computers are usually equipped with a frontend host processor (e.g., hypercube multiprocessor machines). The load balancer (GLBF) sends periodically a message < Request_Load > to a specific processor, called first_proc, within each clusters requesting the average load of each processor within each groups. In level 2, the processors are partitioned into clusters, and the processors within each group are structured as a virtual ring, and operate in parallel to collect the work loads of processors. A virtual ring is designed to be traversed by a token which originates at a particular processor, called first_proc, passes through intermediate processors, and ends it traversal at a preidentified processor called last_proc. Each of the ring will have its own circulating token, so that information (i.e., work loads) gathering within the rings is concurrent. As the token travels through the processors of the ring, it accumulates the information (the load of each processors and id of the processors that contain the highest/lowest loads), so that when it arrives at the last_proc, information have been gathering from all processors of the ring. When all processors have responded, the load balancer (GLBF) computes the new load balance. Once the load balancer makes the decision on when to move which LP to where, it sends Migrateft~q~e,t to the migration process which in turns initiates the migration mechanism. Then, the migration process sends the (heavily overloaded) selected process to the lighted underloaded neighbor processor. The computational load in our parallel simulation model consists of executing the null and the real messages. We wish to distribute the null messages and the real messages among all available processors. We define the (normalized) Load at each processor ( Prk ) as follows Loadk = f ( R a vkg , Na.g) k = ~ Rang~Rang k + (1 k k c~)Navg/Navg ; where Ravg,and Nkavg are respectively the average CPU-queue length for real messages and null messages at each processor Prk; and (~ and (1 - ~) are respectively the relative weights of the corresponding parameters. The value of c~ was determined empirically.
3
Simulation Experiments
The experiments have been conducted on a 72 nodes Intel Paragon, and we examined a general distributed communication model (DCM), see Fig. 1, modeling a national network consisting of four regions where the subnetworks are a toroid, a circular loops, a fully connected, and a pipeline networks. These regions are connected through four centrally located delays. A node in the network is represented by a process. Final message destinations are not known when a message is createad. Hence, no routing algorithm is simulated. Instead one third of arrival messages are forwarded to the neighboring nodes. A uniform distribution is employed to select which neigbor receives a message. Consequently the volumes of messages between any two nodes is inversely proportional to their hop distance.
320
Fig. 1. Distributed Communication Model Messages may flow between any two nodes and typically several paths may be employed. Nodes receive messages at varying rates that are dependant on traffic patterns. Since there are numerous deadlocks in this model, it provides a stress test for any conservative synchronization mechanisms. In fact, deadlocks occurs so frequently and null messages are generated so often that the load balancing strategy is almost a necessity in this model. Various simulation conditions were created by mimicking the daily load fluctuation found in large communication network operating across various time zones in the nation [1]. In the pipline region, for instance, we arranged the subregion into stages, and all processes in the same stages perform the same normal distribution with a standard deviation 25%. The time required to execute a message is significantly larger than the time to raise an event message in the next stage (message passing delay). The experimental data was obtained by averaging several trial runs. The execution time (in seconds) as function of the number of processors are presented in the form of graphs. We also report the synchronization overhead in the form of null message ratio (NMR) which is defined as the number of null messages processed by the simulation using Chandy-Misra null-message approach divided by the number of real messages processed, the performance of our dynamic load balancing algorithms were compared with a static partitioning. Figure 2 depicts the results obtained for the DCM model. We observe that the load balancing algorithm improves the running time. In other words, the running time for both network models, ie., Fully Connected and Distributed Communication, decreases as we increase the number of processors. The results in Fig. 6 indicate a reduction of 50% in the running time using the CL dynamic load balancing strategy compared with the static one when the number of processors is increased from 32 to 64, and about 55-60% reduction when we use the ML dynamic load balancing algorithm. Figure 3 displays N M R as a function of the number of processors employed in the network models. We observe a significant reduction of null-messages for all load balancing schemes. The results show that N M R increases as the number of processors increases for both population levels. For instance, If we confine ourselves to less than 4 processors, we see approximately 30-35% reduction of
321 R13$~
I~
U
~ ML CLcy#*ml ,~. Oyn=ml~r
65
U
R
T
?so
"""
I
'
~..~_~___._~
E ~so
8
/~
"
"
16
45 35
,
~ .....................L~
2 4
Static CL- Dynamic . .
N 5s
"~%
N 11s0
I
32
.~ 48
64
PROCESSORS
Fig. 2. Run Time Vs. Nbr of Processors
25 15
2 4 8
16
32
48
64
PROCESSORS
Fig. 8. NMR Vs. Nbr of Processors
the synchronization overhead using the CL dynamic load balancing algorithm over the static one. Increasing the number of processors from 4 to 16, we observe about 35-40% reduction of N M R . We also observe a 45% reduction using CL strategy over the static one, when 64 processors, are used, and a reduction of more than 50% using ML scheme over the static one. These results conclude that the multi-level load balancing strategy significantly reduces the synchronization overhead when compared to a centralized method. In other words, a careful dynamic load balancing improves the performance of a conservative parallel simulation.
4
Conclusion and Future Work
We have described a dynamic load balancing algorithm based upon a notion of CPU-queue length utilization, and in which process migration takes place between physical processors. The synchronization protocol makes use of ChandyMisra null message approach. Our results indicate that careful load balancing can be used to further improve the performance of Chandy-Misra's approach. We note that a Multi-Level approach seems to be a promising solution in reducing further the execution time of the simulation. An improvement between 30% and 50% in the event throughput were also noted which resulted in a significant reduction in the running time of the simulation models.
References 1. Boukerche, A., Das, S. K.: Dynamic Load Balancing Strategies For Parallel Simulations. TR-98. University of North Texas. 2. Boukerche, A., and Tropper C., "A Static Partitioning and Mapping Algorithm for Conservative Parallel Simulations", [EEE/ACM PADS'94, 1994, 164-172. 3. Misra, J., "Distributed Discrete-Event Simulation", ACM Computing Surveys, Vol. 18, No. 1, 1986, 39-65.
On-Line Scheduling of Parallelizable Jobs Christophe Rapine 1, Isaac D. Scherson 2 and Denis T r y s t r a m 1 1 LMC - IMAG, Domaine Universitaire, BP 53, 38041 GRENOBLE cedex 9, France 2 University of California, Irvine, Department of Information and Computer Science, lrvine, CA 92717-3425, U.S.A.
A b s t r a c t . We consider the problem of efficiently executing a set of parallel jobs on a parallel machine by effectively scheduling the jobs on the computer's resources. This problem is one of optimization of resource utilization by parallel computing programs and/or the management of multi-users requests on a distributed system. We assume that each job is parallelizable and can be executed on any number of processors. Various on-line scheduling strategies of time/space sharing are presented here. The goal is to assign jobs to processors in space and time such that the total execution time is optimized. 1 1.1
Introduction D e s c r i p t i o n o f the Computational M o d e l
A model of c o m p u t a t i o n is a description of a computer together with its p r o g r a m execution rules. The interest of a model is to provide a high level and abstract vision of the real computer world such that the design and the theoretical development and analysis of algorithms and of their complexity are possible. The most commonly adopted parallel c o m p u t e r model is the P R i M model [4]. A P R i M architecture is a synchronous system consisting of a global s h a r e d - m e m o r y and an unbounded number of processors. Each processor owns a small local m e m o r y and is able to execute any basic instruction in one unit of time. After each basic c o m p u t a t i o n a l step a synchronization is done and each processor can access a global variable (for reading or writing) in one time unit. This model provides a powerful basis for the theoretical analysis of algorithms. It allows in particular the classification of problems in t e r m of their intrinsic parallelism. Basically, the parallel characterization of a problem 7) is based on the order of m a g n i t u d e of the time T ~ (n) of the best known algorithm to solve an instance of size n. T~(n) provides a lower bound for the time to solve P in parallel. However, in addition to answer the question how fast the problem can be solved ?, one important characteristic is how efficient is the parallelization ?, i.e. what is the order of magnitude of the number of processors P ~ (n) needed to achieve T ~ (n). This is an essential p a r a m e t e r to deal with in order to implement an algorithm on an actual parallel system with limited resources. This is best represented by the ratio # ~ between the work of the P R i M algorithm, W ~ ( n ) = T~(n).P~(n) and the total number of instructions in the algorithm Wtot. In this p a p e r we consider a physically-distributed logically-shared m e m o r y machine composed of m
323 identical processors linked by an interconnection network. In the light of modern fast processor technology, compared to the PRAM model, communications and synchronizations become the most expensive operations and must be considered as important as computations.
1.2
Model of Jobs
For reasons of efficiency, algorithms must be considered at a high level of granularity, i.e. elementary operations are to be grouped into tasks, linked by precedence constraints to form programs. A general approach leads to write a parallel algorithm for v virtual processors with m ~< v ~< P ~ . The execution of the program needs a scheduling policy a, static a n d / o r dynamic, which directly influences the execution time Tm of the program. We define the work of the application on m processors as the space-time product W,n = m • Tin. Ideally speaking, one may hope to obtain a parallel time equal to T~q/m, where T~q denotes the sequential execution time of the program. This would imply a conservation of the work. W m = / 4 ? t o t . Such a linear speed-up is hard to reach even with an optimal scheduling policy, due mainly to communication delays and the intrinsic non-parallelism of the algorithm (even in the PRAM model we may have # ~ >~ 1). We introduce the ratio Pm to represent the "penalty" of a parallel execution with respect to a sequential one and define it as:
1.3
Wrr~
TYt.T ~
~/~tot
Tseq
Discussion on Inefficiency
The Ine]ficiency factor (pro) is an experimental parameter which reflects the quality of the parallel execution, even though it can be computed theoretically for an application and a scheduling. Let us remark that the parameter #m depends theoretically on n and on the algorithm itself. Practically speaking, most existing implementations of large industrial a n d / o r academic scientific codes show a small constant upper bound. However, we can assume some general properties about the inefficiency that seem realistic. First, it is reasonable to assume that the work ~4;p is an increasing function ofp. Hence inefficiency is also an increasing function of p : pp < #p+l for any p. Moreover, we can also assume that the parallel time Tp is a non-increasing function of p, all the less for "reasonable" number of 1 processors. Hence, ( p + 1)Tp+l < (1 + ~)p Tp =~ Pp+l _< (1 + ~)#v-Developing the inequality, for any number of processors p, q >_ 1 we get #p+q <_ P+q p #p. In other words, ~ is a non-increasing function of p. As a consequence, and as Pl is 1 by definition, pp is bounded by p, which intuitively means that in the worst case the execution of the application on p processors leads in fact to a sequential execution, i.e. the application is not parallel.
324 1.4
Competitivity
The factor pp is a performance characteristic of the parallel execution of a program. It can be easily computed a posteriori after an execution in the same way than speedup or efficiency. A more theoretical and precise performance guarantee is the competitive ratio: an algorithm is said to be p(m) competitive if for any instance 2: to be scheduled on m processors, T,~ (Z) < p(m).T~ (Z), where Tm (1:) denotes the time of the execution produced by the algorithm and T*(Z) is the optimal time on m processors. Our goal is, given multiprocessor tasks individually characterized by an inefficiency pp when executed on p processors, to determine the competitive ratio for the whole system.
2
Scheduling
Independent
Jobs
We define an application as a set of jobs linked by precedence constraints. Most parallel programming paradigms assume that a parallel program produces independent subproblems, to be treated concurrently. Consider the following problem: Given a m-processor parallel machine and a set of N independent parallelizable jobs (Ji)l,N whose duration is not known until they terminate, what competitive ratio can we guarantee for the execution? We assume that each job Ji has a certain inefficiency ttpi when executed on p processors and that it can be preempted. The question can be restated as: Given a set of inefficiencies , what is the competitive ratio we may hope for from an on-line scheduling strategy? In the following, we denote by T* (g) the optimal execution time and by T~ (2) the one of the scheduling strategy a. The competitive ratio p~ of a is defined as the maximum on all the instances Z of the fraction T~ (Z)/T* (Z). Another interesting performance guarantee is the comparison between ira (:/:) and the optimal execution time T~ (Z) when the parallelization of the jobs is not allowed, i.e. jobs are computed one at a time on single processors. We denote by pSaeq the m a x i m u m on Z of T~ (Z)/T~ (Z). This competitive ratio will highlight the (possible) gain to allow the multiproeessing of jobs compared to the classical approach we present below where jobs are purely sequential.
2.1
T w o E x t r e m e Strategies: G r a h a m and G a n g S c h e d u l i n g
The scheduling of multiprocessor jobs has recently become a m a t t e r of much research and study. In off-line scheduling theory, several articles have been published by J.Btaiewicz and al. [2, 1], considering multiprocessor tasks but each one requiring a fixed number of processors for execution. If we look at the field of on-line scheduling, the most important contributions focused on the the case of one-processor tasks. In this classical approach any task requires only one processor for execution. The most famous algorithm is due to G r a h a m [3] and consists simply in a greedy affectation of the tasks to the processors such that a processor cannot be idle if there is a remaining unscheduled job. Its competitive ratio is 2 - ~1, which is the best possible ratio for any deterministic scheduling strategy
325 when the computation costs of the tasks are not known till their completion. At the other end of the spectrum, the Gang scheduling analyzed recently in Scherson [5] assumes multiprocessor jobs and schedule them allowing the whole machine, m processors, to all the tasks. For the Happiness function defined by Scherson, the Gang strategy is proved to be the best one for preemptive multiprocessor job scheduling. Certainly, if we are concerned with the competitive ratio, the best strategy will be a compromise between allowing one processor per job (Graham) and the whole machine to any job (Gang). In particular we hope that multiprocessing of the job may improve the competitive ratio (2 - I ) of Graham. In the following, let denote #v = maxl,N #~. L e m m a 1.
The competitive ratio of Gang scheduling is equal to tt~n.
Proof. Consider any instance Z. Let ai be the work of job J/, and 142tot the total
amount of work of the program. We have TGang = 1 rn -- rn Noticing that W t o t / m is always a lower bound of T*, it follows that p ~ . g _< #m. Consider now a particular instance consisting of m identical jobs of work a. Gang scheduling produces an execution time of p,~a, while assigning one processor per job produces a schedule of length exactly a. As this schedule is one-processor job, it proves that PG~,g ~r -> pro, which gives the result. [] L e l n m a 2.
For Graham scheduling, pSeq * Craham = 2 - ~1 and PGraham >~
rn
Proof. Let consider the scheduling of a single job of size a. Any one-processor scheduling delivers an execution time of a, while the optimal one is reached 9 m giving the whole machine to the job. Thus T* = v--~a,~and so PGraham >- #~ " []
3
SCHEduling
Multiprocessors
Efficiently:
SCHEME
A G r a h a m scheduling breaks into two phases: in the first one all processors are busy, getting an efficiency of one. Inefficiency happens when less than m tasks remain. Our idea is to preempt these tasks and to schedule them with a Gang policy in order to avoid idle time.
Fig. 1. Principle of a S C H E M E scheduling
326 3.1
Analysis of the Sequential Competitivity
Let's consider first time t in the G r a h a m scheduling when less t h a n m jobs are not completed. Let U'l, 9 9 Jk, k <_ m - 1, be these unfinished jobs, and denote by a l , . . . , ak their remaining amount of work. Let 147 = rn.t be the work executed between time 0 and t. We have the m a j o r a t i o n Tsc(m) < -~ + ~-m Zi=li=khi. Introducing the total amount of work of the jobs, ~/~tot = W + ~'~=1 i=k al, and the m a x i m u m amount of work remaining fQr a job, a = m a x { a l , . . . , ak}, it follows t h a t Tsc(m) < W t o t / m + ~ i=k To obtain an expression of sequential c o m p e t i t i v i t y , we need to m i n o r a t e the one-processor optimal time. A lower bound is always )/Vtot/m. As we consider a one-processor job scheduling, T~ is greater t h a n the sequential time of any task. Hence we have T~ > a. We get the following lemma: L e m m a 3.
SCHEME has a sequential competitivity of (1 - ~ ) P m + 1
Proof. It is sufficient to prove t h a t the bound is tight. Consider the following instance composed of m - 1 jobs of size m and rn jobs of size 1. The o p t i m a l schedule allocate one processor per tasks of size m and assign all the small jobs to the last processor. It realizes an execution time rn = }/Ytot/m. A possible execution of S C H E M E scheduling allocates first one small jobs per processor and then uses the gang strategy for the m - 1 large jobs. It conducts to an execution time Tsc(m) = 1 + (m - l)#m, which realizes the bound. [] 3.2
Analysis of the Competitive
Ratio
We determine here the competitive ratio of the S C H E M E scheduling compared to an optimal multiprocessor jobs scheduling. Let consider an instance of the problem on m processors, and S* an o p t i m a l schedule. We decompose S* into t e m p o r a l area slices (A~)i, each slice corresponding to the jobs c o m p u t e between time ti-1 and ti defined recursively as follow: to = 0, and ti+l is the m a x i m a l date such that in the time interval [ti, ti+l [ no job is preempted or completed. Notice that in the slice A.* either all the jobs are c o m p u t e d on one processor, the slice is said "one-processored", or at least one job is executed on p > 2 processors. Let denote by Ai the area needed in the S C H E M E schedule to compute the instructions of A T. Notice that any piece of job in the S C H E M E scheduling is either process on one or on m processors. 9 Let A* be a one-processored slice. Let denote by Gi the a m o u n t of instructions of A* executed on m processors in Ai and by Wi the amount of instructions of A* executed on one processor in Ai. By definition A* = Gi + Wi. At least one of the jobs is entirely c o m p u t e d on one processor in the S C H E M E schedule, hence Wi >_ A~/m. In Ai, the area to compute the instruction of Gi is increased by a factor #,~. Thus Ai = p,,Gi + Wi = #mA r - (#m - 1)Wi < ,
seq
,
/~mAi - (pro - 1)-~-- < ((1 - • + • It follows that d i < Psc ( m ) A i . 9 Let A* be a nmlti-processored slice. There exists a job J computes on p > 2 processors in the slice. Let a be its a m o u n t of instruction and denote by --
fn
/
FT~"
~
"
327 B = A~ - pea the remaining area less J. In the S C H E M E schedule, the area Ai needed to execute the instructions of A* is at most #m(a + B) = pro(A* (pp - 1)a). Moreover A~ < '-p-X-~a• m. Hence: Ai < pro(1 - ~ /e~ p- 1 2r n-/ ) A * T h e --m a x i m u m ratio appears minimizing the t e r m -~--~p. Noticing that p/pp and ,up pp are increasing functions, m a x i m a l value is reached for p = 2. Hence: Ai < 2 , A* We call the ratio factor ~ a ~ ( m ) . For any slice A* of S*, seq ar we have the corresponding area in S bounded by m a x { p ~ (m), ~ c (m)}.A~. Noticing that the S C H E M E schedule contains no idle time, we get the following lemnm: L e m m a 4.
The competitive ratio of S C H E M E scheduling is equal to 1
p'so(m) = (i - m ) p m
+
1
--max{l, rn
2 -
#2
P2
pro}
Proof. To prove that the ratio is tight, consider the instance composed of m - 2 jobs of size 1 and one job of size (2///2). A possible schedule allocating one processor per unit job and two for the last job, completes at time 1. The S C H E M E m ~ 2 scheduling leads to time execution ~-~-((rn - 2) + -7~)" []
4
Concluding Remarks
We are currently experimenting the mixed strategy S C H E M E shown to be (theoretically) promising. Other more sophisticated scheduling strategies are also under investigation.
References 1. J. Blaiewicz, P. Dell'Olmo, M. Drozdowski, and M.G. Speranza. Scheduling multiprocessors tasks on three dedicated processors. In]orrnation Processing Letters, 41:275-280, April 1992. 2. J. Bla~ewicz, M. Drabowski, and ]. W~glarz. Scheduling multiprocessor tasks to minimize schedule length. IEEE Transactions on Computers, 35(5):389-393, 1986. 3. R.L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416-429, March 1969. 4. R.M. Karp. On the computational complexity of combinatorial problems. Networks, 5:45--68, 1975. 5. I. D. Scherson, R. Subramanian, V. L. M. Reis, and L. M. Campos. Scheduling computationally intensive data parallel programs. In Placement dynamique et rdpartition de charge : application aux syst~mes parall~les et rdpartis (Ecole Franfaise de Paralldlisme, Rgseaux et Syst~me), pages 107-129. Inria, July 1996.
Quantum Cryptography on Optical Fiber Networks Paul D. Townsend BT Laboratories, B55-131D, Martlesham Heath, Ipswich, IP5 3RE, U.K. [email protected]
Abstract. The security of conventional or classical cryptography systems relies upon the supposed (but often unproven) difficulty of solving certain classes of mathematical problem. Quantum cryptography represents a new paradigm for secure communications systems since its security is based not on computational complexity, but instead on the laws of quantum physics, the same fundamental laws that govern the behaviour of the universe. For brevity, this paper concentrates solely on providing a simple overview of the practical security problems that quantum cryptography addresses and the basic concepts that underlie the technique. The accompanying talk will also cover this introductory material, but the main emphasis will be on practical applications of quantum cryptography in optical fiber systems. In particular, I will describe a number of experimental systems that have been developed and tested recently at BT Laboratories. The experimental results will be used to provides some insights about the likely performance parameters and application opportunities for this new technology.
1
Introduction
The information revolution of the latter half of the twentieth century has been driven by rapid technological developments in the fields of digital communications and digital data processing. A defining feature of today’s digital information systems is that they operate within the domain of classical physics. Each bit of information can be fully described by a classical variable such as, for example, the voltage level at the input to a transistor or microprocessor. However, information can also be processed and transported quantum-mechanically, leading to a range of radically new functionalities [1]. These new features occur because the properties of quantum and classical information are fundamentally different. For example, while classical bits must take on one of the two mutually exclusive values of zero or one, quantum superposition enables a bit of quantum information to have the strange property of being both zero and one simultaneously. Furthermore, it is always possible, in principle, to read classical information and make a perfect copy of it without changing it. In contrast, quantum systems can be used to carry information coded in such a way that it is impossible to read it without changing it and impossible to make a perfect copy of it [2]. It is D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 35–46, 1998. c Springer-Verlag Berlin Heidelberg 1998
36
Paul D. Townsend
these properties that are directly exploited in quantum cryptography [3,4,5] to provide secure communications on optical fiber systems [6,7,8,9,10]. The technique enables the secrecy of information transmitted over public networks to be tested in a fundamental way, since an eavesdropper will inevitably introduce readily detectable errors in the transmission. Quantum cryptography offers the intriguing prospect of certifiable levels of security that are guaranteed by fundamental physical laws. This possibility arises from the quantum nature of the communication system and cannot be realised with any conventional classical system.
2
Classical Cryptography
In order to understand the problems that quantum cryptography addresses it is necessary to first review some background ideas from conventional or classical cryptography. Encryption is the process of taking a message or plaintext and scrambling it so that it is unreadable by anyone except the authorised recipient. This process is perhaps best described by a simple analogy. Encryption is the mathematical analogue of taking a message and placing it inside a lockable box. The box is then locked with a key held only by the sender and the authorised recipients of the message. Any unauthorised person intercepting the locked box will be unable to retrieve the message from the box without possessing a copy of the key. Mathematically encryption can be described as an invertible mapping of a message m into a ciphertext c. The mapping is achieved by an encryption algorithm, E, which takes as input the secret key, k, and the message so that c = E(m, k). The message, key and ciphertext are expressible as binary data strings. The decryption is achieved with the use of the inverse algorithm, D, so that m = D(c, k). In terms of the lockable box analogy the algorithm is the locking mechanism which is activated by the key. As with the case of a mechanical lock it is the secrecy of the key that keeps the system secure. A well-designed lock should be difficult to open without the key even when the details of the lock are known. It is an accepted design principle of cryptographic algorithms that the security should not be dependent on the secrecy of the algorithm but should reside entirely in the secrecy of the key. A graphic example of this idea is provided by the ‘one-time pad’ cipher system [11] that was proposed by Vernam in 1926. In the binary version of the one-time pad encryption is performed by modulo 2 addition (equivalent to a bit by bit Boolean XOR) of the message and a random key of equal length. Despite the simplicity (and public availability!) of the XOR algorithm the one-time pad offers theoretically perfect secrecy as long as the key is kept secret and used only once [12]. In practice, the one-time pad is not widely used because of the requirement for large volumes of key data. However, most commercial algorithms, such as the Data Encryption Standard (DES), also use publicly available algorithms [13]. The requirement that the secrecy be entirely dependent upon the key also imposes another good design principle. In a practical system such as DES, that does not offer the perfect secrecy of the one-time pad, an eavesdropper can always attack the system simply by trying
Quantum Cryptography on Optical Fiber Networks
37
each possible key in turn. A good algorithm will be designed such that this least efficient strategy is, nevertheless, the fastest method of attack. DES, for example, uses a key length of 56 bits so that an attacker performing an exhaustive key search will have to test, on average 255 keys. Although this is a huge number, future increases in computational power will necessitate the use of longer keys.
Fig. 1. The basic elements of a cipher system.
Figure 1 shows a schematic representation of a cryptography system, where the standard terminology of Alice, Bob and Eve is used to describe the transmitter, receiver and an unauthorised eavesdropper respectively. In order to operate the scheme Alice and Bob must exchange keys and, most importantly, must ensure that these keys do not fall into Eve’s hands. Furthermore, if the system is to remain secure Alice and Bob must regularly repeat this process ensuring that the updated keys are also secret. How to achieve this goal is one of the central problems in secure communications; keys must be generated and distributed and the whole process managed in such a way as to ensure the secrecy of the system. In essence quantum cryptography is a communication protocol that enables cryptographic keys to be securely distributed over communication channels that are physically insecure and hence potentially subject to eavesdropping. The technique achieves the desirable goal of an automated key establishment procedure between two or more parties in such a way that any unauthorised interception can be discovered and dealt with. Some modern cryptography systems attempt to solve the key management problems using a technique known as public-key cryptography [13]. In public-key cryptography each user has two keys, a private key, and a public key. The public key is widely distributed; the private key is kept secret. To send a confidential message using a public-key scheme Alice obtains Bob’s public key from a public directory and uses this to encrypt her message to Bob. Once this has been done only Bob can decrypt the message using his private key. In terms of the lockable box analogy this is like having a box whose lock is operated by two keys. Everyone has access to the key to lock the box but
38
Paul D. Townsend
cannot unlock it. Only one key will open the box and that is the private key held by the message recipient. The security of public-key cryptography systems is based on ’one-way functions’ that is readily computable mathematical problems for which the solution of the inverse problem is thought to be computationally infeasible. The classic example of such a function is multiplication and its inverse factorisation. For example, it easy to multiply two large prime numbers, but extremely difficult to factorise the resulting product into its two prime factors. Cryptographers using public-key cryptography have to make the assumption that advances in mathematics and computing will not dramatically change the ability of an attacker to compute these difficult problems. This issue has received a great deal of attention recently because of Shor’s discovery of a high-speed factorisation algorithm for quantum computers [14]. If such computers can ever be built, the security of public-key cryptosystems may be dramatically reduced.
3 3.1
Quantum Cryptography Background
The following section describes the original ‘BB84’ quantum key distribution protocol developed by Bennett and Brassard [3]. This protocol is used exclusively in the BT experiments that will be described in the talk. In general, quantum information systems require the use of some suitable two-state quantum objects to provide quantum-level representations of binary information (qubits). In the BB84 scheme Alice and Bob employ the linear and circular polarization states of single photons of light for this purpose.Figure 2 shows schematic representations
Fig. 2. Schematic representation and notation used to describe the linear and circular polarization states of single photons of light. In the BB84 quantum cryptography scheme the linear and circular states are used to provide two different quantum level representations of zero and one.
Quantum Cryptography on Optical Fiber Networks
39
of these states together with the notation used to represent them and their associated binary values. The linear and circular ’bases’ are used to provide two different quantum level representations of zero and one. Before describing how the properties of these single photon polarization states are exploited in the key distribution protocol we will consider the outcomes and interpretation of various possible measurements that can be performed on them.
Fig. 3. A receiver uses a polarizer and two single photon detectors to perform linear polarization measurements on the linear (a and b) and circular (c and d) single photon polarization states. The probabilities of the various outcomes are indicated as percentage values. The linear measurement is incompatible with the circular representation and the photon is equally likely to be repolarized as either a horizontal or vertical state and detected at the appropriate detector.
Figures 3(a)-3(d) illustrate the possible outcomes of measurements on each of the four polarization states shown in figure 2. In each case the receiver has arranged a polarizer and two single photon detectors to perform a linear polarization measurement on the incoming photon (the apparatus is assumed to be perfect). For the two linear states the outcome of the measurement is deterministic, i.e. the |V linear photon is registered at detector DV and the |Hlinear photon is registered at detector Dh , both with 100% accuracy. Of course similar results would also be obtained for classical input fields in the vertical and horizontal polarization states. In contrast, a classical input state with circular polarization would generate equal-intensity signals at the two detectors. Single photons, however, are elementary excitations of the electro-magnetic field and they cannot split in two. Instead the |Rcirc and |Lcirc states behave in a random fashion and have a 50% probability of being repolarized in the state |V linear and registered at DV and, similarly, a 50% probability of being repolarized in the state |Hlinear and registered at Dh . In quantum mechanics terminology the photon is said to be projected into an eigen-state of the measurement operator, namely either |V linear or |Hlinear . Taking the example of the |Rcirc state, the proba-
40
Paul D. Townsend
bility of each possible outcome is given by the squared modulus of the amplitude coefficients in (1), 1 |Rcirc = 2(|V linear + i|Hlinear )
(1)
the expansion of |Rcirc in the linear representation. In the case of the current example it can be seen that the measurement provides no information at all on a photon’s state of circular polarization. Of course, the receiver could have included a 1/4-wave retardation plate in front of the polarizer in order to perform a perfect circular measurement. However, this would only be obtained at the expense of no information on the photon’s linear polarization state since the |V linear and |Hlinear states behave in a random fashion when subjected to a circular polarization measurement. This situation arises because linear and circular polarization are complementary quantum observables that are incompatible in the sense that a measurement of one property will always disturb the other. Consider now the situation where one of the four possible polarization states is chosen at random and the receiver is asked to distinguish which one has been sent. Evidently the receiver must choose a measurement to perform on the photon. If he happens to choose an incompatible measurement the outcome will be random, but this fact cannot be deduced from the measurement itself. Hence, the receiver cannot determine unambiguously the polarization state of the photon unless he is supplied with additional information on which type of state has been sent. All that can be said after a particular measurement is that the photon is now in the polarization state measured. Although we have used the example of a specific type of measurement here, there is in fact no single measurement or sequence of measurements that will allow the receiver to accurately determine the state of the photon when he is told only that it has been coded in one of two incompatible representations, but not which one. As we shall see, this peculiarly quantum phenomenon can be directly exploited to provide security in a quantum cryptography scheme. The reader might be tempted to suggest at this point that a solution to the measurement problem could be to copy the photon so that the receiver could measure different properties on the identical copies. Fortunately, at least from the quantum cryptographer’s point of view, this type of ’cloning’ is forbidden by quantum mechanics [2]. 3.2
Quantum Key Distribution Protocol
The aim of quantum key distribution is not to take a specific predetermined key and send it to the recipient but rather to generate a random bit sequence in two separate locations that can then be used as a key. The key only comes into existence when the entire protocol has been executed. In order to carry out this process the transmitter Alice and the receiver Bob employ a pair of communication channels: a quantum channel that is used for sending the photons and a classical channel that is used for a subsequent public discussion about various aspects of the quantum transmission. It is assumed that Eve may have access to
Quantum Cryptography on Optical Fiber Networks
41
both of these channels at any time. She may make any measurement that she chooses on the quantum channel and is free to learn the contents of all the public channel messages although, as discussed later, she cannot change their content. The key distribution process is illustrated schematically in figure 4. Alice begins by choosing a random bit sequence and for each bit randomly selects either the linear or the circular representation. She then prepares a regular-clocked sequence of photons in the appropriate polarization states and transmits them to Bob over the quantum channel. For the example shown in the figure Alice has chosen to send a one in the first timeslot using the linear representation i.e. a vertically polarized photon, a one in the second timeslot using the circular representation i.e. a right polarized photon, and so on. At the other end of the channel Bob makes an independent random choice of whether to perform a linear or a circular measurement in each time period and records the result. As discussed in the example above, the type of measurement is selected by applying a 1/4-wave retardation before the polarizer for circular and no retardation for linear, and the result depends on which of the detectors registers the photon. The optical transmission system will also inevitably suffer from a variety of loss processes that randomly delete photons from the system and in these cases neither of Bob’s detectors fires and he denotes this null event by an ’X’. In the timeslots where Bob did receive a photon and happened to choose the same representation as Alice (as in timeslot 1, for example) the bit that he registers agrees with what Alice sent. In timeslots where Bob used a different representation to Alice (as in time-slot 2, for example) the outcome of the measurement is unpredictable and he is equally likely to obtain a one or a zero. Alice and Bob now carry out a discussion using the classical public channel that enables them to generate a shared random bit sequence from the data and test its secrecy. Bob announces the timeslots when he received a photon and the type of measurement he performed in each but, crucially, not the result of the measurement as this would give away the bit value if Eve were monitoring the discussion. In the case of the transmission shown in figure 4 Bob reveals that he received photons in timeslots 1, 2, 6 and 8 using linear, and timeslots 3 and 5 using circular. Alice then tells Bob in which of these timeslots they both used the same representation (1, 3 and 8) and they discard all other data. In this way they identify the photons for which Bob’s measurements had deterministic outcomes and hence distil the shared subset 101 from Alice’s initial random bit sequence. This is the ‘raw’ key data. In practice, the process would of course be continued until a much longer sequence had been established. Why do Alice and Bob go through this tedious and inefficient process? Recall that the goal is to establish a secret random bit sequence for use as a cryptographic key. Consider the situation illustrated in Fig 5 where Eve attempts to intercept the quantum transmission, make a copy, and then resend it on to Bob. In each timeslot Eve (like Bob) has to make a choice of what measurement to perform. Without any aprioi knowledge of the representation chosen by Alice, Eve (like Bob) has a 50% chance of performing an incompatible measurement as shown in the figure. This leads to a random outcome and a change of polariza-
42
Paul D. Townsend
Fig. 4. Schematic representation of the quantum transmission phase of the BB84 quantum cryptography protocol. The top row shows Alice’s outputs in 8 time slots that run sequentially from right to left. Bob’s choice of measurements (cross=linear, circle =circular) and his results for the 8 timeslots are shown in the lower rows (X=no photon detected). During the public discussion Alice and Bob agree to discard the results where they used different representations (greyed boxes) and retain the shared bit sequence 101.
Fig. 5. Eve tries to intercept, make a copy, and then resend the bit sequence on to Bob. For each photon Eve has a 50% chance of choosing an incompatible measurement that leads to a random outcome. This is the case shown here where Eve has chosen to perform a linear measurement on Alice’s circular photon. As a result Eve uses her source S to send on a vertical photon to Bob who now has a 50% chance of obtaining an error even though he has performed a compatible (as far as Alice and Bob are concerned) measurement.
Quantum Cryptography on Optical Fiber Networks
43
tion state for the photon (note that we assume for simplicity that Eve makes the same type of polarization measurements as Bob, but the system is designed to be secure for any measurements that Eve may make). Eve has no way of knowing that she has performed an incompatible measurement because, unlike Bob, she does not have the benefit of the post transmission public discussion. The latter only takes place after the photons have arrived at Bob’s end of the channel. Consequently, Eve sends a repolarized photon on to Bob who now has a 50% chance of observing an error in a timeslot where he has used the same representation as Alice and therefore expects a deterministic outcome. Consequently, if Eve makes continuous measurements on the channel she will generate a substantial bit-error rate of the order of 25% in the raw key. Of course, Eve may make measurements on only some of the photons or she may use other measurement techniques that yield less information but produce less disturbance on the channel [4,15,16]. Nevertheless, the crucial point is that Eve can only reduce the disturbance that she causes by also reducing the amount of information that she obtains; quantum mechanics ensures that these two quantities are directly related. For the specific example given above, Eve has a 50% chance per interception of making the correct choice of measurement and hence obtaining the correct bit and a 50% chance of making an incorrect choice of measurement that yields no information and causes disturbance on the channel. If Eve listens to the subsequent public discussion she can identify the instances in which her choice of measurement was correct and hence identify the bits (approximately 50%) in her data string that are shared with Alice and Bob. Hence if Eve intercepts 20% of the photons, for example, she will learn 10% of Alice and Bob’s bits and generate an error-rate of around 5%. This simple example is unrealistic, because it does not describe all the measurement strategies that are available to Eve in a practical system, or the relative ’values’ of the different types of information that she can obtain. Nevertheless, it does emphasise the basic principle of the technique which is that Alice and Bob can set an upper limit on the amount of information that Eve has obtained on their shared bit string simply by evaluating the error rate for their transmission [4]. As discussed below, if this is sufficiently low, they can generate a final highly secret key, if not they identify that the channel is currently insecure and cannot be used for secure key distribution. Quantum cryptography thus prevents the situation, which can always arise in principle with any classical key distribution system, where Alice and Bob’s encrypted communications are compromised because they are unaware that Eve has surreptitiously obtained a copy of the key. A full discussion of the final stages of the protocol that achieve this goal is beyond the scope of this paper, however, the main procedures that are involved will be briefly outlined. After discarding the inconclusive results from their bit strings, Alice and Bob publicly compare the parities of blocks of their data, and where these do not match, performing a bisective search within the block to identify and discard the error [4]. In this way they can derive a shared, error free, random bit sequence from their raw data together with a measurement of the error rate. This is achieved at the cost of leaking some additional information to
44
Paul D. Townsend
Eve in the form of the parities of the data subsets. In real systems transmission errors occur even when no eavesdropper is present on the channel. These errors can arise, for example, from the finite polarization extinction ratios of the various optical components in the system or from noise sources such as detector dark counts. In principle there is no way in to distinguish these errors from those caused by eavesdropping, hence all errors are conservatively assumed to have been generated by Eve. If the error rate is sufficiently low (typically of the order of a few percent in practice), then a technique known as ’privacy amplification’ [17,18] can be employed to distil from the error-corrected (but possibly only partially secret) key a smaller amount of highly secret key about which Eve is very unlikely to know even one bit. If, for example, Alice and Bob calculate that Eve may know e bits of their error-corrected n-bit string, they can generate from this data a highly secret m-bit key (where m(n-e) simply by computing (but not announcing as in error-correction) the parities of m publicly agreed-on random subsets of their data [4,17,18]. The final stage of the process is for Alice and Bob to authenticate their public channel discussions using an informationtheoretically secure authentication scheme [19]. This is to prevent the powerful attack in which Eve breaks in to the public channel and impersonates Bob to Alice and visa-versa [4]. If this were possible Eve could share one key with Alice and another with Bob who would be unaware that this were the case. Like the one-time pad cipher, the authentication scheme requires Alice and Bob to possess in advance a modest amount of shared secret key data, part of which is used up each time a message is authenticated. However, this initial key is only required at system turn-on time because each implementation of the quantum key distribution protocol generates a fresh volume of key data some of which can be used for authentication. After privacy amplification and authentication Alice and Bob are now in possession of a shared key that is certifiably secret. They can now safely use this key together with an appropriate algorithm for data encryption purposes.
4
Conclusions
In the accompanying talk I will demonstrate how the theoretical ideas discussed above can be implemented and tested in a number of scenarios of real practical interest. In particular I will describe a scheme based on photon interference that we have developed over the last few years at BT [7,20,21,22,23,24]. This system operates in the 1.3µm-wavelength fiber transparency window over point-to-point links up to ∼50km in length [23] and on multi-user optical networks [6,25]. I will also discuss how this technology performs on fiber links installed in BT’s public network and discuss issues such as cross-talk with conventional data channels propagating at different wavelengths in the same fiber [24]. The experiments demonstrate that quantum cryptography can provide, at least from a technical point of view, a practical solution for secure key distribution on optical fiber links and networks. In summary, quantum cryptography has now moved a long way
Quantum Cryptography on Optical Fiber Networks
45
from the original fascinating theoretical concept towards practical applications in optical communication systems. Acknowledgements I would especially like to thank Keith Blow for his continued advice, support and encouragement and Simon Phoenix for his theoretical contributions. I would also like to acknowledge the important contributions of two graduate students, Christophe Marand and Guilhem Ensuque, to the experimental work mentioned above.
References 1. C. H. Bennett, ‘Quantum information and computation’, Physics Today, October (1995), for a review of this topic. 35 2. W. K. Wootters and W. H. Zurek, ‘A single quantum cannot be cloned’, Nature, 299 802-803 (1982) 35, 40 3. C. H. Bennett and G. Brassard, ‘Quantum cryptography: public-key distribution and coin tossing’, in Proceedings of IEEE International Conference on Computers, Systems and Signal Processing, Bangalore, India, 175-179 (1984). 36, 38 4. C. H. Bennett, F. Bessette, G. Brassard, L. Salvail and J. Smolin, ‘Experimental quantum cryptography’, Journal of Cryptology, 5 3-28 (1992). 36, 42, 43, 44 5. A. K. Ekert, ‘Quantum cryptography based on Bell’s Theorem’, Physical Review Letters, 67 661-663 (1991) 36 6. P. D. Townsend, ‘Quantum cryptography on multi-user optical fiber networks’, Nature, 385, 47-49 (1997) 36, 44 7. C. Marand and P. D. Townsend, ‘Quantum key distribution over distances as long as 30km’, Optics Letters, 20 1695-1697 (1995) 36, 44 8. J. D. Franson and B. C. Jacobs, ‘Operational system for quantum cryptography’, Electronics Letters, 31 232-234 (1995) 36 9. R. J. Hughes, G. G. Luther, G. L. Morgan and C. Simmons, ‘Quantum cryptography over 14km of installed optical fiber’, Proc. 7th Rochester Conf. on Coherence and Quantum Optics (eds J. H. Eberly, L. Mandel and E. Wolf ), 103-112 (Plenum, New York, 1996) 36 10. H. Zbinden, J. D. Gautier, N. Gisin, B. Huttner, A. Muller, and W. Tittel, ‘Interferometry with Faraday mirrors for quantum cryptography’, Electronics Letters, 33, 586-587 (1997) 36 11. G. S. Vernam, J. Amer. Inst. Electr. Engrs., 45, 109-115 (1926) 36 12. C. E. Shannon, ‘Communication theory of secrecy systems’, Bell Syst. Tech. J., 28, 656-715 (1949) 36 13. H. Beker and F. Piper, Cipher Systems : the Protection of Communications, (Northwood Publications, London, 1982). See also G. Brassard, Modern Cryptology, Lecture Notes in Computer Science, eds G. Goos and J. Hartmanis (SpringerVerlag, Berlin, 1988) 36, 37 14. P Shor, ‘Algorithms for quantum computation: Discrete logarithm and factoring’, Proc. 35th Annual IEEE Symposium on Foundations of Computer Science (IEEE Computer Society Press, 1994), 124-134 38 15. A. K. Ekert, B. Huttner, G. M. Palma and A. Peres, ‘Eavesdropping on quantum cryptosystems’, Physical Review A 50, 1047-1056 (1994) 42 16. B. Huttner and A. K. Ekert, ‘Information gain in quantum eavesdropping’, J. Mod. Opt., 41, 2455-2466 (1994) 42
46
Paul D. Townsend
17. C. H. Bennett, G. Brassard and J.-M. Robert, ‘Privacy amplification by public discussion’, SIAM Journal on Computing, 17 210-229 (1988). 44 18. C. H. Bennett, G. Brassard, C. Crepeau and U. Maurer, ‘Generalized privacy amplification’, SIAM Journal on Computing, 17 210-229 (1988). 44 19. M. N. Wegman and J. L. Carter, ‘New hash functions and their use in authentication and set equality’, J. Computer and System Sciences, 22, 265-279 (1981) 44 20. P. D. Townsend, J. G. Rarity and P. R. Tapster, ‘Single-photon interference in a 10km long optical fiber interferometer’, Electronics Letters, 29 634-635 (1993) 44 21. P. D. Townsend, ‘Secure key distribution system based on quantum cryptography’, Electronics Letters, 30 809-810 (1994) 44 22. S. J. D. Phoenix and P. D. Townsend, ‘Quantum cryptography: how to beat the code breakers using quantum mechanics’, Contemporary Physics, 36 165-195 (1995) 44 23. P. D. Townsend, ‘Quantum cryptography on optical fiber networks’, Optical Fiber Technology, (In Press) 44 24. P. D. Townsend, ‘Simultaneous quantum cryptographic key distribution and conventional data transmission over installed fiber using wavelength division multiplexing’, Electronics Letters, 33 188-189 (1997) 44 25. P. D. Townsend, S. J. D. Phoenix, K. J. Blow and S. M. Barnett, ‘Design of quantum cryptography systems for passive optical networks’, Electronics Letters, 30 1875-1877 (1994). See also S. J. D. Phoenix, S. M. Barnett, P. D. Townsend and K. J. Blow, ‘Multi-user quantum cryptography on optical networks’, Journal of Modern Optics, 42, 1155-1163 (1995) 18 19 44
On Optimal k-linear Scheduling of Tree-Like Task Graphs for LogP-Machines Wolf Zimmermann 1, Martin Middendorf 2, and Welf L6we 1 1 lnstitut f/ir Programmstrukturen und Datenorganisation, Universit~it Karlsruhe, 76128 Karlsruhe, Germany, E-Mail:{zimmer]loewe)@ipd.info.uni-karlsruhe.de 2 Institut ffir Angewandte Informatik und Formale Beschreibungsverfahren, Universits Karlsruhe, 76128 Karlsruhe, Germany, E-Mail mmi~aifb.uni-karlsruhe.de
A b s t r a c t . A k-linear schedule may map up to k directed paths of a task graph onto one processor. We consider k-linear scheduling algorithms for the communication cost model of the LogP-machine, i.e. without assumption on processor bounds. The main result of this paper is that optimal k-linear schedules of trees and tree-like task graphs G with n tasks can be computed in time O(n k+~ log n) and O(n k+3 log n), respectively, if o > g. These schedules satisfy a capacity constraint, i.e., there are at most [L/g] messages in transit from any or to any processor at any time.
Introduction The LogP-machine [3,4] assumes a cost model reflecting latency of point-topoint-communication in the network, overhead of communication on processors themselves, and the network bandwidth. These communication costs are modeled with parameters Latency, overhead, and gap. The gap is the inverse bandwidth of the network per processor. In addition to L, o, and g, parameter P describes the number of processors. Furthermore, the network has finite capacity, i.e. there are at most [L/g] messages in transit from any or to any processor at any time. If a processor would a t t e m p t to send a message that exceeds this capacity, then it stalls until the message can be sent without violation of the capacity constraint. The LogP-parameters have been determined for several parallel computers [3, 4, 6, 8]. These works confirmed all LogP-based runtirne predictions. Much research has been done in recent years on scheduling task graphs with communication delay [9-11, 14, 21,23, 25], i.e. o = g = 0. In the following, we discuss those works assuming an unrestricted number of processors. Papadimitriou and Yannakakis [21] have shown that given a task graph G with unit computation times for each task, communication latency L > 1, and integer Tmax, it is NP-eomplete to decide whether G can be scheduled in time < Trna• 9 Their results hold no m a t t e r whether redundant computations of tasks are allowed or not. Finding optimal time schedules remains NP-complete for DAGs obtained by the concatenation of a join and a fork (if redundant computations are allowed) and for fine grained trees (no matter if redundant computations are allowed or
329
not). A task graph is fine grained if the granularity 7 which is a constant closely related to the ratio of computation and communication is < 1 [22]. Gerasoulis and Yang [9] find schedules for task-graphs guaranteeing the factor 1 + 88of the optimal time if recomputation is not allowed. For some special classes of task graphs, such as join, fork, coarse grained (inverse) trees, an optimal schedule can be computed in polynomial time [2, 9, 15, 25]. If reeomputation is allowed some coarse grained DAGs can also be scheduled optimally in polynomial time I1,5, 15]. The problem to find optimal schedules having a small amount of redundant computations has been addressed in [23]. Recently, there are some works investigating scheduling task graphs for the LogP-machine without limitations on the number of processors [12, 16-18,20, 24, 26] and with limitations on the number of processors [19,7]. Most of these works discuss approximation algorithms. In particular, [18,26] generalizes the result of Gerasoulis and Yang [9] to LogP-machines using linear schedules. [20] shows that optimal linear schedules for trees and tree-like task graphs can be computed in polynomial time if the cost of each task is at least g - o. However, linear schedules for trees often have bad performance. Unfortunately, if we drop the linearity, even scheduling trees becomes NP-complete [20]. [24] shows that the computation of a schedule of length at most B is NP-complete even for fork and join trees and o -- g. They discuss several approximation algorithms for fork and join trees. [12] discusses an optimal scheduling algorithm for some special eases of fork graphs. In this paper, we generalize the linearity constraint and allow instead to map up to k directed paths on one processor. We show that it is possible to compute optimal k-linear schedules on trees and tree-like task graphs in polynomial time if o = g. In contrast to most other works on scheduling task graphs for the LogP-machine, we take the capacity constraint into account. Section 2 gives the basic definitions used in this paper. Section 3 discusses some normalization properties for k-linear schedules on trees. Section 4 presents the scheduling algorithm with an example that shows the improvement of klinearity w.r.t, linear schedules.
2
Definitions
A task graph is a directed acyclic graph G = (IV,E, r), where the vertices are sequential tasks, the edges are data dependencies between them, and 7"v is the computation cost of task v E V. The set of direct predecessors P R E D , of a task v in G is defined by PREDv = {u ~ V [ (u,v) E E}. The set of direct successors SUCC~ is defined analogously. Leaves are tasks without predecessors, roots are tasks without successors. A task v is ancestor (descendant) of a task u iff there is a path from v to u (u to v). The set of ancestors (descendants) of u is denoted by ANCu (DESC,). A set of tasks U C_ V is independent iff for all u,v E V, u ~ v, neither u E ANCv nor v E A N C , . A set of tasks U C_ V is k-independent iff there is an independent subset W C_ U of size k. G is an inverse tree iff ISUCCv I -< 1 and there is exactly one root. In this case, SUCC~
330
denotes the unique successor of non-root tasks v. G is tree-like iff for each root v, the subgraph of G induced by the ancestors of v is a tree. A schedule for G specifies for each processor Pi and each time step t the operation performed by processor Pi at time t, provided there starts one. The operations are the execution of a task v E V (denoted by the task), sending a message containing the result of a task v to another processor P (denoted by send(v, P)), and receiving a message containing the result of a task 1 v from processor P (denoted by recv(v, P)). A schedule is therefore a partial mapping s : ~ • I~ --+ OP, where IP denotes an infinite set of processors and OP denotes the set of operations, s must satisfy some restrictions: (i) Two operations on the same processor must not overlap in time. The computation of a task v requires time r~. Sending from v or receiving a message sent by v requires time o. (ii) If a processor computes a task v E V then any task w E PRED~ must be computed or received before on the same processor. (iii) If a processor sends a message of a task, it must be computed or received before on the same processor. (iv) Between two consecutive send (receive) operations there must be at least time g, where v is the message sent earlier (received later). (v) For any send operation there is a corresponding receive operation, and vice versa. Between the end of a send operation and the beginning of the corresponding receive operation must be at least time L. (vi) Any v E V is computed on some processor. (vii) The capacity constraints are satisfied by s.
A trace of s induced by processor P is the sequence of operations performed by P together with their starting time. tasks(P) denotes the set of tasks computed by processor P. A schedule s for G is k-linear, k E N, iff tasks(P) is not (k + 1)-independent for every processor P, i.e. at most k directed paths of G are computed by P. The execution time T I M E ( s ) of a schedule s is the time when the last operation is finished. A sub-schedule s, induced by a task v computes just the ancestors of v and performs all send and receive operations of the ancestors except v on the same processors at the same time as s, i.e.
s(P,t)
sv(P,t) =
I
I.undefined
if s(P,t) E ANCv,
s(P,t) = recv(w, P') for a processor P ' and v r w E ANC~, s(P, t) = send(w, Pt) for a processor P' and v r w E ANCv, otherwise
It is easy to see that s, is a schedule of the subgraph of G induced by ANC~. A schedule s for an inverse tree G = (V, E, r) is non-redundant if each task of G is computed exactly once, for each v E V there exists at most one send operation, and there exists no send operation if v is computed on the same processor as its successor. 1 For simplicity, we say sending and receiving the task v, respectively.
331
3
Properties
of k-linear
Schedules
In this section we prove some general properties of k-linear schedules on trees. These properties are used to restrict the search space for o p t i m a l schedules. L e m m a 1. Let G = (V, E, T) be an inverse tree. For every schedule s for G there
exists a non-redundant schedule s' with TIME(s') < TIME(s). If s is k-linen% then also s I. Proof." Let s be a schedule for G. First we construct a sub-schedule of s where each node is c o m p u t e d only once. Assume that a task is c o m p u t e d more than once in s. If the root is computed more than once just omit all but one computation of the root. Otherwise let v be computed more than once such t h a t SUCCv is c o m p u t e d only once. Let SUCCv be c o m p u t e d on processor P. Then there exists a c o m p u t a t i o n of v on a processor P0 and a (possibly empty) sequence send(v, P1), recv(v, Po), send(v, P 2 ) , . . . , send(v, P), recv(v, Pk) of corresponding send and receive operations of v such t h a t v is received on P before the c o m p u t a t i o n of SUCC~ starts. Now, all other c o m p u t a t i o n s of v are omitted and the above sequence of send and receive operations is replaced by send(v, P); recv(v, Po), and remove all other operations sending or receiving v, provided P0 :~ P. If P0 = P, then all operations sending or receiving v are remove. This process is repeated until each task is c o m p u t e d exactly once. It is clear that the obtained schedule sl is non-redundant, TIME(s') <_ TIME(s), and the above transformations preserve k-linearity. 9 L e m m a 2. Let G = (V, E, T) be an inverse tree. For every non-redundant k-
linear schedule for G there is a non-redundant k-linear schedule s ~ satisfying TIME(s') <_ TIME(s) such that for every processor P, the subgraph 4 G induced by tasks(P) is an inverse tree with at most k leaves. Proof." Suppose there is a processor P such that the subgraph Gp of G induced by tasks(P) is not an inverse tree. Since every subgraph of G is a forest consisting of inverse trees, Gp must have more t h a n one connected component. Since s is k-linear, each connected component contains at most k leaves. Let s" be the schedule obtained from s by computing each of these connected components on separate (new) processors, and updating accordingly the operations t h a t send tasks to processor P. It is easy to see t h a t this t r a n s f o r m a t i o n neither affects the execution time nor the k-linearity. The claim follows by induction. 9 C o r o l l a r y 1. Let G = (V, E, r) be an inverse tree. Then there exists an optimal k-linear schedule s such that for every v E V the sub-schedule s, induced by v is an optimal k-linear schedule for G~ which is non-redundant and for each processor P the subgraph induced by the tasks tasks(P) is an inverse tree with at most k leaves.
Proof: Let s be an o p t i m a l k-linear schedule for G t h a t is non-redundant. If for a v E V the sub-schedule s~ is not optimal replace it by an o p t i m a l k-linear sub-schedule t h a t is non-redundant. 9 Thus, for obtaining o p t i m a l k-linear schedules, it is sufficient to consider non-redundant schedules that satisfy Corollary 1.
332
4
The
Scheduling
Algorithm
In this section, we present the a l g o r i t h m c o m p u t i n g o p t i m a l k-linear schedules for task graphs, if o = g. We first reduce the scheduling p r o b l e m to a special case of the 1-machine scheduling p r o b l e m with precedence constraints a n d release dates, i.e. a t a s k - g r a p h G -=- (V, E, 7") is to be scheduled on one processor where for every task there is a time rv (release date) such t h a t v c a n n o t be scheduled before rv : 3. Let G = (V, E, v) be an inverse tree. Let s be a schedule for G that satisfies the property of Corollary 1. For a v E V consider the sub-schedule sv and let P be the processor computing v. Let tr be the trace of P in sv where receive operations are replaced by computations of length o computing the received tasks, and G~ obtained from the subgraph of Gv induced by the t a s k s ( P ) U S of sv. Then tr is an optimal 1-machine schedule for G I with release dates TIME(s~) + L + o for v E S. Lemma
Proof: No v E S can be received before time t~ -= T I M E ( s , ) , because send(v, P) c a n n o t be scheduled earlier t h a n tv. Since every reception of a task costs t i m e o, tr is a 1-machine schedule for G t with release dates tv q- L + o for v E S. Since s is o p t i m a l tr is also optimal. 9 ALGORITHM 1 (1) for v E V according to a topological order of G d o (2) t~ := oc; s~ := 0; - - try to compute optimal k-linear schedule (3) for every independent U C A N C , , 0 < ]U[ < k d o (4) s'~ := optimal_trace(G, v, U, t) (5) if TIME(s~) < t~ t h e n - - a better schedule is found (6) t~ := TIME(s'v); (7) sv := s'~ ~ ~] s , ~ { s ~ ( P , , t ' ) =send(u,P~)}; u~PRED(U)
where s~(t' + L + o, P~) = recv(u, P,) (8) end; - - if (9) end; - - for (10) end; - - for
A l g o r i t h m 1 c o m p u t e s (in topological order) for each task v C V an o p t i m a l schedule. In the iteration step, for every possible set R for the subtree r o o t e d at v as defined in L e m m a 3, the o p t i m a l trace for the processor c o m p u t i n g v is derived (function optimal_trace). According to L e m m a 3, one of these schedules is o p t i m a l for the subtree rooted at v. T h e a l g o r i t h m uses the fact t h a t R can is uniquely defined by its leaves. 1 ( C o r r e c t n e s s o f A l g o r i t h m 1). Let G = (V, E, 7-) be an inverse tree with root r. The schedule sr computed by Algorithm I is an optimal k-linear schedule for G.
Theorem
333
Proof." We prove by induction on [ANC~ [ that for every v E V s~ is an optimal k-linear schedule. If [ANC~ [ = 0, then v is a leaf and the claim is trivial. If v is not a leaf, the claim follows directly by induction hypothesis and L e m m a 3. By line (7) every send operation is scheduled L + o time steps earlier than the corresponding receive operations. Hence, a processor can perform at most [L/o] receive operations for every time interval of length L. Since the corresponding send operations start at the latest possible time, the capacity constraint is satisfied. 9 T h e o r e m 2. Let G = (V, E, r) be a tree. Algorithm i computes in time O(IV[ k+e log IV[) an optimal k-linear schedule for G.
Proof: For every v ~ V, loop (3)-(8) is executed at most O(]ANCv] k) times, because the number of subsets of size at most k is O(nk). Thus, if optimal_trace can be implemented by a polynomial algorithm, Algorithm 1 is polynomial. It is known that a simple greedy algorithm with time complexity O(IVt log IvI) computes an optimal schedule a 1-machine scheduling problem [13], i.e. optimal_trace can be executed in time O(]ANCv l" log IANC, I). Hence, the time complexity
of loop (3)-(8) is
O(IANCvI k+l log IANC~I) by the above discussion. Summing
over all v E V yields the claimed time complexity of Algorithm 1
9
E x a m p l e : (Execution of Algorithm 1) We compute optimal 1- and 2-linear schedules for the inverse tree G = (V, E, r) shown in Fig. l(a) (where T~ = 1 for all v e V) with the parameters L = 2 and o -- 1 (cf. Fig. l(b) and (e)). Fig. l(d) shows an optimal schedule. The schedules are visualized by Gantt-charts. The x-axis corresponds to time, the y-axis to y. We place a box at (t, i), if processor Pi starts at time t an operation. The horizontal length of the box corresponds to the cost (i.e. r~ if v is computed and o if it is send or receive operation). Send and receive operations are drawn as dark boxes. White boxes denote the
1 I)
2
. . .3 . .4. . 5. . .6 . .7. . 8. . .9. l 0
I1
,~ L.,,m0
12 13 14 15 16 "
0
1 2 3
5 6 7
(a) A Task Graph
pn~:essors
(b) An optimal l-linear schedule 2 ,
0 t
3 b
4
5 I .
6 .
7 .
8
....
9
10 t l
2
time
.
1
2
2
3
r
5
6
7
g
9
tO t l
t2 llme
l 2 processors proce~rb
(c) An optimal 2-linear schedule
(d) An optimal schedule
Fig. 1. A Task Graph with optimal 1- and 2-tinear Schedules
334 T a b l e 1. Execution Times of Optimal k-linear Schedules s~
I l-linear 2-linear 3-linear 8-linear
II
9 { 7 , . . . , 14}l~ 9 {3,..., 6}l~ 9 {1, 2} I ~ = o I t~ = 1 t~ = 6 t~----- 11 t~-----16 t. = 1 t~ = 3 t~ = 8 tv -= 12 t. = 1 t~ = 3 t~----7 tv-- 11 t~ = 1 t~----3 tv----7 't~ = 11
c o m p u t a t i o n of a task. Table 1 shows the executions times of the optimal k-linear schedules for the tasks v E V. All tasks on the same level of the tree have the same schedules up to renaming. The table also shows t h a t optimal schedules for G have execution time 11. The o p t i m a l schedule for tasks 7-14 is obviously the c o m p u t a t i o n of the task. The optimal l-linear schedule for task 3 computes tasks 7 and 3 on one processor and task 8 on another processor. The optimal 2-linear schedule computes the tasks 7, 8 and 3 sequentially. For l-linear schedules, an optimal trace for task 1 and task 0 starts at task 7 (time 11 and 16, respectively). For 2-linear schedules, an optimal trace for task 1 starts with task 7 (time 8), and an o p t i m a l trace for task 0 starts with task 7 and 2 (time 12). 9 Obviously, an optimal n-linear schedule for an inverse tree G is o p t i m a l among all schedules. Thus, Algorithm 1 can be used to compute an optimal schedule. Although this problem is NP-hard and the algorithm is exponential, m a n y subsets of V can be ignored, because they are not independent. As our example shows, using s y m m e t r y m a y reduce the n u m b e r of subsets of V even further. E.g. if all tasks of the task graph in Fig. l(a) have equal weights if they are the same level, then only 66 subsets instead of 215 need to be examined. Although there are still 2 ~ possibilities to be examined, the constant in the exponent is much smaller than 1, which makes Algorithm 1 more feasible t h a n a general brute-force search. The generalization of Algorithm 1 to tree-like task graphs is straightforward: Let G = (V, E, r) be a tree-like task graph and O be the set of its roots. T h e n an optimal schedule for G is obtained by scheduling separately the subgraph induced by ANCo (which is a tree by definition of tree-like) for every o E O according to Algorithm 1. Hence, T h e o r e m 2 implies directly the C o r o l l a r y 2. Let G = (V, E, r) be a tree-like task graph. Then, an optimal klinear schedule for G can be computed in time O(IV)k+31og ]VI) .
5
Conclusion
We have shown that optimal k-linear schedules for trees and tree-like task-graphs can be c o m p u t e d in polynomial time on the LogP-machine with an unbounded number of processors, if o = g and k = O(1). For every task graph G that is an inverse tree, Algorithm 1 constructs optimal k-linear schedules for the subtrees G~ of G rooted at v, where the tasks are processed from the leaves to the root.
335 The m a i n step is to construct an o p t i m a l k-linear schedule for G~, when for all proper ancestors u of v the optimal k-linear schedules for G~ are already known. The subgraph of Gv induced by the tasks computed by the processor computing v is always a tree with at most k leaves. Algorithm 1 is polynomial because there is only a polynomial number of independent sets over the ancestors of v of size at most k and for a given independent set, an optimal trace can be c o m p u t e d in polynomial time. Thus, it is possible to consider all possibilities in polynomial time. T h e generalization of the approach for 9 > o is not straightforward. The principal approach remains the same, but computing the o p t i m a l traces is more difficult. For this, more normalization properties are required. [20] discusses some of those for l-linear schedules. However, it is not clear whether these apply to k-linear schedules in general, and whether other normalization properties are required. [18, 26] show t h a t for every task graph G l-linear schedules guaranteeing the factor 1 + 1 / 7 of the optimal t i m e can be c o m p u t e d in polynomial time, where ~/is the granularity of G. Obviously, with increasing k the execution time of optimal k-linear schedules does not increase. In fact, the example in Section 4 shows that it m a y decrease drastically (in our example by 25%). The question raises whether it is possible to compute k-linear schedules with better performance guarantees for increasing k.
References 1. F.D. Anger, J. Hwang, and Y. Chow. Scheduling with sufficient loosely coupled processors. Journal of Parallel and Distributed Computing, 9:87-92, 1990. 2. P. Chretienne. A polynomial algorithm to optimally schedule tasks over an ideal distributed system under tree-like precedence constraints. European Journal of Operations Research, 2:225-230, 1989. 3. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. yon Eicken. LogP: Towards a realistic model of parallel computation. In 4th ACM S I G P L A N Symposium on Principles and Practice of Parallel Protramming (PPOPP 93), pp. 1-12, 1993. published in: SIGPLAN Notices (28) 7. 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. yon Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78-85, 1996. 5. S. Darbha and D.P. Agrawal. SBDS: A task duplication based optimal algorithm. In Scalable High Performance Conference, 1994. 6. B. Di Martino and G. Ianello. Parallefization of non-simultaneous iterative methods for systems of linear equations. In Parallel Processing: CONPAR 94 - VAPP VI, volume 854 of Lecture Notes in Computer Science, pp. 253-264. Springer, 1994. 7. J. Eisenbiegler, W. LSwe, and W. Zimmermann. Optimizing parallel programs on machines with expensive communication. In Europar' 96 Parallel Processing Vol. 2, volume 1124 of Lecture Notes in Computer Science, pp. 602-610. Springer, 1996. 8. Jrn Eisenbiegler, Weft LSwe, and Andreas Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC'97), pp. 149-155. IEEE Computer Society Press, 1997.
336 9. A. Gerasoulis and T. Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Transactions on Parallel and Distributed Systems, 4:686-701, June 1993. 10. J.A. Hoogreven, J.K. Lenstra, and B. Veltmann. Three, four, five, six or the complexity of scheduling with communication delays. Operations Research Letters, 16:129-137, 1994. 11. H. Jung, L. M. Kirousis, and P. Spirakis. Lower bounds and efficient algorithms for multiprocessor scheduling of directed acyclic graphs with communication delays. Information and Computation, 105:94-104, 1993. 12. I. Kort and D. Trystram. Scheduling fork graphs under LogP with an unbounded number of processors. Submitted to Europar '98, 1998. 13. E.L. Lawler. Optimal sequencing of a single machine subject to precedence constraints. Management Science, 19:544-546, 1973. 14. J.K. Lenstra, M. Veldhorst, and B. Veltmann. The complexity of scheduling trees with communication delays. Journal of Algorithms, 20:157-173, 1996. 15. W. LSwe and W. Zimmermann. On finding optimal clusterings in task graphs. In N. Mirenkov, editor, Parallel Algorithms/Architecture Synthesis pAs'95, pp. 241247. IEEE, 1995. 16. W. LSwe and W. Zimmermann. Programming data-parallel - executing process parallel. In P. Fritzson and L. Finmo, editors, Parallel Programming and Applications, pp. 50-64. IOS Press, 1995. 17. W. Lwe and W. Zimmermann. Upper time bounds for executing P R A M - p r o g r a m s on the LogP-machine. In M. Wolfe, editor, Proceedings of the 9th ACM International Con]erence on Supercomputing, pp. 41-50. ACM, 1995. 18. Welf LSwe, Wolf Zimmermann, and JSrn Eisenbiegler. On linear schedules for task graphs for generalized LogP-machines. In Europar'97: Parallel Processing, volume 1300 of Lecture Notes in Computer Science, pp. 895-904, 1997. 19. W. Lwe, J. Eisenbiegler, and W. Zimmermann. Optimizing parallel programs on machines with fast communication. In 9. International Con]erence on Parallel and Distributed Computing Systems, pp. 100-103, 1996. 20. M. Middendorf, W. LSwe, and W. Zimmermann. Scheduling inverse trees under the communication model of the LogP-machine. Accepted for publication in Theoretical Computer Science. 21. C.H. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal on Computing, 19(2):322 - 328, 1990. 22. C. Picouleau. New complexity results on the uet-uct scheduling algorithm. In Proc. Summer School on Scheduling Theory and its Applications, pp. 187-201, 1992. 23. J. Siddhiwala and L.-F. Cha. Path-based task replication for scheduling with communication cost. In Proceedings o] the International Con]erence on Parallel Processing, volume II, pp. 186-190, 1995. 24. J. Verriet. Scheduling tree-structured programs in the LogP-model. Technical Report UU-CS-1997-18, Dept. of Computer Science, Utrecht University, 1997. 25. T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951-967, 1994. 26. W. Zimmermann and W. L6we. An approach to machine-independent parallel programming. In Parallel Processing: CONPAR 94 - VAPP VI, volume 854 of Lecture Notes in Computer Science, pp. 277-288. Springer, 1994.
Static Scheduling Using Task Replication for LogP and BSP Models Cristina Boeres 1 , Vinod E. F. Rebello 1 and David B. Skillicorn 2 x Departarnento de Ci~ncia da Computa~s Universidade Federal Fluminense (UFF), Niter6i, R J, Brazil {boeres, vefr}@pgcc . u f f .br 2 Computing and Information Science Queen's University, Kingston, Canada sk ill@quc is. q u e e n s u ,
ca
A b s t r a c t . This paper investigates the effect of communication overhead on the scheduling problem. We present a scheduling algorithm, based on LogP-type models, for allocating task graphs to networks of processors. The makespans of schedules produced by our multi-stage scheduling approach (MSA) are compared with other well-known scheduling heuristics. The results indicate that new classes of scheduling heuristics are required to generate efficient schedules for realistic abstractions of today's parallel computers. The scheduling strategy of MSA can also be used to generate BSP-structured programs from more abstract representations. The performance of such programs are compared with "conventional" versions.
1
Introduction
P r o g r a m m e r s face daunting problems when a t t e m p t i n g to achieve efficient execution of parallel programs on parallel computers. There is a considerable range in communication performance on the p l a t f o r m s currently in use. More importantly, the sources of performance penalties are varied, so t h a t the best implem e n t a t i o n of a p r o g r a m on two different architectures m a y differ widely in its form. Understandably, p r o g r a m m e r s wish to build p r o g r a m s at a level of abstraction where these variations can be ignored, and have the program automatically transformed to an appropriate form for each target. One c o m m o n abstraction is a directed acyclic graph whose nodes denote tasks, and whose edges denote d a t a dependencies (and hence communication, if the tasks are m a p p e d to distinct processors). There are two related issues in transforming this DAG: clustering the tasks t h a t will share each processor, and arranging the communication actions of the p r o g r a m to optimally use the communication substrate of the target architecture. Clustering tasks reduces communication costs, but m a y Mso reduce the available parallelism. T h e objective, therefore, is to find the optimal task granularity with respect to the communication characteristics of the target machine. In task clustering without replication, tasks are partitioned into disjoint sets or clusters and exactly one copy of each task is scheduled. In task clustering with
338 duplication, a task may have several copies belonging to different clusters, each of which is scheduled independently. In architectures with high communication costs, this often produces schedules with smaller total execution times [12]. The task scheduling (or clustering) problem, with or without task duplication, has been well studied and is known to be NP-hard [15]. A more complicated issue is the choice of a model of the target's communication system. The standard model in the scheduling community is the delay model, where the sole architectural parameter for the communication system is the latency, that is the transit time for each message [15]. A scheduling technique based on latency is unlikely to produce good schedules on most of today's popular parallel computers because it assumes a fully-connected network. In practice, latency depends heavily on congestion in the shared communication paths. Furthermore, the dominant cost of communication in today's architectures is that of crossing the network boundary, a cost that arises partly from the protocol stack and partly from interaction with other messages leaving and entering the processor at about the same time. In practice, the latter dominates the communication cost [10] and cannot be modelled as a latency. The delay model also assumes that communication and computation can be completely overlapped. This may not be possible because: the CPU may have to manage, or at least initiate communication (even when the details are handled by a network interface card); the CPU may have to copy data from user space to kernel space; and it may be desirable to postpone some communication to allow messages to be packed together, or routed in a congestion-free way [10]. These architectural properties have motivated new parallel programming models such as LogP [6] and BSP [16], both of which have accurate cost models. Given a program, it is possible to accurately predict its runtime given a few program and architecture parameters. However, both models require writing programs in a slightly awkward way, and so it is useful to consider using a scheduling technique to map more general programs into the required structured form. The LogP model [6] is an MIMD message-passing model with four architectural parameters: the latency, L; the overhead, o, the CPU penalty incurred by a communication action; the gap, g, a lower bound on the interval between two successive communication actions, designed to prevent overrun of the communication capacity of the network (and hence related to the inverse of the network's bisection bandwidth); and the number of processors, P . Because the parameter 9 depends on the target architecture, it is hard to write LogP programs in a general and architecture-independent way. Nevertheless, LogP-based predictions of runtimes have been confirmed in practice [6, 7]. BSP is similar, except that it takes a less prescriptive view of communication actions. Programs are divided into supersteps which consist of three sequential phases: local computation, communication between processors, and a barrier synchronisation, after which communicated data becomes visible on destination processors. The architecture parameters are p, the number of processors, g the inverse of the effective network bandwidth, and l the time taken for a barrier
339 synchronisation. The g parameter is significantly different from that of the LogP model despite their superficial similarity. BSP does not limit the rate at which data are inserted into the network. Instead, g is measured by inserting traffic to uniformly-random destinations as fast as possible and measuring how long it takes to be delivered in the steady state (note that this includes an accounting for latency). In a superstep where each processor spends time w~ in computation, and sends and receives some quantity of data hi, the BSP cost model charges
MAXwi + MAXhi9 + 1 for the superstep. The units of 9 are instruction execution times per word, and of I are instruction execution times, so that the sum is sensible. This cost model has been shown to be extremely accurate [8, 9]. Note that the second term accounts for the fact that the cost of communication is dominated by fan-in and fan-out, while the inability of many architectures to overlap computation and communication is taken into account by summing the first two terms, rather than taking their maximum. To the best of our knowledge, few if any, scheduling Mgorithms exist for communication models which consider such parameters as those discussed above. Lhwe et al. [13] proposed a clustering algorithm with task replication for the LogP model, but it only operates on restricted types of algorithms. E T F [11] and DSC [17] which do not use task replication, and PY [15], PLW [14] which do, are well-known and effective scheduling algorithms based on the delay model. When applied to parallel computers, delay model approaches implicitly assume two properties: the ability to multicast, i.e. to send a number of different messages to different destinations simultaneously; and that the communication overhead and task computation on a processor can completely overlap, i.e. communication does not require computation time. Neither assumption is particularly realistic, as discussed earlier. In this paper, we briefl~r present a versatile multi-stage scheduling approach (MSA) for general DAGs on LogP-type machines (with bounded numbers of processors) which can also generate gSP-structured schedules. MSA considers the following parameters, adopted by the Cost and Latency Augmented DAG (CLAUD) model [5] (developed independently of LogP). In the CLAUD model, a DA G G represents the application and 7~ is a set of m identical processors. The overhead associated with the sending and receiving of a message is denoted by A8 and At, respectively, and the communication latency or delay v. The CLAUD model very naturally models LogP, given the following parameter relationships: L = r (LogP only considers fully connected networks), o = ~ 2 , g = ),8 and P = m, and BSP with h = Xs + )t~. (Note the delay model is an instance of these models). 2
A Multi-stage
Scheduling
Approach
The Multi-stage Scheduling Approach (MSA) currently consists of two main stages: the first transforms the input program DAG into one whose characteris-
340 tics (e.g. granularity, message lengths) are better suited to the target architecture; and the second stage maps this resulting DA G onto the target architecture. Each stage consists of two algorithms which are described in greater detail in [1,
3]. A parallel application is represented by a directed acyclic graph (DAG) G = (V, E, ~, ~), where V is the set of tasks, E is the precedence relation among them, c(v~) is the execution cost of task v~ E V and ~(v~, vj) is the weight associated to the edge (vi, vj) E E representing the amount of data units transmitted from vi to vj. Presently, MSA only considers input 1)AGs whose vertices have unit execution costs and edges have unit weight. For the duration of each communication overhead (known here as a communication event), the processor is effectively blocked unable to execute other tasks in V or even to send or receive other messages. Therefore, these sending and receiving overheads can also be viewed as "tasks" executed by processors. In parallel architectures with high overheads, the number of communication events can be reduced by bundling many small messages into fewer, larger ones. This can be achieved by clustering together tasks with common ancestors or successors and restricting the necessary communication events of the cluster to take place only before and after the execution of all of the tasks belonging to that cluster. This allows messages for the same destination processor, from various tasks within the cluster, to be combined and sent as one. In contrast, other clustering techniques (for example [17]) allow communication to occur as soon as each task within the cluster finishes its execution. This leads to a subtle difference between these clustering approaches: the latter only reduces the total communication cost through the elimination of messages, by allocating the source and destination tasks to the same processor; the former, tries to minimise the cost of sending the remaining messages as well. T h e F i r s t S t a g e . This stage attempts to transform the input DAG G into one which will have better communication behaviour on the target architecture. The formation of clusters, each containing a task vi of G and replicated copies of some of its ancestors, minimises communication in 3 ways: no cost is incurred for communication within the cluster; the bundling of messages to a common destination reduces the number of times that overheads are incurred; and the execution of the replicated tasks within the cluster can hide some of the communication cost between v~ and ancestors outside the cluster. The degree of replication, determined by characteristics of both the input DAG and the target architecture, is represented by a single parameter 7, the clustering factor. Determining the best value of 7, i.e. the value that leads to the smallest makespan is discussed elsewhere [1,2]. This clustering process is based on the replication principle (also adopted by PY [15]) and on the latest-finish time of a task i.e. the latest time that a cluster can finish and still have the schedule achieve the minimum makespan assuming all communication costs are 7. The result of this replication algorithm is a set of n clusters of tasks (n being the number of tasks in G). Every cluster contains "y+ 1 tasks except those in the first band which may have fewer. Furthermore, the latest finish time of every
341 cluster is a multiple of 3, + 1. Therefore, groups of parallel clusters exist in nonoverlapping bands or computation intervals. Later, gaps will be created, known as communication intervals, between adjacent bands in which only communication event tasks will be scheduled. For target architectures where the overheads of communication are much higher than the delay, minimising the in- and out-degrees of the clusters is essential for effective scheduling. When overheads are negligible and communication cost is dominated by the delay, it is better to minimise the length of messages. A second algorithm is now applied to specify the appropriate dependencies between required clusters taking into consideration the communication characteristics of the target architecture. The problem of reducing the communication costs of the schedule has now been simplified to iteratively considering the communication costs between at most three adjacent computation intervals [1]. For each computation band, this involves finding the combination of immediate ancestor clusters (containing the required immediate ancestor tasks of all tasks in the the current band), which will incur smallest communication cost. The exact solution is computationally expensive and instead a heuristic approach has been proposed [1,3]. T h e S e c o n d S t a g e . This stage performs a cluster merging step to map each cluster's distinct virtual processor to a physical one, minimising communication costs by applying Brent's Principle [4]. Communication is not permitted during execution of a cluster. When a relatively small overhead cost is associated with sending and receiving events, the bundling of messages can actually increase the makespan. In such cases, the edges of the final DAG can be unbundled and the clusters opened to allow messages to be sent after each task is executed. Once the supertasks have been allocated to the physical processors, excess copies of tasks allocated to the same processor are removed and the final schedule is defined for the new DAG based on the parameters ),~, ,~r and r. The required communication event tasks can now be allocated and the start times of all tasks estimated.
3
The
Structure
of BSP
Programs
To write efficient BSP programs a good strategy is to balance the computation across the processes; balance the communication fan-in and fan-out; and minimise the number of supersteps, since this determines the number of times 1 is incurred. This problem is similar to the problem of clustering for task scheduling in the sense that the number of supersteps can be reduced by increasing the amount of local computation carried out in a process. However, this also reduces the amount of parallelism and thus, just as the scheduling problem tries to balance parallelism against communication, here parallelism has to be balanced against communication and synchronisation. From the previous discussion of MSA, it is clear that there are similarities in the structure of efficient programs for both BSP and LogP-based models. BSP's
342 notion of local computation is most accurately captured by the cluster definition of MSA since communication only occurs after all of the tasks within the cluster have been executed. The cost of the superstep computation, wi, in BSP is equivalent to the computation band of constant cost "~ + 1 in MSA. Since each cluster in the band has the same size, balanced computation is achieved through task replication - a novel feature for BSP programs. The communication and synchronisation costs are represented by the communication interval in MSA. The exploitation of these similarities suggests the following. An approach similar to that adopted by MSA can be used to generate efficient BSP versions of MIMD programs automatically. The efficient execution of programs depends on their schedule which, in turn, is influenced by architectural parameters. Any change of target architecture is likely to require a change to the program structure i.e. the superstep size. BSP emphasises architecture-independence and portability of programs across architectures. If MSA is able to generate effective program schedules whose costs are known to programmers, many of the benefits of the BSP approach can be maintained in a more general programming setting. Further, if MSA generates schedules with good execution times for general-purpose programs on general parallel machines then this implies that the superstep discipline is not too restrictive. This might be expected to increase the use and acceptance of BSP, particularly for irregular problems where superstep structure is sometimes seen as overly restrictive.
4
The Influence
of the Overheads
on Task Scheduling
The impact of LogP-type parameters on the scheduling of tasks can be analysed by comparing of makespans generated by various scheduling approaches for a suite of (in this case, unit execution task unit data communication) DAGs over a range of values for communication overheads and latency. Table 1 presents three scheduling approaches with a selection of DA Gs which include: an irregular DAG based on a molecular physics application (Ir41); two random graphs with 80 and 186 tasks (Ras0, Rals6); a binary in- and out-tree of 511 tasks (It511, 0t511); and a diamond DAG (Di4G0). The application of MSA to a larger variety of DAGs with differing sizes and characteristics (including: course- and fine-grain graphs; regular DAGs such as binary trees and diamond DAGs; and irregular and random graphs) is presented in [1-3]. Table 1 also presents the makespans of MSA schedules generated under the BSP model in order to investigate the additional costs incurred by the superstep discipline. For the LogP-type architectures, MSA is compared with two recently published scheduling algorithms (DSC [17] and PLW [14]) both of which have been proved to generate schedules within a factor of two of the optimal under the delay model. While the delay model-based PLW ignores the overhead parameters (o or As and )~r), DSC is in fact based on an extension of this model (the linear model) and treats the overhead as an additional latency parameter.
343
One objective of this paper is to show that for architectures with LogPtype characteristics, even scheduling strategies which simply add the overhead parameter to the latency do not perform well. Furthermore, in practice, the execution time of the schedule produced can be much worse than predicted. The actual execution times (ATs) of the schedules produced by PLW, DSC and MSA were measured on a discrete event driven simulation of a parallel machine. The main reason for adopting a simulator is to facilitate the study of the relationship between the various parameters and the effects of hardware and operating system techniques to improve message-passing performance.
T a b l e 1. A comparison of predicted and simulated makespans for the DSC, PLW and MSA scheduling heuristics. A, and A~ are the sending and receiving overheads, respectively, and ~"is the latency. The number of processors required by the schedule is represented by Prs, I is the hardware cost of barrier synchronisation and 7 the clustering factor. D S C ""
PLW
MSA
MSABsp([ = 0) M S A B s p (l -- 2)
IIDAG IX.IX.I~IIP~s[pTIATHPrslPTIATI[~IPrsIPTIATII~ Ir41 It41 /P41 [r~
0 0 0 0
]r4]
J
Ir~
1 i 1 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
lr~ Ir4~ Ir~
Ir41 Ir41 Ir41
Ir~l [r~l Ir~x Ir~l Ir~l It41
Ir492 [r41 It41 [r4l tram It41 Iral Ir4]
0'" 1 0 2 0 4 0 8 1 1 1 2 1 4 1 8 0 i 0 2 0 4 0 8 1 1 i 2 i 4 1 8 2 1 2 2 0 1 0 2 0 4 0 8 1 1 1 2 1 4 4 1
13 9 8 3 7 5 8 4 7 5 8 4 5 5 5 2 5 8 5 8 3 1 8 3 5 5
16 16 i22 21 21 20 31 131 19 44142 19 26 44 22 31 41 20 40 ' 5 2 19 52 57 19 26 55 22 31 43 20 40 62 19 52 63 19 31 47 22 31 50 20 38 ,57 19 43 58 19 31 55 22 40 [67 ,20 31 i 73 22 40 90 20 44 68 ~19 41 41 19 40 91 22 38 85 20 43 75 19 43 82 22
16 16 22 ' 22 27 27 39 39 16 I 36 22 I 43 27 40 39 60 16 44 22 47 27 39 39 64 16:49 2 2 56 271 45 39 72 16 55 221 66 16 72 22 77 27 58 39 90 16i 77 22~ 83 27 60 16 93
3 i3 6 112 2 6 2 12 8 11 11 11 6 8 II ii [l'l 11 11 11 11 11 11 ,11 ]12 1
10 10 10 5 8 10 3 5 10 6 6 6 10 i0 6 6 6 6 6 6 6 6 6 6 5 1
16 16 19 I 18 24] 23 28 27 30 I 30 33 I 32 37 33 36 36 25 25 25 25 27 26 31 30 33 33 34 34 34 34 381 37 43 35 43 I 36 28 27 29 28 31 30 35 34 34 34 34 34 35-' 35 41 41
3 2 8 [15 6 6 8 15 6 8 15 15 12 12 15 15 112 12 15 15 ,15 15 12 ,12 115 1
]PrslPT I 10 21 8 24 10 31 7 39 10 33 10 39 10 41 7 45' 10 29 10 31 7 35; 7 39 5 34 4 37 7 41 7 45 5 38 5 41 7 32 7 33' 7 35 7 39 5 36 5 39 7 41 1 41
AT 21 26 31 34 31 32 37 39 28 31 32 :~6 32 34 37 40 35 37 31 33 34 38 34 36 38 41
II ~
]PrslPTI 3 10 21 6 10 27 15 7 35 15! 7 39 12! g 33 12 5 36 14 5 42 15 7 45 6 10 29 15 7 33 15 7 35 15 7 39 12 5 34 12 4 37 15 7 41 15 7 45 12 5 38 12] 5 41 15 7 32 33 15 7 35 15 7 39 12 5 36 12 5 39 15 7 41 1 1 41
A'J" I1 27 31 33 36 33 35 39 41 32 33 34 38 34 36 39 42 37 39 33 35 40 36 38 40 41
6 48 23 22 6 48 45 ]45 116]:~sl491 49 I 0 0 4 27 25 25 41 31- 31 Raso ;t 2 12}I 27 I 40 119311 49 1'21.11451i14i 25 I 76 I 56 11141 25 I 76 I 56 Ili4L 25 1 76 1 59 I Ra~s~ 0 0 ali66120 221187119 19114152121119112190[~s 32 12 9o 341 38] 50 Rals6 4 2 2 i 6 4 '31 112 111 17 93 ! 1 .78 69 47 11 78 69 48 11 78 6 9 Ras0
t [t~.. I o
0 }4}1 98 } 37} 39
}}1671 271 27 II i Isa 1 2a 1 2a II 2 1 64 1 2a
[ts~l ] 4 2. ]21195 ] 6111~1]]a4112] I ~7 II 1 I s~ 14~ 145 II 2 164 152 Ota~ 0 1411146I 25 I 27 I12641 i 3 [ 13 II s 12a61 9 I 9 }l 8 12561 9 Ot~n ii 2 1211121[.43117111308113 181 II s 12561 9' I 9 II 8 1~561 9
illlllBBllmlllln'~t
-28 9 9
t~mgW ImlKmmnn'~lBIB~n~al~lREmi~R
6
64
33
33"
8 256 8 256
9 9
9 9
344 Under the delay model (where)% = A~ = 0), the performance of both DSC and PLW for the DA G Ir41 progressively worsens in comparison with MSA as the DAG effectively becomes more fine-grained due to the increasing latency value. Considering the other graphs under this model, one can argue that the improvements in the makespans for PLW over DSC are due mostly to the benefits of task replication which very much depend on the graph structure (e.g. compare Raso and Ot511). Although MSA also employes replication, it generally utilses fewer processors to greater effect. As expected under this model, the three approaches predict the execution times of their schedules quite accurately. Under LogP-type models (i.e. when the overheads are nonzero), DSC and PLW performs poorly on two counts: the ATs of the generated makespans in many cases are worse than the single processor execution time; and the predictions are not very accurate. In the case of DSC, this is partly because the overhead cost is added to the latency, thus specifying that communication events can be overlapped with task computation. PLW produces the same schedule under both the LogP and delay models, and thus even poorer predictions (PTs), because it ignores overheads all together. While adopting the linear model for PLW may improve matters with respect to the ATs, both PLW and DSC still apply the fundamental assumptions of the delay model which no longer hold on LogP-type machines, e.g. because of the multicasting assumption they tend to create clusters with large out-degrees. The predicted makespans (PTs) of the schedules produced by MSA are, on the whole, quite reasonable although there is scope for improvement. Compared to the predictions for the BSP model, those for the LogP model are in general less accurate. The reason is that it is difficult to predict the communication behaviour since, for example, the order in which messages are sent from a cluster is not defined, or atleast not under the control of the schedule. This effect, hidden by the barrier synchronisations, is more pronounced when strict banding is relaxed. Under BSP, a barrier synchronisation occurs after each pair of computation and communication intervals. In the simulation, the parameter l only models the "system" cost of signalling that all processors have reached the barrier. For different graphs, the reason why the ATs of the LogP schedules are in some cases much better than those of the BSP ones (especially under delay model conditions) is because MSA considered it better to uncluster the supertasks. For the BSP model this is not possible, so MSA attempts to counter this effect by usually increasing ~/ (compare the values used by MSA with those of the two MSABsp schedules for the graph /r41). A larger 7 may sometimes lead to fewer bands with fewer, but larger, smaller in- or out-degree clusters. In the entries where 7 and the number of processors are the same for the MSA and MSABsp schedules, the difference in ATs represents the total cost for the barriers. On the whole, it seems that the MSA scheduling approach does manage to produce reasonable DAG transformations (and thus schedules) for the BSP model particularly under LogP model conditions. Note also that the ATs of the BSP schedules are better than those of DSC and PLW when the delay is large and particularly when the overheads are nonzero.
345 5
Conclusions
A scheduling strategy has been proposed to produce good schedules for a range of parallel computers, including those in which the overheads due to sending and receiving are relatively high compared to the latency. When these communication events (the sending and receiving overheads) are considered, communication is not totally overlapped with computation of tasks as assumed in a number of scheduling heuristics. In the case where overheads are significant, task replication is exploited to decrease the number of communication events. In an attempt to decrease these further, messages are bundled into fewer, larger ones. MSA effectively transforms the structure of the input DAG into one which is more amenable to allocation on the target architecture. For realistic models (LogP and B S P ) , initial results indicate that MSA generates better schedules than some well known conventional delay model-based approaches. This suggests that a new class of scheduling heuristics are required to generate efficient schedules for today's parallel computers. The definition of computational and communication intervals facilitates both a more precise prediction of the number of communication events, especially when the bundling of messages is necessary, and the exploitation of parallelism. Note that the main objective when defining such intervals is not to specify synchronisation points but rather to easily identify potential parallel clusters for efficient allocation onto physical processors and for the scheduling of communication event tasks. In the final LoyP schedule, these clearly defined intervals may no longer exist since unnecessary idle times will be removed. For the BSP model, however, these interval are maintained in the schedule. Under this model, although barrier synchronisation is an additional cost to be incurred by programs, MSA is able to reduce its impact by appropriately determining the sizes and the number of supersteps. Given that, for LogP-type machines, the makespans of BSP-structured schedules are not significantly worse than conventional ones and that their predictions relatively accurate, it seems that the superstep discipline is not too restrictive and that an approach to generating BSP-structured programs using a scheduling strategy such as MSA could be advantageous to programmers.
Acknowledgements The authors would like thank Tan Yang and Jing-Chiou Liou for providing the DSC and PLW algorithms, respectively, and for their help and advice.
References
Versatile Communication Cost Modelling for Multicomputer Task Scheduling Heuristics. PhD thesis, Department of Computer Science, University of Edinburgh, May 1997.
1. C. Boeres.
346 2. C. Boeres and V. E. F. Rebello. Versatile task scheduling of binary trees for realistic machines. In M. Griebl C. Lengauer and S. Gorlatch, editors, Proceedings of the 3rd International Euro-Par Conference on Parallel Processing (Euro-Par'97), LNCS 1300, pages 913-921, Passau, Germany, August 1997. Springer-Verlag. 3. C. Boeres and V. E. F. Rebello. A versatile cost modelling approach for multicomputer task scheduling. Accepted for publication in Parallel Computing, 1998. 4. R. P. Brent. The parallel evaluation of general arithmetic expressions. J. A CA/[, 21:201-206, 1974. 5. G. Chochia, C. Boeres, M. Norman, and P. Thanisch. Analysis of multicomputer schedules in a cost and latency model of communication. In Proceedings of the 3rd Workshop on Abstract Machine Models for Parallel and Distributed Computing, Leeds, UK., April 1996. IOS press. 6. D. Culler et al. LogP: Towards a realistic model of parallel computation. In Proceedings of the Fourth A CM S I G P L A N Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993. 7. B. Di Martino and G. Ianello. Parallelization on non-simultaneous iterative methods for systems of linear equations. In Parallel Processing (CONPAR'94-VAPP VI), LNCS 854, pages 254-264. Springer-Verlag, 1994. 8. J. M. D. Hill, P. I. Crumpton, and D. A. Burgess. The theory, practice, and a tool for BSP performance prediction. In Proceedings of the 2nd International Euro-Par Conference on Parallel Processing (Euro-Par'96), volume 1124 of LNCS, pages 697-705. Springer-Verlag, August 1996. 9. J. M. D. Hill, S. Donaldson, and D. B. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Oxford University Computing Laboratory, October 1997. 10. J. M. D. Hill and D. B. Skillicorn. Lessons learned from implementing BSP. In High-Performance Computing and Networks, Springer Lecture Notes in Computer Science Vol. 1225, pages 762-771, April 1997. 11. J-J. Hwang, Y-C. Chow, F. D. Anger, and C-Y. Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM Journal of Computing, 18(2):244-257, 1989. 12. H. Jung, L. Kirousis, and P. Spirakis. Lower bounds and efficient algorithms for multiprocessor scheduling of DAGs with communication delays. In Proc. A C M Symposium on Parallel Algorithms and Architectures, pages 254-264, 1989. 13. W. Lowe, W. Zimmermann, and J. Eisenbiergler. Optimization of parallel programs on machines with expensive communication. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Proceedings of the 2nd International EuroPar Conference on Parallel Processing (Euro-Par'96), LNCS 1124, pages 602-610, Lyon, France, August 1996. Springer-Verlag. 14. M. A. Palls, J-C. Liou, and D. S. L. Wei. Task clustering and scheduling for distributed memory parallel architectures. IEEE Transactions on Parallel and Distributed Systems, 7(1):46-55, January 1996. 15. C. H. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. S I A M Journal of Computing, 19:322-328, 1990. 16. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249-274, 1997. 17. T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951-967, 1994.
Aspect Ratio for Mesh Partitioning Ralf Diekmann 1, Robert Preis 1, Frank Schlimbach 2, and Chris Walshaw 2 1 University of Paderborn, Germany, {diek@,preis 9 2 University of Greenwich, London, UK,
{sf634,C.Walshaw}@gre.ac.uk Abstract. This paper deals with the measure of Aspect Ratio for mesh partitioning and gives hints why, for certain solvers, the Aspect Ratio of partitions plays an important role. We define and rate different kinds of Aspect Ra$io, present a new center-based partitioning method which optimizes this measure implicitly and rate several existing partitioning methods and tools under the criterion of Aspect Ratio.
1
Introduction
Almost all numerical scientific simulation codes belong to the class of dataparallel applications: their parallelization executes the same kind of operations on every processor but on different parts of the data. This requires the partitioning of the mesh into equal-sized subdomains as preprocessing step. Together with the additional constraint of minimizing the number of cut edges (i.e. minimizing the total interface length in the case of FE-mesh partitioning), the problem is NP-complete. Thus, a number of graph (mesh) partitioning heuristics have been developed in the past (e.g. [8]), and used in practical applications. Most of the existing graph partitioning algorithms optimize the balance of subdomains plus the number of cut edges, the cut size. This is sufficient for many applications, because an equal balance is necessary for a good utilization of the processors of a parallel system and the cut size indicates the amount of data to be transfered between different steps of the algorithms. The efficiency of parallel versions of global iterative methods like (relaxed) Jacobi, Conjugate Gradient (CG) or Multigrid is (mainly) determined by these two measures. But if the decomposition is used to construct pre-conditioners (DD-PCG) [1, 2], cut and balance may not longer be the only efficiency determining factors. The shape of subdom~ins heavily influences the quality of pre-conditioning and, thus, the overall execution time [10]. First attempts at optimizing the Aspect Ratio of subdomains weigh elements depending on their distance from a subdomain's center (e.g. [5]) and include this weight into the cost function of local iterative search algorithms like the Kernighan-Lin heuristic [6]. In the following section we will define the Aspect Ratio of subdomains and give hints why it can improve the overall execution time of domain decomposition methods. Section 3 introduces a new mesh partitioning heuristic which implicitly optimizes the Aspcct Ratio and Section 4 finishes with experimental results. Further information on this topic can be found in [4].
348
2
Aspect
Ratios
An example that cut size is not always the right measure in mesh partitioning can be found in Fig. 1. The sample mesh is partitioned into two parts with different AR's and different cuts. A Poisson problem with homogeneous Dirichlet-0 boundary conditions is solved with DD-PCG. It can be observed that the number of iterations is determined by the Aspect Ratio and that the cut size (number of neighboring elements) would be the wrong measure.
cut =4, An = 1.001 #rt's = 5 Fig.
1.
Aspect Ratio
vs.
Cut Size.
Fig. 2. Different defmitions of Aspect Ratio AR:
Lm~ -~, n--~ and C2 Lmi n ~
1--~"
Possible definitions of A R can be found in Fig. 2. The first two are motivated by common measures in triangular mesh generation where the quality of 2 triangles are expressed in either ~L~.~n(longest to shortest boundary edge) or ~-~ (area of smallest circle containing the domain to area of largest inscribed circle). R2
The definition of A R -= ~ expresses the fact that circles are perfect shapes. Unfortunately, the circles are quite expensive to find for arbitrary polygons: using Voronoi-diagrams, both can be determined in O(2nlogn) steps where n is number of nodes of the polygon. A R -= ~ (A is the area of the domain) is another measure favoring circle-like shapes. It still requires the determination of the smallest outer circle but turns out to be better in practice. We can do a further step and approximate Ro by the length C of the boundary of the domain (which can be determined fast and updated incrementally in O(1)). The C2 definition of A R = ~ additionally assumes that squares are perfect domains. Circles offer a better circumfence/area ratio but force neighboring domains to become non-convex (Fig. 3). For a sub-domain with area A and circumfence C A R = ~C 2 is the ratio between the area of a square with circumfence C and A. For irregular meshes and partitions, the first definition A R = Lm~x does not Lmin express the shape properly. Fig. 4 shows an example. P1 is perfectly shaped, but as the boundary towards P4 is very short, Lmax is large. The circle-based Lmin
349
Fig. 3. Circles as perfect shapes?
R 2
Fig. 5. Problems of AR := R2Fig. 4.
Lmax ? Lmla 9
measures usually fail to rate complex boundaries like "zig-zags" or inscribed corners. Fig. 5 shows examples which have the same A R each but are very different in shape.
3
Mesh Partitioning
The task of mesh partitioning is to divide the set of elements of a mesh into a given number of parts. The usual optimization criteria are the balance and the cut, which is the number of crossing edges of the element graph (the element graph expresses the data-dependencies between elements in the mesh). The calculation of a balanced partition with minimum cut is NP-complete. We will consider methods included in the tools J O S T L E [11], M E T I S [7] and P A R T Y [9]. The coordinate partitioning ( C O O ) method cuts the mesh by a straight line perpendicular to the axis of the longest elongation such that both parts have an equal number of elements, resulting in a stripe-wise partition. If used recursively ( C O O _ R ) , it usually results in a more box-wise partition. There are many examples for greedy methods , in which one element after another is added to a subdomain. We will consider the variations in which the next element to be added is either chosen in a breadth-first manner ( B F S ) or as the one which increases the cut least of all (CFS). The new method Bubble ( P U B ) is designed to create compact domains and the idea is to grow subdomains simultaneously from different seed elements. A partition is represented by a set of (initiM random) seeds, one for each part. Starting from the seeds, the parts are grown in a breadth-first manner until all elements are assigned to a subdomain. Each subdomain then computes its center based on the graph distance measure, i.e. it determines the element which is "nearest" to all others in the subdomain. These center-elements are taken as new seeds and the process is started again. The iteration stops if the seeds do not move any more, or if an already evaluated configuration is visited again. In general, the method stops after few iterations and produces connected and very compact parts, i.e. the shape of the parts is naturally smooth. As drawback, the parts do not have to contain the same number of elements. Therefore we integrated a global load balancing step based on the diffusion m e t h o d [3] as post-processing trying to optimize either the cut ( P U B + C U T ) or the Aspect Ratio ( B U B + A R ) .
350 We also implemented the meta-heuristic Simulated Annealing (SA) for the cz shape optimization and used A R = ~ as optimization function. If the parameters are chosen carefully, SA is able to find very good solutions. Unfortunately, it also requires a very large number of steps.
4
Results
We compare the described approaches with respect to the number of global iterations, the cut size and the Aspect Ratio in Fig. 6 and 7. Methods 1-7 are included in P A R T Y and 8-11 are default settings of the tools. The partitions are listed with the method numbers with increasing number of iterations. 1:COO 2:COO_R 3:BFS 4:CFS 5:BUB 6:BUD+CUT 7:BUB-t-AR 8:PARTY 9:JOSTLE 10 ~1 B 6 7 12 2 I 3 4 10:KMETIS 11:PMETIS Fig. 6. Results of example turin 12:SA (531 elements) into 8 parts.
40
#Its
Cut LJnax/L,Jnln ~ R o/R I ~ Ro/-A~ C*CT16A ~ .
,
,
,
,
,
,
,
,
j
9
3o g 2s ~~
~
#Its Cut L,_max/L,_mln R o/RI ~ G*C/16A
,s ,0
,
The Aspect Ratios ~
Fig. 7. Results of example cooler (749 elements) into 8 parts.
6
ra
2
and ~ do not, whereas the values of -~ and 16A .
i
.
roughly follow the number of iterations. A low cut also leads to a fairly low number of iterations. The coordinate and greedy methods usually result in very high, whereas the bubble variations in adequate iteration numbers. Very low iteration numbers can be observed for the tools and Simulated Annealing.
References 1. S. Blazy, W. Borchers, and U. Dralle. Paralleliziation methods for a characteristic's pressure correction scheme. In Flow Simulation with High-Perf. Comp. II, 1995. 2. J.H. Bramble, J.E. Pasciac, and A.H. Schatz. The construction of preconditioners for elliptic problems by substructering i.+ii. Math. Comp., 47+49, 1986+87. 3. G. Cybenko. Load balancing for distributed memory multiprocessors. J. Par. Distr. Comp., 7:279-301, 1989. 4. R. Diekmann, F. Schlimbach, and C. Walshaw. Quality balancing for parallel adaptive fem. In IRREGULAR, LNCS. Springer, 1998. 5. N. Chrisochoides et.al. Automatic load balanced partitioning strategies for pde computations. In Int. Conf. on Supercomp., pages 99-107, 1989. 6. C. Farhat, N. Mamaa, and G. Brown. Mesh partitioning for implicit computations via iterative domain decomposition. Int. J. Num. Meth. Engrg., 38:989-1000, 1995.
351 7. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report 95-035, University of Minnesota, 1995. 8. B.W. Kernighan and S. Lin. An effective heuristic procedure for partitioning graphs. The Bell Systems Technical Journal, pages 291-308, Feb 1970. 9. R. Preis and R. Diekmaml. The party partitioning-library, user guide, version 1.1. Technical Report TR-RSFB-96-024, Universit~t-GH Paderborn, Sep 1996. 10. D. Vanderstraeten, R. Keunings, and C. Farhat. Beyond conventional mesh partitioning algorithms... In SIAM Conf. on Par. Proc., pages 611-614, 1995. 11. C. Walshaw, M. Cross, and M.G. Everett. A localised algorithm for optimising unstructured mesh partitions. Int. J. Supercomputer Appl., 9(4):280-295, 1995.
A Competitive Symmetrical Transfer Policy for Load Sharing* Konstantinos Antonis, John Garofalakis, Paul Spirakis 1 Computer Technology Institute, P.O. Box 1122, 26110 Patras, Greece Phone: (+30)61-273496, Fax: (+30)61-222086 {antonis
, garofala
, spirakis}@cti.gr
2 University of Patras, Department of Computer Engineering and lnformatics 26500 Rion, Patras, Greece. A b s t r a c t . Load Sharing is a policy used to improve the performance of distributed systems by transferring workload from heavily loaded nodes to lightly loaded ones in the system. In this paper, we propose a dynamic and symmetrical technique for a two-server system, called DifferenceInitiated (DI), in which transferring decisions are based on the difference between the populations of the two servers. In order to measure the performance of this policy, we apply the SSP analytical approximation technique proposed in [3]. Finally, we compare the theoretically derived results of the DI technique with two of the most commonly used dynamic techniques: the Sender-Initiated (SI), and the Receiver-Initiated (RI) which were simulated. 1
Introduction
Transferring of jobs between two nodes lies in the heart of any distributed load sharing algorithm. Especially in the cases of dynamic or adaptive load sharing techniques (in which scheduling decisions are based on the current state of the system), the transferring of jobs from a heavily loaded node to a lightly loaded node, requires in most cases the mutual knowledge of the current state of the two nodes (server and receiver). Such dynamic load sharing techniques are the nearest neighbor approach, the sender-initiated, the receiver-initiated, etc. In this work, we propose a symmetrical transfer policy for load sharing between two nodes, that uses a threshold on the difference of the populations of the respective queues, in order to make decisions for remote processing. We evaluate the performance of this policy, by applying the State Space Partition (SSP) supplementary queueing approximation technique [3], and also simulations. In the simulation model that we developed, we incorporated two of the most widely used dynamic load sharing policies, the sender-initiated (SI) and receiver-initiated (RI) [1], in order to compare their performance with our policy, which we call difference-initiated (DI). * This work was partially supported by ESPRIT LTR Project no. 20244 - ALCOM-IT basic research program of the E.U. and the Operational Program for Research and Development 2 No. 3.3 453 sponsored by the General Secreteriat for Research and Technology.
353 2 2.1
The
Difference-Initiated
Technique
The Model
We consider two serving queues as the processing nodes, which are subject to the following simple load sharing mechanism: when the difference of their queue lengths is greater than or equal to a threshold c, then jobs are transferred from the heavier loaded queue to the lightly loaded queue at a constant rate ~ (i.e. the transfer period is exponential of mean length 1/fl). The above algorithm behaves symmetrically, either as server-intitiated or receiver-initiated, depending on whether the difference between the two queue lengths exceeds or is under the corresponding threshold c. A server probes for the queue length of the other server, every time a job arrives to or departs from its own queue. 2.2
The State Space Partition (SSP) Technique
It is easy to observe that the behaviour of our system as time progresses, balances between 3 elementary queueing systems, QS1, QS2, and QSa, when I n l - n 2 1 < c, nl - n~ >_ c, and n2 - nl >_ c, respectively. This technique computes the steadystate probability P(nl, n2) for finding nl and n2 customers at queue1 and queue2 respectively. The approximation technique conjectures that:
P(nl,n~) = A Pl(nl,n2) + B P2(nl,n~) + C P3(nl,n2)
(1)
where: Pl(nl,n2), P2(nl,n2), P~(nl,n2) denote the steady-state probabilities for finding nl and n2 customers at queue1 and queue2, for the three elementary, easily solvable systems above, with no constraints for (nl - n2), and
- A = Prob{cl}, where el = {(nl,n2) : Inl - n21 < c}, - B = Prob{e2}, where e2 = {(nl,n2) : nl - n2 >_ c}, - C = Prob{c3}, where e3 = {(nl,n2) : n2 - nl > c}. In applying equation 1 we are confronted with the problem of estimating A, B, and C. So, we use the following iteration technique: Iteration Algorithm - Solve QS1, QS2, QS3 by any existing technique for Product Form Queueing Networks (Mean Value Analysis, LBANC, etc.) and get Pi(nl, n2), i = 1,2, 3. - Assume initial values for A, B, C, namely A(~ , B (~ , C (0). Here we select: A(~ = P,(e~), B (~ = P,(e2), C (~ = P,(e3). -i=1 - ( ~ ) Compute the i th iteration value of P(nl, n2) by equation 1 using A (i-i), B(~-I), C (~-~). Call it P(~)(nl, n2) - Find new values A (i), B(i), C (i) by using the following equation:
A (i) -- A(i-1)Pl(el) + B(i-1)P2(el) + C(i-1)P3(el) Similarly, for B (i) and C(i), summing over e2 and e3, respectively. -
i = i + 1
- If a convergence criterion for A, B, C is not satisfied goto (:~). - Use P(i+l)(nl, n2) as an approximation to P(nl, n2).
(2)
354 3
Simulation
Results
and
Comparisons
Three different transfer policies for load sharing between two F I F O servers have been analysed or simulated and compared: the difference-initiated (DI), the sender-initiated (SI), and the receiver-initiated (RI) technique. The analytical model of DI was described in section 2, while the models for the other two simulated techniques can be found in [1]. We mention here, that only one job can be in transmission any time for each direction. We assume Poisson arrival times, and exponential service and transfer times for all the above policies. T h e parameters used are the following: Ts = the threshold adopted for SI, T~ = the threshold adopted for RI, c = the threshold adopted for DI, ),t, ),2 = the arrival rates for the two servers, Pl, #2 = the service rates for the two servers, /3 = the transfer rate. In order to have comparative systems, corresponding to the three policies, we choose to have Ts = c and Tr = 0 in all cases. We compare the three above transfer policies concerning the average response time for a job completion (the total time spent in the two server system by a single job). We have taken analytical (SSP method) or simulation results for the above performance measures for DI including: homogeneous and heterogeneous arrival times and variable ~, for heavier and lighter workloads. The same results for SI and RI were taken by our simulation model.
3.1
Comparative Results for H o m o g e n e o u s Arrival Streams
First, we consider the lightly loaded conditions. Our results have shown that RI has the worst average response time. On the ohter hand, DI behaves better, especially for large/3 (which means a low transfer cost, see fig. 1). As the arrival rate grows, the differences between the performance measures of the three techniques are decreased, but DI continues to behave a little better. The results for SI and RI are likely to be almost identical under heavily loaded conditions. We think that under more heavier conditions, RI should present better results compared to SI, confirming the results of [2]. DI performs better t h a n the other two because of its symmetrical behaviour. 3.2
Comparative Results for Heterogeneous Arrival Streams
Considering the extremal case of fig. 2, we can see t h a t DI continues to perform better than the other two techniques. DI balances the load between the two servers very well, and its results are very close to the results of the SI policy. On the other hand, RI presents very poor results, because of its nature, since it can not move a big number of jobs from the heavier to the lighter server. T h e explaination is t h a t the other two policies can move jobs even when the lightly loaded server is not idle. According to the definition of RI and the threshold used in this case, a server has first to be idle, in order to receive a job from another server.
355
F i g . 1. Comparative results for the av. res. times (homogeneous arrival times, fl=20).
F i g . 2. Comparative results for the av. res. time (heterogeneous arrival times).
References 1. S. P. Dandamudi, "The Effect of Scheduling Discipline on Sender-Initiated and Receiver-Initiated Adaptive Load Sharing in Homogeneous Distributed Systems", Technical Report, School of Computer Science, Carleton University, TR-95-25 1995. 2. D. L. Eager, E. D. Lazowska, and J. Zahorjan, "A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing", Performance Evaluation, Vol. 6, March 1986, pp. 53-68. 3. J. Garofalakis, and P. Spirakis, "State Space Partition Modeling . A New Approximation Technique", Computer Technology Institute, Technical Report, TR. 88.09.56, Patras, 1988.
Scheduling Data-Parallel Computations on H e t e r o g e n e o u s and T i m e - S h a r e d E n v i r o n m e n t s Salvatore Orla.udo 1 and Raffaele Perego u Dip. di Matematica Appl. ed hfformat.ica, Universith Ca' Foscari di Venezia, Italy lstituto CNUCE, Cousiglio Nazionale delle Ricerche (CNR), Pisa, Italy
A b s t r a c t . This paper addresses the problem of load balancing data-parallel computations on heterogeneous and time-shared parallel computing environments, where load imbalance may be introduced hy tbe different capacities of processors populating a computer, or by the sharlng of tile same computational resources among several users. To solve this problem we propose a run-time support for pazallel loops based upon a hybrid (static -t- dynamic) scheduling strategy. The main features of our technique are the absence of centralization avail syn(:lnrowlzation pohfl,s, I.he i)vefel.rhivtg of work toward slower processors, and the overlapphlg of communication latencies with useful computation.
1 Introduction It is widely held that distributed and parallel computing disciplines are converging. Advances in network technology have ill fact strongly increased the network bandwidth of state--of- the--art distributed systems. A gap still exists with respect to communication latencies, but this difference too is now nmch less dramatic. Furthermore, most Massively Parallel Systems (MPSs) a.re now built around tile same off-the-shelf superscalar processors that are used in high pcrfortrtance workstations, so t h a t fine-grain parallelism is now exploited intra-proccssor rather than inter-processor. The convergence of parallel and distrillutcd computix~g disciplines is also more evident if we consider the programming nlodels and clwironmcnts which dominate current parallel programming: MPI, PVM and IIigh Performance Fortran (I1PF) are now available for both MPSs and NOWs. '['his trend h ~ a direct iml)act ou the research carried out on tile two converging disciplines. For example, parallel COml)Ul.ing research has to deal with problems introduced by heterogeneil,y in para.llel systems. This is a typical feature of distributed sysl~ems, but nowadays heterogeneous NOWs arc increasingly used a.s i)arallei COlnt)uting resources [2], while MPSs may be populated with different off-the--shelf processors (e.g. a SG I/Cr~ly T3E system at the time of I,his writ, lug may be i)opulated with 300 or 450 Mllz DEC Alpha processors). Furthermore, some MPSs, which were normally used as batch machines in space sharing, may now be concurrently used by different users as time shared multiprogrammed environments (e.g. an IBM SP2 system can be configured in a way that allows users to specify whether the processors must be acquired as shared or dedicated resources). This paper focuses on one of the main problems programmers thus have to deal with on both parallel and distributed systems: the problem
357 of "system" load imbalance due to heterogeneity and t i m e sharing of resources. Here we restrict our view to dat~-parallel computations expressed by means of parallel loops. In previous works we considered imbalances introduced by non-uniform d a t a parallel computations to be run on homogeneous, distributed memory, MIMD parallel systems [ll,.q, l(}]. We devised a novel compiling technique for parallel loops and a rela.tc(I run time support (Stll'l'l,l~). This paper shows how StI1)I'I,F, can also be utilized to implement loops in all the cases where load imbalance is not a characteristic of the user code, but is caused by variations in capacities of processing nodes. Note that, much other research has been conducted in the field of run-time supports and compilation methods for irregular problems [13, 12, 7]. In our opinion, besides SUPPLE, many of these techniques can be also adopted when load imbalance derives from the use of a time-shared or heterogeneous parallel system. These techiques should also be compared wil.h those specifically devised to face load imbalance in NOW envirotmlents [4,
8]. SUPPLE is based upon a hybrid scheduling strategy, which dynamically adjusts the worldoaxls in I.he presence of variations of proee.~sor capacities. Tlae main features of SI/IWIA,] arc, the ellicient implementa l.io. of regular sLencil communications, the hybrid (static + dynamic) scheduling of chunks of iterations, and the exploitation of aggressive chunk prefetching to reduce waiting times by overlapping communication with useful computation. We report performance results of many experiments carried out on an SGI/Cray T3E and ma IBM SP2 system. The synthetic benchmarks used for the experiments allowed us to model different situations by varying a few important parameters such ms the computational grain and the capacity of each processor. The results obtained suggest that, in the absence of a priori knowledge about the relative cal)acitics of the processors that will actually e• the program, the hybrid strategy adopted in SIJPPLE yields very good performauce. The paper is organized as folh)ws. Section 2 presents the synthetic benchmarks and the machines used for the experiments. Section 3 describes our run-tilne support and its load balancing strategy. 'Fhc experimental results are reported and discussed in Section 4, and, linally, thc conclusions are drawn in Section 5.
2
Benchmarks
We adopted a. very simple benchmark program that resembles a very common pattern of parallelism (e.g. solvers for differential equations, simulations of physical phenomena, and image processing applications). The pattern is data-parallel and "regular", and thus considered e~sy to implenlent on homogeneous and dedicated parallel systems. In the benchmark a bidimensional array is updated iteratively on the b ~ i s of the old values of its elements, while array data referenced are modeled by a five-point stencil. The simple IIPF code illustrating thc benchmark is shown below. REAL ~(N1,N2) ~ B(NI~N2) !HPF$ TEMPLATE D(N1,N2) L.HPFt ~ISTP.Iet~T~ O(I~LOC~,eLOCK) !HPF$ ALIGN A ( i , j ) , B(i,J) WITH V ( i , j )
358 FORALL ( i : 2 : n l - l ,
j= 2:N2-1)
B(i,j) = Coap(ICi,j), ACi-l,j), ACi,j-i), A(i+i,j), A(I,J.}i)) r~ND FORALL A=B END DO
Note the III,OCK (listrilJution to exploit data locality by reducing off local-memory data references. The actual COluputation performed by the benchmark above is hidden by the fimction Comp(). We thus prepared several versions of the same benchmark where Comp() was replaced with a dummy computation characterized by known and fixed costs. Moreover, since it is important to observe the performance of our loop support when we increase the number of processors, for each different grid P of processors we modified the dimensions of the data set to keep the size of the block of data allocated to each processor constant. Finally, another feature that we changed during the tests is Iq_ITER, the number of iterations of the external sequential loop. This was done to simulate the behavior of real apl)lications such as solvers of differential equations, which require the same r)arallcl lool)s to bc cxecute(I many times, and image filtcring al)plications, which usually perform the update of the inl)ut image in jl,st one st(q). 3
The
SUPPLE
Approach
SUPPLE (SUPport for Parallel Loop Execution) is a portable run-time support for parallel loops [9, i1, 10]. It is written in C with calls to tile MPI library. Tile main innovative feature of SUPPLE [9] is its ability to allow data and computation to be dynamically migrated, wit]lout losing the ability to exploit all the static optimizations that can be adopted to efficiently implement stencil data references. Stencil implementation is s|.ra.ightforward, (luc to the regularity of the blocking data layout adopted. For each array involved SUPPLE allocates to each processor enough memory to host the block partition, logically subdivided into an inner and a perimeler region, and a surrounding ghost region. The ghost region is used to buffer the parts of the perimeter regions of the adjacent partitions that are owned by neighboring processors, and are accessed through non local references. The inner region contains data elements that can be computed without fetching external data, while to compute data elements belonging to the perimeter region these external data have to be waitc(l for. Loop iterations are statically assigned to each processor according to the owner compules rule, but, to overlap comnmnications with usefid computations, iterations are reordered [3]: the iterations that assign data items belonging to the inner region (wl,ich refer k)cal data only) are scheduled between the asynchronous sending of the perimeter rrgion to ,leighl)oring l)roc('ssors and the receiviug of the corresponding data into I.hc ghosl region. This si,atic scheduling ma.y bc cha,gt,d at run-time by migratillg iterations, but, in order to avoid the introduction of irregularities in the implementation of stencil data reference, only iterations updating the inner region can be migrated. We group these iterations into chunks of fixed size g, by statically tiling the inner region. SUPPLE migrates chunks and associated data tiles instead of single iterations. At the beginning, to reduce overheads, each processor statically executes its chunks, which are stored in a queue Q, hereafter local queue. Once a processor understands that
359 its local queue is becoming empty, it autonomously decides to start the dynamic part of its scheduling policy, It tries to balance the processor workloads by asking overloaded partners for migrating both chunks and corresponding data tiles. Note that, due to stencil d~d,a references, to allow l,he remote execution of a elnmk, the ,associated tile must be accompanied by a surrotmding area, whose size depends on the specific stencil features. Migrated clmnks and data tiles are stored by each receiving processor in a queue RQ, called remote. Since load balancing is started by underloaded processors, our technique can be classified as ~veeiver initiated [6, 14]. In the following we detail our hybrid scheduling algorithm. During the initial static phase, each processor only executes local chunks in Q and measures their computational cost. Note that, since the possible load imbalance may only derive from different "speeds" of the processors involved, chunks that will possibly be migrated and stored in RQ will be considered as having the same cost as the ones stored in Q. Thus, on the basis of the knowledge of the chunk costs, each processor estimates its cur~ent load by simply insl)ecting the size of its queues Q and RQ. When the estimated local load becomes Iowa," I,han a machine-dependent Threshold, each I)rocessor autonomously sl,arts tim dynamic I)arl, el the scheduling I.echnique and starts asking other processors for re,note chunks. Corresl)ondingly, a processor pj, which receives a migration request from a processor pi, will grant the request by moving some of its workload to pj only if its current load is higher than Threshold. To reduce the overheads which might derive from requests for remote chunks which cannot be served, each processor, when its current load becomes lower than Threshold, broadcasts a socalled termination message. Therefore the round-robin strategy used by underloaded processors to choose a partner to be ,'~sked for further work skips terminated processors. Once an overloaded processor decides to grant a migration request, it must choose the most appropriate mnnber of chunks to I)e migrated. '17o this end, S U P P L E uses a modified Factorin 9 scheme [5], which is a Self Scheduling heuristics formerly proposed to address the elficient implementation of parallel loops on shared-nmnlory multiprocessors
Finally, the policies exploited by SUPPLE to manage data coherence and termination detection are also fully distributed and asynchronous. A full/empty-like technique [1] is used to ~synchronously manage the coherence of migrated data tiles. When processor pi sends a chunk b to pj, it sets a flag marking the data tiles associated with b as invalid. The next time Pl needs to access the same tiles, Pl checks the flag and, if the flag is still set,, waits for the updated data tiles fi'om node pj. As far as termination detection is concerned, the role of a processor in the parallel loop execution finishes when it Itas ah'eady received a termination message fi'om all the other processors, and I)oth its (lllellr Q a.lld ItQ are elnpty. In summary, unlike other proposals [12, 4], the dynamic scheduling policy of SUPPLE is fully distributed and based upon local knowledge about the local workload, and thus there is no need to synchronize tile processors in order to exchange updated information about tile global workload. Moreover, SUPPLE may also be employed for applications composed of a siugle parallel loop, such as filters for image processing. Unlike other proposals [13,8], it does not exploit past knowledge about the work-
360
load distribution at previous loop iterations, since dynamic scheduling decisions are asynchronously taken concurrently with usefifl computation. 4
Experimental Results
All experiments were conducted on au IBM SP2 system and an S G I / C r a y T3E. Note that both machines might be heterogeneous, since both can be equipped with processors of different cal)acities. The T3E can in fact be populated with DEC A l p h a processors of different generations 1, while 1BM enables choiccs from three distinct types of nodes - high, wide or thin - where the differences are in the number of per-node processors, tile type of processors, tile clock rates, the type and size of caches, and tile size of tile main memory, llowever, we used the Sl'2 (a 16 node system equipped with 66 Mllz I)()WEI{, 2 wide processors) as a homo:le,U'ous time-shared environment. ' l b simulate load imbalance we siml)ly ialmche(I some ('o|||pl~te-boun(I processes on a subset of nodes. On the other hand, we used the 'F3E (a system composed of 64 300 MIlz - DEC Alpha 21164 processors) as a sl)ace-sharcd heterogeneous system. Since all the nodes of our systeul are idcl|l.ical, wr had to simulate the presence of processors of different speeds by hJtroducing an extra cost in the COml)utatio|l performed by those processors considered "slow". Thus, if tile granularity of Comp() is p / t s e c (including the time to read and write the data) on the "fast" processors, and F is the factor of slowdown of ~he "slow" ones, the granularity of Comp() on the "slow" processors is ( F . I~) Imec. % prove the effectiveness of SUPPLE, we compared each S U P P L E implementation of e benchmark with an equivalent static and optimized implementation, like the one pie|ted by a very efficient I [ P F compiler. We l)rescnt several curves, all plotting an excculion lime ratio (ETR), i.e. the ratio of the time taken by the static scheduling vcrsiou of the benchmark over the time taken by the SUPPI~E hybrid scheduling version. Ilence, a ratio greater than one corresponds to an improvement in the total execution time obtained by adopting S U P P L E with respect to the static version. Each curve is relative to a distinct granularity p of Comp(), and plots the E T l / s as a function of the number of processors employed. The size of the d a t a set has been modified according to tile number of processors to keep tile size of the sub-hlock statically assigned to each processor constant.
4.1
Tixne-shared Environments
First of all, we show tile results obtained on tile SP2. Due to tile difficulty ill running probing experiments because of the unpredictability of the workloads, as well as for the exclusive use of some resources (ill 1)articular, resources as the SP2 high-performance switch a u d / o r the processors themselves) by other users, the results in this ease are not so exhaustive as those reported for the tests run on the T3E. As previously mentioned, on the SP2 we ran our experiments after the introduction of a synthetic load on a subset of the nodes used. In particular, we launched 4 c o m p u t e - b o u n d processes on I The Pittsburgh Supercomputing Center has a T3E system with 512 application processors of which half runs at 300 Mllz au(l half at 450 Milz.
361 3
IBM 81,2 (tP): SUPPLE w.t,I, i I l ~ i ~ , n g .
~ 1 ~ 4 ~ 6 ~ proel 3
p . 7.5 pae~: I I . 17.fi pzee I I . 21.0 pJee
IBM SP2 ~18): SUPPLE ..... i
W.r.t
. l ~ e Dr,hodullhg. Ioed,.40P 60% pee~z t j . S,SVe~ - . . - -
2
N ..............
,,
t.8 t
0.5
S
Io Number of pmeosi~rl
,s
(~)
~o
o
S
1o Number ol p r o c n l o r l
IS
(b)
Fig. 1. SP2 results: ETR results for experiments exploiting (a) the Ethernet, and (b) the high-performance switch
50% of the processing nodes employed, while the loads were close to zero on the rest of I,he nodes. This corresponds to "slow" proc~'ssors characterized by a. va.hm of F (factor of slowdow,) equal to 5. The two p{t~ts in I,'igure I show the ETRs ol)taincd for various I t and numbers of processors. The size of the sub-block owned by each processor was kept constant and equal to 512 • 512, while N_ITER was equal to 5. As regards the SUPPLE parameters, tim Thleshold value was set to 0.09 msec, and the chunk size g to 32 iterations. Figure 1.(a) shows the SP2 results obtained by using the Ethernet network and the IP protocol. Note that, even in this c~se as in the following ones, the performance improvement obtained with SUIqH,I~ increases in proportion to the size of the grain it. For It = 27 /tsec, ~n(I 4 processors, we obtained the best result: S U P P L E implementation is about t00% faster than the static counterpart. For smaller values of tt, due to the ovcrheads to migrate data, the load imbalance paid by the static version of the benchmark is not large enough to justify the adoption of a dynamic scheduling technique. Figure 1.(b) shows, on the other hand, the results obtained by running the parallel benchmarks in time-sharing on the SP2 by exploiting the US high-performance switch. 1)ue to the better conmmuication framework, in this c ~ e the E'I'R is favorable to SUIWI,I'~ even for smaller vah,,s of tt. 4.2
Heterogeneous Environments
All the expcrimctd,s regarding h(,terogencous envimmueuts were couducted on ~,ue 9' ' "t,3 " ;L,. ' lleterogeneity was simulated, so that we were able to perform a lot of tests, also experimenting different factors F of slowdown. lterative benchmark Figure 2 reports the results for the iterative benchmark, where the external sequential loop is repeated 20 times (N_ITER = 20). For all the tests we kept fixed tile size of tile sub-block assigned to each processor (512 x 512). Figures 2.(a) au(I 2.(b) are relative to a.n environment where only 25% of all processors are "slow".
362 C m y T3E: SUPPLE w,r.t, liege Iohodullno F . 2 on 25% p m e l
Cnly T3E: SUPPLE w.r.t elizUo Ichl~JIInll - F , 2 on 60% plo~z
-
I~ - 0 S pmg~
p . o space .--~-..
t l , 0.6 p * ~ - - . - p . 1.1 p l ~ z . . p . 2 , 2 p H ~ -- m--
p 9 1.1 pltO p . 2,2 p.ee
i.g
!
:, : ;:;: "77;;;777:77 7- 77
- .... "
0 I 1o
2o
30 40 Number of p'ocie.oro
60
so
1o
70
lo
~io 4o Nlllr~t of pmcllnoni
Eo
eo
(~)
(a) Ctlly ')'~E ~8*L~ppLE w.r.i. I illllg iichelilllllg. F . 4 on 26% flrocl
Clay T3E: ~IUPPLE w.t.I, i l i t k i l e t l ~ l . l l l n l l . F . 4 ell 5o% p l e r
4
4 I
36~
p - o.e ixl,ec --.-- p . 1,1 pseo 9 , . p . 2,2 ~se0 o
3.6 3
p . o.8 pwlm p . 1,1 p l i G p , 2.2psoo
3F
: : : : : . " :.; ;:-;.-::.:- :7;7;7;;.77 77 ili~: I
,
~
l.ls
:-;;:;--;7;;;-:;;7;;;..LIi;.L.;:.;.;L;;;.L;iI:I.i::I;171;;Y2.1L.IIT;.] t,ll F
t
tlO.S I-
o.6
o
I
1o
2o
9o 40 Number of p r o ~ o o o r l
(b)
so
so
7o
~o
io
2o
3O ,10 Number o / i ~ o c N z o m
I
(10
70
(d)
F i g . 2. T3E results: E T R results for various values of F and p
The factors F of slowdown of these processors are 2 and 4, respectively. Figures 2.(c) and 2.(d) refer to an environment where half of the processors employed are "slow'. Each curve in all these figurcs plots tile E T R for a benchmark characterized by a distinct it. WiL]l respect to the SP2 tests, in this case we were able to reduce y (I t now ranges I)etween 0.3 and 2.2 psec), always obtaining an encouraging performance with our S U P P L E suPl)ort. Note that such grains are actually very small: for /t equal to 0.3/tscc, about 85% of the total time is spent on memory accesses (to compute addresses and access d a t a element covering the five-point stencil), and only 15% on arithmetic el,oral, lolls. We believe tha.t the encouraging results obtained are due to the smaller overheads and latencies of the T3E network, as well as to the absence of tinie-sharing of the processor, thus specding up the responsiveness of "slow" processors to requests
COllling from "fast," OliOS, 'l'he reductions in granularity nlade it possible to enlarge the chunk size !l (!l = 128) without Iosiilg I,he elfocl,ivenoss of the dynalnic schedllling algorithni. The Threshold ranged between 0.02 slid/I.08 IllSec, where larger values were used for larger y. Looking at Figure 2, we can see that better results (SUPPLE execution times up to 3 times lower than those obtained with static scheduling) were obtained when the "slow" processors were only 25% of the total nmnber. The reason for this behavior is clear: in this case, we have more '
363 obtained with the static implementation are, on the other hand, almost independent of the percentage of "slow" processors. In fact, even if only one of the processors was "slow", its execution time would dominate the overall completion time. We also tested SUPPI,F, (m a homogeneous system (i.e. a balanced one) in order to ewduatc its overhead w.r.t, a static iml)lemeuta.tion, which, in this case, is optimal. 'l'he overhead is almost consta,ut for d a t a sets of the same sizes subdivided into a given number of chunks, but its influence becomes larger for smaller granularities because of the shorter execution time and the limited possibility of hiding comrnunication latencics with computations. Thus, for 11, -- 2.2 ILsec the two execution times are almost comparable, while for ii -- 0.3#scc, the static version of the benchmark becomes 60% faster than SUPPLE. We verified that the overhead introduced by S U P P L E is due to some undesired migratiou of chunks, and to the dynamic scheduling technique which enta.ils polling the network interface to check for incoming messages (even if these messages do not actually arrive).
Fig, 3. Work time and overheads for a static (a) and a SUPPLE (b) version of the benchmark
Finally, we instrumented the static and tile S U P P L E versions of a specific benchmark to evaluate the ratio betwee~t the executiou time spent on useful computations and ou dynamic scheduling overheads. The features of the benchmark used in this case are the following: a 20,18 • 2048 d a t a set distributed over a grid of 4 • 4 T 3 E processors, and an external sequential loop iterated for 20 times. The simulated unbalanced enviroumeut was characterized I)y 25(7~ of "slow" processors, with a slowdown factor of d. Figure ,3.(a) show the results ol)taiued by running the static implementation of this benchmark, it is worth ,oti~lg the work time o , the "slow" processors, which is 4 times the work time on the "fast" ones. The black 1)ortioJls of the bars show idle times on "fast" l)roccssors while waiting for border data, These idle times are thus due to communications implementing stencil d a t a references, where corresponding send/receive on fast/slow processors are not synchronized due to their different capacities. Figure 3.(b), on the other hand, shows the S U P P L E results on the same benchmark. Note the redistribution of workloads from "slow" to "fast" processors. Thanks to dynamic load balanciug, idle times disappeared, but we have larger scheduling overheads due to the communications used to dynamically migrate chunks. These overheads are
364 clearly larger on "slow" processors, which spend a substantial part of their execution time on giving away work and on receiving the results of migrated iterations.
;Z.6
Cmu T3E: BUPPt-E w.v.L ~
~ l l l
- F . 2 o~ ~1~%~
c r y ~ 3E: SUPPLE w.~.l. ~
( , t3~.)
4.6
p - o :* p * * o - - ~ - il . O e p i ~ i1 - 2 2 Illttr
9
..
.
.
, t h t ~ 4 ~ n o - F..4 o n ~1;'/. p r o ~ (1 ~ l r . . ~.OJlse~
4
9 36 3
! l.s O.6
o.s o 20
30 4O Ntn~nber oe p m e e l a c q |
SO
60
;'0
I l0
I 20
I 1 30 40 N m ~ : e r of I ~ O ~ N O r e
I 60
I e0
70
(b) Fig. 4. T3E results: ETlt results obtained rmming a sittgle iteration bcnchnmrk for various values of F and it
Single iteration benchmark As explained above, one of the advantages of S U P P L E is that it can also be used for balancing parallel loops that have only to be executed once. In this case some overheads cannot be overlapped, and at the end of loop execution "slow" processors have to wait for the results of iterations executed remotely. Figure 4 shows the encouraging ETRs obtained by our S U P P L E implementation w.r.t, the static one. Note that all the results plotted in the figure refer to unbalanced environments where only 25% of the processors are "slow". Moreover, due to the larger d a t a sets used for these tests, the ETILs are in some cases even more favorable for S U P P L E than the ones for iterative benchmarks.
5
Conclusions
We have discussed tile implementation on heterogeneous a n d / o r timed-shared parallel machines of regular a , d unifornl parallel loop kernels, with statically predictable stencil d a t a references. We have assumed that no information about the capacities of the processors involved is availal)le until run-time, and that, in time-shared environments, these capacitics may change during run I.ilue. 'lb implement the kernel benchmark we employed SUI'I)I,F,, a ru,-l.ime support tllat we had previously introduced to cOral)lie non-uniform loops. The tests were conducted on an S G l / C r a y T3E and an IBM SP2. We compared the S U P P L E results with a static implementation of the benchmark, where d a t a and computations are evenly distributed to the various processing nodes. The SP2, a parallel system t h a t can be used as a time-shared NOWs, was loaded with artificial computebound processes before running the tests. On the other hand, we needed to simulate a heterogeneous S G i / C r a y '1'31;;, i.e. a machine whose nodes may be equipped with
365 different off-the-shelf processors a n d / o r memory organization. The performance results were very encouraging. On the SP2, where half of the processors were loaded with 4 compute-bound processes, the S U P P L E version of the benchmark resulted at most 100% faster than the static one. On the T3E, depending on the amount of "slow" processors, on the nuulber of processors eniployed and the granularity of loop iterations, l,he SlJl~I'LF, version reached percentages of pcrfornlance improvement ranging between 20% and 270%. Further work has to be done to compare our solution with other dynamic scheduling strategies proposed elsewhere. More exllaustivc experiments with different benchmarks and dynaiuic variations of the system loads are also required to fully evaluate the proposal, llowever, we believe thai. hyhrid strategies like the one adopted by S U P P L E can be profitably exploited in many cases where locality exploitation and load balance must be solved at the same time. Moreover, our strategy can be e ~ i l y integrated in the conlpilatiou nlodel of a high level data parallel language.
References I. II, Alverson el, al, The Tera coniputer sysl,eni, lu Proe. oS the 1990 ACM Int. ConS, on Supercolpuling, pages 1-6, 1990. 2. A. Anurag, G. Edjlali, and J. Saltz. The Utility of Exploiting idle Workstations for Parallel Computation. [n Prec. of the 1997 ACM SIGMETRICS, July 1997. 3. S. Hiranandani, K. Kennedy, and C. Tseng. Evaluating Compiler Optimizations for Fortran D. J. o/Parallel and Distr. Comp., 21(1):27-45, April 1994. 4. S. |"lynn lluninlel, J. Schmidt, R. N. Ulna, and J. Wein. Load-Sharing in IIeterogeneous Systems via Welgllted Factoring. Ill Piwc. o/the 8th SPAA, July 1997. 5. S.F. ! huunlcl, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. os the ACM, 35(8):90-101, Aug. 1992. 6. V. Kumar, A.Y. Grama, and N. Rao Venipaty. Scalable Load Balancing Techniques for Parallel Computers. d. os Parallel and Dislr. Comp., 22:00-79, 1994. 7. J. Liu and V. A. Salctore. Self-Scheduling on Distributed-Memory Machines. In Prec. of Supercomputing '93, pages 814-823, 1993. 8. M. Colajanni M. Cermele and G. Necci. Dynamic Load Balancing of Distributed SPMD Computations with Explicit Message-Passing. In Pivc. oS Ihe IEEE Workshop on Hetelvyeueous Conipn/iug, pages 2-16, 1997. % S. Orlalido and R. Pcrego. SIJPPI,I.']: an Efficient liuu-Tinie. Support for Non-Unlfornl Parallel l,oops. Technical Report TR-17/'.)6, l)il)artiinento di Mat. Appl. ed lnforinatica, Universlth di Venezia, Dec. If){)0. To appear Oil J. o[" System Architecture. 10. S. Orlando and R. Percgo. A Coinparison of lnlplelnentation Stratt'gies for Non-Uniforni Data Parallel Compur 'l~<:hnical Report, TR-,q/97, Dipart.imento di Mat. Appl. ed ]n[orniatlca, UlilVci'sil,~ di Vcncz,ia, April 1997. Under revision for publieatlon on tile J. of Parallel and l.)istr. Conip. 11. S, Orlando and R. Perego. A Support for Non-Uniforni Parallel Loops and its Application to a Flame Simulation Code. lu Iu oS lhe ]tth hit. Symposium, IRREGULAR '97, pages 186-197, Paderborn, Germany, June 1997. LNCS 1253, Spinger-Verlag. 12. O. Plata and F. F. Rivera. Combining static and dynamic scheduling on distributedmemory multiprocessors. In Prec. of Ihe 19.94 A CM Int. Con/. on Sltpercomputing, pages 186--195, 1994.
366 13. J. Saltz et al. Runtime and Language Support for Compiling Adaptive Irregular Programs oll Distributed Memory Machines. Software Practice and Experience, 25(6):597-621, June 1995. t4. M.tI. Willebeek-beMair and A.P. Reeves. Strategies for Dynamic Load Balancing on Highly Parallel Comput.ers. IEEE Trans. on Parallel and Distr. Systems, 4(9):979-993, Scpl,. 1993.
A Lower Bound for Dynamic Scheduling of D a t a Parallel Programs Fabricio Alves Barbosa da Silva 2, Luis Miguel Campos ~, Isaac D. Scherson 1,2 hfformation and Comp. Science, University of California, Irvine, CA 92697 U.S.A. {isaac,lcampos}@ics.uci.edu*** 2 Universit4 Pierre etMarie Curie, Laboratoire ASIM, LIP6, Paris, FranceJ fabricio.silva~asim.lip6.fr ~
A b s t r a c t . Instruction Balanced Time Slicing (IBTS) allows multiple parallel jobs to be scheduled in a manner akin to the well-known gang scheduling scheme in parallel computers. IBTS however allows for time slices to change dynamically and, under dynamically changing workload conditions is a good non-clairvoyant scheduling technique when the parallel computer is time sliced one job at a time. IBTS-parallel is proposed here as a dynamic scheduling paradigm which improves on IBTS by allowing also dynamically changing space sharing of the computer's processors. IBTS-parallel is also non-clairvoyant and it is characterized under the competitive ratio metric. A lower bound on its performance is also derived.
1
Introduction
A solution to the dynamic parallel job scheduling problem is proposed together with its complexity analysis. The problem is one of defining how to share, in an optimal manner, a parallel machine among several parallel jobs. A job is defined as a set of data parallel threads. One important characteristic that makes the problem dynamic, as defined in [3], is the possibility of arrival of new jobs at arbitrary times. In the static case, the set of jobs to be executed is already defined when scheduling decisions are made, and arbitrary job arrivals are not permitted. In this paper we propose a new scheduling algorithm dubbed IBTS-Parallel, which is derived from the IBTS algorithm. IBTS stands for instruction balanced time slicing, and was originally defined in [6]. IBTS is a non-clairvoyant [3] scheduling algorithm designed to optimize the competitive ratio (CR) metric, also defined in [6][4]. In addition to the theoretical analysis of IBTS-Parallel, we present an experimental analysis performed using a general purpose event driven simulator developed by our research group. *** Supported in part by the lrvine Research Unit in Advanced Computing and NASA under grant ~NAG5-3692. t Professor Alain Greiner is gratefully acknowledged. t Supported by Capes, Brazilian Government, grant number 1897/95-11.
368
2
Previous
Work
Various classifications for the parallel job scheduling problem have been suggested in the literature. The classification used here is based on the way in which computing resources are shared : temporal sharing, also known as time slicing or preemption; and space slicing, also called partitioning. These two classifications are orthogonal, and may lead to a taxonomy based on the possible options. Table 1 shows the scheduling policies adopted by commercial and research systems, and was borrowed from [2]. It is worth noting that the lack of consensus on which scheduling policy is best among those showed in table 1 is total. The problem is that the assumptions leading to and justifying the different schemes are usually quite different, which makes difficult the comparison between different solutions.
I
dme shcing yes independent PEs global queue
Mach
gang scheduling local queue Paragon/service Meiko/timeshare KSR/hlleractive Ifansputers Tera/streams Chrysalis
NX/2 on iPSC/2 aCUBE IRIX on SGI NYU Ultra Dynix 2-level/top Hydr~dC.mmp
Star OS Psyche EI~,si API{}<X)
Medusa Butlerfly(e~LLNL Cray T3E MeJko/gang Paragon/gang SGl/gang Tcra/PB MAXl/gang
IBM SP2, Victor I, Meik~/bat ch Paragon/slice KSR/bath 2-level/bottom TRAC, MICROS Amoeba
CM-5 Cedar DHC t,n SP2 DQT on RWC-I
Cray T3D CM-2 PASM Iwpereube,s
MasPar MP2 Alhant FX/8 Chagori on K2
Iiliar IV MPP GFI 1 Warp
Fig. 1. Scheduling policies followed in current commercial and research systems
We address the scheduling problem by using the same assumptions (i.e. model and metrics) described by Subramaniam and Scherson in [6][4]. In [6], Subramaniam and Scherson studied a scheduling policy named Instruction Balanced Time Slicing (IBTS), which performs well under the competitive ratio metrics [4]. Each job is allocated on all P processing elements (PE) of the machine for a finite quantum (time-slice). All time-quanta devoted to the various jobs are equal as measured in machine instructions (that is, all jobs are preempted after equal number of machine instructions). Note that time quanta may vary in absolute value while still being equal when measured in number of instructions. A good analysis of gang scheduling, which is similar to IBTS but imposes an invariant constant time slice can be found in [5]. In [4] the programming model used is the data parallel (V-RAM) model[l]. The scheduling problem was defined as a optimization problem, and the compet-
369
itive ratio was the metric used as the objective function. The competitive ratio (CR) was defined in function of the non clairvoyance of the scheduler. The competitive ratio (CR) is based on the happiness [4] concept. The happiness metric attempts to capture the satisfaction of a job as a function of the scheduling decisions made by a scheduler. The CR is then the ratio between the happiness achieved by a knowledgeable malicious adversary, and the that achieved by a partially ignorant scheduler. IBTS is hence a non-clairvoyant algorithm, and the malicious adversary always has more information that the scheduler itself. It is that hidden information that is used by the adversary to keep the happiness as low as possible. By minimizing the CR the scheduler approaches the result that would be obtained by an adversary with global knowledge.
CR = y~ Happiness(x,y') Happiness(x, y)
(1)
In equation 1, x represents the input to the algorithm, y is the result obtained by the scheduler and y~ is the result obtained by the adversary. Also:
Happiness(x, y) = f f rninjTivJ(1-)d7
(2)
Where pi is the power delivered to job j, and T is the time at which the last job completes. In analogy with physics terminology, the power delivered to a job is defined as: W iv = -(a) F F is the running time of a job as a function of the number of statements and the mean time completion of each statement (the mean time completion is computed according to the type of the statement: local statements, remote access statements, etc, and is machine architecture dependent). W is the work, or the processor-time product in the ideal world, i.e. it is the product of the ideal mlmber of processors for execution of the job and the running time assuming all statements as local statements. iv is valid for the single job case. For multiple jobs we have:
wJ =
(4)
Another related definition is the inefficiency of running the job on a machine:
pIF
= --~--, ~ > 1
(5)
F is the running time of a job, W is the work and PI is the number of processors actually allocated to the job. The most important result contained in [4] [6] is that, of all non-clairvoyant schedulers, instruction-balanced time-slicing has the least CR, with the corresponding proof. As a consequence of the proof it was verified that the least
370
possible CR is equal to r/.~a~, which is the maximum inefficiency among all jobs that are running in a machine at a given moment.
3
IBTS-parallel
In IBTS, each job is allocated to the whole machine and executes a predefined number of instructions before being preempted. However, not all jobs necessarily use all processors of the machine at all times. In order to further optimize the space sharing in parallel scheduling policies should allocate simultaneously multiple jobs. We use the fact that IBTS is the non-clairvoyant algorithm with the least CR to propose a modified version of IBTS that also permits a better spatial allocation of the machine. Let us start by considering a machine with N processors in a MIMD architectural model as described in [4]. In our modified version of IBTS the machine is shared by more than one job at any given time. Jobs are preempted after all threads, running in parallel, execute a fixed number I of instructions. The resulting scheduling algorithm is dubbed IBTS-parallel. The use of m a n y different spatial scheduling strategies are possible with IBTS-Parallel (first-fit, best-fit, etc. IBTS-Parallel has at least the same performance than IBTS under the CR metric, as stated in the lemma below. L e m m a 1. The CR of IBTS-parallel is < ~?ma~
Proof, It follows from [4][6] by verifying that the two properties of IBTS are valid for IBTS-parallel : 1. At any time, an equal amount of power is supplied to all running jobs 2. The job that finishes last incurs the maximum amount of work Using the lemma above, we can state the main result of this section. _IBTS_par -IBTS T h e o r e m 1. t]rna x -<~ - "/]max
In other words, IBTS represents the worst case of IBTS-parallel Proof. As we are running multiple jobs in parallel, the execution time of each job will necessarily be smaller than if we run one job after another, as is the case in IBTS since one job allocates all processing elements of a machine. From the definition of inefficiency : ~-
P'F W
(6)
P~ and W do not change from 1BTS to IBTS-parallel. The time F is smaller or equal for all jobs under IBTS-parallel as compared to pure IBTS. So, all jobs have smaller or equal inefficiencies under IBTS-parallel as compared to pure IBTS, which makes the maximum inefficiency in IBTS-parallel smaller than or equal to the corresponding quantity in IBTS.
371 4
Simulation
and Verification
To verify the results above, we used a general purpose event driven simulator, developed by our research group for studying a variety of related problems (e.g., dynamic scheduling, toad balancing, etc). The simulator accepts two different formats for describing jobs. The first is a fully qualified DAG. The second is a set of parameters used to describe the job characteristics such as computation/communication ratio. When the second form is used the actual communication type, timing and pattern are left unspecified and it is up to the simulator to convert this user specification into a DAG, using probabilistic distributions, provided by the user, for each of the parameters. Other parameters include the spawning factor for each thread, a thread life span, synchronization pattern, degree of parallelism (maximum number of thread that can be executed at any given time), depth of critical path, etc. Even-though probabilistic distributions are used to generate the DAG, the DAG itself behaves in a completely deterministic way. Once the input is in the form of a DAG, and the module responsible for implementing a particular scheduling heuristics is plugged into the simulator, several experiments can be performed using the same input by changing some of the parameters of the simulation such as the number of processing elements available or the topology of the network, among others. The outputs can be recorded in a variety of formats for later visualization. For this study we grouped parallel jobs in classes where each class represents a particular degree of parallelism (maximum number of threads that can be executed at any given time). We divided the workload into ten different classes with each class containing 50 different jobs. The arrival time of a job is described by a Poisson random variable with an average rate of two job arrivals per time slice. The actual job selection is done in a round robin fashion by picking one job per class. This way we guarantee the interleaving of heavily parallel jobs with shorter ones. We distinguish the classes of computation and of communication instructions in the various threads that compose a job. A communication forces the thread to be suspended until the communication is concluded. If the communication is concluded during the currently assigned time-slice the thread resumes execution. All threads are preempted only after executing 100 instructions (duration of a time slice), with the caveat that any thread that executed one or more communication instruction will be preempted at the end of the time-slice regardless of how many instructions it was able to execute in the current time-slice. We used a factor of 0.001 communications per computation instructions. The practical implementation of IBTS-parallel was based on a greedy algorithm applied at the beginning of each workload change.During the workload change interval, called cycle, the workload is obviously assumed constant. Thus, the eligible threads of queued jobs are allocated to processors using the first fit strategy for each time slice. Clearly, after all eligible threads are scheduled on a processor for some time slice (slot), the temporal sequence is repeated peri-
372 odically until a workload change again occurs. We considered in simulations a machine with 1024 processors. Preliminary results are shown in table 1. They have been normalized to show the total execution time of IBTS-parallel to be 100%. Table 1. Preliminary experimental results IBTS IBTS-parallel Total Running] Total Idle Total Running iTotal Idle Time (%) Time (~) Tiine (~) Time (%) 123.6 41.9 100 28.2
It is important to dissect the value obtained for idle time. This value is the result of following three factors: 1. Communications 2. Absence of ready threads 3. Lack of job virtualization The first is a natural consequence of threads communicating among themselves.. The second reflects the fact that jobs arrive and finish at random times and at any given instance there might not be any job ready to be scheduled. The last is a result of not allowing individual threads of a same job to be scheduled at different time-slices. This factor is, we believe, the chief reason behind the high idle time measured and should be greatly reduced by extending IBTS-parallel with job virtualization [4] techniques.
References 1. Blelloch,G. E.: Vector Models for Data-parallel Computing M I T Press, Cambridge, MA (1990) 2. Feitelson, D.: Job Scheduling in Multiprogrammed Parallel Systems IBM Research Report RC 19970, Second Revision (1997) 3. Motwani, R., Phillips, S., Torng, E., Non-clairvoyant scheduling, Theoretical Computer Science, (1994) 130(1):17- 47 4. Scherson, I. D., Subramaniam R., Reis, V. L. M., Campos, L. M.. Scheduling Computationally Intensive Data Parallel Programs. Ecole Placement Dynamique et Re'partition de charge (1996) 39-61 5. Squillante, M. S., Wang, F., Papaefthymiou, M., An Analysis of Gang Scheduling for Multiprogrammed Parallel Computing Environments, Proceedings of the 8th Annual A CM Symposium on Parallel Algorithms and Architectures, (1996) 89-98 6. Subramaniam, R.: A Framework for Parallel Job Scheduling PhD Thesis, University of California at Irvine (1995)
A G e n e r a l M o d u l a r Specification for D i s t r i b u t e d Schedulers Gerson G. H. Cavalheirotl, Yves Denneulin ~ and Jean-Louis Roch 1 1 Pro jet Apache t Laboratoire de Modfisation et Calcul, BP 53 F-38041 Grenoble Cedex 9, France 2 Laboratoire d'Informatique Fondamentale de Lille Cit Scientifique F-59655 Villeneuve d'Ascq Cedex, France
A b s t r a c t . In parallel computing, performance is related both to algorithmic design choices at the application level and to the scheduling strategy. Concerning dynamic scheduling, general classifications have been proposed. They outline two fundamental units, related to control and information. In this paper, we propose a generic modular specification, based not on two but on four components. They and the interactions between them are precisely described. This specification has been used to implement various scheduling algorithms in two different parallel programming environments: PM 2 (EsPACE) and Athapascan (APACHE).
1
Introduction
From various general scheduling systems classifications for parallel architectures [2, 6], it is possible to extract two fundamental elements: a control unit to compute the distribution of work in a parallel machine and an information unit to evaluate the load of the machine. These units can implement different load balancing mechanisms, privileging either a (class of) machine architecture or an (class of) application structure. A dynamic scheduling system (DSS for short) is particularly interesting in case of an unpredictable behavior at execution time. Irregular applications and networks of workstations (NOW) are typical cases where the execution behavior is unknown a priori. Usually a DSS for a tuple { application, machine} is specified by an entry in a scheduling classification, and is implemented as a black-box: it is difficult to tune it or reuse it for another application or machine without rewriting totally or partially its code. Some works in the literature discuss this problem and present possible solutions, like [4] where scheduling services are grouped in families and presented as functions prototypes. In this p a p e r we discuss the gap between the scheduling classifications and the implementation of DSSs. We present a modular structure for general scheduling t CAPES-COFECUB Brazilian scholarship t Supported by CNRS, INPG, INRIA et UJF
374 systems. This general architecture, presented as a framework, aims at avoiding the introduction of additional constraints to implement the different classes of scheduling. This framework introduces two new functional units: a work creation unit and an executor unit and specifies an interface between the components. 2
General
Organization
of a Dynamic
Scheduling
System
In this work a DSS is implemented at software level to control the execution of an application on a parallel architecture. It was conceived to be independent from both application and hardware and to support various classes of scheduling algorithms. The DSS runs on a parallel machine composed of several computing nodes, each one having a local memory and at least one real processor. There is also a support for communications among them.
The notion of job. Units of work are produced at the application level; from them, jobs are built and inserted in a graph of precedence (DAG), denoted G. G evolves dynamically; it represents at each instant of execution the current state of the application jobs. Each node in G represents a job; each edge, a precedence constraint between two jobs. So, a job is the basic element to be scheduled, it corresponds to a finite sequence of instructions, and has properties to allow its manipulation by the scheduler. At a given instant of time, a job is in one of the following states: waiting, the job has precedence constraints in G; ready, the job can be executed; preempted, the execution was interrupted; running, the job's instructions are executed; or completed, the precedence constraints in G imposed by the job are relaxed. Correction of a scheduling algorithm. The scheduling manages the execution of jobs on the machine resources. For this, any DSS has to respect the following properties, that define its semantics: (P1) for the application: only jobs in the ready state can be put in the running state by the DSS; (P2) for the computing resources: if some resources of the machine (nodes) are idle while there are jobs in the ready state, at least one job will be put by the DSS in the running state after a finite delay. This delay, which may sometimes be bounded, defines the liveness of the DSS. A DSS verifying those properties is said to be correct. 3
A Modular
Architecture
for a Parallel
Scheduling
The DSS is implemented as a framework based on four modules (Fig. 1): a Job Builder, to construct and manage a graph of jobs; a Scheduling Policy, where the scheduling algorithm is implemented; a Load Level, to evaluate the load of the machine; and an Executor, which implements a pool of threads used to execute the jobs. Each module defines an interface to its services. An access to one of them represents a read or a write operation, both accomplished in a nonblocking fashion. A read operation is a request for data; it allows a module to
375 get a data handled by another one. Write operations also provide resources for data exchange, but in the opposite direction: a module decides to send a d a t a to another module (e.g. a data or status changes). As a reaction to the use of a service, an internal activity can be triggered and executed concurrently with the caller before the service ends. A short description of each module follows.
~SchedulingPolIc~.. ~
::::re
"........
IP- ~
St at us changi ng
c.
I~ Executor
Fig. 1. Internal communication scheme between modules on the framework.
Job Builder unit. Its role is to build jobs fl'om the tasks defined by the application, eventually giving access to their dependencies. So, it manages the evolutions of G. This module is attached to the application to receive and handle the tasks creation requests. The Job Builder unit must be informed when a job terminates to update the states of the jobs in G. In Fig. 1, arrow G shows the task Executor sending a job termination event to the Job Builder. Scheduling Policy unit This is the core of the scheduling framework; it takes decisions about the placement of the jobs. In order to know how the application evolves, the Scheduling Policy receives from the Job Builder every change in G, e.g. when a job is created (arrow h in Fig. 1) or when its state changed (arrow B). To place the jobs to respect the load balancing scheme, the Scheduling Policy may consider the load on the parallel machine (arrow D). Notice that after taking the decisions, the load information must be updated (arrow D'). The Scheduling Policy must also react to two signals: an irregular load distribution detected by the Load Level (arrow C) and the requests of jobs to execute (arrow E) from the Executor. It can also create a new virtual processor (arrow F) or destruct one
(arrow F'). Executor unit. The Executor is implemented over a support that provides finegrained activities; it furnishes virtual processors, called threads, to process the execution of the application jobs. Each thread executes an infinite loop of two actions: 1) send a request (arrow g) of a job to the Scheduling Policy; and 2) execute sequentially the job's instructions. The load information is updated after and before the execution of each job (arrow H). Load Level unit. The Load Level unit collects the load informations in the computing system. These data are updated after the requests for write operations from the Scheduling Policy and Executor units. If an irregular distribution of the load in the machine is detected, this unit sends a message to the Scheduling Policy unit (arrow C). The definition of the load and the notion of irregular distribution depends on the Scheduling Policy unit.
376
4
Implemented Environments
This general specification has been used to build Athapascan-GS and GTLB, both DSSs for two distinct parallel programming environments. Athapascan-GS [5] was implemented on top of Athapascan-0 [1] for the Athapascan-1 macro dataflow language (INRIA APACHE Project, LMC laboratory) to support static and dynamic scheduling algorithms. Athapascan-GS can choose in the set of ready jobs the one that will be triggered in order to optimize a specific index of performance, like memory space, execution time or communication. The Generic Threads Load Balancer (GTLB) [3] defines a generic scheduler for highly irregular applications of tree search that belongs to the Branch&Bound family; it was implemented on the top of PM 2 (ESPACE project, LIFL laboratory) to support schedulers based on algorithms of mapping with migration. On both implementations, this design provides reusability opportunity in the development of specific scheduling algorithms.
5
Conclusion
In this paper, we have proposed a specification design to implement dynamic scheduling systems. This work completes classical classifications which analyze only two units of scheduling algorithms, dedicated respectively to the control and the load information. We claim that such a distinction is not precise enough; the major argument is that it does not take into account the relationships with the application (that produces tasks) and the execution (that consumes tasks). Two new modules are proposed: one dedicated to the work creation and another to the execution support, a protocol that specifies their interactions was also presented. Two examples of DSSs implementing this general structure were presented: Athapascan-GS and GTLB, both supporting various scheduling algorithms.
References 1. J. Briat, I. Ginzburg, M. Pasin and B. Plateau. Athapascan Runtime: Efficiency for Irregular Problems. In Proc. of the 3th Euro-Par Conference. Passau, Aug. 1997. 2. T.L. Casavant and J.G. Kuhl. A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems. [EEE Trans. Soft. Eng.. V. 14(2): 141-154, Fev. 1988. 3. Y. Dermeulin. Conception et ordonnancement des applications hautement irrgulires dans un contexte de paralllisme grain fin. PhD Thesis, Universit de Lille, Jan. 1998. 4. C. Jacqmot. Load Management in Distributed Computing Systems: Towards Adaptative Strategies. DII, Universit Catholique de Louvain, PhD Thesis, Louvain-laNeuve, Jan. 1996. 5. J.-L. Roch et all. Athapascan-1. Apache Project, Grenoble, http://www-apache, imag .fr Oct. 1997.
6. M.H. Willebeek-LeMair and A.P. Reeves. Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Trans. Par. and Dist. Syst. V. 4(9): 979-993, Sept. 1993.
Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments J. Mark Bull Centre for Novel Computing, Dept. of Computer Science, University of Manchester, M13 9PL, U.K. ~arkb@cs. man. ac. uk
A b s t r a c t . Dynamic loop scheduling algorithms can suffer from overheads due to synchronisation, loss of locality and small iteration counts. We observe that timing information from previous executions of the loop can be utilised to reduce these overheads. We introduce two new algorithms for dynamic loop scheduling which implement this type of feedback guidance, and report experimental results on a distributed shared memory architecture. Under appropriate circumstances, these algorithms are observed to give significant performance gains over existing loop scheduling techniques.
1
Introduction
Minimising load imbalance is a key activity in producing efficient implementations of applications on parallel architectures. Since loops are the most significant source of parallelism in m a n y applications, the scheduling of loop iterations to processors can be an i m p o r t a n t factor in determining performance. Most of the existing techniques for dynamic loop scheduling on shared m e m o r y machines are variants of, or are derived from, guided self-scheduling (GSS) [6]. These techniques share some c o m m o n characteristics--they assume no prior knowledge of the workload associated with the iterations of the loop, and they all divide the loop iterations into a set of chunks, where there are more chunks t h a n processors. The key observations that motivate the new algorithms presented here are the following: - Some prior knowledge of the workload associated with the iterations of the loop m a y be available, particularly if we can assume t h a t the workload has not changed too much since the previous execution of the loop. This is often the case in simulation of physical systems where the parallel loop is over a spatial domain and is executed at every time-step. - Dividing the loop iterations into more chunks t h a n processors can hurt performance, as it m a y incur overheads associated with additional synchronisation, loss of both inter- and intra-processor locality, and reduced efficiency of loop unrolling or pipelining. Our algorithms are designed to utilise knowledge about the workload derived from the measured execution time of previous occurrences of the loop with the aim of reducing the incurred overheads.
378
2 2.1
Feedback-Guided Scheduling Algorithms Self-scheduling
Algorithms
The simplest scheduling algorithm of all is block or static scheduling, which divides the number of the loop iterations by the number of processors as equally as possible. No attempt is made to dynamically balance the load, but overheads are kept to a minimum: no synchronisation is required and the chunk size is as large as possible. Of dynamic algorithms, the simplest is self-scheduling [8] in which a central queue of iterations is maintained. As soon as a processor finishes an iteration, it removes the next available iteration from the queue and executes it. Chunk self-scheduling [3] allows each processor to remove a fixed number k of iterations from the queue instead of one. This reduces the overheads at the risk of poorer load balance. Guided self-scheduling (GSS) [6] attempts to overcome the difficulty of choosing a chunk size by dynamically varying the chunk size as the execution of the loop progresses, starting with large chunks and making them progressively smaller. There is also the possibility that the load may not be well bManced-for example if the loop has N iterations and the first Nip iterations contain more than 1/pth of the workload. A number of algorithms have been proposed which are essentially sfinilar to GSS, and differ m~inly in the manner in which the chunk size is decreased as the algorithm progresses. These include adaptive guided self-scheduling [1], factoring [2], tapering [4] and trapezoid self-scheduling
[9]. Affinity scheduling [5] uses per-processor work queues in order to reduce contention between processors. Each processor initiMly has on its queue the iterations it would have been assigned under block scheduling. Another benefit of affinity scheduling is that it allows temporal locality of data across multiple executions of the parallel loop to be exploited. Variants of affinity scheduling to reduce synchronisation costs have been proposed in [7] and [10].
2.2
New Algorithms
In order to exploit timing information, we require access to a fast, accurate hardware clock. Since our objective is to keep the chunk size a large as possible, we only permit measurement of the execution time of whole chunks, and not of iterations within a chunk. The first Mgorithm we describe is a feedback guided version of block scheduling. We divide the loop iterations into p chunks, not necessarily of equal size, where p is the number of processors. On each execution of the parallel loop we measure the total execution time for the chunk of iterations assigned to each processor. Dividing this time by the number of iterations on each processor gives us a mean load per iteration for that chunk, and hence a piecewise constant approximation to the true workload. We then choose new bounds for each processor to use for the next execution of the loop, based on an equipartition of the area under this piecewise constant function. By keeping a running total of the
379
O(p)
areas under each constant piece, this can be achieved in time. This could be reduced to O(log2(p)), using parallel prefix operations, but this would only be worthwhile for large values of p. The second algorithm is a feedback guided version of affinity scheduling. On each execution of the parallel loop we measure the total execution time for the chunk of iterations initially assigned to each processor. We derive a mean load per iteration, and hence a pieeewise constant approximation to the true workload in the same way as for the feedback' guided block algorithm. Equipartitioning this approximation to the workload gives us the initial assignment of iterations for the next execution of the loop. This is similar in spirit to the algorithm of [7], which initialises each processor with the same number of iterations as it executed on the previous execution of the loop.
and subsequentlyexecutedby
dynamica]finity
3 3.1
Experimental Evaluation Test Problems
To test our algorithms we use a workload consisting of a Gaussian distribution whose midpoint oscillates sinusoidally, defined at the ith iteration and the tth loop execution by
{ (i-(N/2+Nsin(2~rt/v)/4~ 2} where N is the number of iterations, and v is the period of the oscillation. By varying r we can control how rapidly the load distribution changes from one execution of the loop to the n e x t - - a large value gives a nearly static load, a small value gives a rapidly varying one. We execute the parallel loop 1000 times with values of the period v varying between 10 and 1000. We use three different loop bodies. Loop Body 1 simply causes the processor to spin for a time where D is a fixed (wall clock) time period:
Dw~
for (i=0; i
) Loop Body 2 increments the ith element of an array a number of times proportional to w~: for (i=0; i
380 For small chunk sizes, more than one processor at a time may be updating the same cache line of a, resulting in a significant number of coherency misses in a cache coherent system. In Loop Body 3 each iteration updates elements in a column of a two-dimensional array using the corresponding element of a second array and its four nearest neighbours, where the number of elements updated is proportional to w~. for (i=l; i<=N; i++){ khi = N * load(i); for (k=l; k=
On alternate executions of the loop the roles of the two arrays are reversed: the elements of a are used to update b. This loop body is designed to test the Mgorithms' ability to exploit data affinity between subsequent executions. We can now define the four test problems used: Problem Problem Problem Problem 3.2
1 2 3 4
Loop Loop Loop Loop
body body body body
1, 1, 2, 3,
N N N N
= = = =
1000, D = 10 -4 seconds. 1000, D = 10 -3 seconds. 10000. 1000.
Results and Discussion
We ran each of the four test problems on 16 processors of a Silicon Graphics Origin 2000 system (with 195 MHz R10000 processors), using the following scheduling algorithms: block, feedback guided (FG) block, affinity, feedback guided affinity, simple self-scheduling (SS), guided self-scheduling (GSS) and trapezoid self-scheduling (TSS), recording the execution time Tp. We also ran the block scheduling algorithm on one processor to give a sequential reference time Ts. The computational cost of feedback guidance is low: each call to the hardware clock requires approximately 0.35#s and the computation of the new partition for 16 processors approximately 45#s. Figures 1 to 4 show the efficiency (computed as Ts/pTv) as a function of the period r of the oscillating workload. For Problem 1 the workload on each iteration is sufficiently small for the overheads of synchronised access to work queues to dominate the efficiency of the self-scheduling algorithms. The result of this is that simple self-scheduling (which requires O(N) accesses to the central queue), and affinity scheduling (which requires O(plog(g/p2)) accesses to each of the p local queues) perform no better than block scheduling. GSS with O(plog N) accesses and TSS with 4p accesses to the central queue respectively, are more efficient. Adding feedback guidance to affinity scheduling does nothing to reduce the total number of queue accesses, even though there are fewer accesses to queues of other processors, and the additional overhead of timing each chunk of iterations is incurred. FG block
381
,r'
"
|
1.2
:,;-:-"
6],;d
.
F G Bk~ck ~ ' - Affinity - o - FG Affinity --)~.... SS ~a-.G S S --.X6TSS -O---
/
0 98 [ -
0.6
.
.
.
.
.
.
9 b]dck" q , : - : - " F G Bh,ck - + - Affinity -E~- F G Affinity --X.... SS - ~ . . G S S -X--
.
1 0.8
"
"
'"
....
'
m.._---+ ..-+-"
.
~. _ .~. _+_____4
o- -- -- .o-- -- o -- -o-- >~- o -- -- --q- -- -r O.4
.
0,6
-~>- -- -- -e
-
.-+"
m 0.4
.4" 0.2
f
o
10
I(X) Workload period
,"........'- ~i&r~ F G Bh)ck Affinity FG Affinity SS GSS TSS
I 0.8
t 1000
Fig. 2. Problem 2 1
. . . . . . . . . .
-+--ID -")~ "" ~,--"~" -@--
.......
. . . . . . . .
10
Fig. 1. Problem 1 1.2
t . 100 Workload period
')
I(X)O
' '
+ _ . _ .+ A-- f ' ~ ' "
"
0.8
-
0,6
=
" i~),;d~ . . . . . . . . . .
,
/" .4."
']
FG Block - + - " Affinity - D - FG Affilfity --)~....
I
GSS --x-.
|
TSS -~---
"t
/
0,6
~:.~---,~::--o-;~-----o----- o----D--n----- ~
uJ
0.4
0.4 0.2
0,2
0
0 l0
i(X~
Workload period
Fig. 3. Problem 3
L0(K}
\ ...... I
< .... 10
100 Workload period
1000
Fig. 4. Problem 4
scheduling avoids all synchronisation costs, and for sufficiently slowly changing workload (~" _> 100) it is the most efficient of the algorithms tested. For Problem 2 the workload on each iteration is larger, and synchronisation overhead is less important. SS and affinity scheduling successfully balance the load, whereas GSS and TSS (with larger chunks) do not. Feedback guidance does not improve affinity scheduling, but again for sufficiently slowly changing workload (r >_ 200), FG block scheduling shows the best performance. For Problem 3, false-sharing induced communication is the most significant overhead. Both SS and GSS produce many small chunks, with consecutive chunks likely to be executed on different processors. Affinity scheduling also produces many small chunks, but there is a higher likelihood of consecutive chunks being executed on the same processor. TSS produces only 4p chunks, and thus it performs well even though it may not successfully balance the load. Apart from the fastest moving workload (r = 10), FG affinity scheduling is more efficient than affinity scheduling as it further increases the likelihood of consecutive chunks being executed on the same processor. FG block scheduling is again poor for rapidly evolving workload, as it does not balance the load well, but as r increases, the advantage of having only p chunks becomes significant. For these cases, it is by far the most efficient algorithm. For Problem 4, both synchronisation and communication overheads are important. None of the central queue algorithms perform well, as they do not ex-
382 ploit d a t a affinity between executions of the parallel loop. For rapidly changing workload (T < 20), affinity scheduling is the most efficient, but as the workload becomes more predictable, FG affinity scheduling gives the best performance. For small v, FG block scheduling is very poor, but as 7- increases, its performance approaches t h a t of feedback guided affinity scheduling.
4
Conclusions
We have described two algorithms which make use of feedback information to reduce the overheads associated with scheduling a parallel loop in a shared variable p r o g r a m m i n g model. Our experiments show that under certain conditions, when there is sufficient correlation between the workload on successive executions of the parallel loop, these new algorithms will outperform traditional loop self-scheduling methods, sometimes by a considerable margin.
References 1. Eager, D.L. and Zahorjan, J. (1992) Adaptive Guided Self-Scheduling, Technical Report 92-01-01, Department of Computer Science and Engineering, University of Washington. 2. Hummel, S.F., Schonberg, E. and Flynn, L.E. (1992) Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Communications of the ACM, vol. 35, no. 8, pp. 90-101. 3. Kruskal, C.P. and Weiss, A. (1985) Allocating Independent Subtasks on Parallel Processors, IEEE Trans. on Software Engineering, vol. 11, no. 10, pp. 1001-1016. 4. Lucco, S. (1992) A Dynamic Scheduling Method for Irregular Parallel Programs, in Proceedings of ACM SIGPLAN '92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992, pp. 200-211. 5. Markatos, E.P. and LeBlanc, T.J. (1994) Using Processor Affinity in in Loop Scheduling on Shared Memory Multiproeessors, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 4, pp. 379-400. 6. Polychronopoulos, C. D. and Kuck, D. J. (1987) Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, C-36(12), pp. 1425-1439. 7. Subramaniam, S. and D.L. Eager (1994) Affinity Scheduling of Unbalanced Workloads, in Proceedings of Supercomputing '94, IEEE Comp. Soc. Press, pp. 214-226. 8. Tang, P. and Tew, P-C. (1986) Processor Self-Scheduling for Multiple Nested Parallel Loops, in Proceedings of 1986 Int. Conf. on Parallel Processing, pp. 528-535, St. Charles, IL. 9. Tzen, T.H. and Ni, L.M., (1993) Trapezoid Self-Scheduling Scheme for Parallel Computers, IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 1, pp. 87-98. 10. Yah, Y., C. fin and X. Zhang (1997) Adaptively Scheduling Parallel Loops in Distributed Shared-Memory Systems IEEE Trans. on Par. and Dist. Systems, vol. 8, no. 1, pp. 70-81.
Load Balancing for Problems with G o o d Bisectors, and Applications in Finite Element Simulations Stefan Bischof, Ralf Ebner, and T h o m a s Erlebach Institut f/ir Informatik Technische Universit~it M/inchen D-80290 Miinchen, Germany {bischof Iebner ~erlebach}@ in. rum. de http: //www{mayr [zenger}, in. turn.de/
A b s t r a c t . This paper studies load balancing issues for classes of problems with certain bisection properties. A class of problems has a-bisectors if every problem in the class can be subdivided into two subproblems whose weight (i.e. workload) is not smaller than an a-fraction of the original problem. It is shown that the maximum weight of a subproblem produced by Algorithm HF, which partitions a given problem into N subproblems by always subdividing the problem with maximum weight, is at most a factor of [ l / a ] 9(1 - a) (11~1-2 greater than the theoretical optimum (uniform partition). This bound is proved to be asymptotically tight. Two strategies to use Algorithm HF for load balancing distributed hierarchical finite element simulations are presented. For this purpose, a certain class of weighted binary trees representing the load of such applications is shown to have 1/4-bisectors. The maximum resulting load is at most a factor of 9/4 larger than in a perfectly uniform distribution in this case.
1
Introduction
Load balancing is one of the m a j o r research issues in the context of parallel computing. Irregular problems are often difficult to tackle in parallel because they tend to overload some processors while leaving other processors nearly idle. For these applications it is very important to find methods for obtaining a balanced distribution of load. Usually, the load is created by processes t h a t are part of a (parallel) application program. During the run of the application, these processes perform certain calculations independently but have to c o m m u n i c a t e intermediate results or other d a t a using message passing. One is usuMly interested in achieving a balanced load distribution in order to minimize the execution time of the application or to maximize system throughput. Load balancing problems have been studied for a huge variety of models, and m a n y different solutions regarding strategies and implementation mechanisms have been proposed. A good overview of recent work can be obtained f r o m [5], for example. [6] reviews ongoing research on d y n a m i c load balancing,
384 emphasizing the presentation of models and strategies within the framework of general classification schemes. In this paper we study load balancing for a very general class of problems. The only assumption we make is that all problems in the class have a certain bisection property. Such classes of problems arise, for example, in the context of distributed hierarchical finite element simulations. We show how our general results can be applied to numerical applications in several ways. 2
Using Bisectors
for Load Balancing
In many applications a computational problem cannot be divided into many small problems as required for an efficient parallel solution directly. Instead, a strategy similar to divide and conquer is used repeatedly to divide problems into smaller subproblems. We refer to the division of a problem into two smaller subproblems as bisection. Assuming a weight function w that measures the resource demand, a problem p cannot always be bisected into two subproblems Pl and P2 of equal weight w(p)/2. For many classes of problems, however, there is a bisection method that guarantees that the weights of the two obtained subproblems do not differ too much. The following definition captures this concept more precisely. D e f i n i t i o n 1. Let 0 < a <_ 89 A class 7) of problems with weight function w : P -+ ~ + has a-bisectors if every problem p E 7) can be efficiently divided into two problems Pl e 7) and P2 E 79 with w(pl) + w(p2) = w(p) and w(pl), w(p2) E law(p); (1 - a)w(p)]. Note that this definition requires for the sake of simplicity that all problems in 7) can be bisected, whereas in practice this is not the case for problems whose weight is below a certain threshold. We assume, however, that the problem to be divided among the processors is big enough to allow further bisections until the number of subproblems is equal to the number of processors. This is a reasonable assumption for most relevant parallel applications. Figure 1 shows Algorithm HF, which receives a problem p and a number N of processors as input and divides p into N subproblems by repeated application of a-bisectors to the heaviest remaining subproblem. A perfectly balanced load distribution on N processors would be achieved if a problem p of weight w(p) was divided into N subproblems of weight exactly w ( p ) / N each. The following theorem gives a worst-case bound on the ratio between the maximum weight among the N subproblems produced by Algorithm HF and this ideal weight w ( p ) / N . All proofs are omitted in this paper due to space constraints but can be found in [1]. T h e o r e m 2. Let 7' be a class of problems with weight function w : 7) --+ ]~+ that has a-bisectors. Given a problem p E 7) and a positive integer N , Algorithm H F uses N - 1 bisections to partition p into N subproblems Pl, . . . , PN such that l<_i
' (l -
385
InImt: problem p, positive integer N begin e~-
(p};
while ]P] < N do begin q ~-- a problem in P with maximum weight; bisect q into ql and q2; P 4- (P O {ql, q2}) \ {q); end; output P; end.
Fig. 1. Algorithm HF (Heaviest Problem First) It is not difficult to see that the bound given in Theorem 2 is asymptotically tight. This means that for sufficiently large N there exists a sequence of bisections such that the maximum load generated by Algorithm HF is arbitrarily close to this bound. For this purpose, the original problem is first split into a number of subproblems P l , . . . , Pn of equal weight. Each of Pl,. 9 Pn- 1 is then split into [ l / a ] subprobtems using a sequence of [1/aJ - 1 exact a-bisections, whereas Pn is split similarly but without doing the final bisection step. The construction outlined above suggests that the bound from Theorem 2 might not be tight for small values of N, however. Indeed, we can prove that for all a < 1/5 and for all N < i / a , Algorithm HF partitions a problem from a class of problems with a-bisectors into N subproblems such that the weight of the heaviest subproblem is at most w(p)(1 - a) y - 1 . Hence, the ratio between the maximum weight and the ideal average weight is bounded by N(1 - a ) N - l , which is smaller than [ l / a ] ( l - a ) [ 1 / ~ J - 2 for the range where this bound applies.
3
Application of Algorithm HF to Distributed Finite Element Simulations
In this section, we present the application of Algorithm HF for load balancing in the field of distributed numerical simulations with the finite element (FE) method [2]. The FE method is used in statics analysis, for example, to calculate the response of objects under external load. In this paper, we consider a short cantilever under plane stress conditions, a problem from plane elasticity. The quadratic cantilever is uniformly loaded on its upper side. Its left side is fixed as shown in Figure 2, top left. The definition of the object's shape, its structural properties, and the boundary conditions lead to a system of elliptic partial differential equations. We substructure the physical domain of the cantilever recursively. With the method of [4], a tree data structure is built reflecting the hierarchy of the substructured domain (see Figure 2). This data structure contains a system of linear
386
Fig. 2. The example application equations Su = f for each tree node. Its stiffness matrix S determines the unknown displacement values in the vector u at the black points shown in Figure 2 (right), dependent on the external and reaction forces f . In the leaves, the system of equations is constructed by standard FE discretization and the so-called virtual displacement method. In an internal node, the system of linear equations is assembled out of the equation systems of its children. For the solution of all those equation systems, we use an iterative solver which traverses the tree several times, promoting displacements in top-down direction and reaction forces in bottom-up direction. Since the adaptive structure of the tree is not known a priori, a good load balancing strategy is essential before the parallel execution of the solving iterations. We assign a load value s = Cbnb(p) + Csn,(p) to each tree node p, with nb(p) black points on the border without boundary conditions, and ns (p) black points (not ghost points) belonging to the separator line of node p. The load value s models the computing time of node p, where the constants C~ and Cb are independent of node p, and Cs ~ 6C6. These values can be easily collected during the tree construction phase. In order to distribute the work load among the processors, we need to know the computing time of whole subtrees of the FE tree. So for each node p, we accumulate the load in the subtree with root p to get the weight value w(p):
w(p)
f s i f p is leaf /?(p) q- w(cl) § w(c~) i f p is root or internal node
where cl and c2 are the children of p. For a bisection step according to Algorithm HF, we remove the root node p from the tree of weight values. This yields two subtrees rooted at the children
387 cl and c2 of p, and the weight of the root node p is ignored for the remainder of the load balancing phase. The results of Theorem 2 are well approximated because i(p) (work load for the one-dimensional separator) is negligible compared to w(p) (work load for the two-dimensional domain) in our application with commensurately large FE trees. Algorithm HF chops N subtrees off the FE tree, each of which can be traversed in parallel by the iterative solver. These N subtrees contain the main part of the solving work and may be distributed over the available N processors, whereas the upper N - 1 tree nodes cannot exploit the whole number of processors, anyway. Another strategy divides the entire given FE tree into N partitions of approximately equal size by cutting edges (and not root nodes of subtrees). For this purpose, we set w(p) = ~(p) in each tree node. So this application of Algorithm HF takes into account that the main memory resources of the processors become the limiting factor if very high accuracy of the simulation is required. 4
Weighted
Trees
with
Good
Bisectors
Let T be the set of all rooted binary trees with node weights ~(v) satisfying: (1) e(p) _< t(cl) + g(c2) for nodes p with two children cl and c2 (2) g(p) > g(e) if e is a child of p The weight of a tree T = (V, E) in T is defined as w(T) = ~ v e y ~(v). This class T of binary trees models the load of applications in hierarchical finite element simulations, as discussed in Section 3. Recall that in these applications the domain of the computation is repeatedly subdivided into smaller subdomains. The structure of the domains and subdomains yields a binary tree in which every node has either two children or is a leaf. The resource demands (CPU and main memory) of the nodes in this FE tree are such that the resource demand at a node is at most as large as the sum of the resource demands of its two children. Note that (1) and (2) ensure that the two subtrees obtained from a tree in T by removing a single edge are also members of T. The following theorem shows that trees from the class T can be 88 by removal of a single edge unless the weight of the tree is concentrated in the root. T h e o r e m 3. Let T = ( V , E ) be a tree in T , and let r be its root. If ~(r) 3w(T), then there is an edge e 9 E such that the removal ore partitions T into
s btrees T1 and
with w(Tl),w(T ) 9
1 According to Theorem 2 a problem p from a class of problems that has ~bisectors can always be subdivided into N subproblems Pl, . . . , PN such that maxl
388 subtrees by cutting exactly N - 1 edges such that the maximum weight of the resulting subtrees is at most ~9 9 w(T) N "
Note that an optimal rain-max k-partition of a weighted tree (i.e., a partition with minimum weight of the heaviest component after removing k edges) can be computed in linear time [3]. These algorithms are preferable to our approach using Algorithm HF in the case of trees that are to be subdivided by removing a minimum number of edges. Since the heaviest subtree in the optimal solution does obviously not have a greater weight than the maximum generated by Algorithm HF, the bound from Corollary 4 still applies. 5
Conclusion
The existence of s-bisectors for a class of problems was shown to allow good load balancing for a surprisingly large range of values of c~. The m a x i m u m load achieved by Algorithm HF is at most a factor of [ l / a ] 9 (1 - a)L1/~J-2 larger than the theoretical optimum (uniform distribution). Load bMancing for distributed hierarchical FE simulations was discussed, and two strategies for applying Algorithm HF were presented. The first strategy tries to make the best use of the available parallelism, but requires that the nodes of the FE tree representing the load of the computation have good separators. The second strategy tries to partition the entire FE tree into subtrees with approximately equal load. For this purpose, it was proved that a certain class of weighted trees, which include FE trees, has 1/4-bisectors. Here, the trees are bisected by removing a single edge. Partitioning the trees by removing a minimum number of edges ensures that only a minimum number of communication channels of the application must be realized by network connections. Our results provide worst-case bounds on the maximum load achieved for applications with good bisectors in general and for distributed hierarchical finite element simulations in particular. For the latter application, we showed that the m a x i m u m resulting load is at most a factor of 9/4 larger than in a perfectly uniform distribution. At present we are implementing the load balancing methods proposed in this paper and integrating them into the existing finite element simulations software. We expect considerable improvements as compared to the static (compile-time) processor allocation currently in use. See [1] for first results. References 1. Stefan Bischof, Ralf Ebner, and Thomas Erlebach. Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations: Worst-case Analysis and Practical Results. SFB-Bericht 342/05/98 A, SFB 342, Institut ffir Informatik, Technische Universits Miinchen, 1998. h t t p ://wwmaayr. in. tum. d e / b e r i c h t e/1998/TUM-I9811 .ps. gz. 2. Dietrich Braess. Finite Elemente. Springer, Berlin, 1997. 2. iiberaa'beitete Auftage. 3. Greg N. Frederickson. Optimal Algorithms for Tree Partitioning. In Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms SODA '91,
pages 168-177, New York, 1991. ACM Press.
389 4. Reiner Hiittl. Ein iteratives LSsungsverfahren bei der Finite-Element-Methode unter Verwendung yon rekursiver Substrukturierun9 und hierarchischen Basen. PhD thesis, Institut fiir Informatik, Technische Universit~.t Miirtchert, 1996. 5. Behrooz A. Shir&zi, All R. Hurson, and Krishna M. Kavi. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA, 1995. 6. Thomas Schnekenburger and Georg Stellner, editors. Dynamic Load Distribution for Parallel Applications. TEUBNER-TEXTE zur Informatik. Teubner Verlag, Stuttgart, 1997.
An Efficient Strategy for Task Duplication in Multiport Message-Passing Systems Dingchao Li 1, Yuji Iwahori 1, Tatsuya Hayashi 2 and Naohiro Ishii ~ 1 Educational Center for Information Processing 2 Department of Electrical and Computer Engineering 3 Department of Intelligence and Computer Science Nagoya Institute of Technology 1 iding0cent er .nitech. ac. jp
A b s t r a c t . important problem in compilers for parallel machines. In this paper, we present a duplication strategy for task scheduling in multiport message-passing systems. Through a performance gain analysis, we establish a condition under which duplicating a parent task of a task is beneficial. We also show that, by incorporating this strategy into two well-known priority-based scheduling algorithms, significant reductions in the execution time can be achieved.
1
Introduction
Scheduling is a sequential optimization technique used to exploit the parallelism inherent in programs. In this paper, we consider the problem of scheduling the tasks of a given program in a multiport message-passing system with the aim of minimizing the overall execution time of the program. Researchers have investigated two versions of this problem, depending on whether or not task duplication (or recomputation) is allowed. In general, for the same program, task scheduling with duplication produces a schedule with a smaller makespan (i.e., total execution time) than when task duplication is not allowed. Some duplication based scheduling algorithms have been introduced in [2,4, 5, 7]. However, all of these algorithms assume the availability of unlimited number of processors. If the number of processors available is less than the number of the processors they require, there could be a problem. In this paper, we aim to develop an efficient and practical algorithm with task duplication for scheduling tasks onto a bounded number of available processors. In this area, the work described in [8] is close to ours. The algorithm determines the tasks on the critical paths of a given program before scheduling and tries to select such tasks for duplication in every scheduling step. However, a critical path may be shortened and new critical paths may be generated during the scheduling process, because assigning a task and a parent task of the task to a single processor makes the communication overhead between them become zero. Therefore, our algorithm does not attempt to identify whether a task is critical. Instead, it dynamically analyzes the increase over the program execution time, created possibly by duplicating a parent task of a task, to guide a duplication decision. We will show
391 that incorporating this strategy with some known scheduling algorithms such as the mapping heuristic [9] and the dynamic level method [10] can improve their performance without introducing additional computational complexity. The remainder of the paper is organized as follows: system and program models are described in Section 2 and in Section 3, we present a duplication strategy for task scheduling. The performance evaluation of the strategy is discussed in Section 4. Finally, Section 5 gives some concluding remarks. 2
System
and
Program
Models
We consider a system consisting of m identical programmable processors Pi(i = 1 . . . m), each with a private or local memory and fully connected by a multiport interconnection network. In this system, the tasks of a parallel program mapped to different processors communicate solely by message-passing. In every communication step, a processor can send distinct messages to other processors and simultaneously receive multiple messages from these processors. We assume that communication can be overlapped by computation and the communication time can be neglected if two tasks are mapped to a processor. A parallel program is modeled as a weighted directed acyclic graph, or task graph. Let G = (F, A, #, c) denote a task graph, where F is a finite set of tasks T/(i = 1,. -., n), A is a set of directed arcs representing dependence constraints among tasks, # is an execution time function whose value #(Ti) (time units) is the execution time of task Ti, and c is a communication cost function whose value c ( ~ , Tj) denotes the amount of communication from Ti to Tj. T h e arc (Ti, Tj) from Ti to Tj asserts that task Tj cannot start execution until the message from T/arrives at the processor executing Tj. If (Ti, Tj) E A, Ti is said to be a parent task of Tj and Tj is said to be a child task of Ti. In task graph G, tasks without parent tasks are known as entry tasks and tasks without child tasks are known as exit tasks. We assume, without loss of generality, that G has exactly one entry task and exactly one exit task. For task Ti, let est(Ti) and lst(Ti) denote the earliest start and completion times, respectively. The computation of the earliest start and completion times can be determined in a top-down fashion, starting with the entry task and terminating at the exit task. Mathematically,
est(T~) =
min
max
(Tj,Ti)eA
ect(
(Tk,Ti)eA,kej
{ect(T~), ect(Tk) + e(Tk, Ti)),
(1)
~ ""
) = est(T ) +
(2)
Similarly, let lst(T~) and lct(Ti) denote the latest start and completion times of Ti, respectively. The latest start and completion times of Ti can be given by
lct(Ti) =
min
(T~ ,T~ ) e A
{Ist(Tj) - c(Ti Tj)),
lst(Ti) = lct(Ti) - #(Ti).
(3) (4)
Note that the latest start time of 7~ indicates how long the start of Ti can be delayed without increasing the minimum execution time of the graph.
392 3
Scheduling
with
Duplication
For machine and program models described above, this section develops a duplication based algorithm in an attempt to find a one-to-one mapping from each task to its assigned processor such that the program execution time is minimized. Our algorithm consists of two parts. The first part decides which ready task should be assigned to an idle processor. This part is heuristic in nature and we can make the choice by applying one of the existing strategies such as HLF(Highest Level First), ETF(Earliest Task First), LP(Longest Path), LPT(Longest Processing Time) and CP(Critical Path). Given the task in the first part, the second part decides whether a parent of the task should be duplicated.
Pj
J...~...P,j..~,..~
T~w
t(Tv,Tw)
(a)
,t
(b)
Fig. 1. The task duplication strategy. (a) The schedule before duplication. (b) The schedule after duplication.
Before discussing our task duplication strategy, we introduce cm to represent the current time of the event clock, P(T,) to represent the processor that processes task Tu and st(P(T,), T,) to represent the time when T, starts execution on P(T,). We also assume that processor Pk is idle at cm and task Tw is selected for scheduling, as shown in Fig. 1. The start time at which task T~o can run on processor Pk in case of no task duplication is thus given by
Stnocopy(Pk, Tw) ---- (T~,T~)EA,P(T~)#Pk max {cm, st(P(T~), T,) + tt(T,) + c(T,, T~o)].
(5)
Consider the fact that if the start time of a task is later than the latest start time of the task, then the overall completion time of the whole graph will be
393
increased. Thus, we can define the increase incurred due to the execution of task Tw on processor Pk as follows.
(6) Obviously, if 5~o~ovy(Pk, T~o) < O, we should stop duplicating the parent tasks of Tw and assign T~o onto processor Pk, since duplicating its parent tasks is unable to shorten the schedule length even if its start time is reduced. Assume now that 5no~opy(Pk,T~) > 0. To minimize the overall execution time, we need to consider task duplication for reducing this increase as much as possible. Let us use mt(T~, Tw) to denote the time when the message for task T~ from T~ arrives at processor P(Tw); i.e., mt(T,, T~) = st(P(T~), T,) + #(Tu) + c(Tu,T~) ifP(Tu) 7~P(T~). From Fig. I, it is clear that we should select a task T,, as the candidate for duplication, such that
rnt(Tv, :/w)
---
max (T~,T~)EA,P(T~)r
{st(P(T~), T~) + p(T~) + c(T~, T~)}.
Further let it(Pk) denote the time when processor Pk becomes idle. Thus, the time at which T. can start execution on Pk can be calculated below.
st(Pk, T,) = max (T.,T.)eA,P(T~)ePk {it(Pk), st(P(T~), 7;) + #(T,) + c(T~, Tv)}.
(7)
Consequently, in this case, the time at which task T~o can start execution on processor Pk is that
copy(Pk,
=
max
(Tu,Tw)e A,P(Tu)• Pk ,u#v
{cm, st(P(Tu), T,) + #(Tu) + c(T~, T,), st(Pk, T,) + #(T,)}.
(s)
This produces an increase as follows. , cop ( Pk ,
) = St opy ( P, ,
) - l st (
).
(9)
Comparing 5copy(Pk,T~) with 5~o~opy(Pk,Tw), we can finally establish the following condition: duplicating a parent task of Tw to the current processor should only occur if
(lO) The above duplication strategy clearly takes O(p) times in the worst case, where p is the m a x i m u m number of parent tasks of every task in G. Therefore, incorporating this strategy into some known scheduling algorithms such as the mapping heuristic [9] and the dynamic level method [10] does not introduce additional time complexity, since these algorithms generally require O(nm) times for selecting a ready task and an idle processor for assignment, where n is the number of tasks and m is the number of processors.
394
4
Experimental Study
This section presents an experimental study of the effects of the task duplication strategy described in Section 3. By incorporating the strategy into the mapping heuristic (MH) and the dynamic level method (DL) without global clock heuristic, we implemented two scheduling algorithms on a Sun workstation using the programming language C. Whereas we compared the performance of each enhanced algorithm with that of the original algorithm under various parameter settings, due to space limitations, here we show only some of the salient results. In our experiments, we generated 350 random task graphs and scheduled them onto a machine consisting of eight identical processors, fully interconnected through full-duplex interprocessor links. The task graphs generated randomly ranged in size from 40 to 160 nodes with increments of 20. Each graph was constructed by generating a directed acyclic graph and removing all transitive arcs. For each vertex, the execution time and the number of the child tasks were picked from an uniform distribution whose parameters are input data. The communication time of each arc was determined alike. The communication-tocomputation (C/C) ratio value was settled to be 5.0, where the C / C ratio of a task graph is defined as the average communication time per arc divided by the average execution time per task. The performance analysis was based on the average schedule length obtained by each algorithm. Tables 1 and 2 show the experimental results, where Impr. represents the percentage performance improvement of each enhanced algorithm compared with that of the original algorithm. It can be seen that the performance of each algorithm with duplication is consistently better than the algorithm without duplication. The performance improvement compared with the original mapping heuristic varies from 4.01% to 8.17%, and the performance improvement compared with the original dynamic level algorithm varies from 3.42% to 6.90%. These results confirm our expectation that the proposed duplication strategy is efficient and practical. Table 1. Results from the mapping heuristics with and without duplication.
~ ~
Avg. sche. leng. Avg. sche. l e n g . ~ MH without dup. MH with dup. ( ~ 609.64 649.58 826.34 860.88 1093.38 1023.16 1231.68 1310.74 1401.70 1515.46 1583.58 1711.06 1802.51 1962.92
395 T a b l e 2. Results from the dynamic level algorithms with and without duplication. No. of No. of[[ Avg. sche. leng. Avg. sche. leng.llI npr. [graphs I tasks [[DL without dup. DL with dup. [%) 5O 4O 602.60 572.41 5.01 50 60 788.61 761.66 3.42 50 80 1037.65 993.82 4.22 50 100 1226.73 1172.94 4.38 50 120 1454.92 1367.88 5.98 50 140 1643.25 1536.92 6.47 50 160 1891.74 1761.24 6.90
5
Conclusions
We have presented a novel strategy for task duplication in multiport messagepassing systems. The strategy is based on the analysis of the increase over the total p r o g r a m execution time, created possibly by duplicating a parent task of a task. Experimental results have shown that incorporating it into two wellknown scheduling algorithms either produces the same assignments as the original algorithms, or it produces better assigmnents. In addition, the costs, b o t h in implementation effort and compile time, are very low. Future work includes a more complete investigation of the impact of varying the communication-toc o m p u t a t i o n ratio. Experimental studies on various multiprocessor platforms are also needed.
Acknowledgment This work was supported in part by the Ministry of Education, Science and Culture under G r a n t No. 09780263, and by a G r a n t from the ArtificiM Intelligence Research Promotion Foundation under contract number 9AI252-9. We would like to thank the anonymous referees for their valuable comments.
References 1. B. Kruatrachue and T.G. Lewis, Grain Size Determination for Parallel Processing, IEEE Software, pp. 23-32, Jan., 1988. 2. J.Y. Colin and P. Chritienne, C.P.M. Scheduling with Small Communication Delays and Task Duplication, Operations Research, vol. 39, no. 4, pp. 680-684, July, 1991. 3. Y.C.Chung and S.Ranka, Application and Performance Analysis of a CompileTime Optimization Approach for List Scheduling Algorithms on DistributedMemory Multiprocessors, Proc. of Supercomputing'92, pp. 512-521, 1992. 4. H.B. Chen, B. Shirazi, K. Kavi, and A.R. Hurson, Static Scheduling Using Linear Clustering with Task Duplication, Proc. of ISCA International Conference on Parallel and Distributed Computing and Systems, pp. 285-290, 1993.
396 5. J. Siddhiwala and L.F. Chao, Path-Based Task Replication for Scheduling with Communication Costs, Proc. of the 1995 International Conference on Parallel Processing, vol. II, pp. 186-190, 1995. 6. M. A. Palis, J. Liou and D. S. L. Wei, Task Clustering and Scheduling for Distributed Memory Parallel Architectures, IEEE Trans. on Parallel and Distributed Systems, vol. 7, no. 1, pp. 46-55, Jan. 1996. 7. S. Darbha and D.P. Agrawal, Optimal Scheduling Algorithm for DistributedMemory Machines, IEEE Trans. on Parallel and Distributed Systems, vol. 9, no. 1, pp. 87-95, Jan. 1998. 8. K.K. Kwok and I. Ahmad, Exploiting Dupfication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems, Proc. of the sixth IEEE Symposium on Parallel and Distributed Processing, pp. 426-433, 1994. 9. H.E1-Rewini and T.G.Lewis, "Scheduling Parallel Program Tasks onto A r b i t r a r y Target Machines," J. Parallel and Distributed Computing 9, pp. 138-153, 1990. 10. G.C.Sin and E.A.Lee, "A Compile-Time Scheduling Heuristic for InterconnectionConstrained Heterogeneous Processor Architectures," IEEE Trans. on Parallel and Distributed Syst., vol. 4, no. 2, pp. 175-187, 1993.
Evaluation of Process Migration for Parallel Heterogeneous Workstation Clusters M.A.R. Dantas Computer Science Department, University of Brasilia (UnB), Brasilia, 70910-900, Brazil mardant as@comput er. org
A b s t r a c t . It is recognised that the availability of a large number of workstations connected through a network can represent an attractive option for many organisations to provide to application programmers an alternative environment for high-performance computing. The original specification of the Message-Passing Interface (MPI) standard was not designed as a comprehensive parallel programming environment and some researchers agree that the standard should be preserved as simple and clean as possible. Nevertheless, a software environment such as MPI should have somehow a scheduling mechanism for the effective submission of parallel applications on network of workstations. This paper presents the performance results and benefits of an alternative lightweight approach called S e l e c t i v e - M P I (S-MPI), which was designed to enhance the efficiency of the scheduling of applications on parallel workstation cluster environments.
1
Introduction
T h e standard message-passing interface known as M P I (Message Passing Interface) is a specification which was originally designed for writing applications and libraries for distributed m e m o r y environments. The advantages of a standard message-passing interface are portability, ease-of-use and providing a clearly defined set of routines to the vendors that they can design their implementations efficiently, or provide hardware and low-level system s u p p o r t . C o m p u t a t i o n on workstation clusters when using the M P I software environment might result in low performance, because there is no elaborate scheduling m e c h a n i s m to efficiently submit processes to these environments. Consequently, although a cluster could have available resources (e.g. lightly-loaded workstations and m e m ory), these parameters are not completely considered and the parallel execution might be inefficient.This is a potential problem interfering with the performance of parallel applications, which could be efficiently processed by other machines of the cluster. In this paper we consider the m p i c h [2] as a case study implementation of MPI. In addition, we are using ordinary Unix tools therefore concepts employed
398 in the proposed system can be extended for similar environments. An approach called S e l e c t i v e - M P I ( S - M P I ) [1], is an alternative design to provide MPI users with the same attractive features of the existing environment and covers some of the functions of a high-level task scheduler. Experimental results of demonstrator applications indicate that the proposed solution enhances successfully the performance of applications with a significant reduction on the elapsed-time. The present paper is organised as follows. In section 2, we outline some details of the design philosophy of the S-MPI. Experimental results are presented in section 3. Finally, section 4 presents conclusions.
2
S-MPI
Architecture
Figure 1, illustrates the architecture of the S-MPI environment compared to the mpich implementation. Difference between the two environments are mainly the layer where the user interface sits and the introduction of a more elaborate scheduling mechanism.
Fig. 1. Architecture differences between S-MPI and mpich.
The S-MPI approach covers functions of a high-level task scheduler monitoring workstations workload, automatically compiling (or recompiling) the parallel code, matching processes requirements with resources available and finally selecting dynamically an appropriate cluster configuration to execute parallel applications efficiently.
399 3
Experimental
Results
Configurations considered in this paper use a distributed parallel configuration composed of heterogeneous clusters of SUNs and SGIs (Silicon Graphics) workstations, in both cases using the mpich implementation. These workstations were private machines which could have idle cycles (i.e. considering a specific threshold of the average CPU queue length load index of the workstations) and memory available. Operating systems and hardware characteristics of these workstations are illustrated by table 1.
Table 1. General charateristics of the SUN and SGI clusters. CLUSTERS
SUN
SGI
r
Machine Architecture Clock(MHz) D a t a / I n s t r u c t i o n C a c h e (KB) Memory(MB) Operating System M P ! version
SUN4u - Ultra 143-167 64 32-250 SOLARIS 5.5.1
mpich
1.1
IP22 133~250 32-1024 32-250 IRIX 6.2
mpich
1.1
We have implemented two demonstrator applications with different ratio between computation and communication, The first application, was a gaussian algorithm to solve a system of equations. The second algorithm implemented was a matrix multiplication, which is also an essential component in many applications. Our algorithm is implemented using a master-slave interaction. The improved performance trend on the S-MPI environment executing the matrix multiplication is presented in figure 2. This figure demonstrates that the S-MPI has a better performance than the MPI configurations. The SUN-MPI and SGI-MPI environments represent the use of the specific machines defined in the static machines.x~x files. The SUN/SGI-MPI configuration shows results of the use of the procgroup file. This static file was formed with the address of heterogeneous machines from both SUN and SGI clusters. Finally, the results of SUN/SGI-SMPI illustrated the use of the automatic cluster configuration approach of the S-MPI. Figure 3 shows the enhanced performance of the S-MPI approach compared to the MPI for solving a system of 1000 equations. This figure shows a downward trend of the elapsed-time for the S-MPI environment in comparison to the conventional SUN and SGI MPI environments.
4
Conclusions
In this paper we have evaluated the performance results and benefits of a lightweight model called Selective - M P I ( S - M P I ) . The system co-exists cleanly with the
400
G
m m
800 7OO . (.n8S(~77~/ . SNS( " IU~-~ s(/SS~ ~ ! i l i ]~:::i-i]] 600 .~~ ' 'n n 17f~ 500 400 300 200 100 0 2 3 4 5 6 7 8 NUMBER OF WORKSTATIONS
Fig. 2. Elapsed-time of the matrix multiplication on MPI and S-MPI environments.
~9 ~3
,:5 ;r.t r.~
5 u.l
20O 180 160 140 120 100 80
60 f n = 1000 (SUN-MPI) 9 n = 1000 (SGI-MPI) ........ 4o a = 1000 (SUN/SGI-SMPI) ..... 20 o 2 3 4 5 6 7 NUMBER OF WORKSTATIONS i
i
i
i
B
Fig. 3. Elapsed-time of the Gaussian algorithm on MPI and S-MPI environments. existing concepts of the MPI and preserves the original characteristics of the software environment. In addition, S-MPI provides application programmers with a more transparent and enhanced environment to submit MPI parMlel applications without layers of additional complexities for heterogeneous workstation clusters. Moreover, the concept clearly has advantages providing users with the same friendly interface and ensuring that all available resources will be evaluated before configuring a cluster to run MPI applications.
References 1. M.A.R. Dantas and E.J. Zaluska. Efficient Scheduling of MPI Applications on Network of Workstations. Accepted Paper on Future Generation Uomputer Systems Journal, 1997. 2. William Gropp Patrick Bridges, Nathan Doss. Users' Guide to mpich, a Portable Implementation of MPI. Ar9onne National Laboratory http://www.mcs, anl.gov, 1994.
Using Alternative Schedules for Fault Tolerance in Parallel Programs on a Network of Workstations Dibyendu Das* Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, India deedee~cse, iitkgp, ernet, in
1
Introduction
In this work a new method has been devised whereby a parallel application running on a cluster of networked workstations ( say, using PVM ) can a d a p t to machine faults or machine reclamation [1] without any p r o g r a m intervention in a transparent manner. Failures are considered fail-stop in nature rather t h a n Byzantine ones [4]. Thus, when machines fail, they are effectively shut-out from the network dying an untimely death. Such machines do not contribute anything further to a computation process. For simplicity both failures as well as reclamation are referred to as faults henceforth. Alternative schedules are used for implementing this adaptive fault tolerance mechanism. For parallel programs representable by static weighted task graphs, this framework builds a number of schedules, statically at compile time, instead of a single one. Each such schedule has a different machine requirement i.e. schedules are built for 1 machine, 2 machines and so on till a m a x i m u m of M machines. The main aim of computing alternative schedules is to have possible schedules to one of which the system can fall back upon for fault recovery during run-time. For example, if a parallel program starts off with a 3-machine schedule and one of them faults during run-time the p r o g r a m adapts by switching to a 2-machine schedule computed beforehand. However, the entire p r o g r a m is not restarted from scratch. Recomputation of specific tasks and resending of messages if required are taken care of.
2
Alternative Schedule Generation Using Clustering with Cloning
Schedules in this work are generated using the concept of task clustering with cloning [2], [5] as cloning of tasks on multiple machines have a definite advantage as far as adapting to faults is concerned. The Duplication Scheduling Heuristic by Kruatrachue et al. [2], [3] has been used to generate alternative schedules with differing machine requirements like 1,2,3, .... * The author acknowledges financial assisstance in the form of K. S. Krishan Fellowship
402 3
The
Schedule
Switching
Strategy
The major problem at run-time is to facilitate a schedule switch once some machine/s is/are thrown out of the cluster due to fault. As the run-time system may need a new schedule to be invoked at a fault juncture, such a schedule should already exist on the machines. In addition, it is not known at compile time, which cluster of the new schedule will be invoked on which machine. Once the set of completed tasks on each live machine is known the schedule switch to an alternative schedule should take place in such a manner that the resultant program completes as quickly as possible. Also, the switching algorithm itself should be simple and fast. 3.1
Algorithm SchedSwitch
The algorithm for schedule switching is based on minimal weighted bipartite graph matching. The main idea is to shift from a currently executing thread ( cluster of a schedule executing on a machine ) to a virtual cluster of an alternative schedule such that the overall scheduling time is reduced as much as possible. In order to do this a complete weighted bipartite graph G = (V, E) is constructed where V = Mt~ve U Vc~. Mliv~ is the set of live machines (except the faulty ones) and Vct is the set of clusters of an alternative schedule to which a schedule switch can take place. The set of edges E = (vl, v2), vl E Mti~, v2 E V c t . Each edge is weighted, the weight being computed via the following strategy: The weight of the matching between cluster i in the pre-computed schedule with machine j is as follows. The not done tasks of cluster i are tried to be scheduled in the machine j as early as possible. While constructing this schedule, some of the tasks in cluster i may start late as compared to their start time in the original alternative schedule. The maximum of these delays among the tasks in cluster i is computed, and the edge e is annotated with this value. A minimum weighted bipartite matching algorithm is applied to G thus created to determine the best mapping of clusters to machines.
3.2
Static and Run-Time Strategies for Schedule Switching
One of the major requirements for an adaptive program - be it for fault adaptation or machine reclamation, is that the adaptive mechanism must be reasonably fast. Keeping this in mind two strategies have been designed for this framework - STSR ( Static Time Switch Resolution ) and RTSR ( Run Time Switch Resolution ). STSR is a static time approach whereby the possible schedule switches are computed statically given certain fault configurations - usuMly single machine faults. RTSR is a run-time approach whereby the schedule switching algorithm is applied during adaptation. This is a more general approach compared to STSR. STSR is a costly approach even if applied for only single machine failures as it is not known at what point a machine may fail and hence all possibilities are enumerated and the possible schedule switches computed. The complexity of the STSR strategy is O(C~V c. ((Y) 2 + C~logC~)), where the maximum number of
403 clusters formed by a schedule S in the set of schedules CS generated is denoted as C~. Also a maximum of O(V) nodes are present in the task graph. T h o u g h this method is compute-intensive it allows a very fast run-time adaptation. In addition, this algorithm parallelizable. This can be used to reduce its running time. A modified strategy called STSRmod has also been developed. In this modified strategy, even when C~ is high all possible schedule switches need not be considered. This is based on the observation that we get identical machine to cluster matchings over certain intervMs of time as the weights of the matehings change only on some discrete events, such as the completion of a task or the start of a new task on a machine. The complexity of STSRmod reduces to O(C~(V)3). In RTSR, the switching algorithm is applied at run-time instead of compile time. Hence, adaptation time is longer than in STSR. However, RTSR has a greater flexibility being able to handle a wider range of fault configurations. The complexity of RTSR is O((C~)logCu + (V)2). 4
Experiment
The simulation experiment uses a parallel application structured as a F F T graph each node being of weight 2 units and each edge of 25 units. The bar graphs shown in Figure 1 are the scheduling times for the number of machines provided. Here, the number of maximum machines in use is 10 while the lowest is 2. The figure depicts how the total scheduling time T ~ ( including faults ) changes as machines fail one by one. The three plots shown are the plots of T ~ for machine failures with inter-machine failure time set to 10, 20 and 40 time units respectively. T ~ includes the retransmission and recomputation costs. The lowest plot shows the change in T~w with inter-machine failure of 10 units, the middle one shows the change for inter-machine failure of 20 units and the topmost one for 40 units. The upper two plots do not extend in entirety because the program gets over ( i.e. finishes ) before the later failures can take place. The lowest T~w plot for P = 10 is the value of the ultimate scheduling time if the 10-machine is initially run and a fault occurs after 10 time units forcing a switch to a 9-machine schedule. With the new 9-machine schedule if the program runs for another 10 units before faulting, the T ~ using the 8-machine schedule is plotted next. T h e other plots have similar connotations but with different inter-machine failure times. 5
Conclusion
In this work we have described the use of computing alternative schedules as a new framework, which can be used easily to adapt to fluctuations which may either cripple the running of a program or substantially reduce its performance on a network of workstations. This strategy, hitherto unused, can be effectively employed to compute a set of good schedules and used later at run-time without the overhead of generating a good schedule at run-time which uses fewer
404
Fig. 1. Performance for multiple single machine faults for a F F T task graph
machines. Thus, our strategy is a hybrid off-line a p p r o a c h coupled with on-line s u p p o r t for effective fault tolerance or machine eviction on a m a c h i n e cluster w i t h o u t user intervention.
References 1. A. Acharya, G. Edjlali, and J. Saltz. The utility of exploiting idle workstations for parMlel computation. In Proceedings of the A C M Sigmetrics Int'l Conf. on Measurement and Modeling of Computer Systems '97, Seattle, June 1997. 2. B. Kruatrachue and T. Lewis. Grain size determination for parallel programs. IEEE Software, pages 23-32, Jan 1988. 3. H. E1-Rewini, T. Lewis, and H. H. Ali. Task Scheduling for Parallel and Distributed Systems. Englewood Cliffs, N J: Prentice-Hall, 1994. 4. G. Tel. Introduction to Distributed Algorithms. Cambridge University Press, 1994. 5. M. A. Palis, J. Liou, and D. S. L. Wei. Task Clustering and Scheduling for Distributed Memory Parallel Architectures. IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 1, pages 46-55, Jan 1996.
Very Distributed Media Stories: Presence, Time, Imagination Glorianna Davenport Interactive Cinema Group, MIT Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA 02139, [email protected]
”...the uncommon vista raises a pleasure in the imagination because it fills the soul with an agreeable surprise, gratifies its curiosity and gives it an idea of which it was not before possessed.” –Addison The Spectator 1712
1
Introduction
The action of stories is always grounded and contextualized in a specific place and time. For centuries, artists seeking places worthy of representation have found inspiration in both the natural landscape and in man-made surrounds. This inspiration traditionally dwells on the scenic aspects of place and situation, in styles ranging from photorealistic to impressionistic. Sometimes, as in Australian aboriginal ”dreamtime maps,” the real, the historical, and the spiritual components of a place are simultaneously depicted with equal weightings. Sometimes, as in road maps and contour maps, super-simplified representations are enhanced with integrated or overlaid technical measurements; constructed artifacts, such as roads and airports, share equal billing with natural landmarks, such as lakes and rivers. The scale, focus, point-of-view, and narrative content of landscapes are chosen and manipulated to suit the artist’s (and the audience’s) specific purposes: they embody affordances which exert great influence over a work’s final use. When we view the image of a broad, sweeping vista, we seldom notice every leaf on every tree: indeed, artists seldom provide us with this level of detail. Instead, our attention is drawn through and across a collection of landmarks, consciously and unconsciously brought together to serve as sensually satisfying iconic and symbolic representations of a more complex whole. This notion of bringing together and arranging selected elements into an integrated environment – customized and personalized for specific uses – has profound relevance to the digital systems of tomorrow. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 47–54, 1998. c Springer-Verlag Berlin Heidelberg 1998
48
2
Glorianna Davenport
Touchstones from the Past
In the 20th century, artists have increasingly moved away from strict representations of what they see in the world to formal and informal explorations of form and space. Many artists have challenged the limits of art as object by extending expression into the natural or man-made surround; often, fanciful landscape is used to proffer metaphors for engagement and imagination. Towards the end of his life, Matisse – no longer able to hold a paintbrush – developed a technique by which he could produce flat paper cut-outs. By pasting these flat elements onto his hotel walls, Matisse was able to explore anew luminous dimensions of light in space. In these explorations, the artist inverts the objectness of sculpture and negates the boundaries of painting. In the flat frieze of bathers swimming in a band around his dining room, Matisse created an immersive space which is no less substantial and emotionally charged than his earlier works on canvas. The experience of actually dining in this customized and privileged environment – individually and collectively – is in itself a transformation of the familiar into a joyous, ethereal, unfamiliar collision of art and life. Marcel Duchamp’s ”Etant Donnes” – the last major work of his life – at first appears to be a large, ramshackle wooden door. On closer examination, one discovers a peep-hole through which an allegorical 3D tableau (involving a nude, a waterfall, and a gas burner) is visible. Duchamp’s widow has steadfastly upheld his request that no-one be allowed to photograph or otherwise reproduce this voyeuristic vision; only those who physically travel to the museum and peek through the door cracks are rewarded. Even then, the visitor meets ”Etant Donne” on a reticent footing: the piece is situated at the end of a corridor, and look as if workmen may be doing construction behind it; to peer through cracks at a spectacular nude, while other people watch you, seems a bit tawdry. The limited views of a 3D scene present a shifting dynamic were information is selectively concealed or revealed. In the early 70s, beleaguered by minimalism and America’s struggle with their values relative to the environment, artists such as Michael Heizer and Robert Smithson created mystical, monumental Earth Art. These enormous sculptural reworkings of the landscape ranged from huge gouges in the earth, to enormous decorative patterns of ecological flows, to fences across grand grazing terrains. Over time, these earthworks would erode and decay until they became indistinguishable from the natural surround. These and other examples from the past provide a relevant touchstone for modern makers, particularly those who wish to create their own CD-ROMs, immersive interactive environments, or ”virtual reality” installations.
3
Art in Transition
Whatever its wellspring, the artistic imagination is shaped by skills, exposure to pre-existing conventions, the available materials, and worldly beliefs. In this
Very Distributed Media Stories: Presence, Time, Imagination
49
sense, the work is never precisely as we imagined; rather, it is a running compromise which mediates that inner thought with the outer reality of contemporary ideas, available technology, practical collaborations, and expressive powers of creative artists. Today’s digital technology is particularly empowering; unlike the passive, pastoral beauty of painted landscapes, digital environments are computational entities seething with activity, change, and movement. The hardware / software duality of these creations offers new forms of dynamic control over both the physical and metaphysical aspects of space, time, form, and experience. The marriage of content with production and delivery systems extends far beyond the technological; as we gain more experience with these systems, new sets of aesthetics, expectations, and challenges are emerging. A typical ”interactive multimedia” installation ties together devices which sense and track human activity; powerful computational engines; graphical displays; and, increasingly, high-capacity communications networks which interconnect people and resources across vast distances. At the core of these systems is some combination of software and hardware which attempts to glean the desires and motivations of the participant audience, remap them to the intentions of the primary author, and respond in meaningful ways. Many of these exploratory systems empower the audience to enter and traverse vast, sophisticated information landscapes via a ”driving” or ”flying” metaphor; many offer constructionist environments which invite the audience to add their own stories, objects, and other manifestations of desire and personality to the surround. The usefulness and expressive power of these systems depends not only on the technology – which is often ”invisible” to the audience – but also on the choice of overarching metaphor which conjoins and energizes the parts, relating them to specific story content. Today, we are no longer required to sit in front of a small computer screen, typing, pointing, and clicking our way through every kind of locally-stored content. Instead, we can use our voice, gaze, gesture, and body motion – whatever expressive form is most appropriate – to communicate our desires and intentions to the computational engine. These engines can reach across vast networks to find whatever material is needed, bring it back, and massage it on-the-fly to form personalized, customized presentations. Modern sensing devices and displays range from the familiar personal computer to large- scale, immersive environments of superb quality. A complex series of transactions underlies this process of interface, information retrieval, and presentation; as interactive systems rise to dominance, the problems and opportunities presented by these transactions – drawing from a broad base of networked resources owned by others – must ultimately be addressed by e-commerce.
4
Kinesics, Kinesthesia, and the Cityscape
In the physical world, natural vistas engage the human eye, stimulate the brain, and generate emotional responses such as joy, anxiety, or fear. These experiences
50
Glorianna Davenport
also produce a degree of kinesthesia: the body’s instinctive awareness and deep understanding of how it is moving through space. Film directors as well as makers of virtual reality scenarios seek to create transformational imagery which will spark a sensational journey through a constructed or simulated reality. Cinema captures or synthetically provides many sensory aspects of genuine body motion and audiovisual perception, but fails to represent others (such as muscle fatigue). As a result, the empathetic kinesthesia of film has proven to be extremely elusive and transitory; there is a limit to what the audience can experience emotionally and intellectually as they sit at the edge of the frame. By experimenting with lens-based recording technology, multilayered effects, and a variety of story elements, cinema presents an illusion of a world which offers sufficient cues and clues to be interpreted as ”real.” More recent experiments with remote sensing devices, theme-park ”thrill rides,” networked communities, constructionist environments, and haptic input/output devices have greatly expanded the modern experience-builder’s toolkit. In cinema, the perceiver is transported into a manufactured reality which combines representations of space or landscape with the actions of seemingly human characters. This reality must be conceived in the imagination of the film’s creators before it can be realized for the mind and heart of the spectator. The attachment of any director to a particular landscape – be it the desert in Antonioni’s ”The Passenger,” the futuristic city and under-city of Fritz Lang’s ”Metropolis,” or Ruttmann’s poetic ”Berlin: Symphony of a Great City” – is a function of personal aesthetic and symbolic intention. The ”holy grail” of cinematography may well be the realization of satisfying kinesthetic effects as we glide over ”Blade Runner’s” hyperindustrialized Los Angeles or descend into the deep caverns of Batman’s ultra-gothic Gotham City. Note that all of these examples draw upon intellectual properties subject to licensing fees for use: however, the rules and models of commerce governing the on-line delivery of this art remain substantially undefined.
5
The Landscape Transformed
Lured by the potential of a near infinite progression of information, artists are constructing digital environments which reposition the perceiver within the synthetic landscape of the story frame itself. Awarded agency and a role, the Arthurlike explorer sets out on a quest for adventure and knowledge. Positioned at the center of a real-time dynamic display, the perceiver navigates a landscape of choice in which information – text, HTML pages, images, sound, movies, or millions of polygons – is continuously rendered according to explicit rules to move in synch with the participant’s commands. In 1993-1994, Muriel Cooper, Suguru Ishizaki, and David Small of MIT’s Visual Language Workshop designed an ”Information Landscape” in which local information (such as news) was situated on a topological map of the world. As the explorer pursued particular information, they found themselves hurtling across a vast, familiar landscape, or zooming down into an almost infinite well
Very Distributed Media Stories: Presence, Time, Imagination
51
of detail. In this work, the illusion of driving and flying are produced through the smooth scaling of the information. The perceiver’s body is at rest but, as she comes to understand the metaphor of motion, her brain makes the analogy to a phenomenon which is well known in our 20th century world.
Fig. 1. Flavia Sparacino’s “City of News”
What new possibilities for navigational choice are offered when a system is able to parse full-body motion and gestures? Recently, Flavia Sparacino has created the ”City of News,” a prototype 3D information browser which uses human gesture as the navigational input. ”City of News” explores behavioral information space in which HTML pages are mapped onto a familiar urban landscape – a form of ”memory palace” where the location of particular types of information remains constant, but the specific content is forever changing as fresh data feeds in from the Internet. Still in development, this work raises interesting questions about organizational memory structures and the economics of information space. Over the past year, Sparacino has demonstrated ”City of News” in a desktop environment, in a small room-sized environment, and in a larger space which could accommodate floor and wall projection surfaces. As this environment develops, Sparacino will consider how two or more people might build and share such a personal information landscape.
6
Landscape as Harbinger of Story Form
Stories help us to make sense of the chaotic world around us. They are a means – perhaps the principal human cognitive means – by which we select, interpret,
52
Glorianna Davenport
reshape, and share our experiences with others. In order to evolve audiovisual stories in which the participant-viewer has significant agency, the artist seeks to conjoin story materials with relevant aspects of real-world experience. Sets, props, settings, maps, characters, and ”plot points” are just a few of the story elements which can be equipped with useful ”handles” which allow the audience to exert control. As these computer-assisted stories mature – just as cinema has – they will in turn serve as metaphoric structures against which we can measure the real world. Today’s networking technology allows us to bring an audience ”together” without requiring them to be physically present in the same place at the same time: we call this art form ”very distributed story.” Regardless of its underlying spatio-temporal structure, the participant audience always perceives the playout of story as a linear experience. Thus, ”real time” is a crucial controller of the audience’s perceptions of ”story time.” In networked communications, the passing-on of information from one person to the next is characterized by time delays and differences in interpretation: what effect does this have on the communal experience of story? Long-term story structures such as suspense, expectation, and surprise depend on the linear, sequential revelation of knowledge over time. Context and the ”ticking clock” reveal absolute knowledge of a story situation – but only after a significant period of involvement with the narrative. Beyond its consequential responses to the remotely-sensed desires and commands of its audience, very distributed story must embark upon a metaphysics of landscape. This is the focus of the ”Happenstance” system, currently under development by Brian Bradley, another of my graduate students at MIT. ”Happenstance” is flexible storytelling testbed which expands the traditional literary and theatrical notions of Place and Situation to accommodate interactive, on-the-fly story construction. Important aspects of story content and context are made visible, tangible, and manipulable by systematically couching them within the metaphors of ecology, geology, and weather. Information-rich environments become conceptual landscapes which grow, change, and evolve over time and through use. Current information follows a natural cycle modeled after the Earth’s water cycle. Older information, history, and complex conceptual constructs – built up by the flow of data over time – are manifested in the rock and soil cycles. Directed inquiries, explorations of theory, and activities associated with the audience’s personal interests are captured and reflected by plant growth. As a result, information itself is imbued with sets of systemic, semi-autonomous behaviors which allow it to move and act intelligently within the story world or other navigable information spaces in ways which are neither tightly scripted nor random. In Bradley’s work, landscape and the broad forces of weather which sweep across it are the carriers of information as well as scenic elements: they are bound by the rules, cycles, and temporal behavior of ecological systems. Landscape, context, and new information flowing through the system work together to provide a ”stage” where story action can take place.
Very Distributed Media Stories: Presence, Time, Imagination
Fig. 2. The information “water cycle” of Brian Bardley’s “Happenstance”
53
54
Glorianna Davenport
As precious computational resources – known in the jargon as ”real estate” – meet specialized, continuous Information Landscapes, new and intuitive forms of navigation and content selection are beginning to emerge. The symbology of landscapes built by the gathering-together of scattered resources must hook seamlessly into a personalized, customizable transactional model capable of resolving higher-level ambiguities and vaguenesses. The movement of information through these story spaces must have consequence over time, just as the interactions of the participant audience must always elicit promt and meaningful responses from the storytelling system. As narrative finds its appropriate form in immersive electronic environments, the traditional classification of story – myth, history, satire, tragedy, comedy – must be extended to include hybridized forms which reflect the moment-tomoment desires and concerns of the audience. Rather than strictly mapping events to prototypical story action, we will evolve new story forms based on navigation of choice, constrained and guided by authorial will. Hopefully, in time and with practice, these metaphoric structures and maps can be – as Addison foretold – ”as pleasing to the Fancy as the Speculations of Eternity or Infinitude are to the Understanding.”
Dynamic and Randomized Load Distribution in Arbitrary Networks J . G a b e r and B.Toursel L.I.F.L., Universit6 des Sciences et Technologies de Lillel 59655 ViUeneuve d'Ascq cedex-France{gaber, t oursel}@lif i. lifl, fr
A b s t r a c t . We present the analysis of a randomized load distribution algorithm that dynamically embed arbitrary trees in a distributed network with an arbitrary topology. We model a load distribution algorithm by an associated Markov chain and we show that the randomized load distribution algorithm spreads any M-node tree in a distributed network with N vertices with load O(M+5(e)), where 5(e) depends on the convergence rate of the associated Markov chain. Experimental results obtained by the implementation of these load distribution algorithms, to validate our framework, on the two-dimensional mesh of the massively parallel computer MasPar MP-1 with 16,384 processors, and on a network of workstations are also given.
1
Introduction
We model the load distribution problem that dynamically maintains evolving an arbitrary tree in a distributed network as an on-line embedding problem. These algorithms are dynamic in the sense that the tree m a y starts as one node and grows by dynamically spawning children. The nodes are incrementally embedded as they are spawned. B h a t t and Cai present in [1,2] a randomized algorithm t h a t dynamically embeds an arbitrary M-node binary tree on a N processor binary hypercube with dilation O ( l o g l o g g ) and load O(M/N + 1). Leighton and al. present in [3,4] two randomized algorithms for the hypercube and the butterfly. The first algorithm achieves, with high probability, a load O(M/N + log N) and respectively dilation 1 for the hypercube and dilation 2 for the butterfly. The second algorithm achieves, for the hypecube, a dilation O(1) and load O(M/N+ 1) with high probability. Kequin Li presents in [5] an optimal randomized algorithm t h a t achieves dilation 1 and load O(M/N) in linear arryas and rings. T h e algorithm concerns a model of random trees called the reproduction tree model [5]. T h e load distribution algorithm t h a t we describe here work for every arbitrary tree (binary or not) on a general network. The analysis use m a t h e m a t i c a l tools derived f r o m b o t h the Markov chain theory and the numerical analysis of m a t r i x iter ative schemes.
406 2
The Problem
Consider the following needed notations. Let Pj the j t h vertex of the host graph and k a given integer. We define, for each vertex Pj, the set
v(P )
=
...,
This set defines a logical neighbourhood for Pj. n(j, 5) refers to the 5-neighbour of Pj. The terms logical neighbour denotes that an element of V(Pj) is not necessarily closed to Pj in the distributede network. Consider the behavior of the following m a p p i n g algorithm. At any instant in time, any task allocated to some vertex u t h a t does not have k children can spawn a new child task. The newly spawned children must be placed on vertices with satisfying the following conditions as proposed by the p a r a d i g m of B h a t t and Carl (1) without foreknow how the tree will grow in the future, and (2) without accessing any global information and (3) once a task is placed on a particular vertex, it cannot be reallocated subsequently. Hence, the process migration is disallowed and the placement decision must be implemented within the network in a distributed manner, and locally without any global information. T h e m a p p i n g algorithm t h a t we will describe is randomized and operates as follows. The children of any parent node v initially in a vertex Pj are r a n d o m l y and uniformly placed on distinct vertices of the set V(Pj). The probability that a vertex in V(Pj) is chosen is 1/k. At the start, the root is placed on an initial vertex which can be fixed or randomly choosen. The randomness is crucial to the success of the algorithm as it will be seen in its analysis.
3
Analysis
As was mentionned, as each node is spawned as a child of a node v initially it is assigned a r a n d o m vertex in the set Y(Pj). The probability t h a t any particular vertex is chosen is 1/k. As our process has the property t h a t the choice of a vertex destination at any step ~ depends only on the father's node, i.e., depends only on the state ~ - 1, we have thus a Markov's chain whose state space is the set of the N vertices. We construct its transition probability m a t r i x A --~ (aij)o
in Pj,
88if Pj E ))(Pi) otherwise
aij -~ 0
A node distribution can be considered to be a probability distribution and the m a p p i n g of the newly spawned nodes is the c o m p u t a t i o n of a one-step transition probabilities. Let p~ be an N vector where any entry pt(i) is the proportion of objects in state i at time t. We denote by P0 the initial probability distribution. At the state g, we have Pt = poA ~, Vg > 1 We know that if the m a t r i x A is regular, then the Markov chain has stationary transition probabilities since limt-.oo A t exists. This means that for large values of ~, the probability of being in state i after g transitions is approximatively equal to 1 (the ith entry
407 of ~ no matter what the initial state was. In other words, as a consequence of the asymptotic behavior of a such Markov process, the distribution converges to uniform distribution which allocates the same amount of nodes to each vertex for very large M. Let p~ be the probability distribution at step g. We have Pe=P~-IA
et
~=
l i m p e = ( 1N ' Nt '
~--rco
'"'
~)
Let c(v) be the number of outcomes where vertex v is selected. We consider that the Markov chain reaches the stationary phase beginning from t(z) (e.g., we choose the smallest t such that each entries of Pt is equal to 1/N + r where goes to 0). Denote by k (~) the number of spawned nodes at any step t. We have M = E
k(~)' with [ E [t, oo[.
g----0
t(~) We have c(v) = Epe(v)k(e) which become [61 E(pe(v) - P,(v))k (e) + M A ~
g=0
g-----0 Y
t(e)
we obtain the vector a = E ( p t - p-')k(e) . We need now to bound a. As stated by Cybenko in [7], if7 is subdominant modulus of A, then we have II p ~ - N I
2 -< 72qlp0 - NI ~
Pt t(e)
and thus IlPl - NI _< 7~llpo - p--I]and Ilall = IIE
t(e)
k(~)(pe - P-')II<_ E k(01[Pt - P~I <-
g.=0 t(e)
s
t(s)
E k(t)Tt[[P~ - P~[ thus [[a[[ <[[Po - P~[ E k~7~ s
~=0
Note that we have use the fact that the number of spawned nodes k (l) at any 1 step t is at most kl. We obtain Ilall < IlPo- P-'ll(kT)~~ " kT-1 Recall that II.IG of a vector denotes its maxinmm entry. Denote by 6(e) = Ilalloo the maximum entry of a. Since Po - P"= ( 1 N, W, ...), we have I I p o - p I G = -
and 6(r
< (1
-
1 1 - (kT) t(~)+l ~) f_--~--~
(1)
In view of (1), for small 7, we obtain a small value 6 (r Recall that "/El - 1, 1]. For 7 = 88+ a . We compute the Taylor series expansion of 5(r at -}. This reveals that 6(r < t(e) + 1, where t(e) is the rate of convergence of A (note that t(e) is bounded). With the above analysis we obtain the following theorem. T h e o r e m 1. Given a randomized mapping algorithm of a process graph that
is an arbitrary dynamic k-ary tree on any arbitrary topology. If the mapping's
408
stochastic matrix associated with the logical neighbourhood is regular then the number of nodes mapped on a single vertex o/ the network is O( M + 5(~)), where 5(e) = ( 1 - ~ )
(k',/)t( ~)+1 - 1
-'~-7---]
' "/ is the subdominant eigenvalue of the mapping's
stochastic matrix, t(e) the number of steps necessary to converge within e, M is the number of nodes and N the number of vertices of the network.
4
Experimentations
We ran a parallel implementation of an algorithm wherein an arbitrary tree grows during the course of the computation (as in branch-and-bound serach, divideand-conquer or game tree evaluation), on a 128 x 128 mesh. The following table gives the experimental results in terms of the m a x i m u m load obtained when the randomized algorithm is (resp. is not) used to balance load (each line shows the result in terms of the maximum load obtained when we embed an arbitrary tree. The first column gives the total nodes number of this tree). Number load without of nodes randomized algorithm 3123 93.00 55791 219.00 65135 4.00 99325 16.00 204383 64.00 731923 64.00 1640123 256.00
load with randomized algorithm 1.17 4.59 4.OO 6.82 13.43 45.86 102.59
Ldeal load 1 3.41 3.98 6.06 12.47 44.67 100.11
We ran, on a network of 8 stations with a multithreaded environement, a parallel implementation of an algorithm wherein an arbitrary binary tree grow during the course of the computation. The following table gives the effective load of each station obtained with and without the randomized algorithm. The value ToTal denotes the total nodes number of the embeded arbitrary tree.
5
Conclusion
Theorem 1 establishes that for a given mapping function, a simple randomized load distribution algorithm as described in section 2 mMntains dynamically evolving an arbitrary tree on a general distributed network with a load O(N M- + 5(c)), where 5(~) depends on the mapping function. This implies that we can easily compare mapping functions just by computing eigenvalues of the associated adjancy matrix.
409 N=8 Num~ro load without load with Station randomized randomized algorithm algorithm 4171 16106 4012 16164 7302 17050 7289 16012 3605 16627 3412 16597 51722 20394 56659 19222 ToTal 138172.00
ideal load
ToTal/N 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50
Acknowledgment W e are g r a t e f u l to F . T . L e i g h t o n for t h e helpful c o m m e n t s a n d K . L i for h e l p f u l discussions. W e are also grateful to F.Chung[8], S . B h a t t a n d G . C y b e n k o for p r o v i d i n g us a helpful p a p e r s a n d suggestions.
References 1. S.N. Bhatt and J.-Y. CAI. Talk a walk, grow a tree. in Proc. 29th Annual IEEE
Symposuim on Foundations of Computer, Science, IEEE CS, Washington, DC, pp. 469-478, 1988. 2. S.Bhatt and J-Y.Cai. Taking random walks to grow trees in hypercubes. Journal of the ACM, 40(3):741-764, July 1993. 3. F.T. Leighton, M.J. NEWMAN, A.G. RANADE, and E.J. SCltWABE. Dynamic tree embeddings in butterflies and hypercubes. Siam Journal on computing, 21(4):639654, August 1992. 4. F.T. Leighton. Introduction to parallel algorithms and architectures. Morgan Kauifmann Publishers., 1992. Traduit en Frangais par P.FRAIGNAUD et E.FLEuRY, International Thomson publishing France, 1995. 5. K. Li. Analysis of randomized load distribution for reproduction trees in linear arrays and rings. Proc. 11th Annual International Symposium on High Per]ormanee Computers, Winnipeg, Manitoba, Canada (10-12), July, 1997. 6. J. Gaber. Plongement et manipulations d'arbres dans les architectures distributes. Th&se LIFL, Janvier 1998. 7. G. Cybenko. Dynamic load balancing for distributed memory architecture. Journal of Parallel and Distributed Computing, 7:279-301, 1989. 8. Fan.R.K. Chung. Spectral Graph Theory. AMS., 1997.
Workshop 04 Automatic Parallelization and High-Performance Compilers Jean-Francois Collard Co-chairmen
Presentation This workshop deals with all topics concerning automatic parallelization techniques and the construction of parallel programs using high performance compilers. Topics of interest include the traditional fields of compiler technology, but also the interplay between compiler technology and communication libraries or run-time support. Of the 16 papers submitted to this workshop, 7 were accepted as regular papers, 2 as short papers, and 7 were rejected.
Organization The first session focuses on data placement and data access. In "Data Distribution at Run-Time: Re-Using Execution Plans," Beckmann and Kelly show how data placement optimization techniques can be made efficiently available in runtime systems by a mixed compile- and run-time technique. On the contrary, the approach by Kandemir et al. in "Enhancing Spatial Locality using Data Layout Optimizations" to improve cache performance in uni- and multi-processor systems is purely static. They propose an array restructuring framework based on a combination of hyper-plane theory and reuse vectors. However, when data structures are very irregular, such as meshes, the compiler alone can in general extract very little information. In "Parallelization of Unstructured Mesh Computations Using Data Structure Formalization," Koppler introduces a small description language for mesh structures which allows him to propose a special-purpose parallelizer for the class of applications he tackles. It is worth noticing the wide spectrum of techniques, ranging from completely static methods to purely runtime ones, explored in this field. This definitely illustrates the difficulty of the problem, and the three papers mentioned above make significant contributions asserted by real-life case studies. The second session starts with "Parallel Constant Propagation," where Knoop presents an extension to parallel programs of a classical optimization of sequential programs: constant propagation. Another classical sequential optimization, extended to parallel programs, is redundancy elimination [2]. It is well known, however, that redundancies can be an asset in the parallel setting. Eisenbiegler takes benefit of this property in his paper "Optimization of SIMD Programs
412 with Redundant Computations," and reports very encouraging execution time improvements. Finally, in "Exploiting Coarse Grain Parallelism from FORTRAN by Mapping it to IFI," Lachanas and Evripidou describe the parallelization of Fortran programs through conversion to single assignment. This work is also interesting for its smart use of two separately available tools: the front-end of Parafrase 2 and the back-end of the SISAL compiler. In the third session, Feautrier presents in "A Parallelization Framework for Recursive Tree Programs" a novel framework to analyze dependencies in programs with recursive data. It is most noteworthy that a topically related paper has recently been published elsewhere [3], illustrating that the analysis of programs with recursive structures currently is a matter of great interest. How to extend the static scheduling techniques crafted by the author [1] to this framework is an exciting issue. Like scheduling, tiling is a well-known technique to express the parallelism in programs at compile-time. In their paper "Optimal Orthogonal Tiling," Andonov, Rajopadhye and Yanev bring a new analytical solution to the problem of determining the tile size that minimizes the total execution time. A mixed compile- and run-time technique is addressed in the last paper. In "Enhancing the Performance of Autoscheduling in Distributed Shared Memory Multiproeessors," Nikolopoulos, Polychronopoulos and Papatheodoro present a technique to enhance the performance of autoscheduling, a parallel program compilation and execution model that combines automatic extraction of parallelism, dynamic scheduling of parallel tasks, and dynamic program adaptability to the machine resources.
Conclusion When trying to gain retrospect on the papers presented in this workshop, one may notice that the borderline between compile-time and run-time is getting blurred in data dependence techniques, data placement, and the exploitation of parallelism. In other words, intensive research is being conducted to benefit from the best of both worlds so as to cope with real-size applications. This orientation is very encouraging and we can hope the end-user will soon benefit from the nice work proposed by the papers of this workshop.
References 1. P. Feautrier. Some efficient solutions to the affine scheduling problem, part I, one dimensional time. Int. J. of Parallel Programming, 21(5):313-348, October 1992. 2. J. Knoop, O. Riithing, and B. Steffen. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems, TOPLAS, 16:11171155, 1994. 3. Y. A. Liu. Dependence analysis for recursive data. In Int. Conf. on Computer Languages, pages 206-215, Chicago, Illinois, May 1998. IEEE.
Data Distribution at Run-Time: Re-using Execution Plans Olav Beckmann and Paul H J Kelly Department of Computing, Imperial College 180 Queen's Gate, London SW7 2BZ, U.K. {oh3 ,phjk}~doc. ic. ac .uk A b s t r a c t . This paper shows how data placement optimisation techniques which are normally only found in optimising compilers can be made available efficiently in run-time systems. We study the example of a delayed evaluation, self-optimising (DESO) numerical library for a distributed-memory multicomputer. Delayed evaluation allows us to capture the control-flow of a user program from within the library at run-time, and to construct an optimised execution plan by propagating data placement constraints backwards through the DAG representing the computation to be performed. In loops, essentially identical DAGs are likely to recur. The main concern of this paper is recognising opportunities where an execution plan can be re-used. We have adapted both conventional parallelising compiler techniques and hardware dynamic branch prediction techniques in order to ensure that our run-time optimisations need not perform any more work than a parallelising compiler would have to do unless there is a prospect of better performance. 1
Introduction
Parallel libraries have two m a j o r advantages as a parallel p r o g r a m m i n g model. Firstly, they are convenient because user programs simply call library operators which hide all aspects of parallelism internally. Secondly, there is a m p l e evidence [6] that compiled programming models do not yet get close to purposebuilt libraries in terms of performance. It is therefore often worthwhile to invest in a highly optimised, machine-specific library of c o m m o n numerical subroutines. Hitherto, the disadvantage with such libraries has been t h a t opportunities for optimisation across library calls have been missed. In this paper, we study a delayed evaluation, self-optimising (DESO) vectorm a t r i x library for a distributed-memory multicomputer. The idea of DESO is to delay actual execution of function calls for as long as possible. Evaluation is forced by the need to access array elements 1. Delayed evaluation then provides the opportunity at run-time to construct an execution plan which minirnises redistribution by propagating data placement constraints backwards through the DAG representing the computation to be performed. We will study a detailed example in Section 2. 1 The most common reasons for accessing array elements are output and conditional tests. We will refer to statements where this happens as ]orce points.
414
Key issues. The main challenge in optimising at run-time is that the optimiser itself has to be very efficient. We achieve this by - Working from aggregate loop nests, which have been optimised in isolation and which are not re-optimised at run-time. It is precisely the point of the library approach that we have invested in an implementation of selected operators which have been pre-optimised offtine. - Using a purely mathematical formulation for data distributions, which allows us to calculate, rather than search for optimal placements. We will not expand on this aspect of our methodology here. Re-using execution plans for previously optimised DAGs. A value-numbering scheme is used to capture cases where this may be possible. The value numbers are used to index a cache of optimisation results, and we use a technique adapted fl'om hardware dynamic branch prediction for deciding whether to further optimise DAGs we have previously encountered.
-
Context and Structure of this Paper. This paper builds on related work in the field of automatic data placement [8], run-time parallelisation [4, 9, 10] and conventional compiler and architecture technology [1,5]. In our earlier paper [3] we described the basic idea of a lazy, self-optimising parallel vector-matrix library. In this current paper, we extend that work by presenting the techniques we use to avoid re-optimisation of previously encountered problems. Below, we begin in Section 2 by discussing two alternative strategies for run-time optimisation. Following that, Section 3 presents our techniques for avoiding re-optimisation where appropriate. Section 4 shows performance results and Section 5 concludes.
2
Issues in R u n - T i m e O p t i m i s a t i o n
This section discusses and compares two different basic strategies for performing run-time optimisation. We will refer to them as "Forward Propagation Only" and "Forward And Backward Propagation". We use the conjugate gradient iterative algorithm for solving linear systems Ax - b --- 0 to illustrate both strategies. The pseudocode for this algorithm can be found in Figure 2. We use the following terminology: n is the number of operator calls in a sequence, a the m a x i m u m arity of operators, m is the maximum number of different methods per operator. If we work with a fixed set of data placements, s is the number of different placements, and in DAGs, d refers to the degree of the shared node (see [8]) with maximum degree.
Forward Propagation Only. This is the only strategy open to us if we perform run-time optimisation of a sequence of library operators under strict evaluation. The strategy is illustrated in Figure 1: we optimise the placement of each node based on information purely about its ancestors. The total optimisation time for a sequence of operator calls under this strategy has linear complexity in the number of operators. However, as we have illustrated in Figure 1, the price we pay for using such an algorithm is that it may give a significantly suboptimal
415
;ialised ~
(~) ~
(~)
~a p~r
q= A.p 7 =q.P
:r
p~-r.r
G3
-[
x=ap+x
A.p
A.p : q.p
~enow need to transpose either q ' p in order to calculate 7. Sup)se we choose p.
We now require p to be aligned with x, but we have just transposed it, so we have to transpose back. We have made optimisation very easy by deciding the placement of each new node generated based simply on the placement of its immediate ancestor nodes. However, this can result in sub-optimal placements.
Fig. 1. Run-time optimisation of the first iteration of the CG algorithm (see Figure 2) with Forward Propagation Only. answer. This problem is present even for trees, but it is much worse for DAGs since shared nodes are not taken into account properly.
Forward And Backward Propagation. Delayed evaluation gives us the opportunity to propagate placement constraint information backwards through a DAG since we accumulate a full DAG before we begin to optimise. This type of optimisation is much more complex than Forward Propagation Only. Mace [8] has shown it to be NP-complete for general DAGs, but presents algorithms with complexity O((m + s2)n) for trees and with complexity O(n x s d+l) for a restricted class of DAG. The point to note here, though, is that Forward and Backward Propagation does give us the opportunity to find the optimal solution to a problem, provided we are prepared to spend the time required.
3
Re-usingExecution Plans
T h e previous section has shown how delayed evaluation gives us the o p p o r t u n i t y to derive optimM execution plans, but potentially at not insignificant cost. In real programs, essentially identical DAGs often recur. In such situations, our delayed evaluation, run-time approach is set to suffer a significant performance disadvantage over compile-time techniques unless we can reuse the results of previous optimisations we have performed.
416
r (~ = b - Ax (~ for i = 1,... , irnax pi--1 = r(i-l)Tr(i--1) if i= 1 p ( t ) ---- r(0)
else f~i-1 = p i - 1 / p i - 2 p(i) = r ( i - 1 ) _{_~(i_l)p(i-1) endif q(i) = Ap(i) Oti= Pi--1/p(i)Tq (i) X(i) ~_ X(i-1) + a i p (i) r (i) ~ r (i-1) _ oziq(i) check convergence
end
Fig. 2. Left: Pseudocode for the conjugate gradient iterative algorithm. R i g h t : Optimisation with Forward And Backward Propagation: we take account of the use of p in the update of x in deciding the correct placement for the earlier calculation of 7.
This section shows how we can ensure that our optimiser does not have to carry out any more work than an optimising compiler would have to do, unless there is the prospect of better performance than the compiler could deliver. We discuss first the problem of how to recognise a DAG, i.e. optimisation problem, which we have encountered before and then the issue of whether to optimise further, or, re-optimise, such a DAG.
Recognising Opportunities for Reuse. The full optimisation problem is a large structure. To avoid having to traverse it for comparing with previously encountered DAGs, we derive a hashed "value number" [1, 5] for each node. - Our value numbers have to encode data placements and placement constraints, not actual data values. For nodes which are already evaluated, we simply apply a hash function to the placement descriptor of that node. For nodes which are not yet evaluated, we have to apply a hash function to the placement constraints on that node. - The key observation is that by seeking to take account of all placement constraints on a node, we are in danger of deriving an algorithm for calculating value numbers which has the same O-complexity as Forward and Backward Propagation optimisation Mgorithms: each node in a DAG can potentiMly exert a placement constraint over every other node. - Our algorithm for calculating value numbers is therefore based on Forward Propagation Only: we calculate value numbers for unevaluated nodes by applying a hash function to the placement constraints deriving from their immediate ancestors.
417 - Since we do not store the "full DAG" information, we cannot easily detect hash conflicts. We return to this point shortly.
When to Re-Use and When to Optimise. Because our value numbers are calculated on Forward Propagation Only information, we have to address the p r o b l e m of how to handle those cases where nodes which have identical value numbers are used in a different context later on; in other words, how to to avoid the drawbacks of Forward Propagation Only optimisation. This is a branch-prediction problem, and we use a technique adapted from hardware dynamic branch prediction (see [7]) for predicting heuristically whether identical value numbers will result in identical future use of the corresponding node and hence identicM optimisation problems. Caching Execution Plans. Value numbers and 'dynamic branch prediction' together provide us with a fairly reliable mechanism for recognising the fact t h a t we have encountered a node in the same context before. Assuming t h a t we optimised the placement of that node when we first encountered it, our task is then simply to re-use the placement which the optimiser derived. We do this by using a "cache" of optimised placements, which is indexed by value numbers. Each cache entry has a valid-tag which is set by our branch prediction mechanism. Competitive Optimisation. As we showed in Section 2, full optimisation based on Forward And Backward Propagation can be very expensive. Each time we invoke the optimiser on a DAG, we therefore only spend a limited time optimising t h a t DAG. For a DAG which we encounter only once, this means t h a t we only spend very little time trying to eliminate the worst redistributions. For DAGs which recur, our strategy is to gradually improve the execution plan used until our optimisation algorithm can find no further improvements. The final point we need to address is how to handle hash conflicts. The result of a hash conflict on a value number will be that we use a sub-optimal placement for a node which we had previously optimised. In order to detect this, our system has been instrumented to record the communication cost of executing a DAG under an optimised execution plan. An increase in this cost on a "re-use" of t h a t plan indicates a hash conflict.
Summary. We use the full optimisation information, i.e. Forward and Backward Propagation, to optimise. We obtain access to this information by delayed evaluation. We use a scheme based on Forward Propagation Only, with linear complexity in p r o g r a m length, to ensure t h a t we re-use the results of previous optimisations.
4
Performance
In this Section, we show performance figures for our library on the Fujitsu AP3000 multicomputer here at Imperial College. As a benchmark we used the
418 Table 1. Time in milliseconds for 20 iterations of Conjugate Gradient, with a convergence test every 10 iterations, on the AP3000, with 300MHz UltraSparc-2 nodes (average figures, outlying points were omitted). N denotes timings without any optimisation, O timings with optimisation but no caching, and C timings with optimisation and caching of optimisation results. Memory shows time spent in malloc () and f r e e (), Overhead the cost of maintaining data distribution descriptors and suspended library calls at run-time. Speedup is the speedup due to optimisation and execution plan re-use. P N Comp. N 1 256 51.14 O 1 256 51.22 C 1 256 51.13 N 4 512 51.30 O 4 512 51.79 C 4 512 51.69 N 9 768 51.72 O 9 768 51.96 C 9 768 51.90 N 16 1024 52.05 0 16 1024 52.29 c 16 1024 52.13
Memory 0.81 0.81 0.79 1.03 0.86 0.94 1.12 0.91 0.98 1.03 0.88 0.95
Overh. 3.97 3.88 3.25 3.47 3.06 2.45 3.46 3.08 2.46 3.48 3.16 2.49
Comms. 0.00 0.00 0.00 30.51 22.35 21.94 34.78 26.06 25.91 45.37 35.99 35.65
Opt. 0.00 3.59 0.30 0.00 3.73 0.31 0.00 3.74 0.31 0.00 3.82 0.31
Total 55.93 59.50 55.47 86.30 81.79 77.32 91.10 85.75 81.55 101.93 96.14 91.53
Speedup 0.94 1.01 1.06 1.12 1.06 1.12 1.06 1.11
conjugate gradient iterative algorithm. The pseudo-code for the algorithm and the source code when implemented using our library were shown in Figure 2 and the timing data are in Table 1. - Our optimiser avoids two out of three vector transpose operations per iteration. This can not be seen from the data in Table 1, it was determined analytically and by tracing communication. Optimisation achieves a reduction in communication time of between 20% and 30%. We do not achieve more because a significant proportion of the communication in this algorithm is due to reduce-operations which are unaffected by our current optimisations. - Run-time overhead and optimisation time are independent of the amount of parallelism used. We suspect that the slight difference is due to cache effects. - The reason why run-time overhead is reduced by optimisation is that performing fewer redistributions also results in spending less time inspecting data placement descriptors. Caching achieves a further reduction in overhead; this is because it is cheaper to read placement descriptors from cache than to generate them by function calls. Without caching of optimisation results, we achieve an overall speedup of around 60s On platforms which have less powerful processors than the 300MHz UltraSparc-2 nodes we use here, the cost of optimising afresh each time can easily outweigh the benefit of reduced communication. - With caching of optimisation results, the time we spend optimising is negligible, and we achieve overall speedups of around 12~163 -
-
419
5
Conclusions
We have presented a technique for interprocedural data placement optimisation which exploits run-time control-flow information and is applicable in contexts where the calling program cannot be analysed statically. We present preliminary experimental evidence that the benefits can easily outweigh the run-time costs. Related Work. There is a huge body of work on data mappings for regular problems, [2] is but one example. Our work relies on this in producing optimised implementations for library operators. However, the problem we seek to address in this paper is different - - how to perform interprocedurM optimisation over a sequence of such pre-optimised operators. Saltz et al. [10] address the basic problem of how to parallelise loops where the dependence structure is not known statically. Loops are translated into an inspector loop which determines the dependencies at run-time and constructs a schedule, and an executor loop which carries out the calculations planned by the inspector. Saltz et al. discuss the possibility of reusing a previously constructed schedule, but rely on user annotations for doing so. Ponnusamy et al. [9] propose a simple conservative model which avoids the user having to indicate to the compiler when a schedule may be reused. Benkner et al. [4] describe the reuse of parallel schedules via explicit directives in HPF+: aEUSE directives for indicating that the schedule computed for a certain loop can be reused and SCHEDULE variables which allow a schedule to be saved and reused in other code sections. Value numbering schemes were pioneered by Ershov [5], who proposed the use of "convention numbers" for denoting the results of computations and avoid having to recompute them. More recent work on this subject has been done, e.g., by Alpern et al. [1]. R u n - T i m e vs. Cornpile-T~me Optimisation. The particular example studied in this paper would have been amenable to compile-time analysis. We refer back to Section 1 and the arguments of convenience and efficiency we outlined there for why we parallelise this application via a parallel library, using a library then forces us to optimise at run-time. Further, we have shown that a runtime optimiser need not actually perform any more work than a compiler. To compare the quality of run-time and compile-time schedules, consider the loop opposite, assuming that there are no force-points inside the loop and that the loop is encountered a number of times, evaluation being forced after the loop each time. for(i = O; i < N; ++i) { if
420 This loop can potentially have 2 N control-paths. A compile-time optimiser would have to find one compromise execution plan for all invocations of the loop. W i t h our approach, we optimise the actual DAG which is generated on each occasion. If the number of different DAGs is high, compile-time m e t h o d s would probably have the edge over ours, since we cannot reuse execution plans. If, however, the number of different DAGs is small, our execution plans for the actual DAGs will be superior to compile-time compromise solutions, and by reusing them, we limit the time spent optimising.
Future work. The most exciting next step is to store cached execution plans persistently, so t h a t they can be reused subsequently for this or similar applications. Although we can derive some benefit from exploiting run-time control-flow information, we have the opportunity to m a k e run-time optimisation decisions based on run-time properties of data; we plan to extend this work to address sparse matrices shortly. The run-time system has to make on-the-fly d a t a placement decisions. An intriguing question raised by this work is to compare this with an optimal off-line schedule. Acknowledgements. This work was partially supported by the EPSRC, under the Futurespace and CRAMP projects (refs. G R / J 87015 and G R / J 99117). We thank the Imperial College Parallel Computing Centre for the use of their AP3000 machine.
References 1. B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equalities of variables in programs. In 15th Annual ACM Symposium on Principles of Programming Languages, pages 1-11, San Diego, California, Jan. 1988. 2. J. M. Anderson, S. P. Amarasinghe, and M. S. Lain. Data and computation transformations for multiprocessors. SIGPLAN Notices, 30(8):166-178, Aug. 1995. 3. O. Beckmann and P. H. J. Kelly. Runtime iuterprocedural data placement optimisation for lazy parallel libraries (extended abstract). In Lengauer et al., editor, Proceedings of Euro-Par '97, Passau, Germany, number 1300 in LNCS, pages 306309. Springer Verlag, Aug. 1997. 4. S. Benkner, P. Mehrotra, J. V. Rosendale, Zima, and Hans. High-level management of communication schedules in HPF-like languages. Technical Report TR-97-46, Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, VA 23681, USA, Sept. 1997. 5. A. P. Ershov. On programming of arithmetic operations. Communications of the ACM, 1(8):3-6, 1958. Three figures from this article are in CACM 1(9):16. 6. W. D. Gropp. Performance driven programming models. In MPPM'97, Proceedings of the 3rd International Working Conference on Massively Parallel Programming Models, London, U.K., Nov. 1997. To appear. 7. J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantative Approach. Morgan Kaufman, San Mateo, California, 1~t edition, 1990. 8. M. E. Mace. Storage Patterns in Parallel Processing. Kluwer, 1987. 9. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Proceedings of Supercomputing '93: Portland, Oregon, November 15-19, 1993, pages 361-370, New York, NY 10036, USA, Nov. 1993. ACM Press.
421
10. J. H. Saltz, R. Mirehandaney, and K. Crow]ey. Run-time paral]elization and schedu]ing of loops. IEEE Transactions on Computers, 40(5):603-612, May 1991.
Enhancing Spatial Locality via Data Layout Optimizations M. Kandemir 1
A. Choudhary 2
J. R a m a n u j a m 3
N. Shenoy 2
P. Banerjee ~
1 EECS Dept., Syracuse University, Syracuse, NY 13244 ([email protected]) ECE Dept., Northwestern University, Evanston, IL 60208 ( {choudhar, nagaraj, banerj e e}@ece. nwu. edu) 3 ECE Dept., Louisiana State University, Baton Rouge, LA 70803 (jxrQee.lsu.edu)
A b s t r a c t . This paper aims to improve locality of references by suitably choosing array layouts. We use a new definition of spatial reuse vectors that takes into account memory layout of arrays. This capability creates two opportunities. First, it allows us to develop an array restructuring framework based on a combination of hyperplane theory and reuse vectors. Second, it allows us to observe the effect of different array layout optimizations on spatial reuse vectors. Since the iteration space based locality optimizations also change the spatial reuse vectors, our approach allows us to compare the iteration-space based and data-space based approaches in terms of their effects on spatial reuse vectors. We illustrate the effectiveness of our technique using an example from the BLAS library on the SG1 Origin distributed shared-memory machine.
1
Introduction
In most computer systems, exploiting locality of reference is key to high levels of performance. It is well-known that caching of data whether it is private or shared improves memory latency as well as processor utilization. In fact, increasing the cache hit rates is one of the most important factors in reducing the average m e m o r y latency. Although cache hit rates in uniprocessors can be improved by optimizing the organization of the cache such as carefully selecting the cache size, line size and associativity, there are still tasks to be done from software. The programmers and compiler writers often attempt to modify the access patterns of a program so that the majority of accesses are made to the nearby memory. Several efforts have been aimed at iteration space transformations and scheduling techniques to improve locality [10, 9, 13]; these techniques improve d a t a locality indirectly as a result of modifying the iteration space traversal order. In this paper, we focus on an alternative approach to the data locality optimization problem. Unlike traditional compiler techniques, we focus directly on the data space, and attempt to transform data layouts so that better locality is obtained. We present a data restructuring technique which can improve cache performance in uniprocessors as well as multiprocessor systems. Our technique is based on the concept of reuse vectors [9, 13] and uses linear algebra techniques to help a compiler translate potential reuse in a given program into locality. In our
423 technique, we use d a t a transformations expressed by linear full-rank transformation matrices. We also compare d a t a transformations with linear loop transformations t h a t can be expressed by non-singular transformation matrices. We are particularly interested in optimizing spatial locality, that in turn causes better utilization of the long cache lines which represents the current trend in cache architecture. This paper is organized as follows. In the next section, we outline the background and motivations for our work. Section 3 presepts a m e m o r y layout representation based on hyperplane theory fi'om linear algebra. In Section 4, we give a m e t h o d to compute a reuse vector, for a given layout expressed by hyperplanes. We also illustrate how this new definition of reuse vector can be employed to guide d a t a layout transformations. In Section 5, we present our preliminary results on the s y r 2 k example from the BLAS library. In Section 6 we review the related work and conclude the paper in Section 7.
2
Background
We represent each iteration of a loop nest of depth n by a loop iteration vector I = (il, i2, ..., in) where ik is the value of the kth loop from the outermost. Each reference to an m-dimensional array U in such a loop nest is assumed to be an affine function of the iteration represented by I , i.e., I is m a p p e d onto d a t a element Au I + o r . Here A~ is an m • n m a t r i x called the access matrix and the m-vector otj is called the offset vector [13, 9]. Consider the following example. E x a m p l e 1: d o i = li, ui d o j = lj, uj
V ( j + l , i - l ) = V(i-l,j+l) + W ( j - l , i + j + l ) + X(j) + Y(i) enddo enddo
Here, forarrayU, A u =
(D~)
ando~=
(11)Tworeferencestoanarray
are said to belong to a uniformly generated reference (UGR) set if they have the same access m a t r i x [3]. We say t h a t reuse exists during the execution of a loop nest if the same d a t a element or nearby d a t a elements (e.g., the elements stored consecutively in a column for a column-major array) are accessed more than once. Given a loop nest, it is i m p o r t a n t to know where d a t a reuse exists. A simple way of detecting the loops which carry reuse is to derive a set of vectors called reuse vectors [13, 9]. The objective here is first to use vectors to represent directions in which reuse occurs, and then to transform the loop nest such that these vectors will have some desired properties, i.e., forms. Consider a reference with an access m a t r i x Av and offset vector o u . T w o iterations I and J reference the same d a t a element ff Au I + o v = A~rJ + ou ::~ A u ( J - I) = 0 ~ A v r u ~ = 0 ~ r u t ~ K e r { A u } where r u t = J - I . In this case we say t h a t reference Au exhibits temporal reuse; t h a t is, the same
424 d a t a element is used by more than one iteration; r u t is termed as t e m p o r a l reuse vector and Ker{A~r} is referred as temporal reuse vector space [13]. As an example, consider the reference X ( j ) in Example 1. The access m a t r i x of this reference is (0, 1). Therefore, r x t E Ker{(O, 1)}; that is, r x t E span{(1, 0)T}. This means t h a t there is a temporal reuse carried by the i loop. T h a t is, for a fixed j, successive iterations of the i loop access the same element of array X . For concreteness, we say that (1, 0) T is the temporal reuse vector for t h a t reference. In general, the loop corresponding to the first non-zero element in a reuse vector is said to carry the associated reuse. It should be noted that all references t h a t belong to a single U G R set have the same access matrix; therefore, they have the same temporal reuse vector. The reuse vector for the reference Y(i) on the other hand is (0, 1) T. Intuitively, a reuse vector such as (0, 1) T means that the successive iterations of the innermost loop will access the same d a t a element. In a sense, this is a desired reuse vector form; because, the reuse is exploited in the innermost loop where it is most useful. On the other hand, a reuse vector such as (1, 0) T implies that the same d a t a item is accessed in the successive iterations of the outer loop. If the trip count of the innermost loop is very large then it might be the case that between two uses the d a t a item is flushed from cache. In t h a t case we say that reuse has not converted into locality. The m a i n objective of the compiler-based d a t a reuse analysis can be cast as the problem of transforming -rograms such t h a t as m a n y reuses as possible will be converted into locality. In tr terms, this means to transform programs such t h a t the resulting codes will ~ve desired reuse vector forms. In addition to temporal reuse, spatial reuse is very i m p o r t a n t in scientific codes. Assuming a column-major m e m o r y layout for arrays, a spatial reuse occurs when the same column of an array is accessed more than once. Consider E x a m p l e 1 again. Assuming column-major m e m o r y layouts for all arrays, we note t h a t successive iterations of j loop will access elements in the same column of array U. In that case, we say that array U has spatial reuse with respect to j loop. Similarly, successive values of i loop access elements of the same column of array V; that is, array V has spatial reuse with respect to loop i. It should be noted that spatial reuse is defined with respect to a given m e m o r y layout, which is column-major in our case. The spatial reuse exhibited by array W however is slightly more complicated to analyze. For this array, elements of a given column c are accessed if and only if i + j -- c - 1. In m a t h e m a t i c a l terms, in order for two iteration vectors I and J access elements residing in the same column, they should satisfy the condition Au, I = Au8 J where Au8 is the m a t r i x Au (for an array U) with the first row deleted [13]. From this condition, since Au~ represents a linear mapping, Au~ (J - I) -~ 0 ~ J - I E Ker(A~s8 } ~ r u , E Ker{Au8 } where r u s =- J - I. In this case, r u ~ is termed as the spatial reuse vector and Ker{Aus} is referred as spatial reuse vector space [13]. When there is no confusion, we use r u instead of r u ~ for the sake of clarity. The loop which corresponds to the first non-zero element of the generator vector of Ker{Av~ ) is said to carry spatial reuse. In Example 1, for array U, the spatial reuse is carried by j loop. In this
425 example, there exists a spatial reuse vector for each reference and these are v u = (0, 1) T, v v = (1,0) ~', and v w = ( 1 , - 1 ) T. Recall t h a t all these reuse vectors are with respect to column-major m e m o r y layout. It is possible t h a t a reference m a y have more than one spatial reuse vector. In t h a t case, these individual reuse vectors constitute a m a t r i x called spatial reuse matrix, which will be denoted in this paper by Rv for an array U or for a reference to an array U when there is no confusion. Individual spatial reuse matrices from individual references to possibly different arrays constitute a combined reuse m a t r i x [9]. T h e order of the vectors in the combined reuse m a t r i x might be i m p o r t a n t . Like Li [9], we assume t h a t they are prioritized from left to right according to the frequency of occurrence. An i m p o r t a n t attribute of a reuse vector is its height which is defined as the number of dimensions from the first non-zero entry to the last entry. One way of increasing cache locality is to introduce more leading zeros in the reuse vector, called height reduction [9]. In this context, the best possible spatial reuse vector is r = (O, 0,-. -, 0, 1) T meaning that the spatial reuse will be carried by the innermost loop in the nest. We emphasize the fact that temporal reuse can only be m a n i p u l a t e d by iteration space (loop) transformations. D a t a space transformations cannot directly affect the t e m p o r a l reuse exhibited by a reference. Since, in this paper we are interested only in d a t a layout transformations, we consider only spatial reuse vectors. In addition, we only focus on self-spatial reuses [13].
3
Hyperplanes: An Abstraction for Array Layouts
This section reviews our work [6] on representing m e m o r y layouts of multidimensional arrays using hyperplanes. In an m-dimensional d a t a space, a hyperplane can be defined as a set of tuples {(al, a2,..., am) [ gtat +g2 a2 +... +gmam : c} where gl, g2,...,g~ are rational numbers called hyperplane coefficients and c is a rational number called hyperplane constant [12]. For convenience, we use a row vector gT = (gl,g2, ...,g,~) to denote an hyperplane family (with different values of c) whereas g corresponds to the column vector representation. In a two-dimensional d a t a space, we can think of a hyperplane family as parallel lines for a fixed coefficient set and different values of c. An i m p o r t a n t property of the hyperplanes is that two d a t a points (in our case, array elements) dl and d2 belong to the same hyperplane g if
gT d 1 : gT d2.
(1)
For example, a hyperplane vector such as (0, 1) T indicates t h a t two array elements belong to the same hyperplane as long as they have the same value for the column index (i.e. the second dimension); the value for the row index does not matter. We note that a hyperplane family can be used to partially define m e m o r y layout of an array. For example, in two-dimensional array case, a hyperplane family represented by (0, 1) indicates t h a t the array in question is divided into
426 columns such that each column corresponds to a hyperplane with a different c value (hyperplane constant). The data elements that have the same c value (that is, the elements which make up a column) are stored in m e m o r y consecutively (ordered by their column indices). We assume in a broader sense that two data elements which lie along an hyperplane with a specific c value have spatial locality. Another way of saying this is that two elements dl and d2 have spatial locality if they satisfy equation (1). Notice that this definition of spatial locality is coarse and does not hold in array boundaries; but it is suitable for our purposes. In higher dimensions, we need more than one hyperplane. We refer the interested reader to [6] for an in-depth discussion of hyperplane based layout representation. In the remainder of this paper, for the sake of simplicity we mainly focus on two-dimensional arrays. Our results easily extend to higher dimensional arrays. 4
4.1
Reuse
Vectors
under
Different
Layouts
D e t e r m i n i n g Spatial Reuse Vectors
The definition of the spatial reuse vector presented earlier is based on the assumption that the memory layout for all arrays is column-major. In this section, we give a definition of spatial reuse vector under a given m e m o r y layout. We start with the following theorem. The reader is referred to [7] for the proofs of all the theorems in this paper. T h e o r e m 1. Let g represent a memory layout for an array U, and r v the spatial reuse vector associated with a reference represented by the access matrix Au. Then, the following equality holds between g, r u and Au :
gT A u r u : 0
(2)
If the memory layout of an array is represented by a matrix L (as in three- or higher dimensional cases), then the equation (2) should be satisfied for each row of L. This theorem gives us the relation between memory layouts and spatial reuse vectors and is very important. Notice that for a given gT, in general, from g T A u r u ----O, the spatial reuse vector r v can always be found as
r u e Ker{gT A v }
(3)
Let us consider the following example assuming row-major memory layout for all arrays: E x a m p l e 2: d o i = li, ui do j - - lj, uj d o k = lk, uk U(i+k,j+k) = V(i+j,i+k,j+k) + W(k,j,i) enddo enddo enddo
427
The access matrices for this loop nest are Au =
Aw =
10 00
bygT=(l'O)
( l0~1 1 Av
01
and
. The row-major layout for 2-dimensional arrays is represented
andbyL=
(100~)
we find the following reuse matrices: Ru =
0 -1
, Rv =
, and R w =
/ t \
(i). Most of the previous research has concentrated on optimizing r u given \u/ a fixed memory layout. For instance, if an array is stored in column-major order in memory, i.e., gT = (0, 1), fl'om equation (2) g T A v r u = 0 ~ Au, rt7 = 0; this is the definition of spatial reuse vector used previously [9, la]. 4.2
Reproducing the Effects of Iteration Space Transformations
In this subsection, we show how to reproduce the effect of a linear loop transformation by using linear data transformations instead. Li [9] shows that an iteration space transformation T transforms a reuse vector v v into v v ~ = Tvts. In this paper, we treat temporal locality as a special case of spatial locality. T h a t is, accessing the same element can be considered accessing the same column (for a column-major layout). T h e o r e m 2. Given a 'single' loop nest, if a reference is optimized for spatial locality using an iteration space transformation matrix T, the same effect can be obtained by a corresponding data transformation matrix M . Consider the following matrix-multiplication example assuming column-major layout: E x a m p l e 3: d o i = li, ui do j = lj, uj d o k = lk, uk C(i,j) = C(i,j) + A(i,k) * B(k,j) enddo enddo enddo The spatial reuse matrices are R c =
(10) 00 01
, RA =
, and R s =
]tt~\
~;;~.
\01/
By taking into account the frequency of occurrence of each vector,
428
we can write a combined reuse matrix as R =
(i~176
01 . Here the leftmost 10 column has the highest priority whereas the rightmost column has the lowest. In this case, Li's algorithm [9] finds the following matrix to optimize the locality T=
(10)
O1 . The new reuse matrix is R ~ = T R = 1 . Notice that the O0 0 reuse vector of highest priority is optimized very well, i.e., its height is reduced to 1. The transformed program is as follows: do
u - - lit, UU
d o v = lv, uv d o w = lw, uw C(w,u) = C(w,u) + A ( w , v ) * B(~,~) enddo enddo enddo In the optimized program, the references C ( w , u) and A ( w , v) have spatial locality in the innermost loop (w-loop) whereas the reference B ( v , u) has temporal locality in the innermost loop. Since, in this paper, we treat temporal locality in a loop as a special case of spatial locality, we conclude that in this program all three references have spatial locality in the innermost loop. Next we show how to obtain the effect of this transformation with data layout transformations. Notice that after the transformation T, the final spatial reuse matrices for individual references are R c I =
RB I =
1 0
, RA ~ =
0 1
, and
. Now, from each reuse matrix, we choose the most frequently
used reuse vector (considering all reuse vectors in all reuse matrices), and use it as a target reuse vector for the associated array. In this example, the target reuse vector happens to be (0, 0, 1) T for all references.
ForarrayC:
orarray
For array B:
g At.re =00(g~,g2)
g A.,-. '
g Ant,
,
0 l
( ' O 0 1] = 0 ::~ (gt,g2)
(:01)(i) 10
=0=~(gl,g2) TEKer{(O,O)}
1)}
: 0 ~ (gl,g2) T E Ker{(1, O)}
Thus, the array C can have any layout (say row-major), the array A should be row-major, and the array B should be column-major. Notice that if we assign
429 these layouts, then we have spatial locality for all the references with respect to the innermost loop, a result that has also been obtained by the loop transform a t i o n T given earlier. From the previous discussion, we can conclude that given a single loop nest where each array has one U G R set [3], it is always possible to obtain the locMity effect of a linear loop transformation by corresponding linear d a t a transformations. 4.3
Optimizing for the Best Locality
So far, we have shown that under certain conditions, it is possible to reproduce the effect of a loop transformation by d a t a transformations. By studying the effect of a loop transformation on a spatial reuse vector, we can find a corresponding d a t a transformation which can generate the same effect. Now, we go one step further, and prove a stronger result. T h e o r e m 3. Given a loop nest where each array has one UGR set, then it is always possible to optimize the array layouts such that each reference will have spatial locality with respect to the innermost loop. Consider the following example with a c o l u m n - m a j o r layout for all arrays. E x a m p l e 4: d o i = li, ui d o j = lj, uj d o k = lk, uk U(i,j,k) = U(i-l,j,k+1) + U(i,j,k-1) + V ( j + l , k + l , i - 1 ) enddo enddo enddo Ignoring the t e m p o r a l reuses, the spatial reuse matrices for individual arrays are Rcr = (1, 0, 0) T and R v = (0, 1, 0) T. Therefore, the combined reuse m a t r i x is R --
0 . Notice that the first column occurs more frequently (three 0 times to be exact}. The best iteration space transformation f r o m the locality point of view (using Li's approach [9]) is T =
(00) 01 10
which would generate
R ~= T R =
1 . Unfortunately, due to the d a t a dependence involving array 0 U, this transformation is not legal. In search for a second best transformation, we have two options: For the first option, T1 =
10 which gives R I = T1R = 0 . In 00 0 the transformed nest, the spatial locality for array U is exploited in the second
430 for each array U 6 b[ compute the access matrix A v representing U i=n while (p(i) ----1) i=i-1 end while 7~ ~
ei
compute ~f =/(~r{(A~,~) r}
use the elements of the basis set o l a f as rows of the layout matrix for U end for each
Fig. 1. Algorithm for optimizing spatial locality.
innermost loop (v-loop); and the spatial locality for array V is exploited in the outermost loop (u-loop). For the second option, T2 =
0 which gives R' = T 2 R = . In 1 the transformed nest, the spatial locality for array U is exploited in the outermost loop (u-loop); and the spatial locality for array V is exploited in the innermost loop (w-loop). We note that while T1 improves the spatial locality of U slightly, T2 improves the spatial locality for V. It is easy to see that data layout transformations can easily reproduce the effects ofT1 and T2 (see [7]). However, neither T1 nor T2 (nor their d a t a transformation counterparts) is able to optimize the spatial locality for both arrays in the innermost loop. The reason for the failure of the iteration space transformations is the data dependences which prevent the most desired loop permutation. We now show that how a different data layout transformation can optimize the same nest. We use the best possible spatial reuse vector as the desired (or target) vector for both the arrays. In other words, r u ' = v v ' = 1 = O; there0 fore, (gl,g2,g3) T 6 K e r { ( O , O, 1)} meaning that the array U should have a memory layout such that the last dimension should be the fastest changing dimension. The row-major layout is such a memory layout with the layout matrix g Avru
= 0 =r (gl,g2,g3)
(10)
thus, (91, g2, ga) T 6 Ker{(0, 1, 0)} meaning that the memory layout of array V must be such that the second dimension is the fastest changing dimension. An examplelayoutmatrixwhiehsatisfiesthatisLu= (~001) 0 "
431
Fig. 2. Execution times (in sec.) of syr2k with different layouts on SGI Origin.
In this case, for all the references the spatial reuse will be exploited in the innermost loop. This example clearly shows that the data transformations may be effective in some cases where the iteration space transformations fail.
4.4
Optimizing for Shared-MemoryMultiprocessors
In a shared-memory multiprocessor case, we should take into account the issues related to parallelism as well. As a rule, to prevent a common form of false sharing, a loop which carries spatiM reuse should not be parallelized [9]. False sharing occurs when two processors access the same coherence unit (at least one of them writes) without sharing a data element [4]. We assume that the parallelism decisions are made by a previous pass in the compilation process, and the information for a loop nest is available to our algorithm as a form of vector p where p ( i ) (the i th element) is one if the loop i is parallelized otherwise it is zero. Our memory layout determination algorithm is given in Figure 1. In the figure, L/ is the set of all arrays referenced in the nest. The symbol n denotes the number of loops in the nest, and e~ is a vector will all zero entries except for the i th entry which is 1. The algorithm assumes that there is no conflict (in terms of layout requirements) between references to a particular array. If there is a conflict, then a conflict resolution scheme should be applied [6]. For each array, a subset of all possible spatial reuse vectors starting with the best possible vector are tried. The sequence of trials correspond to en,...,e2,e~. A vector is rejected as target spatial reuse vector if the associated loop is parallelized. Otherwise e~ with the largest i is selected as target reuse vector. Once a spatial reuse vector is chosen, the remainder of the algorithm involves only computation of a null set and a set of basis vectors for it. Since optimizing compilers for shared-memory multiprocessors are quite successful in parallelizing the outermost loops, the algorithm terminates quickly; and the spatial locality in the innermost loops is exploited without incurring severe false sharing. 5
Experimental
Results
We conducted experiments on an SGI Origin, which is a distributed-sharedmemory machine that uses 195MHz R10000 processors, and has 32KB L1 d a t a
432
cache and 4MB L2 unified cache. We used the s y r 2 k code (in C) from BLAS to show that our data layout transformation framework can handle complicated access patterns. Assuming a column-major memory layout, all the arrays have poor locality. The version that is optimized by loop transformations (from [9], assuming column-major layouts) optimizes spatial locality for all the arrays, but has two drawbacks: First, the loop bounds and array subscript expressions are very complicated, and second, the temporal locality for array C in the original program has been converted to spatial locality. Our framework, on the other hand, decides suitable memory layouts. Figure 2 shows the performances for eight possible (permutation-based) layouts on an SGI Origin using 1024 • 1024 double matrices. The legend x-y-z means that the memory layouts for C, A and B are x, y and z respectively, where c means column-major and r means row-major. The layout optimized version -obtained by our framework- (marked dopt) and the loop optimized version (marked l o p t ) are also shown as the last two bars. Notice that our framework optimizes the performance substantially, and no dimension re-indexing (permutation-based layout transformation) can obtain this performance. 6
Related
Work
The compiler work in optimizing loop nests for locality can be divided into two groups: (1) iteration space based optimizations and (2) data layout optimizations. In the first group, Wolf and Lam [13] present definitions of different types of reuses and propose an algorithm to maximize locality by evaluating a subset of legal loop transformations. In contrast, Li [9] uses the concept of 'reuse distance'. His algorithm constructs a combined reuse matrix and tries to reduce its 'height' using techniques from linear algebra. McKinley et al. [10] offer a unified optimization technique consisting of loop permutation, loop fusion and loop distribution. Our work on locality is different from those mentioned. First, we focus on data space transformations instead of iteration space transformations. Second, we use a new definition of spatial reuse vector whereas the reuse vectors used in [9] and [13] are oriented for a uniform memory layout. In the second group, O'Boyle and Knijnenburg [11] focus on restructuring the code given a data transformation matrix and show its usefulness in optimizing spatial locality; in contrast, we concentrate more on the problem of determining suitable layouts by taking into account false sharing as well. Anderson et al. [1] propose a data transformation technique--comprised of permutations and strip-mining--that restructures the data in the shared memory space such that the data for each processor are stored in nearby locations. In contrast, we focus on a larger space of data transformations. Cierniak and Li [2] and Kandemir et al. [5] propose optimization techniques that combine loop and data transformations in a unified framework, but restrict the transformation space. Even in the restricted space, they perform sort of exhaustive search. Jeremiassen and Eggers [4] use data transformations to reduce false sharing. Our framework also
433 considers reducing a c o m m o n form of false sharing as well. Leung and Zahorjan [8] present an array restructuring framework to optimize locality in uniprocessors. Our work differs from theirs in several points: (1) our technique is based on explicit representation of m e m o r y layouts which is centrM to a unified loop and d a t a transformation framework such as ours; (2) our technique finds o p t i m a l m e m o r y layouts in a single step rather than first determining a transformation m a t r i x and then refining it for minimizing m e m o r y space using Fourier-Motzkin elimination; and (3) we explicitly take false sharing into account for multiprocessors.
7
Conclusions
In this paper, we have presented a definition of reuse vector under a given m e m o r y layout. For this, we have represented the m e m o r y layout of a multidimensional array using hyperplane families. Then we have presented a relation between layout representation and spatial reuse vector. We have shown t h a t this relation can be exploited at least in two ways. For a given layout and reference, we can find the spatial reuse vector or for a given desired spatial reuse vector, we can determine the target layout. This second usage allows us to develop a d a t a layout reorganization framework based on the existing compiler technology. Given a loop nest, our framework can optimize the m e m o r y layouts of arrays by considering the best possible reuse vectors.
References 1. J. Anderson, S. Amarasinghe, and M. Lain. Data and computation transformations for multiprocessors. Proc. 5th SIGPLAN Symp. Prin. ~ Prac. Para. Prog., Jul. 1995. 2. M. Cierniak, and W. Li. Unifying data and control transformations for distributed shared memory machines. Proe. SIGPLAN '95 Conf. Prog. Lang. Des.~J Impl., Jun. 1995. 3. D. Gannon, W. 3alby, and K. Gallivan. Strategies for cache and local memory management by global program transformations, Y. Para. ~ Dist. Comp., 5:587616, 1988. 4. T. Jeremiassen, and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. Proc. 5th S I G P L A N Symp. Prin. ~J Prac. Para. Prog., Jul. 1995. 5. M. Kandemir, 3. Ramanujam, and A. Choudhary. Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines. Proc. 1997 International Conference on Parallel Architectures and Compilation Techniques, pp. 236-247, Nov. 1997. 6. M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A data layout optimization technique based on hyperplanes. TR CPDC-TR-97-04, Northwestern Univ. 7. M. Kandemir, A. Choudhary, 3. Ramanujam, and P. Banerjee. Optimizing spatim locality in loop nests using linear algebra. Proc. 7th Workshop Compilers for Parallel Computers, 1998.
434 8. S-T. Leung, and J. Zallorjan. Optimizing data locality by array restructuring. Technical Report, CSE Dept., University of Washington, TR 95-09-01, Sep. 1995. 9. W. Li. Compiling for NUMA parallel machines. Ph.D. dissertation, Cornell University, 1993. 10. K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996. 11. M. 'Boyle, and P. Knijnenburg. Non-singular data transformations: Definition, validity, applications. Proc. 6th Workshop on Compilers for Parallel Computers, pp. 287-297, 1996. 12. J. Ramanujam, and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. In [EEE Trans. Para. FJ Dist. Sys., 2(4):472-482, Oct. 1991. 13. M. Wolf, and M. Lam. A data locality optimizing algorithm. In Proc. A CM SIGP L A N 91 Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
Parallelization of Unstructured Mesh C o m p u t a t i o n s Using D a t a Structure Formalization Rainer Koppter Department of Computer Graphics and Parallel Processing (GUP) Johannes Kepler University, A-4040 Linz, Austria/Europe koppler@gup, uni-linz, ac. at
A b s t r a c t , This paper introduces a concept for semi-automatic parallelization of unstructured mesh computations called data structure formalization. Unlike existing concepts it does not expect knowledge about parallelism but just enough knowledge about the application semantics such that a formal description of the data structure implementation can be given. The parallelization tool Parlamat uses this description for deduction of additional information about arrays and loops such that more efficient parallelization can be achieved than with general tools. We give a brief overview of our data structure modelling language and first experiences with Parlamat's capabilities by means of the translation of some real-size applications from Fortran 77 to HPF.
1
Introduction
Unstructured meshes are used in a large number of scientific codes, for example, in computational fluid dynamics and structural analysis codes. Unfortunately, these codes are hard to parallelize efficiently. Automatic parallelization cannot satisfy today's expectations due to restrictions of compile-time techniques. On the other hand, languages such as High Performance Fortran (HPF) [9] require considerable skill and efforts from programmers, in particular with real-size unstructured mesh codes. In this paper we narrow the gap between the above two approaches using a novel concept called data structure formalization. Unlike existing s e m i - a u t o m a t i c approaches our concept does not confront users with parallelism but just expects knowledge about the sequential application such t h a t the user can provide a brief and formal description of the mesh d a t a structure. We show t h a t such a description provides enough information in order to let a software tool perform a u t o m a t i c array distribution and detect parallelism in loops containing indirect array accesses. We implemented a tool with such capabilities t h a t utilizes the results for translation of the Fortran 77 code to H P F . T h e tool's back-end automatically inserts directives for array and loop distribution according to known H P F parallelization techniques. First experiences with the application of our concept to real-size codes gave encouraging results and justify development of special-purpose parallelization systems for particular application classes.
436 2
Parallelization
of Mesh
Computations
The main reason why today's parallelizing compilers fail to produce efficient results is the complexity of unstructured mesh implementations. For example, finite element meshes consist of several arrays, where some are used as index arrays and others store application data. First, complexity inhibits automatic distribution of arrays in a block or cyclic fashion because each mesh partition yields a set of irregular array distributions. Run-time partitioning techniques [18] address this problem but they are expensive and require parallel loops to be known. Secondly, automatic detection of parallel loops in unstructured mesh codes is unfeasible using known techniques. Such codes are dominated by indirect array accesses, which prevent application of compile-time dependence tests. Consequently, most parallelizers leave loops containing indirect array accesses sequential. Some approaches allow speculative parallel execution [2] but require additional run-time analysis, which is costly in terms of run time and storage overhead. Therefore, researchers became aware of the fact that knowledge from the programmer must be specified to the compiler in order to achieve efficient parallelization. They designed extensions to sequential languages, where HPF is the most popular result. Further, several parallelization strategies for unstructured mesh codes using the family of HPF languages have been developed [5, 16]. However, these strategies still keep programmers far away from automatic parallelization. It has hardly been considered that they have been applied only to small and simple kernels and manual application to large-scale codes is timeconsuming and error-prone [8]. Further, in most cases the HPF programmer and the code author are not the same person. The semi-automatic concept we describe in this paper provides a basis for automatic application of parallelization strategies for unstructured mesh codes. In our tool we use a parallelization strategy that makes use of facilities provided by HPF. We adopted basic ideas from existing work on this subject and defined a general framework for automatic application of these ideas and our own additions. Our strategy consists of code parallelization and mesh data preparation. Details on the two steps are beyond the scope of this paper and can be found elsewhere. In the first step directives for array and loop distribution are inserted [5, 16]. The second step consists of mesh partitioning and renumbering [16, 13]. 3
Data
Structure
Formalization
As mentioned earlier automated compiler techniques cannot parallelize unstructured mesh computations efficiently due to lack of information that is accessible to a compiler. An obvious solution to this problem is specification of missing information to the compiler in a systematic way, for example, specification of the mesh data structure. McManus [13] considers systematic description of data structures impractical because in theory there are unlimited possibilities of implementing an unstructured mesh.
437 Although in theory there may be an unlimited variety of mesh implementations, in practice programmers prefer implementation variants that provide a good compromise between low storage overhead and simplicity and efficiency of algorithms. A comprehensive collection of different unstructured mesh codes showed us that in fact there exists one particular variant that is preferably used by Fortran 77 programmers. The main reasons for the dominance of this variant are legibility and efficiency of algorithms that are built up on it. A mesh implementation that follows this variant is shown in Fig. 1.
3 ~ 2,
INTEGERelts(3,6REALcoords(2, ) 7) 6
112443 423664
I 1111 I I I1 I III 11112
5446177 1234567 123456
Fig. 1. Implementation of a triangular mesh
Thereupon we generalized the implemention variant for application to arbitrary mesh structures that can be embedded into the Entity-Relationship (ER) diagram shown in Fig. 2a. Further we developed a tiny modelling language that provides description of mesh structures and mapping of a mesh structure to variables of the application code. Whenever the data structure of an application can be described through this language, semantic knowledge about it can be specified to the compiler in machine-readable form. A remarkable property of our modelling language is that it does not confront the user with parallelism but just expects enough knowledge about the application semantics such that a "formal documentation" of parts of the sequential code can be given. Hence, the language is accessible to a broad range of users, even to those who are inexperienced in parallelization or who do not want to get in touch with parallelism. The rest of this section gives a brief overview of our data structure modelling language. First we describe some basic concepts and terms, which emerged during generalization of the implementation variant mentioned above. Then we show applications of the language by means of different mesh structures taken from real applications and benchmarks. A mesh consists of a number of entities - so-called objects, which belong to different object types, for example, vertex, edge, triangle, or tetrahedron. Mesh applications define a hierarchy between object types, that is, objects may consist of several subobjects with lower dimensionality. For example, a hexahedron can be viewed as a composition of eight vertices, twelve edges, or six quadratic faces. Each application defines a particular hierarchy, which reflects a subdiagram of the ER diagram shown in Fig. 2a. Mesh applications associate object types with specific properties - so-called attributes. For example, fluid dynamics applications usually associate vertices
438 with density, velocity, and energy [14], or a solid mechanics application may associate hexahedrons with normal stresses and shear stresses [12]. We call application-specific properties data attributes. Besides those, each object type that refers to one or several subobjects has a structure attribute. For example, the structure attribute of the triangle type may be a triplet of vertex numbers. The modelling language covers two major issues: (1) definition of applicationspecific object types with their attributes, which yields a subdiagram of the ER diagram shown in Fig. 2a, and (2) mapping of attributes to array variables of the application code. Application-specific object types are defined through derivation of pre-defined base types, which are enclosed by rectangles in Fig. 2a. A definition specifies the name of the derived type, a base type, possibly a subobject type, and a list of data attributes. Figure 3a shows a simplified mesh specification taken from a fluid dynamics application and Fig. 2b depicts its corresponding ER diagram. The pre-defined type Vertex provides the pre-defined data attributes x and y for the description of vertex coordinates.
nsi
s of
Fig. 2. (a) Universal ER diagram (b) Application-specific ER diagram
Mapping of attributes to variables of the application code assumes a mesh implementation as it is shown in Fig. 1. Each attribute of object type T is implemented by an array, which is indexed by numbers of T objects. Figure 3b shows how attributes of the mesh specified in Fig. 3a can be mapped to the arrays given in Fig. 1.
(a) object NSVtx is Vertex with density, velocity, energy. object NSTri is Elem2D(NSVtx) with area.
(b) NSVtx(i).x -> coords(1 ,i). NSVtx(i).y -> coords(2,i). NSVtx(i).density -> d(i). "'" NSTri(i) -> elts(:, i). NSTri(i).area -> a(i).
Fig. 3. Specification of a triangular mesh: (a) structure (b) attribute mapping
439 Some applications implement special treatment of objects that lie on the boundary of a mesh. Such objects may require additional attributes and may be collected in additional object lists. Our modelling language allows definition of boundary object types through derivation of the pre-defined base type Boundary. Figure 4a shows an example of a boundary vertex type. The array b n d l s t stores indices of all NSVtx objects that belong to the boundary vertex type Slipping Vtx. Finally the language also provides support for applications that use hybrid meshes. The number of subobjects of a hybrid object type is not constant but must be specified for each object separately, for example, using an additional array or array section. Figure 4b shows the definition and mapping of a hybrid object type we found in a Poisson solver. The code uses mixed meshes consisting of tetrahedra and hexahedra. (a) object SlippingVtx is Boundary(NSVtx)
(b) object FEMSolid is Elem3D(FEMNode). object FEMNode is Vertex.
with info.
FEMSolid(i).# -> elts(1 ,i).
SlippingVtx(i) -> bndlst(i).
! 1st value = number of FEMNode indices
! bndlst contains NSVtx indices
SlippingVtx(i).info -> bndinf(i).
FEMSolid(i) -> elts(2:9,i). ! room for up to eight FEMNode indices
Fig. 4. (a) Boundary object type (b) Hybrid object type
In this paper we describe our modelling language using simple examples. However, we have also studied mesh implementation strategies designed for multigrid methods [7] and dynamic mesh refinement [10] and observed that these strategies do not necessitate changes to the current language design.
4
The Parallelization
System
Usually today's parallelizers are applicable to every program written in a given language and their analysis capabilities are restricted to the language level. By way of contrast, very few parallelizers have been developed that focus on special classes of applications, for example, unstructured mesh computations. Such systems receive more knowledge about the application than general-purpose parallelizers, thus they require almost no user intervention and can produce more efficient results. Parlamat (Parallelization Tool for Mesh Applications) is a special-purpose parallelizer, which uses our modelling language for specification of knowledge about the implementation of mesh data structures. It consists of a compiler that translates Fortran 77 codes to HPF and a tool for automatic mesh data preparation. The compiler uses the specification for deduction of knowledge for array distribution and dependence analysis and for recognition of particular programming idioms.
440 4.1
Additional Array Semantics
After P a r l a m a t has carried out basic control and data flow analysis, it performs a special analysis that determines additional semantics for arrays of the application code. It examines the use of arrays in loops over mesh objects, in particular, array indexing and assignments to II~TI~GEI~arrays, and extends symbol table information by the following properties: - I n d e x type: Arrays that are indexed using a loop variable receive the ob-
ject type associated with the loop variable as index type. Other arrays, for example, c o o r d s in Fig. 3b, receive this type using the mesh specification. - C o n t e n t s type: Index arrays that store object indices, which can be detected through analysis of assignments, receive the type of these objects as contents type. Other arrays, for example, e l t s in in Fig. 3b receive this type using the mesh specification. Uniqueness: Index arrays that do not contain repeated values are called unique. This property is set using knowledge from the mesh specification or through analysis of index array initializations in the application code. For example, e l t s , which is the structure array of an E l e m 2 D type, is not unique by definition, which can be seen from Fig. 1. -
The first two properties play an important role for array distribution. Arrays that show the same index type become aligned to each other. The distribution of object types is determined by the mesh partition. The third property is used by the dependence test. Before special analysis starts, the mesh specification is evaluated in order to determine the above properties for arrays occuring in the specification. After this, the analysis proceeds in an interprocedural manner, where properties are propagated from actual to d u m m y arguments and vice versa. 4.2
Dependence Analysis
Dependence analysis [1] is essential in order to classify loops as sequential or parallel. For each array that is written at least once in a loop a dependence test checks if two iterations refer to the same array element. P a r l a m a t performs dependence analysis for loops over mesh objects. Analysis consists of a flowsensitive scalar test, which detects privatizeable scalars and scalar reductions, and a flow-insensitive test for index-typed arrays. Direct array accesses, as they appear in unstructured mesh codes, can be handled sufficiently by conventional tests. On the contrary, indirect array accesses must be passed on to a special test. The core of our test is a procedure that checks if two indirect accesses may refer to the same array element. The procedure performs a pairwise comparison of the dependence strings of both accesses. Loosely speaking, a dependence string describes the index chain of an indirect access as a sequence of contents-typed accesses that end up with the loop variable. Figure 5 shows two loops containing simple indirect accesses and
441 (a) DO i = 1, eltnurn vl = elts(1, i) v2 = elts(2 i) ,
(b) DO i = 1, bndvtxnum v = bndlst(i) ! see Fig. 4a (d(vl); elts(1, i); i) ]il IS: d(v) = ""~1 [ (d(v); bndlst(i); i) ]i 1 .... I [ (d(v); bndlst(i); i) ]i 2 .... d(v2) I "1 [ (d(v2); elts(2, i); i) ]i2 ' I
Ell
Fig. 5. Loop-carried dependences: (a) present, (b) absent comparisons of their dependence strings. The notation [x]i stands for "value of expression x in iteration i " The procedure compares strings from the right to the left. In order to detect loop-carried dependences it assumes that il r i2. Hence, in Fig. 5a [vl]il = [v2]i2 is true iff e l t s ( 1 , il) = e l t s ( 2 , i2). At this point the procedure requires information about the uniqueness of e l t s . As mentioned above, e l t s is not unique by definition. Hence, the condition is true and the accesses d ( v l ) and d(v2) will definitely overlap. For example, with respect to the mesh shown in Fig. 1 this will occur for il = 3 and i2 = 2. Unless both accesses appear in reduction operations, the true dependence SI~tS2 disallows parallelization. Figure 5b shows how Parlamat can disprove dependence using uniqueness. The array b n d l s t is the structure array of a boundary object type and is unique by definition as a list of boundary objects does not contain duplicates. Hence, the loop can be parallelized. With respect to the code, where the loop was taken from, this is an important result, because the loop is enclosed by a time step loop and thus executed many times. 5
Experiences
We consider special-purpose systems such as Parlamat primarily useful for parallelization of real-size application codes, where it helps to save much work. On the other hand, even a speciM-purpose solution should be applicable to a broad range of applications. In order to keep this goal in mind we based the development of our solution on a set of various unstructured mesh codes, which comprises realsize codes as well as benchmarks. In this paper we report on experiences with three codes taken from this set: 1. The benchmark euler [11] is the kernel of a 2D fluid dynamics computation. We mention it here because at the moment its HPF version is the only example of a parallel unstructured mesh code that is accepted by one of today's HPF compilers. 2. The structural mechanics package FEM3D [12] solves linear elasticity problems in 3D using meshes consisting of linear brick elements and an elementwise PCG method. 3. The fluid dynamics package NSC2KE [14], which has been used in several examples throughout this paper, solves the Euler and Navier-Stokes equutions in 2D with consideration of turbulences. It uses meshes consisting of triangles.
442 We regard the codes FEM3D and NSC2KE as real-size applications for two reasons. First, they are self-contained packages that solve real-world problems. Secondly, they consist of 1100 and 3000 lines of code, respectively. Table 1 gives a brief characterization of the three codes. The lengths of the data structure specifications show that specifications - even if they belong to large applications usually fit into one screen page. -
Table 1. Characterization of three unstructured mesh applications application
euler FEM3D NSC2KE
data structure specification DO loops boundary length object object attribute in parallel sequential types types mappings lines I/O others 2 2 3
6
7 10 38
11 14 56
5 (3) 25 (5) 66 (17)
2 6 15
1 29 22
total 8 (3) 60 (5) 103 (21)
The table also reports on parallelization of DO loops. It shows how much loops of each application can be marked as parallel through safe use of the INDEPENDENTdirective, sometimes after minor automatic restructuring. Some of the remaining loops cannot be parallelized because they contain I / O statements that refer to one and the same file. Numbers given in parentheses refer to loops that are executed many times, for example, within a time step loop. The statistics indicate that the knowledge accessible to Parlamat helps to achieve a high number of parallel loops for all three applications. Furthermore nearly all of the frequently executed loops can be parallelized, which should lead to good parallel speed-ups. Our experiences with the H P F versions of the above three codes are very limited due to several restrictions of today's H P F compilers. In order to handle arbitrary parallel loops containing indirect array accesses, an H P F compiler must support the inspector-ezccutor technique [15]. Our attempts to compile the H P F version of euler using A D A P T O R 4.0 [4] and pghpf 2.2 [17] gave evidence that these compilers do not support this technique. Thereupon we switched to a prerelease of the H P F + compiler vfc [3]. The H P F + version could be compiled with vfc and executed on a Meiko CS-2 without problems. On the other hand, the H P F + versions of FEM3D and NSC2KE were not accepted by vfc due to preliminary compiler restrictions. In spite of problems with H P F compilers we could verify the correctness of the transformations applied to FEM3D and NSC2KE. Due to the nature of the language H P F programs can be compiled also with a Fortran 9X compiler and executed on a sequential computer. We tested Parlamat's code transformation and mesh data preparation facilities thoroughly on a single-processor workstation through comparison of the results produced by the H P F version and the
443 original version. The differences between the results are hardly noticeable and stem from the fact that renumbering of mesh object changes the order of arithmetic operations with real numbers, which causes different rounding.
6
Related
Work
McManus [13] and Haseoet [8] investigated automatization concepts for another parallelization strategy for unstructured mesh computations. Unlike the HPF strategy described here they focus on execution of the original code as node code on a distributed-memory computer after some changes such as communication insertion or adjustment of array and loop bounds have been made. It is obvious that this strategy is less portable than HPF. McManus considers some steps of his five-step strategy automateable within the framework of the CAPTools environment. However, such a general-purpose system requires extensive knowledge from the user in order to achieve efficient resuits. Hascoet has designed a special-purpose tool that provides semi-automatic identification of program locations where communication operations must be inserted. The tool requests information including the loops to be parallelized, association of loops with mesh entities, and a so-called overlap automaton.
7
Conclusion
In this paper we introduced a concept for semi-automatic parallelization of unstructured mesh computations. The core of this concept is a modelling language that allows users to describe the implementation of mesh data structures in a formal way. The language is simple and general enough to cover a broad variety of data structure implementations. We also showed that data structure specifications provide the potential for automatization of a strategy that translates Fortran 77 codes to HPF. In particular, we outlined their usefulness for efficient loop parallelization by means of enhanced dependence testing. Finally we reported on application of our concept to a set of unstructured mesh codes, with two real-size applications among them. Our experiences show that data structure formalization yields significant savings of work and provides enough knowledge in order to transform applications with static meshes into highly parallel HPF equivalents. Hence, they justify and recommend the development of special-purpose parallelizers such as Parlamat. We believe that the potential of data structure formalization for code parallelization has not been exploited yet by far. The current version of Parlamat utilizes semantic knowledge just for most obvious transformations. In particular, applications using dynamic mesh refinement provide several opportunities for exploitation of semantic knowledge for recognition of coarse-grain parallelism. We regard development of respective HPF parallelization strategies and concepts for automatic translation as interesting challenges for the future.
444
Acknowledgement The author thanks Siegfried Benkner for access to the H P F + compiler vfc.
References 1. Banerjee, U.: Dependence Analysis for Supercomputing. Reading, Kluwer Academic Publishers, Boston (1988) 2. Blume, W., Eigenmann, R., Hoeflinger, J., Padua, D., Petersen, P., Rauchwerger, L., Tu, P.: Automatic Detection of Parallelism. IEEE Parallel and Distributed Technology 2 (1994) 37-47 3. Benkner, S., Pantano, M., Sanjari, K., Sipkova, V., Velkow, B., Wender, B.: VFC Compilation System for H P F § - Release Notes 0.91. University of Vienna (1997) 4. Brandes, T.: ADAPTOR Programmer's Guide Version 4.0. GMD Technical Report, German National Research Center for Information Technology (1996) 5. Chapman, B., Zima, H., Mehrotra, P.: Extending HPF for Advanced Data-Parallel Applications. IEEE Parallel And Distributed Technology 2 (1994) 59-70 6. Cross, M., lerotheou, C., Johnson, S., Leggett, P.: CAPTools - semiautomatic parallelisation of mesh based computational mechanics codes. Proc. H P C N '94 2 (1994) 241-246 7. Das, R., Mavriplis, D., Saltz, J., Gupta, S., Ponnusamy, R.: The Design and Implementation of a Parallel Unstructured Euler Solver Using Software Primitives. AIAA Journal 32 (1994) 489-496 8. Hascoet, L.: Automatic Placement of Communications in Mesh-Partitioning Parallelization. ACM SIGPLAN Notices 32 (1997) 136-144 9. High-Performance Fortran Forum: High Performance Fortran Language Specification - Version 2.0. Technical Report, Rice University, TX (1997) 10. Kallinderis, Y., Vijayan, P.: Adaptive Refinement-Coarsening Scheme for ThreeDimensional Unstructured Meshes. AIAA Journal 31 (1993) 1140-1447 11. Mavriplis, D.J.: Three-Dimensional Multigrid for the Euler Equations. AIAA Paper 91-1549CP, American Institute of Aeronautics and Astronautics (1991) 824-831 12. Mathur, K.: Unstructured Three Dimensional Finite Element Simulations on Data Parallel Architectures. In: Mehrotra, P., Saltz, J., Voigt, R. (eds.): Unstructured Scientific Computation on Scalable Multiprocessors. MIT Press (1992) 65-79 13. McManus, K.: A Strategy for Mapping Unstructured Mesh Computational Mechanics Programs onto Distributed Memory Parallel Architectures. Ph.D. thesis,University of Greenwich (1996) 14. Mohammadi, B.: Fluid Dynamics Computations with NSC2KE - A User Guide. Technical Report RT-0164, INRIA (1994) 15. Mirchandaney, R., Saltz, J., Smith, R., Nicol, D., Crowley, K.: Principles of runtime support for parallel processors. Proc. 1988 ACM International Conference on Supercomputing (1988) 140-152 16. Ponnusamy, R., Hwang, Y.-S., Das, R., Saltz, J., Choudhary, A., Fox, G.: Supporting Irregular Distributions Using Data-Parallel Languages. IEEE Parallel and Distributed Technology 3 (1995) 12-14 17. The Portland Group, Inc.: pghpf User's Guide. Wilsonville, OR (1992) 18. Ponnusamy, R., Saltz, J., Das, R., Koelbel, Ch., Choudhary, A.: Embedding Data Mappers with Distributed Memory Machine Compilers ACM SIGPLAN Notices 28 (1993) 52-55
Parallel Constant Propagation Jens Knoop Universit~t Passau, D-94030 Passau, Germany knoop@fmi,uni-passau, de phone: ++49-851-509-3090 fax: ++49-...-3092 Abstract. Constant propagation (CP) is a powerful, practically relevant optimization of sequential programs. However, systematic adaptions to the parallel setting are still missing. In fact, because of the computational complexity paraphrased by the catch-phrase "state explosion problem", the successful transfer of sequential techniques is currently restricted to bitvector-based optimizations, which because of their structural simplicity can be enhanced to parallel programs at almost no costs on the implementation and computation side. CP, however, is beyond this class. Here, we show how to enhance the framework underlying the transfer of bitvector problems obtaining the basis for developing a powerful algorithm for parallel constant propagation (PCP). This algorithm can be implemented as easily and as efficiently as its sequential counterpart for simple constants computed by state-of-the-art sequential optimizers.
1
Motivation
In comparison to automatic parallelization (cf. [15, 16]), optimization of parallel programs attracted so far only little attention. An important reason may be that straightforward adaptions of sequential techniques typically fail (cf. [10]), and that the costs of rigorous, correct adaptions are unacceptably high because of the combinatorial explosion of the number of interleavings manifesting the possible executions of a parallel program. However, the example of unidirectional bitvector (UBv) problems (cf. [2]) shows that there are exceptions: UBv-problems can be solved as easily and as efficiently as for sequential programs (cf. [9]). This opened the way for the successful transfer of the classical bitvector-based optimizations to the parallel setting at Mmost no costs on the implementation and computation side. In [6] and [8] this has been demonstrated for partial dead-code elimination (cf. [7]), and partial redundancy elimination (cf. [11]). Constant propagation (CP) (cf. [4, 3,12]), however, the application considered in this article, is beyond bitvector-based optimization. CP is a powerful and widely used optimization of sequential programs. It improves performance by replacing terms, which at compile-time can be determined to yield a unique constant value at run-time by that value. Enhancing the framework of [9], we develop in this article an algorithm for parallel constant propagation (PCP). Like the bitvector-based optimizations it can be implemented as easily and as efficiently as its sequential counterpart for simple constants (SCs) (cf. [4, 3, 12]), which are computed by state-of-the-art optimizers of sequential programs.
446 The power of this algorithm is demonstrated in Figure 1. It detects that y has the value 15 before entering the parallel statement, and that this value is maintained by it. Similarly, it detects that b, c, and e have the constant values 3, 5, and 2 before and after leaving the parallel assignment. Moreover, it detects that independently of the interleaving sequence taken at run-time d and x are assigned the constant values 3 and 17 inside the parallel assignment. Together this allows our algorithm not only to detect the constancy of the right-hand side terms at the edges 48, 56, 57, 60, and 61 after the parallel statement, but also the constancy of the right-hand side terms at the edges 22, 37, and 45 inside the parallel statement. We do not know of any other algorithm achieving this.
sO l~)a:~2 2~x:~5
b :=a+ld9 " ~ ' ~ ' ~ b
|
:=x-a
|
Fig. 1. The power of parallel constant propagation.
Importantly, this result is obtained without any performance penalty in comparison to sequential CP. Fundamental for achieving this is to conceptually decompose PCP to behave differently on sequential and parallel program parts. Intuitively, on sequential program parts our PCP-algorithm coincides with its sequential counterpart for SCs, on parallel program parts it computes a safe approximation of SCs, which is tailored with respect to side-conditions allowing
447 us as for unidirectional bitvector problems to capture the phenomena of interference and synchronization of parallel components completely without having to consider interleavings at all. Summarizing, the central contributions of this article are as follows: (1) Extending the framework of [9] to a specific class of nonbitveetor problems. (2) Developing on this basis a PCP-algorithm, which works for reducible and irreducible control flow, and is as efficient and can as easily be implemented as its sequential counterpart for SCs. An extended presentation can be found in [5].
2
Preliminaries
This section sketches our parallel setup, which has been presented in detail in [9]. We consider parallel imperative programs with shared memory and interleaving semantics. Parallelism is syntactically expressed by means of a p a r statement. As usual, we assume that there are neither jumps leading into a component of a parallel statement from outside nor vice versa. As shown in Figure 1, we represent a parallel program by an edge-labelled parallel flow graph G with node set N, and edge set E. Moreover, we consider terms t E T, which are inductively built from variables v E V, constants c E C, and operators op E O p of arity r > 1. Their semantics is induced by an interpretation I = ( D ' U {.1_, "1-},Io ), where D' denotes a non-empty data domain, • and -1- two new data not in D ~, and Io a function mapping every constant c E C to a datum Io(c) E D', and every r - a r y operator o19 E O p to a total, strict function Io(op) : D r ~ D , D=~f D' U {_k, T} (i.e., Io(op)(dl,..., dr ) = _k, whenever there is a j , 1 ~ j ~ r, with dj = _1_). Z = { a I a : V --4 D } denotes the set of states, and a j_ the distinct start state assigning • to all variables v E V (this reflects that we do not assume anything about the context of a program being optimized). The semantics of terms t E T is then given by the evaluation function g : T ~ ( ~ - + D). It is inductively defined by: Vt E T Va E Z.
{
a(x) if t = x E V Io(c) if t-----cEC Io(op)(g(tl)(a),..., g ( t , ) ( a ) ) if t -- op(tl,..., tr)
In the following we assume D ~C T, i.e., the set of data D ~ is identified with the set of constants C. We introduce the state transformation function O, : E-+ ~, _= x : = t, where O~(a)(y) is defined by g(t)(a), if y = x , and by a(y) otherwise. Denoting the set of all parallel program paths, i.e., the set of interleaving sequences, from the unique start node s to a program point n by P P [ s , n], the set of states En being possible at n is given by Zn=df { 0 p ( a • E PP[s,n]}. Here, 0p denotes the straightforward extension of t~ to parallel paths. Based on ~ we can now define the set Cn of all terms yielding a unique constant value at program point n E N at run-time:
448 with gd=d/ {t 9 T lVa e S~. C(t)(~r) = d} for all d 9 D'. Unfortunately, even for sequential programs gn is in general not decidable (cf. [12]). Simple constants recalled next are an efficiently decidable subset of Ca, which is computed by state-of-the-art optimizers of sequential programs.
3
Simple Constants
Intuitively, a term is a (sequential) simple constant (SC) (cf. [4]), if it is a program constant or all its operands are simple constants (in Figure 2, a, b, c and d are SCs, whereas e, though it is a constant of value 5, and f , which in fact is not constant, are not). Data flow analysis provides the means for computing SCs.
a)
0
b)
O
a:=2 i
a[--> 2
b:=a a:=3 a [_> 3, b i._> 2 ~ : ~ a F > 3, b I--> 2, cl --r'4 d := c-2~ a I--> 3, b,d [--> 2, c I--~"4
1
~
b:=2
a:=3
a,b 1~-> 2
r := a+b ) a,b [--> 2, c I--> 4 d := a+l a,b 1--> 2, c I--> 4, d I--> 3
a := 2
c:=4 d
:=
c:=4 d:=3
?2?
b I--> 2, c [--> 4 r :=
a+d
) b [--> 2, c [--> 4 f := a+b*c
~ e
:= a+d
I
f := a+8
Fig. 2. Simple constants.
D a t a F l o w A n a l y s i s . In essence, data flow analysis (DFA) provides information about the program states, which may occur at a specific program point at run-time. Theoretically well-founded are DFAs based on abstract interpretation (cf. [1]). Usually, the abstract semantics, which is tailored for a specific problem, is specified by a local semantic functional [ ] : E--+ (~--+ Z:) giving abstract meaning to every program statement (here: every edge e E E of a sequential or parMlel flow graph G) in terms of a transformation function on a complete lattice (Z:, fq, E , / , T). Its elements express the DFA-information of interest. Local semantic functionMs can easily be extended to capture finite paths. This is the key to the so-called meet over all paths (MOP) approach. It yields the intuitively desired solution of a DFA-problem as it directly mimics possible program executions: it "meets" all informations belonging to a program path leading from s to a program point n E N.
T h e MOP-Solution: Vlo e Z:. MOPlo(n) =[7 { [p](lo)IP E P[s, n] } Unfortunately, this is in general not effective, which leads us to the maximal fixed point (MFP) approach. Intuitively, it approximates the greatest solution
449 of a system of equations imposing consistency constraints on an annotation of the program with DFA-information with respect to a start information 10 E s mfp(n) = { l0
/f n = s { [ (re, n ) ~ ( m f p ( m ) ) l ( m , n ) E E } otherwise
The greatest solution of this equation system, denoted by mfp~o, can effectively be computed, if the semantic functions ~ e ], e E E, are monotonic, and s is of finite height. The solution of the MFP-approach is then defined as follows: T h e MFP-Solution: V lo E E V n E N. MFPIo (n)=dl mfPl 0(n) Central is now the following theorem relating both solutions. It gives sufficient conditions for the correctness and even precision of the MFP-solution (cf. [3, 4]): T h e o r e m 1 ( S a f e t y a n d Coincidence). 1. Safety: MFP(n) E MOP(n), if all ~ e ], e E E, are monotonic. 2. Coincidence: MFP(n) = MOP(n), if all ~ e ], e E E, are distributive.
Computing Simple Constants. Considering D a flat lattice as illustrated in Figure 3(a), the set of states Z together with the pointwise ordering is a complete lattice, too. The computation of SCs relies then on the local semantic functional ]sc : E - + ( Z - + Z ) defined by [e]sc=diOe for all e E E, with respect to the start state a j . Because the number of variables occurring in a program is finite, we have (cf. [3]): L e m m a 1. ~ is of finite height, and all functions [ e ]sc, e E E, are monotonic. Hence, the MFP-solution MFP~C~ of the SC-problem can effectively be computed inducing the formal definition of (sequential) simple constants. C8c__ psc,rnfp__--d/l~ r. E T I 3 d E ~ --df-~
D I. E ( t ) ( M F Fsc~ ( n ) ) = d }
Defining dually the set C~~,m~ -4f {t E T I 3 d E D'. C ( t ) ( M O P ~ ( n ) ) = d } induced by the MOP~-solution, we obtain by means of Theorem 1(1) (cf. [3]): T h e o r e m 2 ( S C - C o r r e c t n e s s ) . Vn E N. Cn _~ --nCSC'm~ _D -nCsc'mlP __--CnSe . a)
T
b)
9 ..
T
*"
r
.."
di. 1
...
d,.~ ...
d~,
I
..L
I
di
I
~ •
Fig. 3. a), b) Flat (data) lattices,
c) Extended lattice.
di+!
""
di. j oo.
...
d~%...
)
d~
I
450
4
Parallel Constant P r o p a g a t i o n
In a parallel program, the validity of the SC-property for terms occurring in a parallel statement depends mutually on the validity of this property for other terms; verifying it cannot be separated from the interleaving sequences. The costs of investigating them explicitly, however, are prohibitive because of the wellknown state explosion problem. We therefore introduce a safe approximation of SCs, called strong constants (StCs). Like the solution of a unidirectional bitveetor (UBv) problem they can be computed without having to consider interleavings at all. The computation of StCs is based on the local semantic functional ~ ]stc : E--+ ( ~ - + Z ) defined by ~e]stc(a)=d$Ostc(o -) for all e E E and a E ~ , where O~tc denotes the state transformation function of Section 2, where, however, the evaluation function E is replaced by the simpler gstc, where gstc(t)(~r) is defined by I0 (c), if t = c E C, and • otherwise. The set of strong constants at n, denoted by Cstc is then analogously defined to the set of SCs at n. Denoting by ~'D=d] { Cstdld E D} U {IdD} the set of constant functions Cstd on D, d E D, enlarged by the identity IdD on D, we obtain: v
n
,
iemma
2.
Ve E E Vv E V. ~eLt~lv E ~D\{CstT}, where Iv denotes the
restriction of [ e Lt~ to v. Intuitively, Lemma 2 means that (1) all variables are "decoupled": each variable in the domain of a state a E Z can be considered like an (independent) slot of a bitvector. (2), the transformation function relevant for a variable is either a constant function or the identity on D. This is quite similar to UBvproblems. There, for every bit (slot) of the bitvector the set of data is given by the flat lattice B of Boolean t r u t h values as illustrated in Figure 3(b), and the transformation functions relevant for a slot are the two constant functions and the identity on B. However, there is also an important difference..~'D is not closed under pointwise meet: consider the pointwise meet of IdD and Cstd for some d E D; it is given by the "peak"-function Pd with pal(x) = d, if x = d, and • otherwise. These two facts together with an extension allowing us to mimic the effect of pointwise meet in a setting with only constant functions and the identity are the key for extending the framework of [9] focusing on the scenario of bitvector problems to the scenario of StC-like problems.
4.1
Parallel Data Flow Analysis
As in the sequential case, the abstract semantics of a parallel program is specified by a local semantic functional [ ~ : E - + (s --~ s giving meaning to its statements in terms of functions on an (arbitrary) complete lattice s In analogy to their sequential counterparts, they can be extended to capture finite parallel paths. Hence, the definition of the parallel variant of the MOP-approach, the PMOP-approach and its solution, is obvious: T h e PMOP-Solution: Vlo E s PMOPlo(n) = ['-1{ [p]~(lo)IP E P P [ s , n ] }
451 The point of this section now is to show that as for UBv-problems, also for the structurally more complex StC-like problems the PMOP-solution can efficiently be computed by means of a fixed point computation.
StC-like problems. StC-like problems are characterized as UBv-problems by (the simplicity of) their local semantic functional [ ] : E --+ (1/9 --+//9). It specifies the effect of an edge e on a single component of the problem under consideration (considering StCs, for each variable v E V), where /D is a flat lattice of some underlying (finite) set /D ~ enlarged by two new elements _1_ and T (cf. Figure 3(a)), and where every function [ e ], e E E, is an element of ~K):df { Cstd I d E 1D} U {IdtD}. As shown before, SD is not closed under pointwise meet. The possibly resulting peak-functions, however, are here essential in order to always be able to model the effect when control flows together. Fortunately, this can be mimiced in a setting, where all semantic functions are constant ones (or the identity). To this end we extend/D to/Dx by inserting a second layer as shown in Figure 3(c), considering then the set of functions 9v~x =dr ( ~ \ { I d ~ } ) U {Ida)• } U { Cstdp I d E K)}. The functions CStdp, d E/D ~, are constant functions like their respective counterparts Cstd, however, they play the role of the peakfunctions p~, and in fact, after the termination of the analysis, they will be interpreted as peak-functions. Intuitively, considering StCs and synchronization, the "doubling" of functions allows us to distinguish between components of a parallel statement assigning a variable under consideration a unique constant value not modified afterwards along all paths (Cstd), and those doing this along some paths (Cstd~). While for the former components the variable is a total StC (or shorter: an StC) after their termination (as the property holds for all paths), it is a partial StC for the latter ones (as the property holds for some paths only). Keeping this separately in our data domain, gives us the handle for treating synchronization and interference precisely. E.g., in the program of Figure 1, d is a total StC for the parallel component starting with edge 15, and a partial one for the component starting with edge 35. Hence, d is an StC of value 3 after termination of the complete parallel statement as none of the parallel "relatives" destroys this property established by the left-most component because it is a partial StC of value 3, too, for the right-most component, and it is transparent for the middle one. On the other hand, d would not be an StC after the termination of the parallel statement, if, let's say, d would be a partial StC of value 5 of the right-most component. Obviously, all functions of ) ~ x are distributive. Moreover, together with the orderings ___seq and ~par displayed in Figure 4, they form two complete lattices with least element Cst• and greatest element CstT, and least element Cst• and greatest element Id~x, respectively. Moreover, Jr~)x is closed under function composition (for both orderings). Intuitively, the "meet" operation with respect to ~par models the merge of information in "parallel" join nodes, i.e., end nodes of parallel statements, which requires that all their parallel components terminated. Conversely, the "meet" operation modelling the merge of information at points, where sequential control flows together, which actually would be given by the pointwise (indicated below by the index "pw") meet-operation on
452 Sequential
=- ldn
Join .
.
.
.
Parallel
=- -
seq
,..
CSt i
9 ""
CStdp
"~176
. i. +
. "'"
Cst d
Cst
.
Cst i+
"~
Cstd+ ''"
Cst p
""
Cst p ,..
.
Cst p
I
par
CStd~
C td~ I
. . . . t. .
CStd~
9
l
~
i
"
"'"
Idvx
Join"
Cs' d
~'
"',
Cst•
f
Cstd
~.
CStdp
C td~ 1
Cst d
Cst
i
di~j
i ' i+
"'"
Cst d "'" ,+J
Cst~.
Fig. 4. The function lattices for "sequential" and "parallel join".
functions of Zn), is here equivalently modelled by the "meet" operation with respect to C seq. In the following we drop this index. Hence, ~ and C expand to [Tseq and C__seq,respectively. Recalling Lemma 2 and identifying the functions Cstdp, d 6 ]19', with the peak-functions Pd, we obtain for every sequential flow graph G and start information out of Z: L e m m a 3. V n e N. MOPpw (n) = MOP(n) = M F P ( n ) Lemma 3 together with Lemma 4, of which in [9] its variant for bitvector problems was considered, are the key for the efficient computation of the effects of "interference" and "synchronization" for StC-like problems. L e m m a 4 ( M a i n - L e m m a ) . Let fi : Y r ~ x - + } : ~ x , 1 < i < q, q E IV, be functions of Y ~ x . Then: 3 k 6 {1,...,q}. fq o ... o f2 o fl ----fk A V j e {k + 1, ..., q}. fj = Idl)x.
Interference. As for UBv-problems, also for StC-like problems the relevance of Main Lemma 4 is that each possible interference at a program point n is due to a single statement in a parallel component, i.e., due to a statement whose execution can be interleaved with the statement of n's incoming edge(s), denoted as n's interleaving predecessors IntPred(n) C_ E. This is implied by the fact t h a t for each e E IntPred(n), there is a parallel path leading to n whose last step requires the execution of e. Together with the obvious existence of a path to n that does not require the execution of any statement of IntPred(n), this implies that the only effect of interference is "destruction". In the bitvector situation considered in [9], this observation is boiled down to a predicate NonDestruct inducing the constant function "true" or "false" indicating the presence or absence of an interleaving predecessor destroying the property under consideration. Similarly, we here define for every node n E N the function Inter/erenceE# (n)=~/ [~{[ e ]le C IntPred(n) A ~ e ] r I d a } These (precomputable) constant functions Inter]erenceEff(n), n E N, suffice for modelling interference.
453
Synchronization. In order to leave a parallel statement, all parallel components are required to terminate. As for UBv-problems, the information required to model this effect can be hierarchically computed by an algorithm, which only considers purely sequential programs. The central idea coincides with that of interprocedural DFA (cf. [14]): one needs to compute the effect of complete subgraphs, in this case of complete parallel components. This information is computed in an "innermost" fashion and then propagated to the next surrounding parallel statement. In essence, this three-step procedure ~A is a hierarchical adaptation of the functional version of the MFP-approach to the parallel setting. Here we only consider the second step realizing the synchronization at end nodes of parallel statements in more detail. This step can essentially be reduced to the case of parallel statements G with purely sequential components G 1 , . . . , G k . Thus, the global semantics ~ Gi ]]* of the component graphs Gi can be computed as in the sequential case. Afterwards, the global semantics ~ G ~* of G is given by: ~[G]* = [-]p~T{ ~Gi 1" I i e {1,... ,k} }
(Synchronization)
Central for proving the correctness of this step, i.e., ~ G 4" coincides with the
PMOP-solution, is again the fact that a single statement is responsible for the entire effect of a path. Thus, it is already given by the projection of this path onto the parallel component containing the vital statement. This is exploited in the synchronization step above. Formally, it can be proved by Main Lemma 4 together with Lemma 2 and Lemma 3, and the identification of the functions CStdp with the peak-function Pd. The correctness of the complete hierarchical process then follows by a hierarchical coincidence theorem generalizing the one of [9]. Subsequently, the information on StCs on parallel program parts can be fed into the analysis for SCs on sequential program parts. In spirit, this follows the lines of [9], however, the process is refined here in order to take the specific meaning of peak-functions represented by Cstd~ into account.
4.2
The Application: Parallel Constant Propagation
We now present our algorithm for PCP. It is based on the semantic functional [ ]pcp : E ~ ( S -+ S), where [ e ] is defined by ~ e ]stc, if e belongs to a parallel statement, and by ~ e ]s~ otherwise. Thus, for sequential program parts, [ ]pep coincides with the functional for SCs, for parallel program parts with the functional for StCs. The computation of parallel constants proceeds then essentially in three steps. (1) Hierarchically computing (cf. procedure .4) the semantics of parallel statements according to Section 4.1. (2) Computing the data flow information valid at program points of sequential program parts. (3) Propagating the information valid at entry nodes of parallel statements into their components. The first step has been considered in the previous section. The second step proceeds almost as in the sequential case because at this level parallel statements are considered "super-edges", whose semantics (~[ ]~*) has been computed in the first step. The third step, finally, can similarly be organized as the corresponding step of the algorithm of [9]. In fact, the complete three-step procedure evolves as
454 an adaption of this algorithm. Denoting the final annotation computed by the PCP-algorithm by Ann pcp : N -+ ~, the set of constants detected, called parallel simple constants (PSCs), is given by
gPSC---{te TI3de n --d]
D'. g(t)(Ann pcp)
-~-
d)
The correctness of this approach, whose power has been demonstrated in the example of Figure 1, is a consequence of Theorem 3: T h e o r e m 3 ( P S C - C o r r e c t n e s s ) . Vn C N. gn D CP8c --
- - n
"
Aggressive PCP. P C P can detect the constancy of complex terms inside parallel statements as e.g. of y + b at edge 45 in Figure 1. In general, though not in this example, this is the source of second-order effects (cf. [13]). They can be captured by incrementally reanalyzing affected program parts starting with the enclosing parallel statement. In effect, this leads to an even more powerful, aggressive variant of PCP. Similar this holds for extensions to subscripted variables. We here concentrated on scalar variables, however, along the lines of related sequential algorithms, our approach can be extended to subscripted variables, too.
5
Conclusions
Extending the framework of [9], we developed a PCP-algorithm, which can be implemented as easily and as efficiently as its sequential counterpart for SCs. The key for achieving this was to decompose the algorithm to behave differently on sequential and parallel program parts. On sequential parts this allows us to be as precise as the algorithm for SCs; on parallel parts it allows us to capture the phenomena of interference and synchronization without having to consider any interleaving. Together with the successful earlier transfer of bitvector-based optimizations to the parallel setting, complemented here by the successful transfer of CP, a problem beyond this class, we hope that compiler writers are encouraged to integrate classical sequential optimizations into parallel compilers.
References 1. P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Conf. Rec. ~th Symp. Principles of Prog. Lang. (POPL'77), pages 238 - 252. ACM, NY, 1977. 2. M. S. Hecht. Flow Analysis of Computer Programs. Elsevier, North-Holland, 1977. 3. J. B. Kam and J. D. Ullman. Monotone data flow analysis frameworks. Acta Informatica, 7:309- 317, 1977. 4. G. A. Kildall. A unified approach to global program optimization. In Conf. Rec. 1st Symp. Principles of Prog. Lang. (POPL '~3), pages 194 - 206. ACM, NY, 1973. 5. J. Knoop. Constant propagation in explicitly parallel programs. Technical report, Fak. f. Math. u. Inf., Univ. Passau, Germany, 1998. 6. J. Knoop. Eliminating partially dead code in explicitly parallel programs. TCS, 196(1-2):365 - 393, 1998. (Special issue devoted to Euro-Par'96).
455 7. J. Knoop, O. Riithing, and B. Steffen. Partial dead code elimination. In Proc. ACM SIGPLAN Conf. on Prog. Lang. Design and Impl. (PLDI'g4), volume 29,6 of ACM SIGPLAN Not., pages 147 - 158, 1994. 8. J. Knoop, B. Steffen, and J. Vollmer. Code motion for parallel programs. In Proc.
of the Poster Session of the 6th Int. Conf. on Compiler Construction (CC'96), 9.
pages 81 - 88. T R LiTH-IDA-R-96-12, 1996. J. Knoop, B. Steffen, and J. Vollmer. Parallelism for free: Efficient and optimal bitvector analyses for parallel programs. ACM Trans. Prog. Lang. Syst., 18(3):268 299, 1996. S. P. Midkiff and D. A. Padua. Issues in the optimization of parallel programs. In Proc. Int. Conf. on Parallel Processing, Volume II, pages 105 - 113, 1990. E. Morel and C. Renvoise. Global optimization by suppression of partial redundancies. Comm. ACM, 22(2):96 - 103, 1979. J. H. Reif and R. Lewis. Symbolic evaluation and the global value graph. In Conf. Ree. 4th Symp. Principles of Prog. Lang. (POPL '77), pages 104 - 118. ACM, NY, 1977. B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global value numbers and redundant computations. In Conf. Rec. 15th Syrup. Principles of Pro9. Lang. (POPL '88), pages 2 - 27. ACM, NY, 1988. M. Sharir and A. Pnueli. Two approaches to interprocedural d a t a flow analysis. In S. S. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 7, pages 189 - 233. Prentice Hall, Englewood Cliffs, N J, 1981. M. Wolfe. High performance compilers for parallel computing. Addison-Wesley, NY, 1996. H. Zima and B. Chapman. Supercompilers for parallel and vector computers. Addison-Wesley, NY, 1991. -
10. 11. 12.
13.
14.
15. 16.
Optimization of SIMD Programs with Redundant Computations Jhrn Eisenbiegler Institut ffir Programmstrukturen und Datenorganisation,
Universits Karlsruhe, 76128 Karlsruhe, Germany eisen~ipd, info .uni-karlsruhe. de
A b s t r a c t . This article introduces a method for using redundant computations in automatic compilation and optimization of SIMD programs for distributed memory machines. This method is based on a generalized definition of parameterized data distributions, which allows an efficient and flexible use of redundancies. An example demonstrates the benefits of this optimization method in practice.
1
Introduction
edundant computations, i.e. computations that are done on more than one prossor, can improve the performance of parallel programs on distributed m e m o r y machines [2]. In the examples given in section 5, the execution time is improved by nearly one magnitude. Optimization methods (e.g. [5, 7]) which use redundant computations have two major disadvantages: First the task graph of the program has to be investigated node per node. Second the communication costs have to be estimated without knowledge about the final data distribution. To avoid these disadvantages, other methods as for example proposed in [3,4, 8] use parameterized data distributions. But, the data distributions given there do not allow redundant computations in a more flexible way than "all or nothing". Therefore, this article introduces a scheme for defining parameterized d a t a distributions, which allow the efficient and flexible use of redundant computations. Based on these definitions, this article describes a method for optimizing SIMD programs for distributed memory systems. This method is divided into three steps. As in [1], the first step uses heuristics (as described in [3,4, 8]) to propose locally good data distributions. However, these distributions are not used on the level of loops but on the level of parallel assignments. In the second step, additional data distributions (which lead to redundant computations) are calculated which connect subsequent parallel assignments in a way such that no communication is needed between them. The last step selects a globally optimal proposition for each parallel assignment. In contrast to [1], an efficient algorithm is described instead of 0-1 integer programming. The benefits of this optimization method are shown by the example of numerically solving of a differential equation.
457 2
Parameterized
Data
Distributions
In order to define parameterized data distributions, a definition for data distributions is required. Generally, a data element can exist on any number of processors and a processor can hold any number of data elements. Therefore, distributions have to be defined as relations between elements and processors: D e f i n i t i o n 1 ( d a t a d i s t r i b u t i o n ) . A data distribution is a set D C No • No of tuples. (p, i) E D iff clement i is valid on processor p, i. e. element i has the correct value on processor p. As an example, assume a machine with P = 4 processors and an array a with 15 elements. The set D = { (0, 0), (0, 1), (0, 2), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (1, 1), (1, 2), (1, 3), (1,4), (1, 5), (1, 13), (1, 14), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (3, 7), (3, 8), (3, 9), (3, 10), (3, 11)} is a data distribution where the elements 1, 2, 4, 5, 7, 8, 10, 11, 13, 14 are valid on exactly two out of four processors. For practical purposes in a compiler or for hand optimization of non-trivial programs, such data structures are too large. Therefore, parameterized data distributions are used. Each combination of parameters represents a certain data distribution. The correlation between the parameters and the data distributions they represent is the interpretation, i.e. an interpretation of a parameterized data distribution maps parameters to a data distribution as defined in Definition i: Definition 2 (parameterized data distribution). A parameterized data distribution space is a pair (S, I), where S is an arbitrary set and I :S - + 2 NoxN~ iS interpretation function. Every P E S is a parameterized data distribution, i.e. (p, i) E I(P) iff element i is valid on processor p. Usually, the set S (the set of parameters) is a set of tuples whose elements describe the number of processors, block length etc. for each array dimension. In order to carry on with the example, regard the data distribution space (N~, I) with interpretation I: I ( b , a , r ) := {(p,i)la + kbP + p b - r <_ i < a + kbP + (p + 1)b + r, kEZ}
The parameter b is the block size, the parameter a the alignment, and the parameter r describes redundancy. In this data distribution space, the data distribution D described above is parameterized with ( 3 , - 1 , 1), because I ( 3 , - 1 , 1) = D. Note, that the correlation between processors and data elements is still relational, even if the interpretation I of the parameters is a function. Therefore definition 2 is a generalization of former definitions (e.g. [3, 8]). The subset relation defines a semi-ordering for data distributions. This semiordering can be transferred to parameterized data distributions: D e f i n i t i o n 3 ( s e m i - o r d e r i n g ) . Let (S, I) be a parameterized data distribution space, P1,P2 e S: P~ _< P2, iffI(P~) C I(P2). T h e o r e m 1. Let (S, I) be a parameterizcd data distribution space, P1, P2 E S. There arc no communication costs for changing the distribution of an array from P1 to P2, iffP1 >_ P2.
458 Proof Let P1 > P2. Then, every data element which is valid on a processor in P1 is also valid their in P2, since I(P~) D I(P2). Therefore, no data elements have to be transferred and no communication costs occur. If P1 ~ P2, there is at least one data element valid on a processor in I(P2) which is not valid their in I(P1). This data element has to be transferred generating communication costs. The data distribution state of a program is described by the data distribution of each array in the program. This state is defined by a data distribution function, which maps arrays to parameterized data distributions. D e f i n i t i o n 4 ( d a t a d i s t r i b u t i o n f u n c t i o n s ) . Let ID be the name space of a program and (S, I) be a parameterized data distribution space. Then, a function f : 119 --~ S is called a data distribution function. Let fl and f~ be two data distribution functions, fl < f~ iffVx E ID : f l ( x ) < f2(x). 3
Parallel
Array
Assignments
With the definitions given in section 2, it is possible to distinguish between the input and output data distribution of a parallel assignment. As an example, regard the data distribution space described above and the parallel assignment p a r i in [1, n - 1] x[i] := x[i - 1] § x[i] + x[i + 1]; If the output data distribution of x is to be (3, - 1 , 0), the input data distribution for x must be at least ( 3 , - 1 , 1). This distinction is not possible with former definitions of parameterized data distributions, since it is not possible to describe the input data distribution there. In order to optimize data distributions for arrays, the execution of an S I M D program can now be seen as a sequence of array assignments. Every assignment has an input and an output data distribution function. Between two assignments the d a t a distribution state has to be changed from the output data distribution of the first to a data distribution which is larger or equal to the input d a t a distribution of the second assignment. Note that there is no difference between the remapping of data and getting the necessary input for an assignment. In this article, the overlapping of communication and computation is not examined. However, this might be used for further optimization. 4 4.1
Optimization Proposition
In the first step of the optimization, heuristics propose locally good data distributions for each parallel assignment. This paper does not go into detail about searching locally good data distributions, since this problem is well-investigated (e.g. [3,8]). It is allowed to propose more than one data distribution for one parallel assignment using more than one proposition method to exploit the benefits of different heuristics.
459 The methods mentioned above only compute the output d a t a distributions for the array written in the assignment. Therefore the corresponding input and output data distribution functions have to be calculated. If there are no data elements of an array needed in a data distribution function, this array is declared as undefined. As the result of this step, each parallel assignment is applied with several proposals of pairs of input and output d a t a distribution functions. 4.2
Redundancy
In the second step of optimization, additional propositions are introduced which connect subsequent parallel assignments in a way such that no communication is necessary. Let al and a2 be two subsequent parallel assignments and (il, 01) and (i2, 02) pairs of input and output distribution functions proposed for al and a2, respectively. Let n be the array written in al and dl U d2 be a parameterized data distribution which is larger than dl and d2. First, (i4, o4) is set to the pair of distribution functions corresponding to the proposal i2(n) tO o2(n) for n in a~ (as above). Hence, no communication is needed to redistribute n from o4 (n) to i2(n). Second, to avoid communication for other arrays, (i~, 0~) is set to
oi(x) =
oi(
)
u
# n 9 =
= { ii( ) u
# n
ii(
= n
)
As a result of this calculation, o~ is larger than i2, and therefore no communication is needed to redistribute from o~ to i2. However, in most cases some data elements have to be calculated redundantly on more than one processor for al. The method can be applied to all combinations of proposals for al and a2. Especially, it can be applied to cost tuples for a2 which were themselves proposed this way. This leads to (potential) sequences in the program where no communication is needed. At the end of this step, again all parallel assignments are applied with several pairs of input and output d a t a distribution functions. 4.3
Configuration
The configuration step tries to select the globally optimal data distribution for each parallel assignment from the local proposals. If the program is not oblivious (see [5]), it can not be guaranteed that the optimal selection is found due to the lack of compile time knowledge. In order to find an optimal solution for arbitrary programs, the configuration method by Moldenhauer [6] is adopted. The proposals for the parallel assignments are converted to cost tuples of the form (i, c, o) where i and o are the proposed input and output distribution functions and c are the costs for computing this parallel assignment with the proposed distributions. To find an optimal solution for the whole program, these sets of cost tuples are combined to sets of cost tuples for larger program blocks. Let al and a2 be two subsequent program blocks and M1 and M2 be the sets of cost tuples proposed or calculated for these blocks. If al is executed
460 with (il, cl, ol) and a2 is executed with (i2, c2, 02) then al; a2 is executed with (i, c, o) = (il, cl, ol) ot (i2, c2, 02) where f il (x) if il (x) or ol (x) is defined [ i~ (x) else if i2 (x) is defined
i(x)
c = cl + c2 + redistribution_costs(ol, i2) f o2(x) if o2(z) is defined
[ ol (x) else
O(X)
For further combination with other program blocks, the cost tuples with minimal costs for all combinations of input and output data distribution functions which can occur using the proposals are calculated, i.e. the set of cost tuples M := /1//1 o M2 for the sequence al ; a= is M := M1 o M2 := {(i, c, o)l(i , c, o) is cost-minimal in {(i, c*, o) = (il,
c,,
01)
o t (i2,
02)l(ik , Ck, Ok) E Mk}}
e2,
Rules for other control constructs can be given analogously and are omitted in this article. Using these rules, the set of cost tuples for the whole program can be calculated. In the set for the whole program the element with minimal costs is selected. For oblivious programs it represents the optimal selection from the proposals. For non-oblivious programs heuristics have to be used for the combination of the sets. To obtain the optimal proposals for each parallel assignment, the origin of the tuples has to be remembered during the combination step. This configuration problem is not in APX, i.e. no constant approximation can be found. Accordingly, the algorithm described above is exponential. But in contrast to the solution proposed in [1], it is only exponential in the number of array variables simultaneously alive. Since this number is small in real applications the method is feasible. T[s] 500
.
.
.
.
.
~ optimized.....
.~-"" ~f-,"
block-wise distribution
250 10050,
.
/
~
-
z /
25 10 5 2.5
--
1O00
'
2250
~
5062
'
~
r
11390 25626 57665 129746 n
Fig. 1. Solving a differential equation
461
5
Practical Results
The optimization method described above is implemented in a prototype compiler. This section shows the optimization results compared to results that can be achieved if no redundancy is used and data distribution is not done for parallel assignments but for loops only. All measurements were done on an 8 processor Parsytec PowerXplorer. Figure 1 shows the runtimes for solving a differential equation in n points. If no redundant computations are allowed, a blockwise distribution is optimal. Therefore the optimized program is compared to a version using blockwise distribution. Especially for small input sizes, the optimized program is nearly one magnitude faster than the blockwise distributed version.
6
C o n c l u s i o n and F u r t h e r Work
This article introduces a formal definition for parameterized data distributions that allow redundancies. This makes it possible to transfer the concept of redundant computations to the world of parameterized data distributions such that compilers can use redundant computations efficiently. Based on these definitions, a method for optimizing SIMD programs on distributed systems is given. This method allows combining several local optimization methods for computing local proposals. Furthermore, redundant computations are generated to save communication. For oblivious programs the m e t h o d selects the optimal proposals for each array assignment. The practical results show a large gain in performance. In the future different data distribution spaces, i.e. parameters and their interpretations, have to be examined to select suitable ones. Additionally, optimal redistribution schemes for these data distribution spaces should be selected.
References 1. R. Bixby, K. Kennedy, and U. Kramer. Automatic data layout using 0-1 integer programming. In M. Cosnard, C. R. Gao, and G. M. Silberman, editors, Parallel Architecture and Compilation Techniques (A-50). IFIP, Elsevier Science B.V. (North-Holland), 1994. 2. J. Eisenbiegler, W. LSwe, and A. Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC'97), pages 149-155. IEEE Computer Society Press, March 1997. 3. M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, Department of Computer Science, University of Illinois at UrbanaChampaign, 1992. 4. P. Z. Lee. Efficient algorithms for data distribution on distributed memory parallel computers. IEEE Transactions on Parallel and Distributed Systems, 8(8):825 - 839, August 1997.
462 5. W. LSwe, W. Zimmermann, and J. Eisenbiegler. Optimization of parallel programs with expensive communication. In EUROPAR '96, Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 602-610, 1996. 6. H. Moldenhauer. Kostenbasierte Konfigurierung Jilt Programme und SW-Architekturen. Logos Verlag, Berlin, 1998. Dissertation, Universits Karlsruhe. 7. C. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal on Computing, 19(2):322 - 328, 1990. 8. M. Philippsen. Optimierungstechniken zur Ubersetzung paralleler Programmiersprachen. Number 292 in VDI Fortschritt-Berichte, Reihe 10: Informatik. VDIVerlag GmbH, Diisseldorf, 1994. Dissertation, Universi~t Karlsruhe.
Exploiting Course Grain Parallelism from F O R T R A N by Mapping it to IF1 Adrianos Lachanas and Paraskevas Evripidou Department of Computer Science,University of Cyprus P.O.Box 537,CY-1687 Nicosia, Cyprus Te1:+357-2-338705 (FAX 339062), skevosOturing, es.ucy, ac. cy
A b s t r a c t . FORTRAN, a classical imperative language is mapped into IF1, a machine-independent dataflow graph description language with Single Assingment Semantics (SAS). Parafrase 2 (P2) is used as the front-end of our system. It parses the source code, generates an intermediate representation and performs several types of analysis. Our system extends the internal representation of P2 with two data structures: the Variable to Edge Translation Table and the Function to Graph Translation Table. It then proceeds to map the internal representation of P2 into IF1 code. The generated IF1 is then processed by the back-end of the Optimizing SISAL Compiler that generates parallel executables on multiple target platforms. We have tested the correctness and the performance of our system with several benchmarks. The results show that even on a single processor there is no performance degradation from the translation to SAS. Furthermore, it is shown that on a range of processors, reasonable speedup can be achieved.
1
Introduction
The translation of imperative languages to Single Assingment Semantics was thought to be a hard, if not an impossible task. In our work we tackled this problem by mapping Fortran to a Single Assingment Intermediate Form, IF1. Mapping Fortran to SAS is a worthwhile task by itself. However, our work has Mso targeted the automatic and efficient extraction of parMlelism. In this paper we present the compiler named V e n u s that transforms a classical imperative language (Fortran, C) into an inherently parallel applicative format (IF1).The Venus system uses a multi-language precompiler (Parafrase 2 [5]) and a multiplatform post compiler (Optimizing SISAL Compiler [4]) in order to simplify the translation process and to provide multi-platform support [1]. The P2 intermediate representation of the source code is processed by Venus to generate a dataflow description of the source program. The IF1 dataflow graph description is then passed to the last stage, the Optimizing SISAL Compiler (OSC), which performs optimizations and produces parallel executables. Since IF1 is machine independent and OSC has been implemented successfully on many major parallel computers, the final stage can generate executables for a wide range of computer platforms [4].
464 Our systems maps the entire Fortran 77 semantics into IF1. The only construct that is not yet supported is I/O. We have tested our system with a variety of benchmarks. Our analysis have shown that there is no performance degradation from the mapping to SAS. Furthermore, the SAS allows for more efficient exploitation of parallelism. In this paper we concentrate on the description of the mechanism for mapping coarse grain Fortran constructs into SAS. 2
Compilation
Environment
T h e T a r g e t L a n g u a g e (IF1): IF1 is a machine independent dataflow graph description language that adheres to single assignment semantics. It was developed as the intermediate form for the compilation of the SISAL language. As a language, IF1 is based on hierarchical acyclic graphs with four basic components: nodes, edges, types, and graph boundaries. The IF1 G r a p h C o m p o n e n t is an important tool in IF1 for procedural abstraction. Graph boundaries are used to encapsulate related operations. In the case of compound nodes, the graph serves to separate the operations of the various functional parts. Graph boundaries can also delimit functions. The IF1 E d g e C o m p o n e n t describes the flow of data between nodes and graphs, representing explicit data dependencies. The IF1 N o d e C o m p o n e n t represents operations in the IF1 dataflow graph. Nodes are referenced using their label which must be unique within the graph. All nodes are functions, and have both input and output ports. T h e f r o n t - e n d , Parafrase 2 [5], is a source to source multilingual restructuring compiler that translates Fortran to Cedar Fortran. The P2 invokes different passes to transform the program to an Internal Representation: in the form of a parse tree, symbol table, call graph and dependency graph [5]. P2 does dependency analysis, variable alias analysis, and loop parallelization analysis. Venus starts by traversing the Parse Tree of P2. The basic module of the parse tree is the function and the subroutine. Each such module contains a parameter list, a declarations list and a linked list of statments/expressions. For each module we generate an FGTT and update it with the paramemter and declarations lists. Then we proceed to generate a new scope for the VETT. We then continue to update the VETT with all variables in the delacration list. For each statment in the statment/expression list we generate a new IF1 node and its corresponding edges and update the VETT accordingly. The back-end, the Optimizing SISAL Compiler [2] is used to produce executable code from the IF1 programs. OSC was chosen for this project since it has been successfully implemented for performance machines like Sequent, SGI, HP, Cray, Convex and most UNIX based workstations [2]. 3
Fine and Medium
Grain Parallelism
The problem when generating code on the basic (expression) level is the single assignment restriction. This is a concern when Fortran variables are translated into IF1 edges [1]. In Fortran a variable is a memory location that can have many
465 values stored in it over time whereas IF1 edges carry single values. An IF1 edge is a single assignment. Our system handles this problem on the basic block level using a structure called V a r i a b l e t o E d g e T r a n s l a t i n g T a b l e ( V E T T ) [3]. The V E T T always allows the translator to be able to identify the node and port whose output represent the current value of a particular variable. When a variable is referenced in an expression, V E T T is accessed to obtain the current node and port number of the variable. The V E T T fields do not need to change since the value of the variable is not changed. When a variable is referenced on the left hand side (LHS) of an assignment, the node and port entry in V E T T will change to represent the new node and port that assigned/carried the new value. An U P D A T E D flag is set to V E T T entry to show that the variable has changed value within the scope. A variable write consists only of updating the V E T T entries (SN, SP) of that variable. In IF1, arrays can be accessed sequentially or in parallel. Whether an array is accessed sequentially or in parallel is decided by Parafrase 2. During the Seq u e n t i a l L H S A c c e s s the dereferencing nodes that expose the location of the old element, the nodes that provide the value for the new element, and the reconstruction nodes that combine the deconstructed pieces of the array together with the new element value, build a new array. The mechanism used in P a r a l l e l R H S A c c e s s is similar to that used in sequential array access. Multiple values of the array elements are generated and are forwarded to IF1 nodes within the compound node. P a r a l l e l L H S a c c e s s is done using a loop reduction node. Values for the array elements are generated by the parallel loop iterations and are gathered into a new array. 4
Coarse
Grain
Parallelism
Fortran limits the amount of parallelism we can exploit because of its parameter interface. In Fortran, all input parameters of a module are passed by reference. This allows the body of the function to have access on the input parameter and change its value. 4.1
Mapping of Functions and Subroutines into IF1 Graphs
Fortran Modules are interrelated parts of code that have common functionality. Translating modules into IF1 graphs requires additional information about the module like: the type of the module (function or subroutine), the input parameters, the output parameters and the source-destination node-port of each parameter. The translator handle this problem using a structure named F u n c tion to Graph Translating Table (FGTT). The F G T T is generated in two passes. During the F i r s t P a s s the translator builds the F G T T and stores all the information about a module such as: the name of the module, the type of module (function or subroutine) and the c o m m o n variables that the module has access (see Figure 1). In IF1 there are not any global variables. The translator removes all common statements and replaces
466 ! s u b r o u t i n e sample(pphd,p i n t e g e r pphd, ppld, pfhd
pld,pfdh)
Type: SubrouUne
pphd ~-ppld * p f d h call foo(pphd) ppld = goo(pphd,pfdh)
Label : 1070
return end
I n p u t P a r a m e t e r List
ID: ~ - - - - - ~ sample Funct
/
pphd ID:I
ppld ID:2
pfdh
ID:3 I
O u t p u t P a r a m e t e r List ID: 1 ID: 2 F u n c t i o n Calls List
--~
too
I
goo
]
Fig. 1. FGTT functionality at coarse grain parallelism
real f u n c t i o n
b a a=b*b*2
b f l I !
foo(b)
real real
endfO~
a
I
Function
Body
1~
12
foo
b
subr~ r e a l a, b ab ==ba++b a
b
g~176 b )Il .. i!
122
Function
Body
a
b
Fig.2. (a) The IF1 graph of a Fortran function. Output port 1 is reserved for the function return value.(b)The IF1 graph of a Fortran subroutine.
them with input and output parameters. The values from a Fortran module are returned explicitly or implicitly. The function return values are exported at the first port of the IF1 graph (Figure 2a) whereas the other values will be returned on greater ports. At the first pass the parameters that are assigned at the function body are flagged as updated. At the S e c o n d P a s s F G T T is processed. A function's parameter can also be changed by a function call that has as input that parameter. If the parameter has changed by the function call, it should be returned by the IF1 graph. For each function, the translator accesses the function call list with their input parameters, and detect which parameters are changed by each function call. The parameters returned by a module are only those that have been changed at the function body. This reduces the data dependencies in the dataflow model because when a function parameter does not change then its value can be fetched by the nodes ascending the function call node. Fine and Medium Grain Parallelism at function body are translated with the aid of the V E T T as described in Section 3. To create output edges from the graph, the translator traverses the F G T T output parameter list. For each o u t p u t parameter the translator finds its corresponding V E T T and creates an output edge with source n o d e / p o r t values, the corresponding V E T T ' s source n o d e / p o r t field values (SN, SP).
467 4.2
S t r u c t u r e d C o n t r o l Flow: N e s t e d IFs
Structured control flow of Fortran is implemented using the IF1 compound node "Select". The Select node consist of at least three subgraphs: a selector subgraph to determine which code segment to execute, and at least two alternative subgraph branches (corresponding to THEN and ELSE block) containing each code branch.
FORTRAN Code
VETT after Ihe scannltlg orTHEN part
FORTRAN c~le
VETT after the scanning of TIIEN part
If (m~t~t .EQ. O) their
a=b+c b=2*b
end if Cnll i f
Select Graph
Select Graph
Te.~
J
~_ VETT after the scan.th~ . f EL~;E part
Tr
J VETT after the scanning of ELSE part
Fig. 3. Select Node handling (a) else statement is present, (b) else statement is missing.
A compound node constitutes a new IF1 scope. The code within the Fortran IF statement is scanned twice by Venus. During the first pass, the translator builds IF1 code for each alternative branch, and updates the "then" or "else" flag of V E T T (see Figure 3 a, b) of variables being updated. For example, if a variable is updated on the THEN branch, then its "then" flag in V E T T is set to 1 (True). For variables updated on ELSE branch, their "else" flag at V E T T is set to 1. At second pass the translator rebuilds the IF1 code for each branch and generates output edges. A variable within compound node scope that has the "then" or "else" flag in V E T T set to 1, is returned by the compound node. Variables that were assigned as updated have their V E T T entries changed in order to reflect the new values exported by the compound node.
5
Performance
Analysis
Our benchmark suite consists of a number of Livermore Loops and several benchmarks from weather applications. In Table 1 we present the execution time on a single processor of the Livermore Loops written in Fortran, Sisal and IF1 generated from the Venus translator. As shown in the table the loops execute in time similar to their Sisal equivalent and Fortran. This shows that there is no performance degradation when we map Fortran to SAS. Furthermore the results
468 shows that Venus can produce as efficient code from Fortran as OSC does from SISAL. The results obtained from a multiprocessor Sequent S y m m e t r y $81 and a HP Convex system are summarized in Table 2. (12L/2OL is a collection of 12 and 20 Livermore Loop s u b r o u t i n e s . A,E,F,G are symbols that corespond to vtadv, cloudy, hzadv, depos modules from Eta-dust code for regional weather prediction and dust accumulation. Eta-dust is a product of National Meteorological Center and University of Belgrade. B,C,D are symbols that corespond to modules simple, ricard, yale which are benchmarks taken from weather prediction fortran codes ) The results are summarized in terms of speedup gained in a number of processors (ranged from 2 to 6) over 1 processor. It shows that in all cases the Venus was able to produce IF1 code with respectable speedups.
Table 1. Execution time of Livermore Loops on a Sun OS/5.
Livermore Loop Loop # 1 2 3 4 5 6 7 8 9 10 11 12 12L 20L
6
Serial Execution Time in sec Fortran Fortran to IF1 II SISAL II Number of Iterations
3001 6001 100011 3001 6001 100o11 3001 6001 1000 72.3 118.5 179.9 70.i. 115.3 162.8 70.4 112.3 177.5 30.1 42.2 58.7 30.9 42.1 60.4 30.4 40.8 55.7 57.3 96.6 149.1 60.2 92.9 144.7 54.8 93.2 148.1 369.7 738.0 1228.2 550.7 742.1 1220.7 361.2 735.4 1199.9 54.6 83.6 122.2 55.7 83.9 130.4 51.7 83.1 118.3 26.2 35.4 57.3 26.5 33.3 5 9 . 1 24.9 32.5 58.7 53.8 71.7 95.9 54.4 70.2 99.3 50.9' 70.5 99.6 119.7 134.5 154.7 123.4 154.2 162.1 102.4J127.9 145.2 42.2 74.5 117.5 42.3 74.9 121.4 38.7 72.9 114.7 32.6 5 1 . 4 7 4 . 3 31.4 52.3 72.6 30.5 50.4 72.8 50.9 83.6 127.5 52.3 82.6 125.3 47.4 85.7 131.4 33.2 5 2 . 1 77.3 33.7 53.7 7 4 . 2 32.51 52.6 74.6 784.8 731.4 I 1746.1 1672.81
Conclusions
In this paper we presented, Venus, a system thats maps Fortran programs into Single Assignment Semantics with special emphasis in the extraction of parallelism. Venus is able to process Fortran 77 with the exception of the I / O constructs. We have developed the necessary frame work for the mapping and developed two special data structures for implementing the mapping. The overall system is robust and has been thoroughly tested both on a uniprocessor and on
469 T a b l e 2. Speedup gained from Fortran-to-IF1 programs over a range of processors
CPUs 2 3 4 5 6
Live[more Loop # 2 6 7 8 10 14 15 16 17 19 20 21 12L20L 2.11.82.11.91.62.01.81.31.41.91.21.9 1.8 1.6 3.02.12.92.62.32.92.7!1.61.82.71.43.1 2.7 2.5 4.02.53.93.73.24.03.52.12.23.41.53.813.4 3.2 4.22.84.94.2!3.94.94.72.62.24.01.74.9 4.0 3.7 5.93.215.94.74.55.75.34.23.5!4.62.15.74.7 4.3
Weather Ben~marks A B C D E F G 1.411.71.61.72.01.52.4 1.7.2.52.32.3i2.6'1.93.0 2.2~3.1!3.22.93.42.24.1
a multiprocessor. The results have shown that there is no performance penalty when we m a p from Fortran to SAS. Furthermore we can exploit the same level of parallelism as we do from equivalent programs written in a functional language. This paper has presented the case for combining the benefits of imperative syntax with functional semantics by transforming structured sequential Fortran code to the single assignment intermediate format, IF1. Furthermore, this project couples OSC with Fortran 77. Thus, it has the potential of exposing a much larger segment of the scientific community to portable parallel p r o g r a m m i n g with OSC.
References 1. Ackerman, W. B., "Dataflow languages." In S. S. Thakkar, editor, Selected Reprints on Dataflow and Reduction Architectures, Computer Society Press, Washington D.C., 1987. 2. Calm, C. D., "The Optimizing SISAL Compiler." Lawrence Livermore National Laboratory, Livermore, California. 3. Evripidou P. and Robert, J. B., "Extracting Parallelism in FORTRAN by Translation to a Single Assignment Intermediate Form." Proceedings of the 8 th IEEE International Parallel Processing Symposium, April 1994. 4. Feo, J. T., and Cann, D. C. "A report on the Sisal language project." Journal o] Parallel and Distributed Computing, 10, (1990) 349-366. 5. Polychronopoulos, C. D., "Parafrase 2: an environment for parallelizing, partitioning, synchronizing and scheduling programs on multiprocessors." Proceedings of the 1989 International Conference on Parallel Processing, 1989. 6. Polychronopoulos, C. D., and Girkar, M. B., "Parafrase-2 Programmer's Manual", February 1995. 7. Skedzielewski, S, "SISAL." In B. K. Szymemski, editor, Parallel Functional Languages and Compilers, Addison-Wesley, Menlo Park, CA, 1991. 8. Skedzielewski, S, Glauert, J, "IF1: An intermediate form for Applicative languages." Technical Report TR M-170, University of California - Lawrence Livermore National Laboratory, August 1985.
HPcc as High Performance Commodity Computing on Top of Integrated Java, CORBA, COM and Web Standards G.C. Fox, W. Furmanski, T. Haupt, E. Akarsu, and H. Ozdemir Northeast Parallel Architectures Center, Syracuse University, Syracuse NY, USA [email protected], [email protected] http://www.npac.syr.edu Abstract. We review the growing power and capability of commodity computing and communication technologies largely driven by commercial distributed information systems. These systems are built from CORBA, Microsoft’s COM, JavaBeans, and rapidly advancing Web approaches. One can abstract these to a three-tier model with largely independent clients connected to a distributed network of servers. The latter host various services including object and relational databases and of course parallel and sequential computing. High performance can be obtained by combining concurrency at the middle server tier with optimized parallel back end services. The resultant system combines the needed performance for large-scale HPCC applications with the rich functionality of commodity systems. Further the architecture with distinct interface, server and specialized service implementation layers, naturally allows advances in each area to be easily incorporated. We illustrate how performance can be obtained within a commodity architecture and we propose a middleware integration approach based on JWORB (Java Web Object Broker) multi-protocol server technology. Examples are given from collaborative systems, support of multidisciplinary interactions, proposed visual HPCC ComponentWare, quantum Monte Carlo and distributed interactive simulations.
1
Introduction
We believe that industry and the loosely organized worldwide collection of (freeware) programmers is developing a remarkable new software environment of unprecedented quality and functionality. We call this DcciS - Distributed commodity computing and information System. We believe that this can benefit HPCC in several ways and allow the development of both more powerful parallel programming environments and new distributed metacomputing systems. In the second section, we define what we mean by commodity technologies and explain the different ways that they can be used in HPCC. In the third and critical section, we define an emerging architecture of DcciS in terms of a conventional 3 tier commercial computing model, augmented by distributed object and component technologies of Java, CORBA, COM and the Web. This is followed in D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 55–74, 1998. c Springer-Verlag Berlin Heidelberg 1998
56
G.C. Fox et al.
sections four and five by more detailed discussion of the HPcc core technologies and high-level services. In this and related papers [5], we discuss several examples to address the following critical research issue: can high performance systems - called HPcc or High Performance Commodity Computing - be built on top of DcciS. Examples include integration of collaboration into HPcc; the natural synergy of distribution simulation and the HLA standard with our architecture; and the step from object to visual component based programming in high performance distributed computing. Our claim, based on early experiments and prototypes is that HPcc is feasible but we need to exploit fully the synergies between several currently competing commodity technologies. We refer to our approach towards HPcc, based on integrating several popular distributed object frameworks as Pragmatic Object Web and we describe a specific integration metodology based on multi-protocol middleware server, JWORB (Java Web Object Request Broker).
2
Commodity Technologies and Their Use in HPCC
The last three years have seen an unprecedented level of innovation and progress in commodity technologies driven largely by the new capabilities and business opportunities of the evolving worldwide network. The web is not just a document access system supported by the somewhat limited HTTP protocol. Rather it is the distributed object technology which can build general multi-tiered enterprise intranet and internet applications. CORBA is turning from a sleepy heavyweight standards initiative to a major competitive development activity that battles with COM, JavaBeans and new W3C object initiatives to be the core distributed object technology. There are many driving forces and many aspects to DcciS but we suggest that the three critical technology areas are the web, distributed objects and databases. These are being linked and we see them subsumed in the next generation of “object-web” [1] technologies, which is illustrated by the recent Netscape and Microsoft version 4 browsers. Databases are older technologies but their linkage to the web and distributed objects, is transforming their use and making them more widely applicable. In each commodity technology area, we have impressive and rapidly improving software artifacts. As examples, we have at the lower level the collection of standards and tools such as HTML, HTTP, MIME, IIOP, CGI, Java, JavaScript, Javabeans, CORBA, COM, ActiveX, VRML, new powerful object brokers (ORB’s), dynamic Java clients and servers including applets and servlets, and new W3C technologies towards the Web Object Model (WOM) such as XML, DOM and RDF. At a higher level collaboration, security, commerce, multimedia and other applications/services are rapidly developing using standard interfaces or frameworks and facilities. This emphasizes that equally and perhaps more importantly than raw technologies, we have a set of open interfaces enabling distributed modular software development. These interfaces are at both low and high levels and
HPcc as High Performance Commodity Computing
57
the latter generate a very powerful software environment in which large preexisting components can be quickly integrated into new applications. We believe that there are significant incentives to build HPCC environments in a way that naturally inherits all the commodity capabilities so that HPCC applications can also benefit from the impressive productivity of commodity systems. NPAC’s HPcc activity is designed to demonstrate that this is possible and useful so that one can achieve simultaneously both high performance and the functionality of commodity systems. Note that commodity technologies can be used in several ways. This article concentrates on exploiting the natural architecture of commodity systems but more simply, one could just use a few of them as “point solutions”. This we can term a “tactical implication” of the set of the emerging commodity technologies and illustrate below with some examples: – Perhaps VRML,Java3D or DirectX are important for scientific visualization; – Web (including Java applets and ActiveX controls) front-ends provide convenient customizable interoperable user interfaces to HPCC facilities; – Perhaps the public key security and digital signature infrastructure being developed for electronic commerce, could enable more powerful approaches to secure HPCC systems; – Perhaps Java will become a common scientific programming language and so effort now devoted to Fortran and C++ tools needs to be extended or shifted to Java; – The universal adoption of JDBC (Java Database Connectivity), rapid advances in the Microsoft’s OLEDB/ADO transparent persistence standards and the growing convenience of web-linked databases could imply a growing importance of systems that link large scale commercial databases with HPCC computing resources; – JavaBeans, COM, CORBA and WOM form the basis of the emerging “object web” which analogously to the previous bullet could encourage a growing use of modern object technology; – Emerging collaboration and other distributed information systems could allow new distributed work paradigms which could change the traditional teaming models in favor of those for instance implied by the new NSF Partnerships in Advanced Computation. However probably more important is the strategic implication of DcciS which implies certain critical characteristics of the overall architecture for a high performance parallel or distributed computing system. First we note that we have seen over the last 30 years many other major broad-based hardware and software developments – such as IBM business systems, UNIX, Macintosh/PC desktops, video games – but these have not had profound impact on HPCC software. However we suggest the DcciS is different for it gives us a world-wide/enterprise-wide distributing computing environment. Previous software revolutions could help individual components of a HPCC software system but DcciS can in principle be the backbone of a complete HPCC software system – whether it be for some
58
G.C. Fox et al.
global distributed application, an enterprise cluster or a tightly coupled large scale parallel computer. In a nutshell, we suggest that “all we need to do” is to add “high performance” (as measured by bandwidth and latency) to the emerging commercial concurrent DcciS systems. This “all we need to do” may be very hard but by using DcciS as a basis we inherit a multi-billion dollar investment and what in many respects is the most powerful productive software environment ever built. Thus we should look carefully into the design of any HPCC system to see how it can leverage this commercial environment.
3
Three Tier High Performance Commodity Computing
Fig. 1. Industry 3-tier view of enterprise Computing
We start with a common modern industry view of commodity computing with the three tiers shown in fig 1. Here we have customizable client and middle tier systems accessing “traditional” back end services such as relational and object databases. A set of standard interfaces allows a rich set of custom applications to be built with appropriate client and middleware software. As indicated on figure, both these two layers can use web technology such as Java and Javabeans, distributed objects with CORBA and standard interfaces such as JDBC (Java Database Connectivity). There are of course no rigid solutions and one can get “traditional” client server solutions by collapsing two of the layers together. For instance with database access, one gets a two tier solution by either incorporating custom code into the “thick” client or in analogy to Oracle’s PL/SQL, compile the customized database access code for better performance and incorporate the compiled code with the back end server. The latter like the general 3-tier solution, supports “thin” clients such as the currently popular network computer.
HPcc as High Performance Commodity Computing
59
The commercial architecture is evolving rapidly and is exploring several approaches which co-exist in today’s (and any realistic future) distributed information system. The most powerful solutions involve distributed objects. Currently, we are observing three important commercial object systems - CORBA, COM and JavaBeans, as well as the ongoing efforts by the W3C, referred by some as WOM (Web Object Model), to define pure Web object/component standards. These have similar approaches and it is not clear if the future holds a single such approach or a set of interoperable standards. CORBA is a distributed object standard managed by the OMG (Object Management Group) comprised of 700 companies. COM is Microsoft’s distributed object technology initially aimed at Window machines. JavaBeans (augmented with RMI and other Java 1.1 features) is the “pure Java” solution - cross platform but unlike CORBA, not cross-language! Finally, WOM is an emergent Web model that uses new standards such as XML, RDF and DOM to specify respectively the dynamic Web object instances, classes and methods. Legion is an example of a major HPCC focused distributed object approach; currently it is not built on top of one of the three major commercial standards. The HLA/RTI [2] standard for distributed simulations in the forces modeling community is another important domain specific distributed object system. It appears to be moving to integration with CORBA standards. Although a distributed object approach is attractive, most network services today are provided in a more ad-hoc fashion. In particular today’s web uses a “distributed service” architecture with HTTP middle tier servers invoking via the CGI mechanism, C and Perl programs linking to databases, simulations or other custom services. There is a trend toward the use of Java servers with the servlet mechanism for the services. This is certainly object based but does not necessarily implement the standards implied by CORBA, COM or Javabeans. However, this illustrates an important evolution as the web absorbs object technology with the evolution from low- to high-level network standards: – from HTTP to Java Sockets to IIOP or RMI – from Perl CGI Script to Java Program to JavaBean distributed object As an example consider the evolution of networked databases. Originally these were client-server with a proprietary network access protocol. In the next step, Web linked databases produced a three tier distributed service model with an HTTP server using a CGI program (running Perl for instance) to access the database at the backend. Today we can build databases as distributed objects with a middle tier JavaBean using JDBC to access the backend database. Thus a conventional database is naturally evolving to the concept of managed persistent objects. Today as shown in fig 2, we see a mixture of distributed service and distributed object architectures. CORBA, COM, Javabean, HTTP Server + CGI, Java Server and servlets, databases with specialized network accesses, and other services co-exist in the heterogeneous environment with common themes but disparate implementations. We believe that there will be significant convergence as a more uniform architecture is in everyone’s best interest.
60
G.C. Fox et al.
Fig. 2. Today’s Heterogeneous Interoperating Hybrid Server Architecture. HPcc involves adding to this system, high performance in the third tier.
We also believe that the resultant architecture will be integrated with the web so that the latter will exhibit distributed object architecture shown in fig 3. More generally the emergence of IIOP (Internet Inter-ORB Protocol), CORBA2, rapid advances with the Microsoft’s COM, DCOM, and COM+, and the realization that both CORBA and COM are naturally synergistic with Java is starting a new wave of “Object Web” developments that could have profound importance. Java is not only a good language to build brokers but also Java objects are the natural inhabitants of object databases. The resultant architecture in fig 3 shows a small object broker (a so-called ORBlet) in each browser as in Netscape’s current plans. Most of our remarks are valid for all these approaches to a distributed set of services. Our ideas are however easiest to understand if one assumes an underlying architecture which is a CORBA or Javabean distributed object model integrated with the web. We wish to use this service/object evolving 3-tier commodity architecture as the basis of our HPcc environment. We need to naturally incorporate (essentially) all services of the commodity web and to use its protocols and standards wherever possible. We insist on adopting the architecture of commodity distribution systems as complex HPCC problems require the rich range of services offered by the broader community systems. Perhaps we could “port” commodity services to a custom HPCC system but this would require continued upkeep with each new upgrade of the commodity service. By adopting the architecture of the commodity systems, we make it easier to track their rapid evolution and expect it will give high functionality HPCC systems, which will naturally track the evolving Web/distributed object worlds. This requires us to enhance certain services to get higher performance and to incorporate new capabilities such as high-end visualization (e.g. CAVE’s) or massively parallel systems where needed. This is the essential research challenge for HPcc for we must not only enhance performance where needed but do it in a way that is preserved as we evolve the basic commodity systems.
HPcc as High Performance Commodity Computing
61
Fig. 3. Integration of Object Technologies (CORBA) and the Web
We certainly have not demonstrated clearly that this is possible but we have a simple strategy that we will elaborate in ref. [5] and sec. 5. Thus we exploit the three-tier structure and keep HPCC enhancements in the third tier, which is inevitably the home of specialized services in the object-web architecture. This strategy isolates HPCC issues from the control or interface issues in the middle layer. If successful we will build an HPcc environment that offers the evolving functionality of commodity systems without significant re-engineering as advances in hardware and software lead to new and better commodity products. Returning to fig 2, we see that it elaborates fig 1 in two natural ways. Firstly the middle tier is promoted to a distributed network of servers; in the “purest” model these are CORBA/ COM/ Javabean object-web servers, but obviously any protocol compatible server is possible. This middle tier layer includes not only networked servers with many different capabilities (increasing functionality) but also multiple servers to increase performance on an given service. The use of high functionality but modest performance communication protocols and interfaces at the middle tier limits the performance levels that can be reached in this fashion. However this first step gives a modest performance scaling, parallel (implemented if necessary, in terms of multiple servers) HPcc system which includes all commodity services such as databases, object services, transaction processing and collaboratories. The next step is only applied to those services with insufficient performance. Naively we “just” replace an existing back end (third tier) implementation of a commodity service by its natural HPCC high performance version. Sequential or socket based messaging distributed simulations are replaced by MPI (or equivalent) implementations on low latency high
62
G.C. Fox et al.
bandwidth dedicated parallel machines. These could be specialized architectures or “just” clusters of workstations. Note that with the right high performance software and network connectivity, workstations can be used at tier three just as the popular “LAN consolidation” use of parallel machines like the IBM SP-2, corresponds to using parallel computers in the middle tier. Further a “middle tier” compute or database server could of course deliver its services using the same or different machine from the server. These caveats illustrate that as with many concepts, there will be times when the relatively clean architecture of fig 2 will become confused. In particular the physical realization does not necessarily reflect the logical architecture shown in fig 2.
4 4.1
Core Technologies for High Performance Commodity Systems Multidisciplinary Application
We can illustrate the commodity technology strategy with a simple multidisciplinary application involving the linkage of two modules A and B – say CFD and structures applications respectively. Let us assume both are individually parallel but we need to link them. One could view the linkage sequentially as in fig 4, but often one needs higher performance and one would “escape” totally into a layer which linked decomposed components of A and B with high performance MPI (or PVMPI). Here we view MPI as the “machine language” of the higher-level commodity communication model given by approaches such as WebFlow from NPAC. There is the “pure” HPCC approach of fig 5, which replaces all commodity web communication with HPCC technology. However there is a middle ground between the implementations of figures 4 and 5 where one keeps control (initialization etc.) at the server level and “only” invokes the high performance back end for the actual data transmission. This is shown in fig 6 and appears to obtain the advantages of both commodity and HPCC approaches for we have the functionality of the Web and where necessary the performance of HPCC software. As we wish to preserve the commodity architecture as the baseline, this strategy implies that one can confine HPCC software development to providing high performance data transmission with all of the complex control and service provision capability inherited naturally from the Web. 4.2
JavaBean Communication Model
We note that JavaBeans (which are one natural basis of implementing program modules in the HPcc approach) provide a rich communication mechanism, which supports the separation of control (handshake) and implementation. As shown below in fig 7, Javabeans use the JDK 1.1 AWT event model with listener objects and a registration/call-back mechanism.
HPcc as High Performance Commodity Computing
Fig. 4. Simple sequential server approach to Linking Two Modules
Fig. 5. Full HPCC approach to Linking Two Modules
Fig. 6. Hybrid approach to Linking Two Modules
63
64
G.C. Fox et al.
Fig. 7. JDK 1.1 Event Model used by (inter alia) Javabeans
JavaBeans communicate indirectly with one or more “listener objects” acting as a bridge between the source and sink of data. In the model described above, this allows a neat implementation of separated control and explicit communication with listeners (a.k.a. sink control) and source control objects residing in middle tier. These control objects decide if high performance is necessary or possible and invoke the specialized HPCC layer. This approach can be used to advantage in “run-time compilation” and resource management with execution schedules and control logic in the middle tier and libraries such as MPI, PCRC and CHAOS implementing the determined data movement in the high performance (third) tier. Parallel I/O and “high-performance”” CORBA can also use this architecture. In general, this listener model of communication provides a virtualization of communication that allows a separation of control and data transfer that is largely hidden from the user and the rest of the system. Note that current Internet security systems (such as SSL and SET) use high functionality public keys in the control level but the higher performance secret key cryptography in bulk data transfer. This is another illustration of the proposed hybrid multi-tier communication mechanism. 4.3
JWORB based Middleware
Enterprise JavaBeans that control, mediate and optimize HPcc communication as described above need to be maintained and managed in a suitable middleware container. Within our integrative approach of Pragmatic Object Web, a CORBA based environonment for the middleware management with IIOP based control protocol provides us with the best encapsulation model for EJB components. Such middleware ORBs need to be further integrated with the Web server based middleware to assure smooth Web browser interfaces and backward compatibility with CGI and servlet models. This leads us to the concept of JWORB (Java Web
HPcc as High Performance Commodity Computing
65
Object Request Broker)[6] - a multi-protocol Java network server that integrates several core services (so far dispersed over various middleware nodes as in fig 2) within a single uniform middleware management framework. An early JWORB prototype has been recently developed at NPAC. The base server has HTTP and IIOP protocol support as illustrated in fig 8. It can serve documents as an HTTP Server and it handles the IIOP connections as an Object Request Broker. As an HTTP server, JWORB supports base Web page services, Servlet (Java Servlet API) and CGI 1.1 mechanisms. In its CORBA capacity, JWORB is currently offering the base remote method invocation services via CDR based IIOP and we are now implementing higher level support such as the Interface Repository, Portable Object Adapter and selected Common Object Services.
Fig. 8. Overall architecture of the JWORB based Pragmatic Object Web middleware During the startup/bootstrap phase, the core JWORB server checks its configuration files to detect which protocols are supported and it loads the necessary protocol classes (Definition, Tester, Mediator, Configuration) for each protocol. Definition Interface provides the necessary Tester, Configuration and Mediator objects. Tester object inpects the current network package and it decides how to interpret this particular message format. Configuration object is responsible for the configuration parameters of a particular protocol. Mediator object serves the connection. New protocols can be added simply by implementing the four classes described above and by registering a new protocol with the JWORB server. After JWORB accepts a connection, it asks each protocol handler object whether it can recognize this protocol or not. If JWORB finds a handler which can serve the connection, is delegates further processing of the connection stream
66
G.C. Fox et al.
to this protocol handler. Current algorithm looks at each protocol according to their order in the configuration file. This process can be optimized with randomized or prediction based algorithm. At present, only HTTP and IIOP messaging is supported and the current protocol is simply detected based on the magic anchor string value (GIOP for IIOP and POST, GET, HEAD etc. for HTTP). We are currently working on further extending JWORB by DCE RPC protocol and XML co-processor so that it can also act as DCOM and WOM/WebBroker server.
5
Commodity Services in HPcc
We have already stressed that a key feature of HPcc is its support of the natural inclusion into the environment of commodity services such as databases, web servers and object brokers. Here we give some further examples of commodity services that illustrate the power of the HPcc approach.
5.1
Distributed Collaboration Mechanisms
The current Java Server model for the middle tier naturally allows one to integrate collaboration into the computing model and our approach allow one to “re-use” collaboration systems built for the general Web market. Thus one can without any special HPCC development, address areas such as computational steering and collaborative design, which require people to be integrated with the computational infrastructure. In fig 9, we define collaborative systems as integrating client side capabilities together. In steering, these are people with analysis and visualization software. In engineering design, one would also link design (such as CATIA or AutoCAD) and planning tools. In both cases, one would need the base collaboration tools such as white-boards, chat rooms and audio-video conferencing. If we are correct in viewing collaboration (see Tango [10, 11] and Habanero [12]) as sharing of services between clients, the 3 tier model naturally separates HPCC and collaboration and allows us to integrate into the HPCC environment, the very best commodity technology which is likely to come from larger fields such as business or (distance) education. Currently commodity collaboration systems are built on top of the Web and although emerging facilities such as Work Flow imply approaches to collaboration, are not yet defined from a general CORBA point of view. We assume that collaboration is sufficiently important that it will emerge as a CORBA capability to manage the sharing and replication of objects. Note CORBA is a server-server model and “clients” are viewed as servers (i.e. run Orb’s) by outside systems. This makes the object-sharing view of collaboration natural whether application runs on “client” (e.g. shared Microsoft Word document) or on back-end tier as in case of a shared parallel computer simulation.
HPcc as High Performance Commodity Computing
67
Fig. 9. Collaboration in today’s Java Web Server implementation of the 3 tier computing model. Typical clients (on top right) are independent but Java collaboration systems link multiple clients through object (service) sharing 5.2
Object Web and Distributed Simulation
The integration of HPCC with distributed objects provides an opportunity to link the classic HPCC ideas with those of DoD’s distributed simulation DIS or Forces Modeling FMS community. The latter do not make extensive use of the Web these days but they have a commitment to CORBA with their HLA (High Level Architecture) and RTI (Runtime Infrastructure) initiatives. Distributed simulation is traditionally built with distributed event driven simulators managing C++ or equivalent objects. We suggest that the Object Web (and parallel and distributed ComponentWare described in sec. 5.3) is a natural convergence point for HPCC and DIS/FMS. This would provide a common framework for time stepped, real time and event driven simulations. Further it will allow one to more easily build systems that integrate these concepts as is needed in many major DoD projects – as exemplified by the FMS and IMT DoD computational activities which are part of the HPCC Modernization program. HLA is a distributed object technology with the object model defined by the Object Model Template (OMT) specification and including the Federation Object Model (FOM) and the Simulation Object Model (SOM) components. HLA FOM objects interact by exchanging HLA interaction objects via the common Run-Time Infrastructure (RTI) acting as a software bus similar to CORBA. Current HLA/RTI follows a custom object specification but DMSO’s longer term plans include transferring HLA to industry via OMG CORBA Facility for Interactive Modeling and Simulation. At NPAC, we are anticipating these developments are we are building a prototype RTI implementation in terms of Java/CORBA objects using the JWORB middleware [7]. RTI is given by some 150 communication and/or utility calls, packaged as 6 main managment services: Federation Management, Object Management, Declaration Managmeent, Ownership Management, Time Management, Data Distribution Management, and one general purpose utility ser-
68
G.C. Fox et al.
Fig. 10. Overall architecture of the Object Web RTI - a JWORB based RTI prototype recently developed at NPAC
vice. Our design shown in fig 10 is based on 9 CORBA interfaces, including 6 Managers, 2 Ambassadors and RTIKernel. Since each Manager is mapped to an independent CORBA object, we can easily provide support for distributed management by simply placing individual managers on different hosts. The communication between simulation objects and the RTI bus is done through the RTIambassador interface. The communication between RTI bus and the simulation objects is done by their FederateAmbassador interfaces. Simulation developer writes/extends FederateAmbassador objects and uses RTIambassador object obtained from the RTI bus. RTIKernel object knows handles of all manager objects and it creates RTIambassador object upon the federate request. Simulation obtains the RTIambassador object from the RTIKernel and from now on all interactions with the RTI bus are handled through the RTIambassador object. RTI bus calls back (asynchronously) the FederateAmbassador object provided by the simulation and the federate receives this way the interactions/attribute updates coming from the RTI bus. Although coming from the DoD computing domain, RTI follows generic design patterns and is applicable to a much broader range of distributed applications, including modeling and simulation but also collaboration, on-line gaming or visual authoring. From the HPCC perspective, RTI can be viewed as a high level object based extension of the low level messaging libraries such as PVM or MPI. Since it supports shared objects management and publish/subscribe based multicast channels, RTI can also be viewed as an advanced collaboratory framework, capable of handling both the multi-user and the multi-agent/multi-module distributed systems.
HPcc as High Performance Commodity Computing
5.3
69
Visual HPCC ComponentWare
HPCC does not have a good reputation for the quality and productivity of its programming environments. Indeed one of the difficulties with adoption of parallel systems, is the rapid improvement in performance of workstations and recently PC’s with much better development environments. Parallel machines do have a clear performance advantage but this for many users, this is more than counterbalanced by the greater programming difficulties. We can give two reasons for the lower quality of HPCC software. Firstly parallelism is intrinsically hard to find and express. Secondly the PC and workstation markets are substantially larger than HPCC and so can support a greater investment in attractive software tools such as the well-known PC visual programming environments. The DcciS revolution offers an opportunity for HPCC to produce programming environments that are both more attractive than current systems and further could be much more competitive than previous HPCC programming environments with those being developed by the PC and workstation world. Here we can also give two reasons. Firstly the commodity community must face some difficult issues as they move to a distributed environment, which has challenges where in some cases the HPCC community has substantial expertise. Secondly as already described, we claim that HPCC can leverage the huge software investment of these larger markets.
Fig. 11. System Complexity (vertical axis) versus User Interface (horizontal axis) tracking of some technologies
In fig 11, we sketch the state of object technologies for three levels of system complexity – sequential, distributed and parallel and three levels of user (programming) interface – language, components and visual. Industry starts at the top left and moves down and across the first two rows. Much of the current commercial activity is in visual programming for sequential machines (top right box) and distributed components (middle box). Crossware (from Netscape) represents
70
G.C. Fox et al.
an initial talking point for distributed visual programming. Note that HPCC already has experience in parallel and distributed visual interfaces (CODE and HenCE as well as AVS and Khoros). We suggest that one can merge this experience with Industry’s Object Web deployment and develop attractive visual HPCC programming environments as shown in fig 12. Currently NPAC’s WebFlow system [9][12] uses a Java graph editor to compose systems built out of modules. This could become a prototype HPCC ComponentWare system if it is extended with the modules becoming JavaBeans and the integration with CORBA. Note the linkage of modules would incorporate the generalized communication model of fig 7, using a mesh of JWORB servers to manage a recourse pool of distributedHPcc components. An early version of such JWORB based WebFlow environment, illustrated in fig 13 is in fact operational at NPAC and we are currently building the Object Web management layer including the Entperprise JavaBeans based encapsulation and communication support discussed in the previous section. Returning to fig 1, we note that as industry moves to distributed systems, they are implicitly taking the sequential client-side PC environments and using them in the much richer server (middle-tier) environment which traditionally had more closed proprietary systems.
Fig. 12. Visual Authoring with Software Bus Components
We will then generate an environment such as fig 12 including object broker services, and a set of horizontal (generic) and vertical (specialized application) frameworks. We do not have yet much experience with an environment such as fig 12, but suggest that HPCC could benefit from its early deployment without the usual multi-year lag behind the larger industry efforts for PC’s. Further the diagram implies a set of standardization activities (establish frameworks)
HPcc as High Performance Commodity Computing
71
and new models for services and libraries that could be explored in prototype activities.
Fig. 13. Top level view of the WebFlow environment with JWORB middleware over Globus metacomputing or NT cluster backend
5.4
Early User Communities
In parallel with refining the individual layers towards production quality HPcc environment, we start testing our existing prototypes such as WebFlow, JWORB and WebHLA for the selected application domains. Within the NPAC participation in the NCSA Alliance, we are working with Lubos Mitas in the Condensed Matter Physics Laboratory at NCSA on adapting WebFlow for Quantum Monte Carlo simulations [13]. This application is illustrated in figures 14 and 15 and it can be characterized as follows. A chain of high performance applications (both commercial packages such as GAUSSIAN or GAMESS or custom developed) is run repeatedly for different data sets. Each application can be run on several different (multiprocessor) platforms, and consequently, input and output files must be moved between machines. Output files are visually inspected by the researcher; if necessary applications are rerun with modified input parameters. The output file of one application in the chain is the input of the next one, after a suitable format conversion. The high performance part of the backend tier in implemented using the GLOBUS toolkit [14]. In particular, we use MDS (metacomputing directory services) to identify resources, GRAM (globus resource allocation manager) to allocate resources including mutual, SSL based authentication, and GASS (global access to secondary storage) for a high performance data transfer. The high performance part of the backend is augmented with a commodity DBMS (servicing
72
G.C. Fox et al.
Fig. 14. Screendump of an example WebFlow session: running Quantum Simulations on a virtual metacomputer. Module GAUSSIAN is executed on Convex Exemplar at NCSA, module GAMESS is executed on SGI Origin2000, data format conversion module is executed on Sun SuperSparc workstation at NPAC, Syracuse, and file manipulation modules (FileBrowser, EditFile, GetFile) are run on the researcher’s desktop
Permanent Object Manager) and LDAP-based custom directory service to maintain geographically distributed data files generated by the Quantum Simulation project. The diagram illustrating the WebFlow implementation of the Quantum Simulation is shown in fig 15. Another large application domain we are currently adressing is DoD Modeling Simulation, approached from the perspective of FMS and IMT thrusts within the DoD Modernization Program. We already described the core effort on building Object Web RTI on top of JWORB. This is associated with a set of more application- or component-specific efforts such as: a) building distance training space for some mature FMS technologies such as SPEEDES; b) parallelizing and CORBA-wrapping some selected computationally intense simulation modules such as CMS (Comprehensive Mine Simulator at Ft. Belvoir, VA); c) adapting WebFlow to support visual HLA simulation authoring. We refer to such Pragmatic Object Web based interactive simulation environment as WebHLA [8] and we believe that it will soon offer a powerful modeling and simulation framework, capable to address the new challenges of DoD computing in the areas of Simulation Based Design, Testing, Evaluation and Acquisition.
HPcc as High Performance Commodity Computing
73
Fig. 15. WebFlow implementation of the Quantum Simulations problem
References [1] [2] [3]
[4]
[5]
[6]
[7]
[8]
[9]
Client/Server Programming with Java and CORBA by Robert Orfali and Dan Harkey, Wiley, Feb ’97, ISBN: 0-471-16351-1 56 High Level Architecture and Run-Time Infrastructure by DoD Modeling and Simulation Office (DMSO), http://www.dmso.mil/hla 59 Geoffrey Fox and Wojtek Furmanski, “Petaops and Exaops: Supercomputing on the Web”, IEEE Internet Computing, 1(2), 38-46 (1997); http://www.npac.syr.edu/users/gcfpetastuff/petaweb Geoffrey Fox and Wojtek Furmanski, “Java for Parallel Computing and as a General Language for Scientific and Engineering Simulation and Modeling”, Concurrency: Practice and Experience 9(6), 415-426(1997). Geoffrey Fox and Wojtek Furmanski, “Use of Commodity Technologies in a Computational Grid”, chapter in book to be published by Morgan-Kaufmann and edited by Carl Kesselman and Ian Foster. 56, 61 Geofrey C. Fox, Wojtek Furmanski and Hasan T. Ozdemir, “JWORB - Java Web Object Request Broker for Commodity Software based Visual Dataflow Metacomputing Programming Environment”, NPAC Technical Report, Feb 98, http://tapetus.npac.syr.edu/iwt98/pm/documents/ 65 G.C.Fox, W. Furmanski and H. T. Ozdemir, “Java/CORBA based Real-Time Infrastructure to Integrate Event-Driven Simulations, Collaboration and Distributed Object/Componentware Computing”, In Proceedings of Parallel and Distributed Processing Technologies and Applications PDPTA ’98, Las Vegas, Nevada, July 13-16, 1998, http://tapetus.npac.syr.edu/iwt98/pm/documents/ 67 David Bernholdt, Geoffrey Fox and Wojtek Furmanski, B. Natarajan, H. T. Ozdemir, Z. Odcikin Ozdemir and T. Pulikal, “WebHLA - An Interactive Programming and Training Environment for High Performance Modeling and Simulation”, In Proceedings of the DoD HPC 98 Users Group Conference, Rice University, Houston, TX, June 1-5 1998, http://tapetus.npac.syr.edu/iwt98/pm/documents/ 72 D. Bhatia, V. Burzevski, M. Camuseva, G. Fox, W. Furmanski and G. Premchandran, “WebFlow – A Visual Programming Paradigm for Web/Java based coarse grain distributed computing”, Concurrency Practice and Experience 9,555-578
74
[10]
[11] [12] [13]
[14]
G.C. Fox et al. (1997), http://tapetus.npac.syr.edu/iwt98/pm/documents/ 70 L. Beca, G. Cheng, G. Fox, T. Jurga, K. Olszewski, M. Podgorny, P. Sokolowski and K. Walczak, “Java enabling collaborative education, health care and computing”, Concurrency Practice and Experience 9,521-534(97). http://trurl.npac.syr.edu/tango 66 Tango Collaboration System, http://trurl.npac.syr.edu/tango 66 Habanero Collaboration System, http://www.ncsa.uiuc.edu/SDG/Software/Habanero 66, 70 Erol Akarsu, Geoffrey Fox, Wojtek Furmanski, Tomasz Haupt, “WebFlow - HighLevel Programming Environment and Visual Authoring Toolkit for High Performance Distributed Computing”, paper submitted for Supercomputing 98, http://www.npac.syr.edu/users/haupt/ALLIANCE/sc98.html 71 Ian Foster and Carl Kessleman, Globus Metacomputing Toolkit, http://www.globus.org 71
A Parallelization Framework for Recursive Tree Programs Paul Feautrier* Laboratoire PRISM, Universit4 de Versailles St-Quentin 45 Avenue des Etats-Unis 78035 VERSAILLES CEDEX FRANCE
A b s t r a c t . The automatic parallelization of "regular" programs has encountered a fair amount of success due to the use of the polytope model. However, since most programs are not regular, or are regular only in parts, there is a need for a parallelization theory for other kinds of programs. This paper explore the suggestion that some "irregular" programs are in fact regular on other data and control structures. An example of this phenomenon is the set of recursive tree programs, which have a well defined parallelization model and dependence test.
1
A Model
for Recursive
Tree
Programs
T h e polytope model [7, 3] has been found a powerful tool for the parallelization of array programs. This model applies to program t h a t use only DO loops and arrays with afline subscripts. The relevant entities of such programs (iteration space, d a t a space, execution order, dependences) can be modeled as Z-polytopes, i.e. as sets of integral points belonging to bounded polyhedra. Finding parallelism depends on our ability to answer questions about the associated Z-polytopes, for which task one can use well known results from m a t h e m a t i c s and operation research. T h e aim of this paper is to investigate whether there exists other p r o g r a m models for which one can devise a similar a u t o m a t i c parallelization theory. The answer is yes, and I give as an example the recursive tree programs, which are defined in sections 1.4 and 1.5. The relevant parallelization framework is presented in section 2. In the conclusion, I point to unsolved problems for the recursive tree model, and suggest a search for other examples of parallelization frameworks. 1.1
An Assessment of the Polytope Model
The main lesson of the polytope model is that the suitable level of abstraction for discussing parallelization is the operation, i.e. an execution of a statement. * e-mail : P a u l . F e a u t r i e r O p r i s m . u v s q . f r
471 In the case of DO loop programs, operations are created by loop iterations. Operations can be named by giving the values of the surrounding loop counters, arranged as an iteration vector. These values must be within the loop bounds. If these bounds are affine forms in the surrounding loop counters and constant structure parameters, then the iteration vector scans the integer points of a polytope, hence the name of the model. To achieve parallelization, one has to find subsets of independant operations. Two operations are dependent if they access the same memory cell, one at least of the two accesses being a write. This definition is useful only if operations can be related to the memory cells they access. When the data structures are arrays, this is possible if subscripts are affine functions of iteration vectors. The dependence condition translates into a system of linear equations and inequations, whose unknowns are the surrounding loop counters. There is a dependence iff this system has a solution in integers. These observations can be summarized as a set of requirements for a parallelization framework: 1. We must be able to describe, in finite terms, the set of operations of a program. This set will be called the control domain in what follows. The control domain must be ordered. 2. Similarly, we must be able to describe a data structure as a set of locations, and a fonction from locations to values. 3. We must be able to associate sets of locations to operations through the use of address functions. The aim of this paper is to apply these prescriptions to the design of a parallelization framework for recursive tree programs. 1.2
Related Work
This section follows the discussion in [5]. The analysis of programs with dynamic d a t a structures has been carried mainly in the context of pointer languages like C. The first step is the identification of the type of data structures in the program, i.e. the classification of the pointer graph. The main types are trees (including lists), DAG and general graphs. This can be done by static analysis at compile time [4], or by asking the programmer for the information. This paper uses the second solution, and the data structures are restricted to trees. The next step is to collect information on the possible values of pointers. This is done in a static way in the following sense: the sets of possible pointer values are associated not to an operation but to a statement. These sets will be called regions here, by analogy to the array regions [9] in the polytope model. Regions are usually represented as path expressions, which are regular expressions on the names of structure attributes [8]. Now, a necessary (but not sufficient) condition for two statements to be in dependence is that two of their respective regions intersect. It is easy to see that this method incurs a loss of information which may forsake parallelisation in
472 important cases. The main contribution of this paper is to improve the precision of the analysis for a restricted category of recursive tree programs.
1.3
Basic C o n c e p t s and N o t a t i o n s
The main tool in this paper is the elementary theory of finite state a u t o m a t a and rational transductions. A more detailed treatment can be found in Berstel's book [1]. In this paper, the basic alphabet is ]N (the set of non negative integers). e is the zero-length word and the point (.) denotes concatenation. In practical applications, the alphabet is always some finite subset of IN. A finite state automaton (fsa) is defined in the usual way by states and labelled transitions. One obtains a word of the language generated (or accepted) by an automaton by concatenating the labels on a path from an initial state to a terminal state. A rational transduction is a relation on IN* x IN* which is defined by a generalized sequential automaton (gsa): an fsa whose edges are labelled by input and output words. Each time an edge is traversed, its input word is removed at the begining of the input, and its output word is added at the end of the ouput. Gsa are also known as Moore machines. The family of rational transductions is closed by inversion (simply reverse the elements of each digram), concatenation and composition [1]. A regular language can also be represented as a regular expression: an expression built from the letters and r by the operations of concatenation (.), union (+) and Kleene star. This is also true for gsa, with letters replaced by digrams. The domain of a rational transduction is a regular language whose fsa is obtained by deleting the second letter of each edge label. There is a similar construction for the range of a rational transduction. From one fsa c, one may generate many others by changing the set of initial or terminal states, c(s; ) is deduced from c by using s as the unique initial state. c(; t) has t as its unique terminal state. In c(s; t), both the initial and terminal states have been changed. Since rational transductions are defined by automata, similar operations can be defined for them. 1.4
The Control Domain of Recursive Programs
The following example is written in a C-like language which will be explained presently: BOOLEAN tree leaf; void main(void) { int tree value; 5 : sum([]);} } void sum(address I) { 1 : if(! leaf[I]) { 2 : sum(I.1); 3 : sum(I.2); 4 : value[I] = valueEI.1] + valueEI.2] }
473 v a l u e is a tree whose nodes hold integers, and l e a f is a Boolean tree. The problem is to sum up all integers on the leaves, the final result being found at the root of the tree. Addresses will be discussed in Sect. 1.5. Let us consider statement 4. Execution of this statement results from a succession of recursive calls to sum, the statement itself being executed as a part of the b o d y of the last call. Naming these operations is achieved by recording where each call to sum comes from: either line 2 or 3. The required call strings can be generated by a control a u t o m a t o n whose states are the functions names and the basic statements of the program. The state associated to main is initial and the states associated to basic statements are terminal. There is a transition from state p to state q if p is a function and q is a statement occuring in p body. The transition is labelled by the label of q in the body of p.
s2 ) Fig. 1. The Control Automaton of stm
As the example shows, if the statements in each function b o d y are labeled by ascending numbers, the execution order is exactly lexicographic ordering on the call strings.
1.5
A d d r e s s i n g in T r e e s
R e m a r k first t h a t most tree algorithms in the literature are expressed recursively. Observe also t h a t in the polytope model, the same m a t h e m a t i c a l object is used as the control space and the set of locations of the d a t a space. Hence, it seems quite natural to use trees as the prefered d a t a structure for recursive programs. In a tree, let us number the edges coming out of a node from left to right by consecutive integers. The name of node n is then simply the string of edge numbers which are encountered on the unique p a t h from the root of the tree to n. The n a m e of the root is the zero-length string, e. This scheme dates back at least to Dewey Decimal Notation as used by librarians the world over. The set of locations of a tree structure is thus lN*, and a tree object is a partial function from IN* to some set of values, as for instance the integers, the floating point numbers or the characters.
474 Address functions map operations to locations, i.e. integer strings to integer strings. The natural choice for them is the family of rational transductions [1]. Consider again the above example. Notice the global declaration for the trees v a l u e and l e a f . a d d r e s s is the type of integer strings. In line 4, such addresses are used to access v a l u e . The second address, for instance, is built by postfixing the integer 1 to the value of the address variable I. This variable is initialized to e at line 5 of main. If the call at line 2 (resp. 3) of sum is executed, then a 1 (resp. 2) is postfixed to I. The outcome of this discussion is that at entry into function sum, I comes either from lines 2 or 3 or 5, hence the regular equation:
i = (5,
+ I.(2,
+ i.(a, 2),
whose solution is the regular expression: I = (5, e).((1, 2) + (3, 2))*. Similarly, the second address in line 4 is given by the following rational transduction : (4, @((2, 1) + (3, 2))*.(4, 1). I conjecture that the reasoning that has been used to find the above address functions can be automated, but the details have not been worked out yet. It seems probable that this analysis will succeed only if the operations on adresses are suitably limited. Here, the only allowed operator is postfixing by a word of IN*. This observation leads to the definition of a toy language, T , which is similar to C, with the following restrictions: - No pointers are allowed. They are replaced by addresses. The only data structures are scalars (integers, floats and so on) and trees thereof. Trees are always global variables. Addresses can only be used as local variables or functions parameters. No function may return an address. - The only control structures are the conditionals and the function calls, possibly recursive. No loops or g o t o are allowed. -
2 2.1
Dependence
Analysis
of 7"
Parallelization Model
When parallelizing static control programs, one has first to decide the shape of the parallel version. One usually distinguishes between control parMIelism, where operations executed in parallel are instances of different statements, and data parMlelism, where parallelism is found among iterations of the same statement. In recursive programs, repetition of a statement is obtained by enclosing it in the body of a recursive function, as for example in the program of Fig. 2. Suppose it is possible to decide that the operations associated to S and all operations generated by the call to f o o are independent. The parallel version in Fig. 3 (where (^ . . . ~} is used as the parallel version of s . . . } [6]) is equivalent to the sequential original. The degree of parallelism of this program
475 void foo(x) { {void foo(x) { s; if (p) foo(y); }
S; if(p) foo(y); -} }
Fig. 2. Sequential version
Fig. 3. Parallel version
is of the order of the number of recursive calls to foo, which probably depends on the d a t a set size. This is d a t a paralellism expressed as control parallelism. A possible formalization is the following. Let us consider a function f o o and the statements {$1, 9.., Sn} of its body. The statements are numbered in textual order, and i is the label of S~. Tests in conditional statements are to be considered as elementary, and must be numbered as they occur in the program text. Let us construct a synthetic dependence graph (SDG) for f o o . The vertices of the SDG are the statements of foo. There is a dependence edge from S~ to Sj, i < j iff there exists three iteration words u, v, w such that: u is an iteration of foo. Both u.i.v and u.j.w are iterations of some terminal statements Sk and St. - u.i.v and u.j.w are in dependence. -
-
Observe t h a t u is an iteration word for f o o , hence belongs to the language generated by c(;foo). Similarly, v is an iteration of Sk relative to an iteration of Si, hence belongs to c(Si;Sk), and w 6 c(Sj;St). As a consequence, the pair (u.i.v, u.j.w) belongs to the following rational transduction:
h = 4;
j>.c(S,;
m which if a is an automaton, then a = is the transduction obtained by setting each output word equal to the corresponding input word. This formula also uses an a u t o m a t o n as a transduction whose output words have zero length. Similarly, the inverse of an a u t o m a t o n is used as a transduction whose input words have zero length. S~ and Sj m a y also access local scalar variables, for which the dependence calculation is trivial. Besides, one must r e m e m b e r to add control dependences, from the test of each conditional to all statements in its branches. Lastly, dependences between stqtements belonging to opposite arms of a conditionM are to be ommited. Once the SDG is computed, a parallel p r o g r a m can be constructed in several well known ways. Here, the program is put in serie/parallel form by topological sorting of the SDG. As above, I will use the C - E A R T H version of the f o r k . . . j o i n construct, {" . . . ~}. The run time exploitation of this kind of parallelism is a well known problem [6].
476 2.2
The Dependence
Test
C o m p u t i n g dependences for the body of function f o o involves two distinct algorithms. The first, (or outermost) one enumerates all pairs of references which are to be checked for dependence. This is a purely combinatorial algorithm, of polynomial complexity, which can be easily reconstructed by the reader. The inner algorithm has to decide whether there exists three strings x, y, w such that: where the first term expresses the fact that x and y are iterations of Sk and St which are generated by one and the same call to foo, the second and third ones expressing the fact that both x and y access location w. f a n s g are the address transductions of Sk and St. The first step is to eliminate w, giving @, y) C k = g-1 o f. k is a rational transduction by Elgot and Mezei theorem [2]. Hence, the pair (x, y) belongs to the intersection of the two transductions h and k Deciding whether h M k is e m p t y is clearly equivalent to deciding whether e M = is e m p t y where = is the equality relation and l = k-1 o h. Deciding whether the intersection of two transductions is e m p t y is a well known undecidable problem [1]. It is possible however to solve it by a semialgorithm, wich is best presented as a (one person) game. A position in the g a m e is a triple (u, v, p) where u and v are words and p is a state of g. The initial state is (e, c, p0), where P0 is the initial state ofg. A position is a win if u = v = e and i f p is terminal. A move in the game consists in selecting a transition from p to q in ~ with label (x, y}. The outcome is a new position (u', v', q) where u' and v ~ are obtained from u.x and v.y by deleting their c o m m o n prefix. A position is a loss if u and v begin by distinct letters: in such a case, no a m o u n t of postfixing can complete u and v to equal strings. There remains positions in which either u or v or both are e. Suppose u = c. Then, for success, v must be the prefix of a string in the domain of g when starting from p. This can be tested easily, and, if the check fails, then the position again is a loss. The situation is s y m m e t r i c a l ifv=c. This game m a y have three outcomes: if a win can be reached, then by restoring the deleted c o m m o n prefixes, one reconstructs a word u such t h a t (u~ u} E g, hence a solution to the dependence problem. If all possible moves have been explored without reaching a win, then the problem has no solution. Lastly, the game can continue for ever. One possibility is to put an upper bound to the number of moves. If this bound is reached, one decides that, in the absence of a proof to the contrary, a dependence exists. The following algorithm explores the game tree in breadth-first fashion.
Algorithm D. 1. 2. 3. 4.
Set D = 0 and L = {(c, c,p0)] where P0 is the initial node of l. If L = ~, stop. There is no dependence. Extract the leftmost element of L, (u, v, p). If (u,v,p) E D, restart at step 2.
477 5. If u = v = e and if p is terminal, stop. There is a dependence. 6. If b o t h u r ~ and v r r the position is a loss. Restart at step 2. 7. If u = c and if v is not a prefix of a word in the domain of l(p; ), restart at step 2. 8. If v = ~ and if u is not a prefix of a word in the range of g(p;), restart at step 2. 9. Add (u, v,p) to D. Construct all the positions which can be reached in one move fl'om (u, v,p) and add them on the right of L. Restart at step 2. Since the exploration is breadth-first, it is easy to prove t h a t if there is a dependence, then the algorithm will find it. This algorithm has been implemented as a stand alone p r o g r a m in Objective Caml. The user has to supply the results of the analysis of the input program, including the control automaton, the address transductions, and the list of statements with their accesses. The program then computes the SDG. All examples in this paper have been processed by this pilot implementation. As far as m y experience goes, the case where algorithm D does not terminate has never been encountered. 2.3
sum R e v i s i t e d
Consider the problem of parallelizing the body of sum. There are already control dependences from statement 1 to 2, 3, and 4. T h e crucial point is to prove t h a t there are no dependences from 2 to 3. One has to test one output dependence from v a l u e [ I ] to itself, two flow dependences from v a l u e [ I ] to v a l u e [ I . 1] and v a l u e [ I . 2], and two symmetrical anti-dependences. Let us consider for instance the problem of the flow dependence f r o m v a l u e [ I ] to v a l u e [ I . 1]. The s transduction begins in the following way: t=
((2,2)+(3,3))*.(3,2)
....
Algorithm D finds t h a t there is no way of crossing the (3, 2) edge without generating distinct strings. Hence, there is no dependence. On the other hand, algorithm D readily finds a dependence from 2 to 4. All in all, the SDG of sum is given by Fig. 4, to which corresponds the parallel p r o g r a m in Fig. 5 - - a typical case of parallel divide-and-conquer. 3
Conclusion
and
Future
Work
I have presented here a new framework in which to analyze recursive tree programs. The main differences between the present m e t h o d and the more usual pointer alias analysis are: -
D a t a structures are restricted to trees, while in alias analysis, one has to determine the shape of the structures. This is a weakness of the present approach.
478
••
void s u m ( a d d r e s s
I)
{ I : if(! leaf[I])
{
{2 : 3 :
sum(I, i) ; sum(1.2)
"}; 4
:
value[I]
=
value [I. 1] + value [I. 2] ; }
Fig. 4. The SDG of sum
Fig. 5. The parallel version of sum
In 7-, the operations on addresses are limited to postfixing, which, translated in the language of pointers, correspond to the usual pointer chasing. - The analysis is operation oriented, meaning that addresses are associated to operations, not to statements. This allows to get more precise results when computing dependences. -
Pointer alias analysis can be transcribed in the present formalism in the following way. Observe that, in the notations of Sect. 2.2, the iteration word x belongs to Domain(h), which is a regular language, hence w belongs to f ( D o m a i n ( h ) ) , which is also regular. This is the region associated to Sk. Similarly, w is in the region g(Range(h)). If the intersection of these two regions is empty, there is no dependence. When compared to the present approach, the advantage of alias analysis is that the regions can be computed or approximated without restrictions on the source program. It is easy to prove that when f ( D o m a i n ( h ) ) A g(Range(h)) -- 0 then e is empty. It is a simple matter to test whether this is the case. The present implementation reports the number of cases which can be decided by testing for the emptiness of ~, and the number of cases where Algorithm D has to be used. In a T implementation of the merge sort Mgorithm, there were 208 dependence tests. Of these, 24 were found to be actual dependences, 34 where solved by region intersection, and 150 required the use of algorithm D. While extrapolating from this example alone would be jumping at conclusions, it gives at least an indication of the relative power of the region intersection and of algorithm D. Incidentaly, the SDG of merge sort was found to be of the same shape as the SDG of sum, thus leading to another example of divide-and-conquer parallelism. The 7- language as it stands clearly needs a sequential compiler, and a tool for the automatic construction of address relations. Some of the petty restrictions of Sect. 1.5 can probably be removed without endangering dependence analysis. For instance, having trees of structures or structure of trees poses no difficulty. Allowing trees and subtrees as arguments to functions would pose the usual
479 aliasing problems. A most useful extension would be to allow trees of arrays, as found for instance in some versions of the adaptive multigrid method. How is T to be used? Is it to be another p r o g r a m m i n g language, or is it better used as an intermediate representation when parallezing pointer programs as in C or ML or Java? The latter choice would raise the question of translating C (or a subset of C) to T , i.e. translating pointer operations to address operations. Another problem is that 7" is static with respect to the underlying set of locations. It is not possible, for instance, to insert a cell in a list, or to graft a subtree to a tree. Is there a way of allowing t h a t kind of operations? Lastly, trees are only a subset of the d a t a structures one encounter in practice. I envision two ways of dealing, e.g., with DAGs and cyclic graphs. Adding new address operators, for instance a prefix operator: 7r(al . . . . . a,~) = (al . . . . . an-a) allows one to handle doubly linked lists and trees with an upward pointer. The other possibility is to use other m a t h e m a t i c a l structures as a substrate. Finitely presented monoids or groups come immediately to mind, but there might be others.
References 1. Jean Berstel. Transductions and Context-Free Languages. Teubner, Stuttgart, 1979. 2. C.C. Elgot and J.E. Mezei. On relations defined by generalized finite automata. IBM J. of Research and Development, 47-68, 1965. 3. Paul Feautrier. Automatic parallelization in the polytope model. In Guy-Ren~ Perrin and Alain Darte, editors, The Data-Parallel Programming Model, pages 79103, Springer, 1996. 4. Rakesh Ghiya and Laurie J. Hendren. ls it a tree, a dag or a cyclic graph? In PoPL'96, pages 1-15, ACM, 1996. 5. Joseph Hummel, Laurie J. Hendren, and Alexandru Nicolau. A general dependence test for dynamic, pointer based data structures. In PLDI'94, pages 218-229, ACM Sigplan, 1994. 6. Laurie J. Hendren, Xinan Tang, Yingchun Zhu, Shereen Ghobrial, Guang R. Gao, Xun Xue, Haiying Cai, and Pierre Ouellet. Compiling C for the EARTH multithreaded architecture. Int. J. of Parallel Programming, 25(4):305-338, August 1997. 7. Christian Lengauer. Loop parallelization in the polytope model. In Eike Best, editor, CONCUR'93, LNCS 715, pages 398-416, Springer-Verlag, 1993. 8. James R. Larus and Paul N. Hilfinger. Detecting conflicts between structure accesses. In PLDI'88, pages 31-34, ACM Sigplan, 1988. 9. R~mi Triolet, Frangois Irigoin, and Paul Feautrier. Automatic parallefization of FORTRAN programs in the presence of procedure calls. In Bernard Robinet and R. Wilhelm, editors, ESOP 1986, LNCS 213, Springer-Verlag, 1986.
Optimal Orthogonal Tiling Rumen Andonov i, Sanjay Rajopadhye 2, and Nicola Yanev a 1 LIMAV, University of Valenciennes, France. [email protected] 2 lrisa, Rennes, France. Sanj ay. Rajopadhye@irisa. f r 3 University of Sofia, Bulgaria. [email protected]
A b s t r a c t . Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to orthogonal tiling--uniform dependency programs with (hyper) parallelepiped shaped iteration domains which can be tiled with hyperplanes parallel to the domain boundaries. Our formulation includes many machine and program models used in the literature, notably the BSP programming model. We resolve the optimization problem analytically, yielding a closed form solution.
1
Introduction
Tiling the iteration space [17, 7, 14] is a common method for improving the performance of loop programs on distributed memory machines. It may be used as a technique in parallelizing compilers as well as in performance tuning of parallel codes by hand (see also [6, 13, 10, 15]). A tile in the iteration space is a (hyper) parallelepiped shaped collection of iterations to be executed as a single unit, with the communication and synchronization being done only once per tile. Typically, communication are performed by send/receive calls and also serve as synchronization points. The code for the body contains no communication calls. The tiling problem can be broadly defined as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: tile shape optimization [5], and tile size optimization [2,6,9, 13] (some authors also attempt to resolve both problems under some simplifying assumptions [14, 15]). In this paper, we address the tile size problem, which, for a given tile shape, seeks to choose the size (length along each dimension of the hyper parallelepiped) so as to minimize the total execution time. We assume that the dependencies are uniform, the iteration space (domain) is an n-dimensional hyper-rectangle, and the tile boundaries are parallel to the domain boundary (this is called orthogonal tiling). A sufficient condition for this that in all dependence vectors, all non-zero terms have the same sign. Whenever orthogonal tiling is possible, it leads to the simplest form of code; indeed most compilers do not even implement any other tiling strategy. We show here that when orthogonal tiling is possible the tile sizing problem is easily solvable even in its most general, n-dimensional case.
481 Our approach is based on the two step model proposed by Andonov and Rajopadhye for the 2-dimensional case in [2], and which was later extended to 3 dimensions in [3]. In this model we first abstract each tile by two simple parameters: tile period, 79 and inter-tile latency, Z:. We then "instantiate" the abstract model by accounting for specific architectural and p r o g r a m features, and analytically solve the resulting non-linear optimization problem, yielding the desired tile size. In this paper we first extend and generalize our model to the case where an n-dimensional iteration space is implemented on a k-dimensionM (hyper) toroid (for any 1 < k < n - 1). We also consider a more general form of the specific functions for s and P. These functions are general enough to not only include a wide variety of machine and p r o g r a m models, but also be used with the Bse model [11, 16], which is gaining wide acceptance as a well founded theoreticM model for developing architecture independent parallel programs. Being an extension of previous results, the emphasis of this paper is on the new points in the m a t h e m a t i c a l framework and on the relations of our model to others proposed in the literature. More details about the definitions, notations and their interpretations, as well about some practical aspects of the experimental validations are available elsewhere [1-3]. T h e remainder of this paper is organized as follows. In the following section we develop the model and formulate the abstract optimization problem. In Section 3 we instantiate the model for specific machine and program parameters, and show how our model subsumes most of those used in the literature. In Section 4 we resolve the problem for the simpler H K T model which assumes t h a t the communication cost is constant, independent of the message volume (and also a sub-case of the asP model, where the network bandwidth is very high). Next, in Section 5 we resolve the more general optimization problem. We present our conclusions in Section 6.
2
Abstract Model Building
We first develop an analytical performance model for the running t i m e of the tiled program. We introduce the notation required as we go along. The original iteration space is an N1 • N2 • ... • N~ hyper-rectangle, and it has (at least n linearly independent) dependency vectors, d l , . . , din. The nonzero elements of the dependency vectors all have the same sign (say positive). Hence, orthogonM tiling is possible (i.e., does not induce any cyclic dependencies between the tiles). Let the tiles be ~1 x x~ • ... • x~ hyper-rectangles, and let q~ = ~ be the number of tiles in the i-th dimension. The tile graph is the graph where each node represents a tile and each arc represents a dependency between tiles, and can be modeled by a uniform recurrence equation [8] over an ql x q2 x . . . x qn hyper-rectangle. It is well known that if the zi's are large as compared to the elements of the dependency vectors 1 , then the dependencies between the tiles are unit vectors (or binary linear combinations thereof, which can be neglected 1 In general this implies that the feasible value of each zi is bounded from below by some constant. For the sake of clarity, we assume that this is 1.
482 for analysis purposes w i t h o u t any loss of generality) 9 A tile can be identified by an index vector z = [ Z l , . . . , z~]. T w o tiles, z and z ~, are said to be successive, if z - z ~ is a unit vector. A simple analysis [7] shows t h a t the earliest "time instant" (counting a tile execution as one time unit) t h a t tile z can be executed i s t z = zl + . . . + zn. We m a p the tiles to Pl • P2 • - 9 9• Pk processors, arranged in a k dimensional hyper-toroid 2 (for k < n). T h e m a p p i n g is by projection o n t o the subspace spanned by k of the n canonical axes. To visualize this m a p p i n g , first consider the case when k = n - 1, where tiles are allocated to processors by projection along (without loss of generality) the n - t h dimension, i.e., tile z is executed by processor [ z l , . . . , z ~ - l ] . This yields a (virtual) array of ql • q2 • . . . • q ~ - i processors, each one executing all the tiles in the n - t h dimension. In general Pi <_ qi, so this array is emulated by our Pl • P2 • ..- • Pn-1 h y p e r - t o r o i d by using multiple passes (there are ~ passes in the i-th dimension). T h e case when k < n - 1 is modeled by simply letting Pk+l : - . . . . P n - 1 = 1. T h e period, 7~ of a tile is defined as the time between executions of corresponding instructions in two successive tiles (of the same pass) m a p p e d to the s a m e processor. T h e latency, s between tiles is defined to be the time between executions of corresponding instructions in two successive tiles m a p p e d to diff e r e n t processors. Depending on the volume of the d a t a t r a n s m i t t e d and the n a t u r e of the p r o g r a m dependencies, the latency m a y be different for different dimensions of the hyper-toroid, and we use s (for i = 1 . . . k ) to denote the latency in the i-th dimension. We assume t h a t by the time the first processor, 1 = [ 1 . . . 1], finishes c o m p u t i n g its m a c r o column, at least one of the "last" processors (i.e., one of [ 1 , . . . , p ~ , . . . , 1 ] , for 1 <_ i < k) has finished its first tile 3. In this case, processor 1 can i m m e d i a t e l y start another pass. Let W = 1-[i=1 Ni denote the total c o m p u t a t i o n volume, v = 1-[i-1 xi be the tile volume and P = 1--[i-1 Pi be the total n u m b e r of processors. T h e total n u m b e r of tries is 1-[i=1 qi = -C" Let n
9
~
.
--
n
9
9
r~
--
W
denote Pi - 1, Pkk = ~ = 1 ~ and Vm~• = (1"-[~=1 N i ) / P . Let us first analyze a single pass. Each processor m u s t execute qn tiles, and this takes qn7) time. However, the last processor (i.e., the one with coordinates [Pl, P2,..-Pk]) c a n n o t start because of the dependencies of the tile graph. Indeed, processor [Pl, 1 . . . , 1] can only start at time (pl - 1)s processor [Pt, P 2 - . . , 1], at time (Pl - 1)/21 + (P2 - 1)s and hence processor [Pl, P 2 . . . , Pk] can only start its first pass at time ~ i = 1 pi i. There are vw tiles, of which Pq~ are executed in each pass, and hence there are w--ixpasses. Because there is no idle time between passes, the last pass starts Pq,, 2 This is just an abstract machine architecture. Our machine model is independent of the topology, since we will later assume that communication time is independent of distance and/or contention. 3 There is no loss of generality in this assumption 9 Were it not true, the processors would be idle between passes, and this would lead to a sub-optimal solution, this has been proved for 2 and 3 dimensions [2, 3] and the arguments carry over.
483 at time ( ~ w
_ 1) 5Oq~. Thus, the
last processor
starts executing its
last macro
k
l) Pqn+E~lZi. It takesanotherT)qn time,
column at time instant ( p W n
i=1
and hence the total running time and the corresponding optimization problem are as follows.
WP
=
pv
k
+
(1)
P r o b . 1: Minimize (1) in the feasible space, Tr given below (recall that the lower bounds may be other than 1, based on the dependencies of the original recurrence). < } (2) n = e z I1< where ul = N, for i = 1 . . . k , and ui = Ni for k P~
3
Machine
and
Program
Specific Model
We now "instantiate" P r o b . 1 for a specific program and machine architecture. The code executed for a tile is the standard loop: repeat receive(v1); receive(v2) ..... receive(vk) compute(body); send(vl); send(v2) ..... send(vk) ; end
;
where vi denotes the message transmitted in the i-th dimension. We will now determine P and s Our development is based on Andonov & Rajopadhye [1], and uses standard assumptions about low level behavior of the architecture and program [4]. The sole difference is that each tile now makes k systems calls to send (and receive) messages. A tile depends directly on its n neighbors in each dimension. The volume of data transfer along the i-th dimension is proportional to the (hyper) surface of the tile in that dimension, i.e., YIj~i XJ "~ V/Xi" In the first k dimensions, this corresponds to an inter-processor communication, whereas in the dimensions k + 1 . . . n , the transfer is achieved through locM memory. Hence the period of a tile can be written as follows.
P=k(/3~+f~)+
2TCv k
+av
(3)
i=1
Here f , (rasp. fir) is the overhead of the send (rasp. r e c e i v e ) system call, re is the time (per byte) to copy from user to system memory, ~ is the computation time for a single instance of the loop body (we neglect the overhead of setting up the loop for each tile, as well as cache effects). Similarly, if 1/rt is the network bandwidth, the latency is as follows (see [2] for details).
484 V Xi
2r~v
=kfl~+ k
+av+rt-i=1
(4) Xi
Note that the/3~'s are subtracted because the r e c e i v e occurs on a different processor, and when the sender and receiver are properly synchronized, the calls are overlapped. Substituting in Eqn. (1) and simplifying, we obtain k
+
1
k ~
i=1 xi
i=1 x i
W
3.1
2roW ~ ~
1
+
i:1 xi
~W
(5)
Simplifying Assumptions and Particular Cases
The model of (5) is very general. One may want to specialize it for a number of reasons--say rendering the final optimization problem more tractable, or modeling a certain class of architectures or computations. It turns out that many of these simply consist of choosing the parameters appropriately in the above function. The HKT model, first used by Hiranandani et al. [6], corresponds to setting fir = rc = rt = 0 (a slightly more general version consists of letting fir be nonzero). This model assumes that communication cost is independent of the message size, but is dominated by the startup time(s) fls (and fir). At first sight this may seem an oversimplification. However, in addition to making the mathematical problem more tractable, it is not far from the truth, as corroborated by other authors [12, 13]. Indeed, experimental as well as analytic evidence [1] shows that on machines such as the Intel Paragon the more accurate models yield no observable difference in the predictions. With rc = rt = 0, we obtain the H K T c o s t f u n c t i o n : ,~W
Tk( ) = - - U + k(Z + Z,)
W
+
(~v + k ~ s ) ~
(6)
The BSP model [11,16] has been proposed as a formal model for developing architecture independent parallel programs. It is a bridge between the PRAM model which is general but somewhat unrealistic, and machine-specific models which lead to lack of portability and predictability of performance. Essentially, the computation is described in terms of a sequence of "super-steps" executed in parallel by all the processors. A super-step consists of (i) some local computation, (ii) some communication events launched during the super-step, and (iii) a synchronization which ensures that the communication events are completed. The time for a super-step is the sum of the times for each of the above activities.
485 This is very similar to our tile model: indeed, if we simply set rt = fir = 0 and fls = ~k (• is the BSP synchronization cost) we obtain the running time of the program under the BSP model. With this simplification, we obtain the B S P cost f u n c t i o n as follows.
~k(~) = - p - + ~
+ (~v + ~ ) E + 2re E v +
-
(7)
i=1 xi
In the BSP model, the communication startup cost is replaced by the synchronization cost. However, in our general cost function (5) we incur the startup cost k times. As a result, if we take a particular case of the BSP model where the network bandwidth is extremely high (the communication time is dominated by the synchronization cost), then this high bandwidth B S P cost .function is given as follows (it is similar but not identical to the H K T model).
Tk(~) = - 7 +
+ (~v + ~)E
(S)
With some other simple modifications, our cost function can also model the overlap of communication and computation, which is often used as a performance tuning strategy. This is not detailed here due to space constrMnts. 4
Solution
for HKT
and
High
Bandwidth
BSP
Models
In this section, we will focus only on the simple models (6, 8). Our main results are that the optimal tile volume and the dimension of optimal virtual architecture can be determined as a closed form solution. These results serve two important purposes, in spite of the apparent simplicity of the model. First, they are valid for a number of current machines where the communication latency and network bandwidth are both relatively high (such as the Intel Paragon, and a number of similar machines as well as networks of workstations). Second, they give a good indication of our solution method for the more general results. We first solve the problem for the H K T model, and then show how the high bandwidth BSP model follows almost directly. 4.1
Optimal Tile Volume
It is easy to see that the running time (6) depends only on the tile volume and A is a strictly convex function of the form Tk(v) = V + Bv + C, which attains its optimal value of C + 2 Av/-A--Bat ~ = C A . By substituting we obtain
V
PaPk
Tk= p ~ ~W+k~E+2
(9)
~ ~+p~")wpk
(10)
486 The optimal solution will be as given above i f 7 is a feasible tile volume. Now, observe that each xi is bounded from above by -~ and hence v < w . Since W grows much faster than P, we have 1 < ~" < w asymptotically. Hence we have the following result. T h e o r e m 1. The optimal tile volume and the corresponding running time for the H K T model are given by (9-10).
4.2
Optimal Architecture
So far, we have assumed that k is a fixed constant as are each o f the Pi 's. In practice, we typically have P processors, and the values of each pi are not specified, and neither is k, and we now solve this problem. Note that the dominant term in (10) is - ~ , the "ideal" parallel time with no overhead. The ideal architecture g--.....
will seek to minimize the overhead whose dominant term is O ( ~ / w ) . From (10) it can be easily deduced that for a given k and P, the optimal architecture is the one that minimizes Pkk, i.e., the torus with Pi = ~/-fi. Substituting this in (10), the coefficient of the overhead term is proportional to the square root of the function f ( k , P) = k 2P~-1 _ k2" Thus ~kk is minimized for the value of k that minimizes f ( k , P), i.e., k* = 0.625 In P. Since f ( k , P) is monotonically decreasing up to k* (for each P ) and monotonically increasing thereafter, we have the following result. C o r o l l a r y 1. If n < [k*] the optimal architecture is a balanced n - 1 dimensional torus, and for n > [k*] it is either a balanced [k*] or a [k*] dimensional torus depending respectively on whether f ( [ k * ] , P) is smaller than f([k*J, P) or not. It is thus clear that as the number of processors grows, the optimal architecture tends towards an n - 1 dimensional hyper-torus. However, k* grows logarithmically with P, and for a limited number of processors (most practical cases), the optimal may not be n - 1 dimensional. Indeed, the sensitivity of k* with respect to number of processors P is illustrated by the following: for up to 25 processors, the optimal architecture is a linear array, from 25-130 it is 2-dimensional, for 130-650 processors it is a 3-dimensional, etc. The extension of these results to the high bandwidth BSP model (Eqn 8) is straightforward, and indeed the mathematical treatment is a little simpler. The solution is the same as that given by (9-10), but with k,/3, and/3s respectively equal to 1, 0 and /3. The sole subtle difference is that the function f ( k , P ) is k P 1 / k - k, and hence k* = +oo, i.e., f ( k , P) is monotonically decreasing. Hence, the optimal architecture is always a balanced n - 1 dimensional torus. Finally, note that the optimization of the processor architecture yields a second order improvement--it does not affect the dominant term -g-~-.
487
5
S o l u t i o n for t h e B S P C o s t F u n c t i o n
We now solve our optimization problem for the BSP cost function (7) in the feasible space specified by (2). We will show that our problem can be decomposed into two special cases. The first case is very similar to the H K T model, but the second one is more complicated, for which we first solve the corresponding unconstrained optimization problem and then determine where the constrained solution lies. Finally, we show that the second solution is globally optimM asymptoticMly. L e m m a 1. The minimum of (7) over conv(7~) is attained at:
either or
x~ = N_~ for i = 1 k Pi ' x ~ = l , for i = k + l . ..n " ' "
Proof. Let v be an arbitrary feasible volume and let there exist two indices l, m for I < k, k + 1 _< m _< n such that xt < Nl/p~ and xm > 1. Then by increasing xl, decreasing x~n and keeping their product constant, the function (7) strictly decreases and we obtain the needed. | Based on this, we have to look for the solution in the two regions of Tr corresponding to the above two conditions. Let T r TCNxl = N_~ for i = 1 . . . k, P~ and T~2 = T~ VIxi = 1 for i = k + 1 . . . n . 5.1
Case I
T h e cost function in region T~I can be simplified to
w
Zw
T k ( , ) = (a + 2vcNk)-~ + flPkk + --~v + (a + 2rcNkk)~V
(11)
k
where N k = ~ i = l N~(" We can now use the same reasoning as in Section 4, and obtain the optimal volume and running time as follows: ~=
~W
~
--
w
.,,/~w~
and Tk = (a + 2~-cNk)--~ + f l ~ . z V
p
(12)
where w -- (~ 4- 2"rcNk)Pk. As before, appropriate values for xk+l . . . z n in T~I t h a t yield the optimal volume are all equivalent. Observe t h a t in (12) we have a factor 2vcNk in the dominant term, and this turns out to be i m p o r t a n t as we shall see later. 5.2
Case II k
Let us now consider region ~ 2 . Here, v -- r I i = l xi, but the cost function remains the same as (7), except that we have only k variables to solve for. We obtain the solution in two steps.
488 U n c o n s t r a i n e d O p t i m i z a t i o n We first solve the problem in the entire positive orthant, without any constraints. This can be formulated as follows. P r o b . 2: Minimize (7) in the feasible space T~_ = { [ x l . . . x k ] T I xi >_ 0}. k
Let Hv be the hyperboloid defined by {x G 7~_ ] I'L'=I xi = v}. Observe that the set of families Hv is a partition of 7~_, i.e., 7 ~ = U H~, and hence minTk(x) = m i n m i n T k ( x ) . Thus, we first minimize (7) for a given tile volume R~_
v
H~
(i.e., over a given hyperboloid H . ) and then choose the volume. Now, observe 1 whose minimum is attained at that for a fixed v, (7) is of the form A + B ~ i ~-7 E H v , x l = x2 . . . . . xk. Hence, for a given tile volume, v, the optimal tile k 1 k shape is (hyper) cubic with xi = ~ , for each i = 1 . . . k . Thus, E xi - ~/~' i=1
and we can define,
= minT(.,) = A + Bv + 2 ,-cPkv /-/,~
+ kZ)v-
+ --g- +
(13)
V
where A = ~p , B -- c~Pk, and D = 2r~w p 9 Now we have to determine the optimal tile volume by minimizing f(v) in the feasible space, 1 < v < Vm~• It :an easily be shown that f(v) is an unimodal function and attains its minimum the root of f~(v) = 0. Unfortunately this is not easily obtained in closed form :t it is quite well approximated by the root of the function h(v) = =._#_A+ B +
/ A~D + B . Thus, we obtain the following 2 r c ( k - 1 ) P k - D, whose zero is at V2,o(k_l)jy (approximate) optimal solution of the unconstrained problem.
v~
and Tk ,~ 7 -
+ 2krc
+ O
(14)
where 7 = (~+2Tc)/((2Tc(k--1)+a)Pk). Of course we could always determine an exact solution if needed, but we have found the approximate solution to be reasonable in practice. Moreover, as we shall see later, it illustrates some interesting points. Also recall that in the problem as we have resolved so far, we assume that k and the pi's are fixed, as also the choice of which of the Ni's to map to the processor space. C o n s t r a i n e d O p t i m i z a t i o n We now address the question of the restrictions on the optimal solution imposed by the feasibility constraints (2) namely 1 < xi < ui. Note that the unconstrained (global) optimal solution is on the intersection of the line defined by the vector 1, and the hyperboloid H7, where ~" is the optimal tile volume (14). If this intersection is outside the feasible space (2) we will need to solve the constrained optimization problem. We have observed that the optimal running time is extremely sensitive to the tile volume, and much less dependent on the particular values of xi (indeed this is predicted by the H K T model). We have the following cases
489 Case A: ~ > Vmax In this case, the optimal volume hyperboloid does not intersect the feasible space, and according to the properties of the function f ( v ) the optimal tile size is given by xi -- N~ for i = 1 .. k. However, note that since = O ( x / - W ~ ) while vm~• = O ( W / P ) , this case is unlikely. Case B: ~ <_ Vm~• Now, H~" has a non-empty intersection with (2) and so our heuristic is to choose a solution by moving one of the xi's such that we move within the feasible region. 5.3
W h e r e is t h e G l o b a l O p t i m u m ?
T h e optimal solutions for the two cases are given, respectively by (12) for region 7~1 and by (14) for region T~2. Most authors [6, 13, 12] have only considered region 7~1 either implicitly by not posing the problem in full generality, or by erroneously claiming that the solution is always in 7~a. This is incorrect because the final solution will always asymptotically be in 7~2, due to the additional factor, 2yeN{ in the dominant term in (12) as compared to (14). The following simple example illustrates the prediction of lemma 1: for n = 3, k = 2, Pl = P2 -4, ~ - fl = vc = 10 -6, N1 = 50, N2 --- Na = 1000, the optimal time in 7~1 is 1.8 while that in 7~2 is 1.5. More illustrative instances with real data can be found in [3].
6
Conclusions
We addressed the problem of finding the tile size that minimizes the running time of SPMD programs. We formulated a discrete non-linear optimization problem using first an abstract model and then specific machine model. The resulting cost function is general enough to subsume most of those in the literature, including the BSP model. We then anMytically solved the resulting discrete nonlinear optimization problem, yielding the desired solution. There are a number of open questions. The first one is the direct extension to the non orthogonal case (when the tiles boundaries cannot be parallel to the domain boundaries). We have addressed this elsewhere (for the 2-dimensional case) and formulated a non-linear optimization problem [2], but a closed form solution is not available. Finally, experimental validation on a number of target machines is the subject of our ongoing work. References 1. R. Andonov, H. Bourzoufi, and S. Rajopadhye. Two-dimensional orthogonal tiling: from theory to practice. In International Conference on High Performance Computing, pages 225-231, Tiruvananthapuram, India, December 1996. IEEE. 2. R. Andonov and S. Rajopadhye. Optimal orthogonal tiling of 2-d iterations. Journal of Parallel and Distributed Computing, 45(2):159-165, September 1997.
490 3. R. Andonov, N. Yanev, and H. Bourzoufi. Three-dimensional orthogonal tile sizing problem: Mathematical programming approach. In A S A P 97: International Conference on Application-Specific Systems, Architectures and Processors, pages 209-218, Zurich, Switzerland, July 1997. IEEE, Computer Society Press. 4. S. Bokhari. Communication overheads on the Intel iPSC-860 Hypercube. Technical Report Interim Report 10, NASA ICASE, May 1990. 5. P. Boulet, A. Darte, T. Risset, and Y. Robert. (pen)-ultimate tiling? INTEGRATION, the VLSI journal, 17:33-51, Nov? 1994. 6. S. Hiranandani, K. Kennedy, and C-W. Tseng. Evaluating compiler optimizations for Fortran D. Journal Of Parallel and Distributed Computing, 21:27-45, 1994. 7. F. Irigoin and R. Tifiolet. Supernode partitioning. In 15th ACM Symposium on Principles of Programming Languages, pages 319-328. ACM, Jan 1988. 8. R. M. Karp, R. E. Miller, and S. V. Winograd. The organization of computations for uniform recurrence equations. JA CM, 14(3):563-590, July 1967. 9. C-T. King, W-H. Chou, and L. Ni. Pipefined data-parallel algorithms: Part Iconcept and modelling. IEEE Transactions on Parallel and Distributed Systems, 1(4):470-485, October 1990. 10. C-T. King, W-H. Chou, and L. Ni. Pipelined data-parallel algorithms: Part IIdesign. IEEE Transactions on Parallel and Distributed Systems, 1(4):486-499, October 1990. 11. W. F. McColl. Scalable computing. In J. van Leeuwen, editor, Computer Science Today: Recent Trends and Developments, volume 1000, pages 46-61. Springer Verlag, 1995. 12. H. Ohta, u Saito, M. Kainaga, and H. Ono. Optimal tile size adjsutment in compifing general DOACROSS loop nests. In International Conference on Supercomputing, pages 270-279, Barcelona, Spain, July 1995. ACM. 13. D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, pages xx-yy, St. Charles, IL, August 1994. IEEE. 14. J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for non shared-memory machines. In Supercomputing 91, pages 111-120, 1991. 15. R. Schreiber and J. Dongarra. Automatic blocking of nested loops. Technical Report 90.38, RIACS, NASA Ames Research Center, Aug 1990. 16. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990. 17. M. J. Wolfe. Iteration space tiling for memory hierarchies. Parallel Processing for Scientific Computing (SIAM), pages 357-361, 1987.
Enhancing the Performance of Autoscheduling in Distributed Shared Memory Multiprocessors* Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Computing Architectures Laboratory Department of Computer Engineering and Informatics University of Patras Rio 26500, Patras, Greece e-maih{dsn, edp, tsp}~hpclab, ceid. upatras, gr
Autoscheduling is a parallel program compilation and execution model that combines uniquely three features: Automatic extraction of loop and functional parallelism at any level of granularity, dynamic scheduling of parallel tasks, and dynamic program adaptability on multiprogrammed shared memory multiprocessors. This paper presents a technique that enhances the performance of autoscheduling in Distributed Shared Memory (DSM) multiprocessors, targetting mainly at medium and large scale systems, where poor data locality and excessive communication impose performance bottlenecks. Our technique partitions the application Hierarchical Task Graph and maps the derived partitions to clusters of processors in the DSM architecture. Autoscheduling is then applied separately for each partition to enhance data locality and reduce communication costs. Our experimental results show that partitioning achieves remarkable performance improvements compared to a standard autoscheduling environment and a commercial parallelizing compiler. Abstract.
1
Introduction
Distributed Shared Memory (DSM) multiprocessors have evolved as a powerful and viable platform for parallel computing. Unfortunately, automatic parallelization for these systems becomes often an onerous duty, requiring numerous manual optimizations and substantial effort from the programmer to sustain high performance [3]. The development of programming models and compilation techniques that exploit the advantages and overcome the architectural bottlenecks of DSM systems remains a challenge. Among several parallel compilation and execution models proposed in the literature, autoscheduling [13] is one of the most promising approaches. Autoscheduling provides a compilation framework which merges the control flow and data flow models to efficiently exploit multiple levels of loop and functional parallelism. The compilation framework is coupled with an execution environment * This work was supported by the European Commission under ESPRIT IV project No. 21907 (NANOS)
492 that schedules parallel programs dynamically on a variable number of processors, by controlling the granularity of parallel tasks at runtime. Coordination between the compiler and the execution environment achieves effective integration of multiprocessing with multiprogramming. In this paper, we identify sources of performance degradation for an autoscheduling environment running on a DSM multiprocessor. Performance bottlenecks arise as a consequence of poor caching performance and excessive communication through the interconnection network. We present a technique that attempts to resolve these problems by augmenting autoscheduling with partitioning and clustering. We partition the application task graph in subgraphs with independent dataflows and map the derived subgraphs to processor clusters. The topology of the clusters conforms to the system architecture. Partitioning and clustering are performed dynamically at runtime. After mapping partitions to clusters, autoscheduling is applied separately to each partition to extract and schedule the parallelism. Using application and synthetic benchmarks, we demonstrate that partitioning reduces significantly the execution time of autoscheduled programs on a 32-processor SGI Origin2000. For the same set of benchmarks, partitioned autoscheduling outperforms the native SGI loop parallelizer, even in situations where standard autoscheduling fails to do so. The rest of this paper is organized as follows: Section 2 provides some background on autoscheduling and discuss its performance limitations. Section 3 presents our technique along with some implementation issues. Section 4 describes our evaluation framework and Section 5 provides detailed experimental results. We summarize our conclusions in Section 6. 2
Autoscheduling
and Performance
Limitations
An autoscheduling environment consists of a parallelizing compiler, a multithreading runtime library and an enhanced operating system scheduler. The compiler applies control and data dependence analysis to produce an intermediate program representation called the Hierarchical Task Graph (HTG) [5]. Programs are represented with a hierarchical structure consisting of simple and compound nodes. Simple nodes correspond to basic blocks of sequential code. Compound nodes represent tasks amenable to parallelization, such as parallel loops and parMlel sections. The levels of the hierarchy are determined by the nesting of loops and correspond to different levels of exploitable parallelism. Nodes are connected with arcs representing execution precedences. A boolean expression with a set of firing conditions is associated with each node. When this expression evaluates to TRUE, all data and control dependences of the node are satisfied and the corresponding task is enabled for execution. The compiler generates parallel code by means of a user-level runtime system which creates and schedules lightweight execution flows called nano-threads [13]. The granularity of the generated parallelism is controlled dynamically at runtime. A communication mechanism between the runtime system and the operat-
493 ing system scheduler keeps parallel programs informed of processor availability and enables dynamic program adaptability to varying execution conditions [14]. Each compound node in the HTG is annotated as a candidate source of paralMism. The runtime system decomposes all the compound nodes which contain enough parallelism to utilize the processors available at a given execution instance. If several compound nodes coexist at the same level of the hierarchy, parallelism can be extracted from all of them simultaneously. In that case, nanothreads originating from different compound nodes are executed in an overlapped manner across the processors. As an example, consider the H T G of the multiplication of two matrices Y, Z with complex entries, calculated as: X =YZ
= (A+jB)(C+jD)
= (AC-BD)+j(BC+AD)
(1)
The computation consists of four independent matrix by matrix multiplications, followed by an addition and a subtraction. Each of these tasks is a compound node with inherent loop parallelism. If a sufficient number of processors is available at runtime, autoscheduling may execute the matrix multiplications simultaneously and each matrix multiplication in parallel to exploit two levels of parallelism. Overlapped execution of nano-threads originating from different compound nodes can have undesirable effects on a DSM multiprocessor. This is particularly true when each compound node individually uses all the available processors. Three performance implications may arise in this case: Increase of conflict and capacity cache misses, heavy traffic in the interconnection network, and higher runtime overhead. Bursts of conflict and capacity misses occur when the working sets of concurrently executing compound nodes do not fit in the caches, thus forcing the caches to alternate frequently between different working sets. This happens as processors are switched between nanothreads originating from different compound nodes. Overlapped execution of compound nodes stresses the interconnection network, due to excessive data communication. The runtime system ovehead is higher, since multiple processors are forced to access heavily shared ready queues and memory pools, in order to create and schedule nano-threads. This side-effect exacerbates contention and communication traff• due to remote memory accesses [11].
3
Performance
Enhancements
Our method to alleviate the problems presented in Sect. 2 while preserving the semantics of autoscheduling, is to decompose the program HTG into partitions with independent data flows and the processor space of the program into clusters, with a one-to-one mapping between HTG partitions and processor clusters. After partitioning and clustering, autoscheduling is applied separately for each partition. Parallelism from each partition is extracted to utilize the processors of the cluster to which the partition is mapped. Processor clusters serve as memory
494 locality domains for the partitions. This means that the working set of a partition is stored in physical memory that lies within the corresponding cluster, in order to reduce memory latency during the execution of the partition. T h e communication required for a partition is performed within the cluster, thus putting less pressure in the interconnection network and avoiding long distance memory accesses. Furthermore, data dependent partitions use the same processors to preserve locality. If a partition 7~ is data dependent on a partition Q, our mechanism executes P either in a superset, or a subset of the cluster to which 7) is mapped. The relative size of the clusters is determined with the following procedure: Compound nodes that can be simultaneously executed are detected in the H T G and groupped in sets. The computational requirements of each compound node are estimated with execution profiling and then weighed with the aggregate computational requirements of the compound nodes that belong to the same set. Processors are distributed proportionally to each compound node according to its weighed computational requirements. The actual size of the clusters is set at runtime, and depends on their precomputed relative size and processor availability. Let P be the number of processors avMlable to the application at an execution instance, n the number of concurrently executing compound nodes at that instance and ri, i = 1 ... n, the computational requirement of each compound node, expressed in CPU time. Then the processor set is partitioned into n clusters ci, i = 1 . . . n, and each cluster receives Pc, processors where: r, p} Pc, = m a x { l , ~i=1 n ri
(2)
If the number of available processors does not exceed the degree of functional parallelism exposed by the application --expressed as the number of concurrently executing compound nodes--, then our technique degenerates to standard autoscheduling. Otherwise, autoscheduling is applied within the bounds of each cluster by extracting the data and/or functional parallelism of the associated H T G partition. The clusters can be dynamically reconfigured at runtime. Load imbalances between clusters are not considered. We rather rely on the proportional allocation of processors to clusters to handle these eases. However, load imbalances within the clusters across parallel loops or parallel sections are handled by the user-level scheduler which employs affinity scheduling and work stealing [15]. The topology of the clusters is kept consistent with the underlying system architecture. For DSM architectures, clusters are built incrementally using the Symmetric Multiprocessing (SMP) nodes as the basic building block. T h e interconnection network is exploited to form clusters with SMP nodes which are physically close to each other. The clusters can shrink or grow at runtime, either by splitting, or by joining directly connected clusters. The entire mechanism is implemented at user-level. Partitions are detected from the H T G , and communicated to the runtime system through directives. Static information about the relative processor requirements of each partition is also communicated to the runtime system. The actual cluster sizes are computed
495 at runtime using Eq. 2. The infrastructure for mapping the partitions to processors is provided by a hierarchy of ready queues [2]. The hierarchy corresponds to a tree with a global system queue at the root, per-processor local queues at the leaves and a number of intermediate levels of ready queues. Mapping of partitions to clusters is performed with Distributed Hierarchical Control (DHC) [4]. When a compound node is assigned to a cluster of size p, the corresponding nano-thread is mapped to a ready queue at the root of a subtree that has at least p processors at the leaves. The processor that creates the nano-thread searches the hierarchy up until it finds a queue from which a tree of the appropriate size is emanated. Nested nano-threads created by the outermost nano-thread are mapped to the queues that reside at the leaves of the subtree. Processor clusters are implicitly configured using DHC, without invoking the operating system to construct the memory locality domains. As an example, Fig. 1.(a) illustrates the task graph of complex matrix multiplication. Following our scheme, the processor set that executes the computation is initially partitioned in four clusters of equal size, and a matrix multiplication is executed separately in each cluster. Within a cluster, the matrix multiplication is autoscheduled on a number of processors equal to the size of the cluster. When the subtraction and addition tasks are activated, pairs of neighbouring clusters are joined to form two clusters of doubled size and execute the tasks. Figures 1.(b) and 1.(c) demonstrate executions of the task graph under standard autoscheduling and our scheme respectively. Circles indicate nano-threads, while their fill patterns indicate compound nodes from which the nano-threads originate. Under standard autoscheduling, the execution sequence of nano-threads from the concurrently executing compound nodes is determined from the serialization of processor accesses to the ready queues and may change between subsequent executions of the program. With partitioning, compound nodes are still executed concurrently, but the parallelism extracted from each partition is scheduled in isolation within a cluster. Figure 2 illustrates a hierarchy of ready queues for an 8-processor sytsem and the scheduling strategy for the complex matrix multiplication example.
IO@~G~QQ ~ Q@@e|174
5~|174174174174 I ~OOO01 IQ~0O~|174174174 L .mw
@~xl~x2~x3
Q~x4@~-0~§ (a)
(hi
e~X1 ~X2
~X3
(el
Fig. 1. Complex matrix multiplication task graph and execution with standard and partitioned autoscheduling . • + , - indicate matrix by matrix operations.
496 Pr~os~r Local O u e u ~
Fig. 2. Hierarchy of ready queues and enqueing of complex matrix multiplication tasks.
4
Evaluation
Framework
We evaluate our technique with a number of application and synthetic benchmarks. The application benchmarks presented here were used in a previous study of autoscheduling [9]. This gives us the opportunity for a quantitative analysis of our results in direct comparison with existing work in the field. The purpose of the synthetic benchmarks is to demonstrate the performance implications of autoscheduling and the ability of our method to overcome them. The application benchmarks include the complex matrix multiplication kernel, the kernel of a Fourier-Chebyschev spectral computational fluid dynamics code [7], and an industrial molecular dynamics code [16]. All benchmarks have two levels of exploitable parallelism, namely a constant degree of functional parallelism at the outer lelel and data parallelism at the inner level. The complex matrix multiplication and the cfd kernels use matrices of size 256x256. The molecular dynamics kernel consists of 14 parallel functions, each one executing a parallel loop with 1000 iterations. More detMls on the structure of these benchmarks, as well as their performance under autoscheduling on a bus-based multiprocessor can be found in [9]. Our first synthetic benchmark consists of 8 parallel dot products. Each dot product is computed with two vectors of 64 Kilobytes of double precision elements. A detailed description of the benchmark can be found in [11]. Two levels of functional and data parallelism are exploitable, similarly to the application benchmarks. The datafiow of this benchmark favours processor clustering to maintain locality. The benchmark experiences increased conflict cache misses due to the alignment of data in memory, and a low ratio of computation to overhead because of the fine thread granularity. The second synthetic benchmark consists of four parallel point LU factorizations, followed by two parallel additions of the factorized matrices. The matrix size used is 256 x 256. Autoscheduling is expected to lead to conflicts between elements of different matrices in the caches, and increased remote cache invalidations due to the all to all communication pattern of the computation. Partitioning is expected to save cache misses and localize communication.
497 Table 1. Maximum speedups attained for the benchmarks with the three paralellization schemes. Benchmark synthetic dot product synthetic LU cmm kernel cfd kernel md kernel'
Loop Parallel Smax p 1.42 4 7.06 16 7.14 16 5.59 12 1.13 14
Autoscheduled Sm~ p 1.82 4 7.34 16 8.51 16 6.'22 12 1.03 14
~artitioned Autoscl~. Smax P 3.52 8 15.72 32 10.08 16 7.31 12 1.34 28
We performed our experiments on a SGI Origin2000 [6], equipped with 32 MIPS R1000 processors. We implemented partitioning and clustering using NthLib [8]. NthLib is a multithreading runtime library designed to support autoscheduling. The library is currently under development in the context of the E S P R I T project NANOS [10]. NthLib provides lightweight threads, created and scheduled entirely at user-level. The runtime system translates H T G representations into multithreading code, while attempting to preserve desirable properties such as data locality and load balancing. At the same time, NthLib communicates with the operating system to enable application adaptivity to the runtime conditions. Automatic parallelization was performed with HPF-like directives and the Parafrase-2 compiler [1, 12]. Three versions of each benchmark were used in the experiments: (a) a loop parallel version produced by the native SGI parallelizing compiler and exploiting a single level of parallelism from loops; (b) a autoscheduled version, which exploits multiple levels of loop and functional parallelism; and (c) a partitioned autoscheduled version which uses autoscheduling enhanced with partitioning and clustering. All benchmarks were compiled with the -03 optimization flag and executed in a dedicated environment.
5
Performance
Results
In this section, we present detailed results from our experiments. Table 1 shows the maximum speedups attained for the benchmarks presented in Sect. 4 with the three parallelization schemes. The table also shows the number of processors on which the maximum speedup is attained in each case. The obtained speedups are justified as follows: For the synthetic dot product benchmark, poor speedup is a consequence of increased runtime overhead and cache conflicts. This benchmark scales better only if the problem size and thread granularity are significantly increased. The molecular dynamics kernel exhibits poor speedups due to poor data locality and high irregularity in the native code. Our results for this code agree with those reported in [9]. The other three codes exhibit good speedups, with respect to the structure of the computations, the platform architecture and the problem size. The primary conclusion drawn from Table 1 is that in all cases, the m a x i m u m attainable speedup is obtained from the partitioned autoscheduled versions. The
498 partitioned autoscheduled versions of the synthetic benchmarks and the molecular dynamics kernel scale up to a higher number of processors than both the autoscheduled and the loop parallel versions. The autoscheduled versions have always better speedups than the loop parallel versions with the exception of the molecular dynamics kernel. These results are consistent with the results in [9] and strengthen the argument that autoscheduling exploits unstructured parallelism more effectively. All benchmarks achieve better speedups and some of them do scMe up to 32 processors if the problem sizes and thread granularities are adequately increased. However, we insisted in using small problem sizes to stress the ability of the runtime system to handle fine-grain parallelism, a prerequisite for the applicability of autoscheduling. We believe that parallel sections and loops with such a fine granularity are frequently met in scientific codes, and the question whether parallelizing these code fragments or not is worth the effort remains often dangling. Autoscheduling is a practical scheme for exploiting this kind of parallelism from scientific applications and our experiments intend to demonstrate this feature. A more thorough study of the codes as well as architecture-specific optimizations that could improve the speedups on our experimental platform are out of the scope of this paper. Figure 3 illustrates the normalized execution times of the benchmarks when executed on a variable number of processors. Normalization is performed against the sequential execution time excluding the parallelization overhead. The conclusions concerning the relative performance of the three parallelization methods, agree with those derived from Table 1. The partitioned autoscheduled versions exhibit good behaviour even when executed on 32 processors, despite the fact that most benchmarks do not scale up to this point. Compared to loop parallelization and standard autoscheduling, our technique improves speedup by a factor of two and 36% respectively for the complex matrix multiplication. The corresponding numbers for the CFD kernel are 55% and 450/o. For the synthetic dot product benchmark, partitioned autoscheduling is the only mechanism that achieves any speedup on 32 processors. Two benchmarks that exhibit interesting behaviour are the synthetic LU benchmark and the molecular dynamics kernel. In the former, autoscheduling gives only marginal improvements compared to loop parallelization, while in the latter, autoscheduling results in performance degradation. In both cases, partitioning overcomes these limitations. The speedup of the synthetic LU benchmark is almost doubled with partitioning. The molecular dynamics kernel enjoys a noticeable speedup improvement of 19%. In order to gain a better insight into our results, we performed experiments that measured memory performance under the three parallelization schemes. We used the hardware counters of MIPS R10000 and the p e r f e x utility to obtain various performance statistics. Figure 4 presents a summary of the results from these experiments. The charts show the total number of cache misses and remote invalidations for each benchmark when executed on 32 processors. The results indicate clearly that partitioned autoscheduling improves memory performance
499
Fig. 3. Normalized execution times of the benchmarks
by reducing cache misses and interconnection network traffic. Cache misses and remote invalidations are on average halved with partitioning and clustering. 6
Conclusions
This paper demonstrated that although autoscheduling exploits potentially more parallelism, it is still amenable to enhancements in order to achieve high performance in DSM multiprocessors. We presented a simple and easy to implement technique which takes advantage of data and communication locality by applying partitioning and clustering of autoscheduled programs. The results prooved that autoscheduling remains the exploiting choice for multilevel parallelism when the mechanisms that extract and schedule the parallelism are aware of the target architecture. Several issues of partitioned autoscheduling need further investigation. Topics for further research might be the automation of the clustering procedure using program analysis in the compiler, and the evaluation of partitioned autoscheduling under multiprogramming conditions. Our current work is oriented towards these directions. Acknowledgements We would like to thank Constantine Polychronopoulos for his support and valuable comments, the European Center for Parallelism in Barcelona (CEPBA) for providing us access to their Origin2000, all oar partners in the NANOS project and George Tsolis for his help in improving the appearance of the paper.
500
Fig. 4. Memory performance of the benchmarks
References 1. E. Ayguad~, X. Martorell, J. Labarta, M. Gonzhlez, and N. Navarro, Exploiting Parallelism through Directives on the Nano- Threads Programming Model, Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing, Minneapolis, Minnesota, August 1997. 2. S. Dandamundi, Reducing Run Queue Contention in Shared Memory Multiprocessors, IEEE Computer, Vol. 30(3), pp. 82-89, March 1997. 3. R. Eigenmann, J. Hoeflinger and D. Padua, On the Automatic Parallelization of Perfect Benchmarks, IEEE Transactions on Parallel and Distributed Systems, Vol. 9(1), pp. 5-23, January 1998. 4. D. Feitelson and L. Rudolph, Distributed Hierarchical Control for Parallel Processing, IEEE Computer, Vol. 23(5), pp. 65-79, May 1990. 5. M. Girkar and C. Polychronopoulos, Automatic Extraction of Functional Parallelism from Ordinary Programs, IEEE Transactions on Parallel and Distributed Systems, vol. 3(2), pp. 166-178, March 1992. 6. J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241-251, Denver, Colorado, June 1997. 7. S. Lyons, T. Hanratty and J. MacLaugtdin, Large-scale Computer Simulation of Fully Developed Channel Flow with Heat Transfer, International Journal of Numerical Methods for Fluids, vol. 13, pp. 999-1028, 1991. 8. X. Martorell, J. Labarta, N. Navarro and E. Ayguad~, A Library Implementation of the Nano-Threads Programming Model, Proceedings of Euro-Par'96, pp. 644-649, Lyon, France, August 1996. 9. J. Moreira, On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, PhD Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, 1995.
501 10. NANOS Consortium, Nano-Threads Programming Model Specification, ESPRIT Project No. 21907 (NANOS), Deliverable M1.D1, July 1997, also available at http://www.ae.upc.es/NANOS 11. D. Nikolopoulos, E. Polychronopoulos and T. Papatheodorou, Efficient Runtime Thread Management for the Nano- Threads Programming Model, Proceedings of the IPPS/SPDP'98 Workshop on Runtime Systems for ParalLeL Programming, LNCS vol. 1388, pp. 183-194, Orlando, Flol~ida, March/April 1998. 12. C. Polychronopoulos, M. Girkar, M. Haghighat, C. Lee, B. Leung and D. Schouten,
13.
14.
15.
16.
Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing and Scheduling Programs, International Journal of High Speed Computing, Vol. 1 (1), 1989. C. Polychronopoulos, N. Bitar and S. Kleiman, Nano-Threads: A User-Level Threads Architecture, CSRD Technical Report 1297, University of Illinois at Urbana-Champaign, 1993. E. Polychronopoulos, X. Martorell, D. Nikolopoulos, J. Labarta, T. Papatheodorou and N. Navarro, Kernel-Level Scheduling for the Nano-Threads Programming Model, Proceedings of the 12th ACM International Conference on Supercomputing, Melbourne, Australia, July 1998. E. Polychronopoulos and T. Papatheodorou, Dynamic Bisectioning Scheduling for Scalable Shared-Memory Multiprocessors, Technical Report LHPCA-010697, University of Patras, June 1997. B. Quentrec and C. Brot, New Method ]or Searching for Neighbors in Molecular Dynamics Computations, Journal of Computational Physics, Vol. 13, pp. 430-432, 1975.
Workshop 05+15 Distributed Systems and Databases Lionel Brunie and Ernst Mayr Co-chairmen
Distributed
Systems
The 15 papers in this workshop have been selected from the submissions to two areas: five from parallel and distributed databases, and another ten from distributed systems and algorithms. Parallel and distributed database systems are critical for many application domains such as high-performance transaction systems, data warehousing and interoperable information systems. A number of solutions are commercially available, but many open problems remain. Future database systems must support flexible and adaptive approaches for data allocation, load balancing and parallel query processing, both at the DML and at the transaction program level. Distributed databases have to meet new challenges by the Internet and by the integration into workflow management systems. New application areas such as decision support, data mining, text handling imply new requirements with respect to functionality and performance. Interprocessor communication has become fast enough that distributed systems can be used to solve highly parallel problems in addition to more traditional distributed problems (such as client-server applications). These distributed systems range from a local area network of homogeneous workstations to coordinated heterogenous workstations and supercomputers. Algorithmic and architectural solutions to problems from the fields of distributed and parallel processing (as well as new solutions) can often be applied or adapted to these kinds of systems. Typical examples are implementations of shared memory abstractions on top of message-passing systems, scheduling parallel applications on distributed heterogeneous systems, mechanism and abstractions for fault tolerance, and algorithms to provide elementary system functions and services. The first session on Distributed Systems deals with networks of processors, connected via various type of networks, like hypercubic networks or tree networks. The talks are concerned with issues of load balancing, with synchronization (mutual exclusion) between neighboring processors, and with fault propagation. The second session also addresses several areas. The first pair of talks discusses efficient protocols for inexpensive networks of workstations, the following paper presents some refinements of causality relations between non-atomic poset events in a distributed environment, and the last talk presents a new proposal to communicate channels over channels, an idea interesting for instance for mobile computing systems.
504 In the third session on Distributed Systems, some practical solutions to efficiency problems are presented. The first talk describes an SCI-based implementation of shared-memory and the operating system support for it, while the next paper presents a simple and efficient solution for garbage collection in an environment with task migration. Finally, a proposal based on active messages for improving communication efficiency in local networks of workstations gets discussed. We would like to express our sincere appreciation to the other members of the Program Committee, Prof. Paul Spirakis, Prof. Friedemann Mattern, and Dr. Marios Mavronicolas, for their invaluable help in the entire selection process. We would also like to thank the numerous reviewers for their time and effort. Parallel and Distributed
Databases
The distributed and parallel databases session is composed of 5 papers. 2 papers concern parallel databases while 3 papers address distributed databases topics. Though object oriented databases are widely accepted as a very promising paradigm, only a few papers have addressed the parallelization of object oriented queries. The first paper is one of them. Authors propose new algorithms for the paralllel processing of collection intersection join queries, i.e. queries which check for whether there is an intersection between collection attributes (e.g. sets, lists, arrays...). Performance evaluation of these algorithms is very promising. With the huge development of the Internet, information retrieval (IR) has received a new attention. In this framework, symmetric multiprocessors, by exploiting the parallelism implicit in the queries, allow processing tremendous loads. The paper by Lu et al. investigate how to balance hardware and software resources to exploit a symmetric multiprocessor devoted to IR. Last papers explore three of the main issues in distributed database management systems : update protocols, query optimization and concurrency control. The paper by Pedone et al. proposes to use atomic broadcast primitives to ensure query termination in replicated databases. Experiments demonstrate that this approach shows a better throughput and response time than traditional atomic commitment methods. F. Najjar and Y. Slimani give a novel insight on distributed query optimization based on semi-joins. Using dynamic programming techniques, the optimization heuristics they propose allow exploiting all the parallelism inherent to the query. Finally, Boukerche and al. investigate how to adapt virtual time techniques, classicaly used in distributed systems and VLSI design, to concurrency control problems in distributed databases. They show this new approach outperforms multiversion timestamp techniques as the number of conflicts or transaction size increases. In summary, this session proposes a very exciting panel of papers which give new insights on some of the hottest issues in parallel and distributed databases.
Collection-Intersect Join Algorithms for Parallel Object-Oriented Database Systems David Taniar 1 and J. Wenny Rahayu 2 1 Monash University - GSCIT, Churchill, Vic 3842, Australia David. Yaniar9 cit. monash, edu. au La Trobe University, Dept. of Computer Sc. 8z Comp. Eng., Bundoora, Vic 3083, Australia wenny@latcs i. lat. oz. au
A b s t r a c t . One of the differences between relational and object-oriented databases (OODB) is that attributes in OODB can of a collection type (e.g. sets, lists, arrays, bags) as well as a simple type (e.g. integer, string). Consequently, explicit join queries in OODB may be based on collection attributes. One form of collection join queries in OODB is collectionintersect join queries, where the joins are based on collection attributes and the queries check for whether there is an intersection between the two join collection attributes We propose two algorithms for parallel processing of collection-intersect join queries. The first one is based on sortmerge, and the second is based on hash. We also present two data partitioning methods (i.e. simple replication and "divide and partial broadcast") used in conjunction with the parallel collection-intersect join algorithms. The parallel sort-merge algorithm can only make use of the divide and partial broadcast data partitioning, whereas the parallel hash algorithm may have a choice which of the two data partitioning to use.
1
Introduction
In Object-Oriented Databases (OODB), although path expression between classes m a y exist, it is sometimes necessary to perform an explicit join between two or more classes due to the absence of pointer connections or the need for value matching between objects. Furthermore, since objects are not in a normal form, an attribute of a class may have a collection as a domain. Collection attributes are often mistakenly considered merely as set-valued attributes. As the m a t t e r of fact, set is just one type of collections. There are other types of collection. The Object Database Standard ODMG [1] defines different kinds of collections: particularly set, list/array, and bag. Consequently, object-oriented join queries may also be based on attributes of any collection type. Such join queries are called collection join queries [9]. Our previous work reported in [9] classified three different types of collection join queries, namely: collection-equi join, collectionintersect join, and sub-collection join. In this paper, we would like to focus on collection-intersect join queries. We are particularly interested in formulating parallel algorithms for processing such queries. The algorithms are non-trivial to
506
parallel object-oriented database systems, since most conventional join algorithms (e.g. hybrid hash join, sort-merge join) deal with single-valued attributes and hence most of the time they are not suitable to handle collection join queries. Collection-intersect join queries are queries that join two classes based on an attribute of a collection type. The join predicates check for whether there is an intersection between the two collection join attributes. An intersect predicate can be written by applying an intersection between the two sets and comparing the intersection result with an empty set. It is normally in a form of ( a t t r l i n t e r s e c t a t t r 2 ) != s e t ( n i l ) . Attributes a t t r l and a t t r 2 are of type set. If one or both of them are of type bag, they must be converted to sets. Suppose the attribute editor-in-chief of class Journal and the attribute program-chair of class Proceedings are of type sets of Person. An example of a collection-intersect join is to retrieve pairs of Journal and Proceedings, where the program-chairs of a conference are intersect with the editors-in-chief of a journal. The query expressed in OQL (Object Query Language) [1] can be written as follows:
Select A, B From A in Journal, B in Proceedings Where (A.editor-in-chief intersect B.program-chair)
!= s e t ( n i l )
As clearly seen that the intersection join predicates involve the creation of intermediate results through an intersect operator. The result of the join predicate cannot be determined without the presence of the intermediate collection result. This predicate processing is certainly not efficient. In a collection-intersect join query, the original subset predicate has to produce an intermediate set, before it can be compared with an empty set. This process checks for the smaller set twice: one for an intersection, the other for an equality comparison. Therefore, optimization algorithms for efficient processing of such queries are critical if one wants to improve query processing performance. In this paper, we present two parallel join algorithms for collection-intersection join queries. The primary intention is to solve the inefficiency imposed by the original join predicates. An interest in parallel OODB among database community has been growing rapidly, following the popularity of multiprocessor servers and the maturity of OODB. The emerging between parallel technology and OODB has shown promising results [3], [5], [7], [11]. However, most research done in this area concentrated on path expression queries with pointer chasing. Explicit join processing exploiting collection attributes has not been given much attention. The rest of this paper is organized as follows. Section 2 explains two data partitioning methods for parallel collection-intersect join algorithms. Section 3 describes a proposed parallel join algorithm for collection-intersect join queries based on a sort-merge technique. Section 4 introduces another algorithm, which is based on a hash technique. Finally, section 5 draws the conclusions and explains the future work.
507 2
Data
Partitioning
Parallel join algorithms are normally decomposed into two steps: data partitioning and local join. Data partitioning creates parallelism, as it divides the data to multiple processors, so that the join can then be performed locally in each processor without interfering others. For collection-intersect join queries, it is not possible to have non-overlap partitions, due to the nature of collections which may be overlapped. Hence, some data needs to be replicated. Two non-disjoint partitioning methods are proposed. The first is a simple replication based on the value of the element in each collection. The second is a variant of Divide and Broadcast [4], called "Divide and Partial Broadcasf'.
2.1
Simple Replication
Using a simple replication technique, each element in a collection is treated as a single unit, and is totally independent of other elements within the same collection. Based on the value of an element in a collection, the object is placed into a particular processor. Depending on the number of elements in a collection, the objects that own the collections may be placed into different processors. When an object is already placed at a particular processor based on the placement of an element, if another element in the same collection is also to be placed at the same place, no object replication is necessary. As a running example, consider the data shown in Figure 1. Suppose class A and class B are Journal and Proceedings, respectively. Both classes contain a few objects shown by their OIDs (e.g., objects a to i are Journal objects and objects p to w are Proceedings objects). The join attributes are editor-in-chief of Journal and program-chair of Proceedings; and are of type collection of Person. T h e OID of each person in these attributes are shown in the brackets. For example a(250,75) denotes a Journal object with OID a and the editors of this journal are Persons with OIDs 250 and 75. Figure 2 shows an example of a simple replication technique. The b o l d printed elements are the elements which are the basis for the pin'cement of those objects. For example, object a(250, 75) in processor 1 refers to a placement for object a in processor 1 because of the value of element 75 in the collection. And also, object a(250, 75) in processor 3 refers to a copy of object a in processor 3 based on the first element (i.e., element 250). It is clear that object a is replicated to processors 1 and 3. On the other hand, object i(80, 70) is not replicated since both elements will place the object at the same processor, that is processor 1.
2.2
D i v i d e and Partial Broadcast
The Divide and Partial Broadcast algorithm, shown in Figure 3, proceeds in two steps. The first step is a divide step, and the second step is a partial broadcast step. We divide class B and partial broadcast class A. The divide step is explained as follows. Divide class B into n number of partitions. Each partition of class B is placed in a separate processor (e.g. partition B1 to processor 1, partition
508
Class A (Journal)
Class B (Proceedings)
a(250, 75) b(210, 123) c(125, 181) d(4, 237) e(289, 290) f(150, 50, 250) g(270) h(190, 189, 170) i(80, 70)
p(123, 210) q(237) r(50, 40) s(125, 180) t(50, 60) u(3, 1, 2) v(lO0, 102, 270) w(80, 70)
Proceedings OIDs Journal OIDs
program-chair OIDs editor-in-chief OIDs
F i g . 1. Sample d a t a
Class B
Class A
a(250, 75) d(4, 237) f(150, SO, 250) i(80, 70)
r(50, 40) t(50, 60) u(3, 1, 2) w(80, 70)
Processor 1 (range 0-99)
b(210, 123) c(125, 181) f(150, 50, 250) h(190, 189, 170) a(250, 75) b(210, 123) d(4, 237) e(289,290) f(150, 50, 250) g(270)
p(123, 210) s(125, 180) v(100, 102,270)
Processor 2 (range 100-199)
p(123,210) q(237)
Processor 3 (range 200-299)
F i g . 2. Simple replication
509 B2 to processor 2, etc). Partitions are created based on the largest element of each collection. For example, object p(123, 210); the first object in class B, is partitioned based on element 210, as element 210 is the largest element in the collection. Then, object p is placed on a certain partition, depending on the partition range. For example, if the first partition is ranging from the largest element 0 to 99, the second partition is ranging from 100 to 199, and the third partition is ranging from 200 to 299, then object p is placed in partition B3, and subsequently in processor 3. This is repeated for M1 objects of class B.
Procedure DividePartialBroadcast Begin Step 1 (divide): i. Divide class B based on the largest element in each collection. 2. For each partition of B (i = I, 2 .... , n) Place partition Bi to processor i End For Step 2 (partial broadcast): 3. Divide class A based on the smallest element in each collection. 4. For each partition of A (i = I, 2 . . . . . n) Broadcast partition Ai to processor i to n End For End Procedure
Fig. 3. Divide and Partial Broadcast Algorithm
The partial broadcast step can be described as follows. First, partition class A based on the smMlest element of each collection. Clearly, this partitioning m e t h o d is exactly the opposite of that in the divide step. Then for each partition Ai where i-=1 to n, broadcast partition Ai to processors i to n. This broadcasting technique is said to be partial, since the broadcasting goes down as the partition number goes up. For example, partition A1 is basically replicated to all processors, partition A2 is broadcast to processor 2 to n only, and so on. In regard to the load of each processor, the load of the last processor may be the heaviest, as it receives a full copy of class A and a portion of class B. T h e load goes down as class A is divided into smaller size (e.g., processor 1). Load balanced can be achieved by applying the same algorithm to each partition but with a reverse role of A and B; that is, divide A and partial broadcast B. It is beyond the scope of this paper to evaluate the Divide and Partial Broadcast partitioning method. This has been reserved for future work. Some preliminary results have been reported in [10].
510
3
Parallel S O R T - M E R G E Join A l g o r i t h m
The partitioning strategy for the parallel sort-merge collection-intersect join query algorithm is based on the Divide and Partial Broadcast technique. The use of the Divide and Partial Broadcast is attractive to collection joins because of the nature of collections where disjoint partitions without replication are often not achievable. After data partitioning is completed, each processor has its own data. The join operation can then be done independently. The overall query results are the union of the results from each processor. In the local joining process, each collection is sorted. Sorting is done within collections, not among collections. After the sorting process is completed, the merging process starts. We use a nested loop structure to compare the two collections from the two operand objects. Figure 4 shows a parallel sort-merge join algorithm for collection-intersect join queries. Program Parallel-Sort-Merge-Collection-lntersect-Join Begin Step I (data partitioning): Call DividePartialBroadcast
Step 2 ( l o c a l j o i n i n g ) : In each processor a. Sort phase For each object a(cl) and b(c2) of class A and B, r e s p e c t i v e l y Sort c o l l e c t i o n cl and c2 End For b. Merge phase For each object a ( c l ) of class A For each object b(c2) of class B Merge collection cl and c2 If TRUE Then Concatenate objects a and b into query result End If End For End For End Program
Fig. 4. Parallel Sort-Merge Collection-Intersect Join Algorithm
4
Parallel H A S H Join A l g o r i t h m
In this section we introduce a parallel join algorithm based on a hash method for collection-intersect join queries. Like the previous Parallel Sort-Merge algorithm, Parallel-Hash algorithm is also divided into data partitioning and local join phases. However, unlike the parallel sort-merge algorithm, data partitioning
511 for parallel hash algorithm is available in two forms: Divide and Partial Broadcast and Simple Replication. Once data partitioning is complete, each processor has its own data, and hence local join process can proceed. The locM join process itself is divided into two steps: hash and probe. The hashing is carTied out to one class, whereas the probing is performed to the other class. In the hashing part, it basically runs through M1 elements of each collection in a class. The probing part is done in similar way, but is applied to the other class. Figure 5 shows the pseudo-code for parallel hash join algorithm for collection-intersect join queries.
Program Parallel-Hash-Collection-Intersect-Join Begin Step I (data partitioning): Divide and Partial Broadcast version: Call DivideAndPartialBroadcast partitioning Simple Replication version: Call SimpleReplication partitioning
Step 2 ( l o c a l j o i n i n g ) : In each processor a. Hash For each object a ( c l ) of class A Hash c o l l e c t i o n cl to a hash t a b l e End For b. Probe For each object b(c2) of class B Hash and probe collection c2 into the hash table If there is any match Then Concatenate obj b and the matched obj a into query result End If End For End Program
Fig. 5. Parallel Hash Collection-Intersect Join Algorithm
5
C o n c l u s i o n s and Future Work
The need for join algorithms especially designed for collection-intersect join queries is clear, as collection-intersect join predicates normally require intermediate results to be generated, before the final predicate results can be determined. This is certainly not optimal. In this paper, we present two algorithms especiMly design for collection-intersect join queries, namely Parallel Sort-Merge and Parallel Hash algorithms. These two algorithms are designed especially for collection-intersect join queries in object-oriented databases. Data partitioning
512 methods, which create parallelism, are based on either simple replication or "divide and partial broadcast". Parallel Sort-Merge can make use only the divide and partial broadcast method, whereas Parallel Hash can have a choice between the two partitioning methods. Once a data partitioning m e t h o d is applied, local join is carried out by either a sort-merge operator (in the case of parallel sort-merge algorithm) or a hash function (in the case of parallel hash). Local join process is therefore much straightforward, as each processor performs sequential sort-merge or hash operations. Hence, the critical element is the data partitioning method. Our future work includes evaluating the two data partitioning methods (and possibly other partitioning methods) for parallel collection-intersect join algorithms, since data partitioning plays an important role in the overall efficiency of parallel collection join algorithms.
References 1. Cattell, R.G.G. (ed.), The Object Database Standard: ODMG-93, Release 1.1, Morgan Kaufmann, 1994. 2. DeWitt, D.J. and Gray, J., "Parallel Database Systems: The Future of High Performance Database Systems", Communication of the ACM, 35(6), pp. 85-98, 1992. 3. Kim, K-C., "Parallelism in Object-Oriented Query Processing", Proceedings of the Sixth International Conference on Data Engineering, pp. 209-217, 1990. 4. Leung, C.H.C. and Ghogomu, H.T., "A High-Performance Parallel Database Architecture", Proceedings of the Seventh ACM International Conference on Supercomputing, pp. 377-386, Tokyo, 1993. 5. Leung, C.H.C., and Taniar, D., "Parallel Query Processing in Object-Oriented Database Systems", Australian Computer Science Communications, 17(2), pp. 119-131, 1995. 6. Mishra, P. and Eich, M.H., "Join Processing in Relational Databases", ACM Computing Surveys, 24(1), pp. 63-113, March 1992. 7. Taniar, D., and Rahayu, W., "Parallelization and Object-Orientation: A Database Processing Point of View", Proceedings of the Twenty-Fourth International Conference on Technology of Object-Oriented Languages and Systems TOOLS ASIA'97, Beijing, China, pp. 301-310, 1997. 8. Taniar, D. and Rahayu, J.W., "Parallel Collection-Equi Join Algorithms for Object-Oriented Databases", Proceedings of International Database Engineering and Applications Symposium IDEAS'98, IEEE Computer Society Press, Cardiff, UI(, July 1998. 9. Taniar, D. and Rahayu, J.W., "A Taxonomy for Object-Oriented Queries", a book chapter in Current Trends in Database Technology, Idea Group Pubfishing, in press (1998). 10. Taniar, D. and Rahayu, J.W., "Divide and Partial Broadcast Method for Parallel Collection Join Queries", High-Performance Computing and Networking, P.Sloot et al (eds.), Lecture Notes in Computer Science 1401, Springer-Verlag, pp. 937-939, 1998. 11. Thakore, A.K. and Su, S.Y.W., "Performance Analysis of Parallel Object-Oriented Query Processing Algorithms", Distributed and Parallel Databases 2, pp. 59-100, 1994.
Exploiting Atomic Broadcast in Replicated Databases* Fernando Pedone**, Rachid Guerraoui, and Andr~ Schiper DSpartement d'Informatique Ecole Polytechnique F~d~rale de Lausanne 1015 Lausanne, Switzerland
A b s t r a c t . Database replication protocols have historically been built on top of distributed database systems, and have consequently been designed and implemented using distributed transactional mechanisms, such as atomic commitment. We argue in this paper that this approach is not always adequate to efficiently support database replication and that more suitable alternatives, such as atomic broadcast primitives, should be employed instead. More precisely, we show in this paper that fully replicated database systems, based on the deferred update replication model, have better throughput and response time if implemented with an atomic broadcast termination protocol than if implemented with atomic commitment.
1
Introduction
In the database context, replication techniques based on the deferred update model have received increasingly attention in the past years [6]. According to the deferred update model, transactions are processed locally at one server (i.e., one replica manager) and, at commit time, are forwarded for certification to the other servers (i.e., the other replica managers). Deferred update replication techniques offer many advantages over immediate update techniques, which synchronise every transaction operation across all servers. Among these advantages, one may cite: (a) better performance, by gathering and propagating multiple updates together, and localising the execution at a single, possibly nearby, server (thus reducing the number of messages in the network), (b) more flexibility, by propagating the updates at a convenient time (e.g., during the next diM-up connection), (c) better support for fault tolerance, by simplifying server recovery (i.e., missing updates may be demanded to other servers), and (d) lower deadlock rate, by eliminating distributed deadlocks [6]. Nevertheless, deferred update replication techniques have two limitations. Firstly, the termination protocol used to propagate the transactions to other servers for certification is usually an atomic commitment protocol (e.g., a 2PC * Research supported by the EPFL-ETHZ DRAGON project and OFES under contract number 95.0830, as part of the ESPRIT BROADCAST-WG (number 22455). ** On leave from CoMgio T6cnico Industrial, University of Rio Grande, Brazil
514 algorithm), whose cost directly impacts transaction response time. Secondly, the certification procedure t h a t is usually performed at transaction termination time consists in aborting all conflicting transactions. 1 Such certification procedure typically leads to a high abort rate if conflicts are frequent, and hence impacts transaction throughput. In the context of client-server distributed systems, most replication schemes (that guarantee strong replica consistency) are based on atomic broadcast communication [3, 9]. An atomic broadcast communication primitive enables to send messages to a set of processes, with the guarantee that the processes agree on the set of messages delivered, and on the order according to which the messages are delivered [7]. With this guarantee, consistency is trivially ensured if every operation on a replicated server is distributed to all replicas using atomic broadcast. Although several authors have mentioned the possibility of using atomic broadcast to support replication schemes in a database context (e.g., [3, 11]), little work has been done in that direction. In this paper, we show that atomic broadcast can successfully be used to improve the performance of the database deferred update replication technique. In particular, we show that, for different resilience scenarios, the deferred update replication technique based on atomic broadcast provides better transaction throughput and response time than a similar scheme based on atomic c o m m i t ment. The paper is organised as follows. In Section 2, we present the replicated database model, and we recall the principle of the deferred update replication technique. In Section 3, we describe a variation of this replicated technique based on atomic broadcast. In Section 4, we compare the transaction t h r o u g h p u t and response time of deferred update replication based on atomic broadcast with the throughput and response time of deferred update replication based on classical atomic c o m m i t m e n t . Section 5 summarises the m a i n contributions of the paper and discusses some research perspectives.
2 2.1
The Deferred Update Technique Replicated Database Model
We consider a replicated database system composed of a set of processes ~ = {Pl, P 2 , . . . , P.}, each one executing in a different processor, without shared m e m ory and access to a common clock. Each process has a replica of the database and plays the role of a replica manager. Processes m a y fail by crashing, and can recover after a crash. We assume that processes do not behave maliciously (i.e., we exclude Byzantine failures). If a process p is able to execute requests at a certain time r (i.e., p did not fail or p has failed but recovered) we say t h a t the 1 Note that we do not consider here replication protocols that allow replica divergence and require reconciliations. We focus on replication schemes that guarantee one-copy serializability [2].
515
process p is up at time r. Otherwise the process p is said to be down at time r. We say that a process p is correct if there is a time after which p is forever up. 2 Processes execute transactions, that are sequences of read and write operations followed by a commit or abort operation. Transactions are s u b m i t t e d by client processes executing in any processor (with or without a replica of the database). Our correctness criterion for transaction execution is one-copy serializability (1SR) [2]. C o m m i t t e d transactions are represented as ~ , Tj and Tk. The set H i contains all c o m m i t t e d transactions in process p at time v. N o n - c o m m i t t e d transactions are represented as t~n and t,~. When a transaction tm commits, it is represented as Ti, where i indicates the serial order of tin, related to the already c o m m i t t e d transactions in the database. 2.2
The Deferred Update
Principle
In the deferred update technique, a transaction passes through some well-defined states. It starts in the executing state, when its read and write operations are locally executed by the process where it initiated. When the transaction requests the commit, it passes to the committing state and is sent to the other processes. When received by a process, the transaction is also in the c o m m i t t i n g state. A transaction in a process remains in the c o m m i t t i n g state until its fate is known by the process (i.e., commit or abort). The executin~ and c o m m i t t i n g states are transitory states, whereas the commit and abort states are final states. To summarise the principle of the deferred update replication technique, we consider a particular transaction t,~ executing in some process Pi. Hereafter, the readset ( R S ) and writeset ( W S ) are the sets of d a t a items that a transaction reads and writes, respectively, during its execution. 1. Transaction tm is initiated and executed at process pi. Read and write operations request the appropriate locks, t h a t are locally granted according to the strict two phase locking rule. During this phase, transaction tm is in the executing state. 2. W h e n transaction t,~ requests the c o m m i t , tm passes to the c o m m i t t i n g state. Its read locks are released (the write locks are maintained by pi until t m reaches a final state), and process pi then triggers the termination protocol for t,~: the updates performed by tm (e.g., the redo log records), as well as its readset and writeset, are sent to all the other processes. 3. As part of the termination protocol, t,~ is certified by every process. T h e certification test guarantees one-copy serializability. The outcome of this test is the c o m m i t or abort of tin, according to concurrent transactions t h a t 2 The notion of forever up is a theoretical assumption to guarantee that correct processes do useful computation. This assumption prevents cases where processes fail and recover successively without being up enough time to make the system evolve. Forever up means, for example, from the beginning until the end of a termination protocol.
516 executed in other processes. We will come back to this issue in Sections 3 and 4. 4. If tm passes the certification test, all its write locks are requested and its updates are applied to the database. Hereafter, transaction tm is represented as ~ . Transactions in the execution state whose locks conflict with T/'s write locks are aborted for ~ ' s sake. During the time when a process p is down, p m a y miss some transactions by not participating in their termination protocol. However, as soon as process p is up again, p catches up with another process t h a t has seen all transactions in the system. This recovery procedure depends on the implementation of the termination protocol (see [10] for a detailed discussion).
3 3.1
A Replication
Scheme
Based
on Atomic
Broadcast
Atomic Broadcast
An atomic broadcast primitive enables to send messages to a set of processes, with the guarantee that all processes agree on the set of messages delivered and the order according to which the messages are delivered [7]. More precisely, atomic broadcast ensures that (i) if some process delivers message m then every correct process delivers m (uniform atomicity); (ii) no two processes deliver any two messages in different orders (order); and (iii) if a process broadcasts message m and does not fail, then every correct process eventually delivers m
(termination). It is i m p o r t a n t to notice that the properties of atomic broadcast are defined in terms of message delivery and not in terms of message reception. Typically, a process first receives a message, then performs some c o m p u t a t i o n to guarantee the atomic broadcast properties, and then delivers the message. The notion of delivery captures the concept of irrevocability (i.e., a process must not forget that it has delivered a message). In the following, we use the expression delivering a transaction tm to mean delivering the message that contains transaction tin. 3.2
A Termination
Protocol Based on Atomic Broadcast
We describe now the termination protocol for the deferred u p d a t e replication technique based on atomic broadcast. On delivering a c o m m i t t i n g transaction, each process certifies the transaction and, in case of success, c o m m i t s it. Once the transaction is delivered and certified successfully, it passes to the c o m m i t state and its writes are processed. In order for a process to certify a committing transaction tin, it has to know which transactions conflict with tin. The notion of conflict is defined by the precedes relation between transactions and the operations issued by the transactions. A transaction Tj precedes another transaction tin, denoted Tj --+ tin, if Tj c o m m i t t e d before tm started its execution. The relation Tj 74 t,~ (not Tj --+ tin)
517 means that Tj committed after tm started its execution. Based on these definitions, we say that a transaction Tj conflicts with tm if (1) Tj 74 tm and (2) t~n and Tj have conflicting operations. 3 More precisely, process pj performs the following steps after delivering tin. 1. Certification test. Process pj aborts t m if there is any committed transaction Tj that conflicts with tm and that updated data items read by tin. 2. Commitment. If tm is not aborted, it passes to the commit state (hereafter tm will be expressed as the committed transaction T~, where i represents the sequential order of transaction tin, and H;~ = H~j U {Ti}); process pj tries to grant all ~ ' s write locks, and processes its updates. There are two cases to consider. (a) There is a transaction tn in execution at pj whose read or write locks conflict with Ti's write locks. In this case tn is aborted by pj and, therefore, all tn's read and write locks are released. (b) There is a transaction tn, that is executed locally at pj and requested the commit, but is delivered after ~ . If tn executed locally at pj, it has all write locks on the data items it updated. If tn commits, its writes will supersede T/'s (i.e., the ones that overlap) and, in this case, Ti need neither request these write locks nor process the updates over the database. This is similar to Thomas' Write Rule [12]. If tn is later aborted (i.e., it does not pass the certification test), the database should be restored to a state without tn, for instance, by applying Ti's redo log entries to the database. Given a committing transaction t,~, the set of transactions {Tj I Tj 74 tin} (see certification test above) is determined as follows. Given a process p, let lastp(r) represent the last delivered and committed transaction in p at local time r. A transaction tm that starts its execution in process p at time r0 has to be checked at certiilcation time r e against tile transactions {~a.,t,(ro)+l,. 9 T~a,t,(T,)}. The information lastp(ro) is broadcast together with the writes ofTi, and lastp (l-p) is determined by each process p at certification time (see [10] for more details).
4
Atomic Broadcast
vs.
Atomic Commit
In this section we describe the deferred update technique based on atomic commit and compare it with the deferred update technique based on atomic broadcast using two criteria: transaction throughput and transaction response time. 4.1
The Termination Protocol Based on Atomic Commit
With an atomic commit implementation of the deferred update termination protocol, no total order knowledge is avMlable for the processes and a conflict has to be resolved by aborting all conflicting transactions. We describe below the termination protocol based on atomic commit. Two operations conflict if they are issued by different transactions, access the same data item and at least one of them is a write.
518 1. Certification test. A process can make a decision on the commit or abort of a transaction tm when it knows that all the other processes also know about tin. This requires an interaction among processes that ensures that when a transaction is certified, all committing transaction are taken into account. On certifying transaction t,,, process pj decides for its abort if there is a transaction t~ that does not precede t,~ (t~ 76 tin) and with which t,~ has conflicting operations. 2. Commitment. If t,~ is not aborted, it passes to the commit state and pj tries to grant all its write locks and processes its updates. Transactions that are being locally executed in pj and have read or write locks on data items updated by t,~ are aborted. In order to certify and commit transactions, the system has to detect transactions that executed concurrently in different processes. For example, in [1] this is done associating vector clocks timestamps with local events (e.g., execution or commit of a transaction). These timestamps have the property of being ordered if the events are causally related. Each time a process communicates to another, it sends its vector clock.
4.2
Transaction T h r o u g h p u t and R e s p o n s e T i m e
The values presented in Figure 1 were obtained by means of a simulation involving a database with 10000 data items replicated in 5 database processes. Each transaction has a readset and a writeset with 10 different data items and each data item has the same probability of being accessed. These experiments consider concurrent committing transactions (i.e., transactions that have already requested the commit). The results show that total order can indeed produce better results. In the following we compare implementations of atomic broadcast to implementations of atomic commitment. Due to lack of space, we do not describe how atomic broadcast can be implemented (see [10] for a detailed discussion). The measures shown in Figure 2 were obtained with Sparc 20 workstations (running Solaris 2.3), an Ethernet network using the T C P protocol, transactions (messages and logs) of 1024 bytes, and failure free runs (the most frequent ones in practice). The measures convey an average time to deliver a transaction message. These results show that atomic broadcast protocols have a better performance than atomic commit, when compared under the same degree of resilience.
5
Concluding
Remarks
In a recent paper, Cheriton and Skeen expressed their scepticism about the adequateness of atomic broadcast to support replicated databases [4]. The reasons that were raised were (1) atomic broadcast replication schemes consider individual operations, whereas in databases, operations are gathered inside transactions, (2) atomic broadcast does not guarantee uniform atomicity (e.g., a server
519 I
'
P
",
,,
i.
160
_ _
155 0.9
150 145
0.8
140 0.7
135 130
0.6
125 0.5
120 115
0.4 0.3
/ , I0
II0 , i 15 20 Transaclionsls
Fig. l. Throughput
i 25
1(15 30
3
4
5
6 7 Number of Nodes
8
9
10
Fig. 2. Response time
could deliver a message and crash, with the other servers not even receiving it), whereas uniform atomicity is fundamental in distributed databases (all processes, even those that have later crashed, must agree to commit or not a transaction), and (3) atomic broadcast is usually considered in a crash-stop process model (i.e., a process that crashes never recovers), whereas in databases processes are supposed to recover after a crash. In this paper, we have considered an atomic broadcast primitive in a crash-recovery model that guarantees uniformity, i.e., if one process delivers a transaction, all the others also deliver it. We have shown how that primitive can efficiently be used to propagate transaction control information in a deferred update model of replication. The choice for that replication model was not casual, as some recent research has shown that immediate update models are inviable due to the nature of the applications or the side effects of synchronisation [6]. Indirectly, this paper points out the fact that existing database replication protocols are built on top of distributed database systems, using mechanisms that were developed to deal with distributed information, but not necessarily designed with replication in mind. A flagrant example of this is the atomic commitment mechanism, which we have shown can be favourably replaced by an atomic broadcast primitive, providing better throughput and response time. Our performance figures help dis-mystify a common (mis)belief that total order atomic broadcast primitives are too expensive, when compared to traditional transactional primitives, and so inappropriate for high performance systems. As already stated in [13], we also believe that replication can be successfully applied for performance, and this can be achieved without sacrificing consistency (e.g., as in [5]) or making semantic assumptions about the transactions (e.g., as in [6,
8]). It is important however to notice that the replication scheme based on atomic broadcast presented in this paper is based on two important assumptions: (1) processes certify and commit transactions sequentially, and (2) the database is fully replicated. The first assumption can be bypassed since concurrency between transactions that come from the Certifier is allowed if the transactions have disjoint write sets. The second assumption about a fully replicated
520 database is reasonable in traditional closed information systems, but is inappropriate for an open system with a large number of nodes or a large database. In such systems, replication can only be partial (i.e., processes store only subsets of the database), and the extend to which atomic broadcast can be useful in this context is an open issue.
References 1. D. Agrawal, A. E1 Abbadi, and R. Steinke. Epidemic algorithms in replicated databases. In Proceedings of the Sixteenth A CM SIGA CT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, 12-15 May 1997. 2. P. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. 3. K. P. Birman and R. van Renesse. Reliable Distributed Computing with the ISIS Toolkit. IEEE Press, 1994. 4. D. Cheriton and D. Skeen. Understanding the Limitations of Causally and Totally Ordered Communication. In Proceedings of the 14th A CM Symposium on Operating Systems Principles, Asheville, North Carolina, December 1993. 5. D. J. Dehnolino. Strategies and techniques for using Oracle 7 replication. Technical report, Oracle Corporation, 1995. 6. J. N. Gray, P. Helland, P. O'Neil, and D. Shasha. The dangers of replication and a solution. In Proceedings of the 1996 A C M SIGMOD International Conference on Management of Data, Montreal, Canada, June 1996. V. Hadzilacos and S. Toueg. Distributed Systems, 2ed, chapter 3, Fault-Tolerant Broadcasts and Related Problems. Addison Wesley, 1993. 8. H. V. Jagadish, I. S. Mumick, and M. Rabinovich. ScMable versioning in distributed databases with commuting updates. In Proceedings of the Thirteenth International Conference on Data Engineering, April 7-11, 1997 Birmingham U.K., pages 520531. IEEE Computer Society Press, April 1997. 9. M. F. Kaashoek and A. S. Tanenbaum. Group communication in the amoeba distributed operating system. In 11th International Conference on Distributed Computing Systems, pages 222-230, Washington, D.C., USA, May 1991. IEEE Computer Society Press. 10. F. Pedone, R. Guerraoui, and A. Schiper. Exploiting atomic broadcast in replicated databases. Technical Report No 98/258, Swiss Federal Institute of Technology, Lausanne, Switzerland, 1998. Available at http://lsewww.epfl.ch/pedone. 11. A. Schiper and M. Raynal. From group communication to transaction in distributed systems. Communications of the ACM, 39(4):84-87, April 1996. 12. R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. on Database Systems, 4(2):180-209, June 1979. 13. P. Triantafillou. High availability is not enough. In J.-F. Paris and H. G. Molina, editors, Proceedings of the Second Workshop on the Management of Replicated Data, pages 40-43, Monterey, California, November 1992. IEEE Computer Society Press.
The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors Zhihong Lu Kathryn S. McKinley Brendon Cahoon Department of Computer Science University of Massachusetts Amherst, MA 01003
{zlu, mckinley, cahoon}@cs.umass.edu
A b s t r a c t . Web search engines, such as AltaVista and Infoseek, handle tremendous loads by exploiting the parallelism implicit in their tasks and using symmetric multiprocessors to support their services. The web searching problem that they solve is a special case of the more general information retrieval (IR) problem of locating documents relevant to the information need of users. In this paper, we investigate how to exploit a symmetric multiprocessor to build high performance IR servers. Although the problem can be solved by throwing lots of CPU and disk resources at it, the important questions are how much of which hardware and what software structure is needed to effectively exploit hardware resources. We have found, to our surprise, that in some cases adding hardware degrades performance rather than improves it. We show that multiple threads are needed to fully utilize hardware resources. Our investigation is based on InQuery, a state-of-the-art full-text information retrieval engine.
1
Introduction
As information explodes across the Web and elsewhere, people increasingly depend on search engines to help t h e m to find information. Web searching is a special case of the more general information retrieval (IR) problem of locating documents relevant to the information need of users. In this paper, we investigate how to bMance hardware and software resources to exploit a s y m m e t r i c multiprocessor (SMP) architecture to build high performance I R servers. Our I R server is based on InQuery [2, 3], a state-of-the-art full-text information retrieval engine that is widely used in Web search engines, large libraries, companies, and governments such as Infoseek, Library of Congress, White House, West Publishing, and Lotus [5]. Our work is novel because it investigates a reM, proven effective system under a variety of realistic workloads and hardware configurations on an SMP architecture. The previous research investigates either the I R system on massively parallel processing (MPP) architecture or it investigates only a subset of the system on SMP architecture such as the disk system or it compares the cost factors of SMP architecture with other architectures. (See [6]
522 for a more thorough comparison with the related work). Our results provide insights for building high performance IR servers for searching the Web and other environments using a symmetric multiprocessor. The remainder of this paper is organized as follows. The next section describes the implementation of our parallel IR server and simulator. Section 3 presents results that demonstrate the system scalability and hardware/software balancing with respect to multiple threads, CPUs, and disks. Section 4 summarizes our results and concludes. 2
A Parallel
Information
Retrieval
Server
This section describes the implementation of our parallel IR server and simulator. We begin with a brief description of the InQuery retrieval engine [2, 3, 5]. We next present the features we model and summarize our validation of the simulator against the multithreaded implementation. InQuery Retrieval Engine InQuery is one of the most powerful and advanced full-text information retrieval engines in commercial or government use today [5]. It uses an inference network model, which applies Bayesian inference networks to represent documents and queries, and views information retrieval as an inference or evidential reasoning process [2, 3]. The inference networks are implemented as inverted files. In this paper, we use "collection" to refer to a set of documents, and "database" to refer to an indexed collection. The InQuery server supports a range of IR commands such as query, document, and relevance feedback. The three basic IR commands we model are: query, summary, and document. A query command requests documents that match a set of terms. A query response consists of a list of top ranked document identifiers. A s u m m a r y command consists of a set of document identifiers. A summary response includes the document titles and the first few sentences of the documents. A document command requests a document using its document identifier. The response includes the complete text of the document. The Parallel IR Server To investigate the bMance between hardware and software in a IR system on a symmetric multiprocessor, we implemented a parallel IR server using InQuery as the retrieval engine. The parallel IR server exploits parallelism as follows: (1) It executes multiple IR commands in parallel by multithreading; and (2) It executes one command against multiple partitions of a collection in parallel. To expedite our investigation of possible system configurations, characteristics of IR collections, and the basic IR system performance, we implement a simulator with numerous system parameters, such as the number of CPUs, threads, disks, collection size, and query characteristics. Table 1 presents all the parameters and the values we use in our experiments. The simulation model is driven by empirical timing measurements from the actual system. For queries, summaries, and documents, we measure CPU and
523 Table I. Experimental Parameters Parameters AbbreviationJt Values ll Query Number QN I000 Terms per Query (average) 2 shifted neg. binomial dist. (n,p,s) TPQ (4,0.8,1) Query Term Frequency dist. from queries QTF Observed Distribution Client Arrival P a t t e r n 6 30 60 90 120 Poisson process (requests per minute) A 150 180 240 300 Collection Size GB) ' CSIZIg 1 2 4 8 16 Disk Number DISK 1 2 4 8 16 CPU Number CPU 1 2 3 4 Thread NUmber TH 1 2 4 8 16 32
disk usage for each operation, but do not measure the memory and cache effects. We model the collection and queries by obtaining document and term statistics from test collections and real query sets (See [1,6] for more details.) We validate our simulator against the multithreaded implementation. The simulator reports response times that are 4.5% slower than the actual system on the average (See [6] for more details).
3
Experiments
and
Results
This section explores how software and hardware configurations affect system scalability with respect to multiple threads, CPUs, and disks. We start with a base system that consists of one thread, CPU, and disk. This system is disk bound. We improve the performance of our IR server through better software: multithreading; and with additional hardware: CPUs and disks. We demonstrate the system scalability using two sets of experiments. In the first set of experiments, we explore the effects of threading on system scalability. In the second set of experiments, we explore system scalability by increasing the collection size from 1 GB to 16 GB. When multiple disks exist, we use a round-robin strategy to distribute the collection and its index over disks. We assume the client arrival rate is a Poisson process. Each client issues a query and waits for response. For each query, the server performs two operations: query evaluation and retrieving the corresponding summaries. Since users typically enter short queries, we experiment with a query set that consists of 1000 short queries, with an average of 2 terms per query that mimic those found in the query set down loaded from the Web server for searching the 103rd Congressional Record [4]. All experiments measure response time, CPU and disk utilization, and determine the largest arrival rate at which the system supports a response time under 10 seconds. We chose 10 seconds arbitrarily as our cutoff point for a reasonable response time.
3.1
Threading
This section examines how the software structure, i.e., number of threads, affects system scMability. Figure 1 illustrates how the average response time and
524 Q N - ~ P ~ Q T F CSIZEFPI_~ISI~ q TH 1 0 0 0 / 2 | O b s , ]1 G B / 1 1 4 ]Var!ed
QN F P q Q T F CSIZEFPLIDISIq T H
[1000121Obs. I 1 G B / 1 / ~
|Va.ried I
6O ~,
60
![2 threads-+--] ]4 threads -E~--.
I I
50
8
' ' ' ~ 4 thread 8 threads 16 threads 32 threads
50 40
, ~" ' - ' , ~t -e--~ ~::' -+--[ t~; -E:]-- # ~f.; -.x..... / ,~:
30 rr
/ I
g
10
0
/f":
0 20 40 60 80 100 120 140 160 180 Average Arrive) R a t e ( r e q u e s t s per minute)
20 40 60 80 100 120 140 160 180 per minute)
A v e r a g e Arrival R a t e ( r e q u e s t s
(a)
(b)
QN [rPQ[QTFCSIZEFPU~ISI ~ TH
QN F P ~ Q T F CSIZEFPUpISIq TII 1 1 0 0 0 | 2 | O b s . I 4 GB 1 2 . l 4~Varied,I
i~000|~(O~s.. I ~ GB i 2 ! 6O
i
,
i
[
,
4/Varied I ,
4 thread ~ 8 threads --P-.
50
16 t h r e a d s
40
-8--
i
[ [
g
/ /
~o
,
,
,
~ thread -,--
50 - 8threads -~-~16 threads -m-.-
32 th reads x ~ / / / S i
,
32threads .........1
,o
f
,
~ [
J
t
i ' ]
/' /
/
[
/
i
3O 20
/
ec g
0 ~
'
~
"
O 20 40 60 60 100 120 140 160 180 Average Arrival R a t e (requests per minute)
20 10 O
120
0 20 40 60 80 700 140 160 180 Average Arrival R a t e ( r e q u e s t s per minute)
(d) Configuration (a) A = 120 [=resource TH=21 T H = 4 CPU 34.4% 36.2% DISK 92.7% 97.30% (e) hardware
Configuration (b) h -~ 150 TH=41 TH~-8 72.6% 89.2% 51.1% 62.8% utilization at some
Configuration (c) Configuration (d) ,k = 189 ,k = 150 TH=S] TH----16 TII----S T H - - 1 6 ' 55.1% 56.7% ~/3 3% 85.2% 77.4% 79.4~o 68 0% 79.3% interested data points
Fig. 1. Performance as the number of threads increases
resource utilization changes as the number of threads increases with varying number of CPUs and disks. In all the configurations, the average response time improves significantly as the number of threads increases until either the disk or the CPU is overutilized (see Figure 1 (a) and (b)). Too few threads limits the system's ability to achieve its peak performance. For example in configuration (c), using 4 threads only supports 120 requests per minute for a response time under 10 seconds, while using 16 threads supports more than 180 requests per minute under the same hardware configuration. When either the CPU or the disk is a bottleneck, the system needs fewer threads to reach its peak performance. When CPUs and disks are well balanced (configuration (c) and (d)), the necessary number of threads is influenized more by the number of disks than the collection size. In both configuration (c) and configuration (d), the system achieves its peak performance using 16 threads. Additional threads do not bring further improvement.
525
3.2 Increasing the collection size This section examines system scalability and hardware balancing as the collection size increases from 1 GB to 16 GB. In order to examine different hardware configurations, we consider two disk configurations as the collection size increases: fixing the number of disks, and adding disks. We vary the number of CPUs in each disk configuration. Figure 2 illustrates the average response time and resource utilization when the collection is distributed over 16 disks which means each disk stores a database for 1/16 of the collection. Partitioning the collection over 16 disks illustrates when the system is CPU bound (see Figure 2(c)). Although performance degrades as the collection size increases, the degradation is closely related to the CPU utilization. With 1 CPU where the CPU is overutilized for 1 GB and 60 requests per minute, increasing the collection size from 1 GB to 16 GB decreases the largest arrival rate at which the system supports a response time under 10 seconds by a factor of 10 (see Figure 2(a)). With 4 CPUs where CPUs are overutilized for 1 GB and 180 requests per minute, the performance degrades much more gracefully (see Figure 2(b)). Increasing the collection size from 1 GB to 16 GB only decreases the largest arrival rate for a response time under 10 seconds by a factor of 3 for 4 CPUs. Figure 3 illustrates the average response time and the resource utilization when the number of disks varies with the collection size and each disk stores a database for 1 GB of data. The system produces response times better than 1 GB for 2 GB using 1 CPU and 8 GB using 4 CPUs. A single CPU system thus handles a 2 GB collection faster than a 1 GB collection and a 4 CPU system handles a 2, 4, or 8 GB collection faster than a 1 GB collection in our configuration. The performance improves because work related to retrieving summaries is distributed over the disks such that each disk handles less work, relieving the disk bottleneck. By examining the utilization of CPU and disks in Figure 3(c), we see that the performance improves until the CPUs are overutilized. In the example of the single CPU system, the CPU is overutilized for a 4 GB collection. For a 2 GB collection distributed over 2 disks, the system handles 27.8% more requests than for a 1 GB collection on 1 disk. By comparing Figure 2(c) and Figure 3(c), we find that the CPU utilization is more closely related to the number of disks rather than the collection size. We also find that adding disks degrades system performance when CPUs and disks are not well balanced. For example, for a 8 GB collection, partitioning over 8 disks using 4 CPUs results in 67.5% CPU utilization (see Figure 3(c)), while partitioning over 16 disks results in 88.0% CPU utilization (see Figure 2(c)) due to the additionM overhead to access each disk. In this configuration, a system with 16 disks performs worse than 8 disks because the CPUs are a bottleneck. 4 Conclusion In this paper, we investigate building a parallel information retrieval server using a symmetric multiprocessor to improve the system performance. Our results show that we need more than one thread to fully utilize hardware resources (4 to
526 QN TPQ 1000 2 60
QTF CSIZE CPU DISK TH O b s . V a r i e d V a r i e d 16 1 6 ' Ji'
;''
' /
I? ~B i } o
50 " i i
~=
40
ii
. . . . .
~'=
30 "H
ikG
:I
20 i!813 10
~
0
'
'
'
'
>~ <
J
!8GB!
30 i t
! J ~
i
I-: a. I n-
/
i
3o
/
2o o
/8 GB
i
i
"
,/i
i
:.
;
i
,:
12 G
.,,...)~ ....o
: : . -D
,
(b) C P U = 4
=
'
,
i~
]
I-:
=~
o. nr
i
w h e n ?~ = 120
F i g , 2, A v e r a g e r e s p o n s e t i m e a n d resource utilization for a collection dist r i b u t e d over 16 disks
,
J
, /1GB
'
/16 GB/
3o
~ l i /
20
to
i .2
<
,i
'
/
/
/
i
i ii i
J]
/87s
/
40'
i
::"
I
:
9 ' i ~~,,I ii
.
/
/:
::
. .
:
-
,~'4Gs
.. '/ ..-'"
-
0 = 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
(b) C P U = 4
Size of Collection
( s i z e / 1 6 G B p e r disk) 1 G B 2 G B 4 G B 8 G B 16 G B 96.6% 97.4% 98.1% 98.6% 99.0% 21.6% 18.4% 14,7% 11.9% 9 . 8 % 3 8 . 9 % 4 9 . 7 % 68.0% 88.0% 9 3 . 5 % 3 4 . 9 % 3 7 . 8 % 4 1 . 1 % 42.6% 3 7 . 4 %
(c) R e s o u r c e u t i l i z a t i o n
' ,"
I f
so
J
...... ~.".::..~-:.~:::..~ =--: - - - _.w-:-,-~, 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
Num. ReCPUssource CPU 1 DISK CPU 4 DISK
]
l
=
~4GB:
"
.,.
|
.~
-
!16GB ]
40
/
(a) C P U = 1 60
80
f
[;
, 2 GB
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
(a) C P U = 1
oo~
';
' /1GB
/
/
'..r _=-:.-"
0
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
'
"
i i ! x ~, ..-""
10
:' ,[
: ] .:4GB /
i
20
rr :I
'
i
I-ii
40
L
~
= i ' i16GB .
|i
~
E~
, t
g
60 50 Fi
o.
Q:
QTF CSIZE[ CPU ]DISK]TH O b s . V a r i e d Varied~/arie<~ 16
I GB
i4,GB /
i
QN TPQ 1000 2
Size of Collection
Num. ReCPUssource 1 GB CPU 36.2% 1 DISK]97.4% C P U 9.0% 4 DISK 97.4%
(1 G B p e r disk) 2 GB 4 GB 8 GB 75.3%'94,3% 98.1% 82.2% 43.5%'20.6% 19.0% 3 5 . 7 % , 6 7 . 5 % 82.9% 66.7%'57.1%
16 G B 99.0% 9.8~ 93.5% 37.4%
(c) R e s o u r c e u t i l i z u t i o n w h e n A = 120
F i g . 3. A v e r a g e r e s p o n s e t i m e a n d resource u t i l i z a t i o n when the number of
disks varies with the size of the collection
16 threads for the configurations we explored). We also show that adding hardware components can improve the performance, but these c o m p o n e n t s must be well balanced since the IR workload performs significant amounts of both I / O and C P U processing. Our results show that we can search more data with no loss in performance in many instances. Although performance eventually degrades as the collection size increases, the performance degrades very gracefully if we keep the hardware utilization balanced. Our results also show that system performance for our system is more strongly related to the number of disks rather than the collection size.
527
Acknowledgment This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623. Kathryn S. McKinley is supported by an NSF C A R E E R award CCR-9624209. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor. We thank Bruce Croft, Jamie Callan, James Allan, and Chris Small for their support of this work. We also thank all the developers of InQuery without whose efforts this work would not have been possible.
References 1. B. Cahoon and K. S. McKinley. Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 110-118, Zurich, Switzerland, August 1996. 2. J. P. Callan, W. B. Croft, and J. Broglio. TREC and TIPSTER experiments with INQUERY. Information Processing ~ Management, 31(3):327-343, 1995. 3. J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, Valencia, Spain, September 1992. 4. W. B. Croft, R. Cook, and D. Wilder. Providing government information on the internet: Experiences with THOMAS. In The Second International Conference on the Theory and Practice of D~g~tal Libraries, Austin, TX, June 1995. 5. lnQuery, ht tp :/ / c iir. cs. umass, edu/inf o/highl ight s. html.
6. Zhihong Lu, Kathryn S. McKinley, and Brendon Cahoon. The hardware/software balancing act for information retrieval on symmetric multiprocessors. Technical Report TR98-25, University of Massachusetts, Amherst, 1998.
The Enhancement of Semijoin Strategies in Distributed Query Optimization F. Najjar and Y. Slimani Dept. Informatique - Facult6 des Sciences de Tunis Campus Universitaire - 1060 Tunis, Tunisie yahya, slimani@f st. rnu. tn
A b s t r a c t . We investigate the problem of optimizing distributed queries by using semijoins in order to minimize the amount of data communication between sites. The problem is reduced to that of finding an optimal semijoin sequence that locally fully reduces the relations referenced in a general query graph before processing the join operations.
1
Introduction
The optimization of general queries, in a distributed database system, is an important and challenging research issue. The problem is to determine a sequence of database operations which process the query while minimizing some predetermined cost function. Join is a frequently used database operation. It is also the most expensive, specifically in a distributed database system; it may involve large communication costs when the relations are located at different sites. Hence, instead of performing joins in one step, semijoins [1], are performed first to reduce the size of the relations so as to minimize the data transmission cost for processing queries [2]. In the next step, joins are performed on the reduced relations. The join of two relations R and S on an attribute A is denoted by (R ~A S), while the semijoin from R to S on an attribute A is denoted by S (XA R. Thus, S (:X:AR is defined as follows: (i) project R on the join attribute A (i.e. R(A)); (ii) Ship R(A) to the site containing S; (iii) Join S with R(A). The transmission cost of sending S to the site containing R for the join R ~ n S can thus be reduced. There are two main methods to process a join operation between two relations. One is called the nondistributed join, where a join is performed between two unfragmented relations. The other is called the distributed join, where the join operation is performed between the fragments of relations. As pointed out in [5], the problem of query processing has been proved to be NP-hard. This fact justifies the necessity of resorting to heuristics. The remaining of this paper is organized as follows: preliminaries are given in Section 2. Section 3 defines the main characteristics of two semijoin-based query optimization heuristics; then, we present and discuss the join query optimization in a fragmented database. Finally, Section 5 concludes the paper.
529 2
Preliminaries
A join query graph can be denoted by a graph G = (V, E), where V is the set of relations and E is the set of edges. An edge (Ri, Rj) E E, if there exists a join predicate on some attribute of Ri and Rj. Without loss of generality, only cyclic query graphs are considered. In addition, all attributes are renamed in such a way that two join attributes have the same name if and only if they have a join predicate between them. The relations referenced in the query are assumed to be located at different sites. The query problem is simplified to be the estimation of the data statistics and the optimization of the transmission order, so that the total data transmission is minimized. We denote by IS I the cardinality of a relation S. Let WA be the width of an attribute A and wR~ be the width of a tuple in Ri. The size of the total amount of data in Ri can then be denoted by IIRill = wR, IRil. For notational simplicity, we use IAI to denote the extant domain of the attribute A. Ri(A) denotes the set of distinct values for the attribute A appearing in Ri. For each semijoin Rj o(A •i, a selectivity factor, ilia =- ]R~(A)] IAI is used to predict the reduction effect. After the execution of Rj OCARi, the size of Rj becomes PiAIIRill. Morever, it is important to verify that a semijoin Rj (XA Ri is profitable, i.e. if the cost incurred by this semijoin, wA]Ri(A)], is less than the cost of the reduction (called the benefit), which is computed in terms of avoided future transmission cost, wR, ]Ri]--piA]Ri]. The profit is set to be (benefit - cost).
3
Nondistributed Join Method
In this section, we propose two heuristics. The first, namely one-phase Parallel Semi Joins, 1-PSJ, determines a set of parallel semijoins. The second, namely Hybrid A* heuristic, HA*, finds a general sequence of semijoins, which is a combination of parallel and sequential semijoins.
3.1
1-PSJ
We say that Ri is fully locally reduced if {j / Ri (XA Rj is feasible}. We denote by RDi= {j/Ri c
530 3.2
H y b r i d A*
The well known A* can be used to determine a sequence of semijoin reducers [6] for distributed query processing. The key issue of A* algorithm is to derive a heuristic function which can intelligently guide the search of a sequence of semijoins. In the A* algorithm, the search is controlled by a heuristic function f , with two arguments: the cost of reaching p from the initial node (original query graph with its corresponding profile), and the cost of reaching the goal node from p. Accordingly, f(p) = g(p) + hip), where g(p) estimates the minimum cost of trajectory from the initial state to p, and hip ) estimates the minimum cost from p to the goal state. The node p chosen for expansion (i.e., whose immediate successors will be generated) is the one which has the smallest value f(p) among all generated nodes that have not been expanded so far. In order to derive a general sequence of semijoins, for a node p, giP) = 9(q) + ~ cos tiRi ~ Rj) + IIR~II, where p is an immediate successor of q, and jGAPI
R~ denotes the resulting relation after performing applicables semijoins to the original relation Ri. The function h is defined as the sum of the sizes of remaining relations such that the effect of the total reduction (with respect to neighboring relations) gives the best estimation, h(p) = )--~(~ cos t(Rk oc Rj) +IS R k~ IS), where k j Rk is not yet reduced. E x a m p l e 1: Consider the following join query: Select A, D from R1, R2, R3, R4 where (R1.A = R3.A) and (R1.B = R2.B) and (R2.C = R3.C) and (R3.D =
R4.1)). We suppose that R1, R2, R3 and R4 are located in different sites. The corresponding query graph and profile are given, respectively in Figure 1 and Table 1.
R1
R2
C
R4
R3
Fig. 1. Join query graph for example 1.
1 - P S J finds the set {R1 c( R2, R3 ~ R1, R3 c( R4}, with the total transmission cost to the final site R2, 18,370. Whereas, H A . , finds the general sequence of semijoins, R1 c< R2, {R3 c< R~, R3 ~ R4}, with the cost of 16,681. To show more insights into the performance of 1 - P S J and HA* heuristics, simulations were carried on different queries for n (n is being 5-12) relations
531 Table 1. Profile Table for example 1.
R~ I [R,I X W x IR~(A)I R1 1190IA 2 830 B R2 34401B C R3 2152 A
R4 3100 D
1 i 3 2 3 1 1
850 850
900 800 9OO 720 700
involved in each query. For a comparison purpose, in addition to 1 - P S J and HA*, we also apply the original method, OM, which consists of sending all the relations directly to the final site. In Figure 2, it is apparent t h a t as the number of relations increases (n > 8), HA* heuristic becomes better than 1 - PSJ. When n >_ 9, HA* outperfoms the other heuristics significantly (the reduction cost is about 45%).
o
]1
~.~ 75~1~ I E zo.t |"] I - - ' '
,-
,
,
,.
,.
Number of referenced relations
Fig. 2. Impact of the number of relations on transmission cost.
4
Distributed Join Method
A relation can be horizontally partitioned (declustered) into disjoint subsets, called fragments, distributed across several sites. We associate for each fragment its qualification, which is expressed by a predicate describing the common properties of the tuples in that fragment. One major issue in developping a horizontal fragmentation technique is determining the criteria to be used in guiding the fragmentation. A major difficulty is that there are no known significant criteria that can be used to partition relations horizontally. In the context of our study, we suggest a bipartition of each relation Ri, such that, a relation is divided into mutually exclusive fragments. To represent the
532 fragments more specifically, we propose the following formula: IRi[ = a[Ril + (1 - a)lRi I = IRill+ IRi21, where a is a rational number ranging from 0 to 1 and Ril, Ri2 are the fragments of Ri. The above fragmentation satisfies the three conditions [2], completeness, reconstruction, and disjointness, which ensure a correct horizontal fragmentation. Note that bipartitioning can be applied to a relation repeatedly. To estimate the cost of an execution plan, the database profile may have the following statistics: IRijl denotes the cardinMity of the fragment number j of relation Ri and IRij (A)I represents the number of distinct values of attribute A in its fragment. When semijoins are used in a fragmented database system, they have to be performed in a relation-to-fragment manner, so that they do not cause the elimination of contributive tuples. At each site containing a fragment Rj~ to be reduced, we proceed as follows: (i) every fragment of Ri 1 must participate in reducing Rjk; so, find the optimal set of applicable semijoins and send values of the semijoins attributes from each fragment of Ri to Rjk; (ii) Merge the fragments of R~ before eliminating any tuple of Rjk. E x a m p l e 2: We illustrate the distributed join method (HA* on fragmented relations) with the same previous example discussed for the nondistributed join. After partitioning, the corresponding profile is given in Table 2. Table 2. Profile Table for Example 1.
Rij IRijl X Wx IRij(A)I Rll/R121119/1071 A 2 90/809 B
1
91/818
R21/R22 344/3096 B 1 91/818 C
3
97/872
R31/R32 215/1936 A 2 87/782 C D R41/R4~ 310/2790 D
3 1 1
97/872 79/800 76/683
The optimal general sequence is: {R12 c( R22, R12 0( (R31 t2 R32)}, {R21 c< R l l , R21 (3( (R31 U R32)} , R42 (3( R31, {R32 (2( (Rll U R12), {R32 c< (R21 U R22),R32 0( -R42}; it incurs 13,959, which is less than in nondistributed join method. A general conclusion is that the communication cost is substantially reduced if we use a "good fragmentation". In the absence of a formal definition of a good fragmentation, we can approximate it by the a-factor. In effect, we have noted that a good choice of this criteria leads to a good fragmentation. The Fig 2 shows the effect of the a-factor on the communication cost for a given query in which the number of relations is constant and a is varied. 1 Ri such that Rj ~ Ri is applied in the query.
533
5
Conclusion
In this paper, we proposed two distributed query processing strategies for join queries using semijoin as a query processing tactic. For these two strategies, we present new heuristics t h a t "intelligently" guiding the search and returning a general reducer sequence of semijoins. For the case of the distributed join strategy, we proposed a technique to bipartition each relation assuming a fixed a-factor.
References 1. P.A. Bernstein, N. Goodman, E. Wong, C. Reeve, azld J.B. Rothnie. Query Processing in a System for Distributed Databases (SDD-1). ACM TDS, vol. 6(4), Dec. 1981, pp. 602-625. 2. S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGrawHill, 1985. 3. M-S Chen and P.S Yu. Combining 3oin and Semijoin Operations for Distributed Query Processing. [EEE TKDEvol. 5(3), Jun. 1993, pp. 534-. 542. 4. F. Najjar, Y. Slimani, S. Tlili, and J. Boughizane. Heuristics to determine a general sequence of semijoins in distributed query processing. Proc. of the 9thIASTED Int. Conf., PDCS, Washington D. C. (USA), Oct. 1997, pp. 354-359. 5. C. Wang and M-S. Chen. On the Complexity of Distributed Query Optimization. IEEE TKDE, vol. 8(4), Aug. 1996, pp. 650-662. 6. H. Yoo and S. Lafortune. An Intelligent Search Method for Query Optimization by Semijoins. IEEE TKDE, vol. 1(2), Jun. 1989, pp. 226-237.
Virtual T i m e Synchronization in Distributed Database Systems Using a Cluster of Workstations*. Azzedine Boukerche 1 Timothy E. LeMaster 2, Sajal K. Das 1 and Ajoy D a t t a 2 1 Department of Computer Sciences, University of North Texas, Denton, TX. USA 2 Department of Computer science, University of Nevada~ Las Vegas, NV., USA
Virtual Time is the fundamental concept in optimistic synchronization schemes. In this paper, we investigate the efficiency of a distributed database management system, synchronized by virtual time using a LAN connected collection of SPARCS. To the best of our knowledge, there is little data reporting VT performance in DDBMS. The experimental results are compared with a multiversion timestamp ordering (MVTO) concurrency control method, a widely used technique in distributed databases, and approximately 25-30% reduction in the response time is observed when using the VT synchronization scheme. Our results demonstrate that the VT synchronization approach is a viable alternative concurrency control method for distributed database systems and outpforms MVTO one. Abstract.
1
Introduction
Virtual Time is the fundamental concept in optimistic synchronization schemes. Many applications for virtual time have been proposed which include parallel simulations, distributed systems, and VLSI design problems. Time Warp (TW) [4, 5] based mechanism is a technique to implement virtual time by allowing application programs to cancel previously scheduled events. While conservative synchronization methods [3] rely on blocking to avoid violation of dependency constraints, Time Warp relies on detecting synchronization errors at run time and on recovery using rollback mechanisms. The principal advantages of Time Warp over more conventional blocking-based synchronization protocols is that Time Warp offers the potential for greater exploitation of concurrency and parallelism, and more importantly greater transparency of the synchronization mechanism to the programmer. The latter is due to the fact that Time Warp is less reliant on application specific information regarding which computations Despite the fact that research on virtual time has been ongoing for the past decade, little data reporting performance of distributed database systems * Part of this research is supported by Texas Advanced Technology Program grant TATP-003594031
535 (DDBMS) synchronized by virtual time. Jefferson [4, 5] has suggested that virtual time could be used to synchronize DDBMS. Therefore, a question of considerable pragmatic interest is whether virtual time is really a viable, alternate concurrency control method for distributed database systems and whether it can provide good performance or not. In this paper, our primary goal is to describe an implementation of a virtual time synchronization in distributed database management systems, and study its performance. This is an important step to determine the practicality of a virtual time synchronized database systems. We then compare the performance of our method with a MultiVersion Timestamp Orderings (MVTO) concurrency control method [1]. The empirical results demonstrate that the V T approach outperforms the MVTO one.
2
Virtual
Time
Database
Synchronization
We view a database system as a collection of objects (sequential processes) that execute concurrently and communicate through timestamped messages. Each object is capable of both computation and communication, and it has an associated virtual clock which acts as the simulation clock for that object. Now, let us examine the component and the structure of a database management system. There are four majors components of a DDBMS at each site: transactions, transaction manager (TM), database manager (DM), and data. Transactions are generated from users. The transaction manager supervises interaction between users and the DDBMS, while the database manager controls access to the data, and provides an interface between the user (via the query) and the file system. It is also responsible for controlling the simultaneous use of the database and maintaining its integrity and security. There are no formal distinction between transaction and data objects. While data objects respond only to Read/Write messages, transaction objects receive messages directly from external users. Objects communicate with each other through Request and Response primitives which send messages. A Request either reads or writes a data value, while a Response returns a data value. After an object issues a Request, it blocks to await a response. The possibility of deadlock exists since two objects within a transaction could request data values from each other at the same time. Each would block, waiting for the other to respond to the Request. Virtual Time scheme prevents deadlock between transactions, since all transactions are executed in timestamps order, and all transaction timestamps are unique. Virtual time can be viewed as a global, one-dimensional temporal coordinate system imposed on a distributed system to measure computational progress and define synchronization [4, 5]. There are two important properties of virtual time. First, it may increase or decrease with respect to real time. There is no restriction on the amount or frequency by which it changes. Second, the virtual times generated must form a total (or partial) ordering on the relation "less than."
536 A manager (or process) may send a message to any other m a n a g e r at any time. There are no restrictions placed upon potential communication paths between processes. No fixed or static declarations are required before execution begins. The communication medium is assumed to be reliable but messages are not required to arrive in the order they are sent. Indeed, Occasionally, different transactions cause an unordered message to be received by an object. When this occur, rollback is used for synchronization. However, instead of rolling back the entire transaction, only certain actions within the transactions are rolled back. These actions are the ones that are casually connected to the errant message. This is where virtual time synchronization approach diverges from traditional one, in which the entire transaction is either aborted an retried or is blocked. Virtual time incorporates several different d a t a structures and control algorithms to produce optimistic processing. The most noteworthy are the T i m e W a r p based mechanism to implement virtual time, and the use of rollback as the fundamentM synchronizer in the system. The algorithms include transaction and database managers. The former is responsible for generating and s u b m i t t i n g requests while the latter processes the requests. The rollback mechanism is used by both managers. In this paper, we compare VT with the M V T O version based on Bernstein's algorithm [1]. A complete description of V T and M V T O can be found in [2].
3
Simulation
Experiments
All experiments were conducted on a 12-processors bus based message passing Sun Microsystems SPARC Stations 1. The SPARC stations were connected into a local area network (LAN) via ethernet. It should be pointed out that in distributed database systems, there are more structures t h a n in distributed simulation environments, see [2] for more details. A simple database model was used. Each database manager was responsible for a specified number of items. Whenever a transaction requested a Read or Write access to an item, it was delayed 50 milliseconds to simulate the physical I / O operations. The goals of our experiments are to study the viability of virtual t i m e as a concurrency control method in database systems, and to compare its relative performance compared to multiversion ordering synchronization scheme. In our model, we m a d e use of several d a t a b a s e / t r a n s a c t i o n sizes, inter-transaction delays and varied the number of processors from 4 to 12. The experimental d a t a which follows were obtained by averaging several trial runs. Every test run required between 8 and 50 minutes of actual time to complete all experiments. T h e average length of the experiments was 24 minutes. Space limitation preclude us to include all of our experiments. Hence, we present our results below in the form of graphs of response time, and the throughput. The response time achieved is defined as the average a m o u n t of wall clock time per processor required for a transaction manager to successfully submit a 1 W use a distributed memory environment.
537 fixed number of transactions. The throughput is defined as follow: K. N..___~t.where Wr~p T~,sp represents the total response time of all processors; K is the n u m b e r of processors used, and N t is the number of transactions.
4000
~00
~09~
25Q~
.ooo
i,,,,o
Fig. 1. Response Time Vs. Database Size
Fig. 2. Response Time Vs. Number of Processors
In our experiments, we analyzed the effect of database size on the performance (response time and throughput) of the DDBMS synchronized by virtual time. We made use of 8 processors. We varied the database size from 1 to 20. T h e transaction size was set to 7% of the database size. All Read requests were updated with a probability 25%. The Inter-transaction delay was set to zero. In Fig. 1, we present the response time as a function of database size for b o t h synchronizations V T and MVTO. T h e results obtained show t h a t M V T O and V T exhibit about the same response time when the database size is less t h a n 12. Then, when we increase the database size from 12 to 20, V T responded faster than M V T O . We observe about 25% reduction in the response time when we use a database size of 20. In the next set of experiments, in order to reduce the response time of DDBMS system (s), we run some preliminary tests where we varied the n u m b e r of processors from 4 to 12. The results obtained for the response time are shown in Figure 2. As expected, we also observe t h a t as we increase the n u m b e r of processors, the response time increases for b o t h synchronization schemes. Furthermore, we see that V T performs better t h a n MVTO. For example, we see a reduction of the response time between 25-30% with V T scheme when compared to M V T O one. Similar results were obtained for the throughput metric [2].
References 1. P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. 2. A. Boukerche, T. LeMaster, Sajal, I<. Das, and A.K. Datta, "Virtual Time Synchronization in Distributed Systems", Tech. Rep. TR-98, University of North Texas.
538
3. A. Boukerche, C. Tropper, "Parallel Simulation on the Hypercube Multiprocessor', Distributed Computing, Spring Verglas 1993. 4. D. Jefferson and A. Motro, "The Time Warp Mechanism for Database Concurrency Control," International Con]erence on Data Engineering, 1986, 474-481. 5. D.R. Jefferson, "Virtual Time," A CM Transactions on Programming Languages and Systems, Vol. 7, No. 3, 1985, 404-425.
Load Balancing and Processor Assignment Statements C. Rodriguez, F. Sande, C. Le6n, I. Coloma, A. Delgado Dpto. Estadlstica, Investigaci6n Operativa y Computaci6n, Universidad de La Laguna, Tenerife, Spain
A b s t r a c t . This paper presents extensions of processor assignment statements, the language constructs to support them and the methodology for their implementation on distributed memory computers. The data exchange produced by these algorithms follows a generalization of the concept of hypercubic communication pattern. Due to the efficiency of the introduced weighted dynamic hypercubic communication patterns, the performance of programs built using the methodology are comparable to the obtained using classical message passing programs. The concept of balanced processor assignment statement eases the portability of PRAM algorithms to multicomputers.
1
Balanced
Processor
Assignment
Statements
Algorithms for the PRAM [1] model are usually described introducing a parallel statement like: f o r all x in m..n in p a r a l l e l d o instruction(x) It is essential to differentiate between the concepts of processor activation and processor assignment statements. The f o r all construction is used to denote processor assignment statements, which are usually charged with constant time, although authors avoid to explain this point. Load balancing is a fundamental issue in parallel programming. It is straightforward to implement nested processor assignment statements using cyclic and block distribution mappings. However, the division of the processors among the tasks in groups of equal sizes, provided by cyclic and block processor distribution policies may produce, speciMly for irregular problems, unfair balance. Any divide and conquer algorithm can be used to illustrate this idea. Since the complexity of the resulting subproblems after the division of the problem may be considerably different, to split the current group of processors in equal sizes will likely lead to work load unbalance. One way to solve this inefficiency is to expand the language with a statement allowing the programmer to modify the processor distribution policy. This can be achieved introducing a simple variant of the f o r all statement to provide extra heuristic information w(x), regarding to the suitable number of processors required for each task instruction(oc): f o r all a in a..b w i t h w e i g h t w(x) in p a r a l l e l d o instruction(x)
540 The effect of this new "balanced parallel assignment statemeng' is to adjust the number of processors according to the weights attached to each instance of
instruction(x). An example of this kind of construct can be found in the La Laguna C system (llc). The La Laguna C tool consists in a set of C macros and functions expanding the library model for message passing (PVM, MPI, Inmos C, etc.). It provides not only load balancing but also many other features as processor virtualization, nested parallelism and support for collective operations. Let us first consider the implementation in llc of the well-known quicksort algorithm. A first approach to the solution consists in using a classical processor assignment statement. In this case, the workload assigned to each processor depends of the selected pivot. The implementation of this solution in llc is what appears in Figure 1, substituting in line 6 the call to PAR.WEIGHTVIRTUAL by a call to the macro PARVIR.TUAL. 1 2 3 4 5
void qs(int first, int last) { int wl, w2, size1, size2, i, j; if (first < last) { partition(&i, &j, first, last); sizel-- w l = (j-first + 1); size2 = w2-- ( l a s t - i + 1); 6 PARWEIGHTVIRTUAL(w1,qs(first,j), (A+first), size1, w2, qs(i,last), (A+i), size2);
r} s} Fig. 1. Balancing the size of the groups according to the load
Figure I shows the use of one of the llc balanced parallel processor assignment statements, the macro PARWEIGHTVIRTUAL. At the time to execute the macro P A R W E I G H T V I R T U A L at line 6, the set of available processors is divided into two subsets. One executing the first call (second parameter) and the other doing the second one (sixth parameter). The third and fourth parameters, A + first and size1, inform that the result of the call to the function qs(first,j) on the first segment is constituted by the sizel integers beginning at A + first. The two last parameters A + i and size2 describe the fact that the result of the second call is constituted by the size2 integers starting in A + i. The first and fifth parameters of the call (wl and w2) provide heuristic information about the number of processors required by the corresponding sorting task. In this case they are simply the sizes of the subintervals to sort. At the end of the parallel call the two sets of processors exchange their results and rejoin to form the previous set. To increase performance, instead of using virtualization, the programmer can take control of the number of available processors through the use of the llc variable NUMPROCESSORS. The former example is a Replicated problem. We define a Replicated problem (or parallel algorithm) as one in which the initial input data have to be replicated in all the processors and the solution is also required in all the processors. The
541 1 void qs (int *size) { 2 int totalSize, i, j, pivot, sum, s; 3 if (NUMPROCESSORS > 1) { 4 BALANCE( *totalSize, (*size), A, decision, getWeigth); 5 if (totalSize > NUMPROCESSORS) { 6 REDUCEBYADD(sum, A[*size/2]); 7
pivot = sum / NUMPROCESSORS;
9 PAn( part(*size-l,p,&i,&j,&s), A+i, s, revPart(*size-l,p,&i,&j,&s), A-t-i, s); 10 * s i z e = j + 1 + s ; 11 SPLIT(qs(size), qs(size)); } 12 else qsSeq( O, (*size-l)); } 13 else qsSeq( O, (*size-l));
14 } Fig. 2. Distributed quicksort
example in Figure 2 presents the use of processor assignment statements in a Distributed Problem. A Distributed problem is one in which both the input data and the results are distributed among the processors. The initial array is distributed among the processors and the goal is to have the items sorted according to the processor name. Through the all-to-all reduction rtEDUCEBYADD, the common pivot is chosen in lines 6 and 7. The current set of processors is divided in two subsets of equal size using the PAR macro in line 9: one executing the classical partition procedure part, and the other executes a reversed version of partition revPart, that stores elements greater than the pivot in the first part of the array (from 0 to j) and stores the s items smaller than the pivot between i and size - 1. The results are interchanged and the size of the array is updated in line 10. The recursive call to the macro SPLIT divides the group using the same processor distribution policy than the the former call to the PArt macro. The work load is equilibrated by the call to macro BALANCE at line 4. This La Laguna C macro generalizes the dimension exchange algorithm [4] for hypercubes. Parameters decision and getWeight are provided by the user. When called, getWeight returns a measure of the difficulty of the problem. From the input weights of two arrays of problems, decision decides the number of items in the array of problems exceeding the ideal. Furthermore, BALANCE returns in totaISize the sum of the sizes of the arrays in the current group.
2
Weighted Dynamic Hypercube Topologies
When nested unbalanced or balanced assignment statements are executed on a distributed system, each processor computes its partners in the other subsets created. Let us consider a set of nested bManced processor assignment statements:
542 for all X 1 in 1..nl w i t h weight wl(xl) in parallel do for all x~ in 1..n2 w i t h weight w2(x2) in parallel do for all Xr~ in 1..n,~ w i t h w e i g h t wm(xm) in p a r a l l e l do instruct ion ( x ~, x : , ..., Xm) the resulting communication pattern becomes a sort of distorted dynamic hypercube. Each nested balanced processor assignment statement creates a "new dimension" in the current set H of available processors. In the j t h nested call, the set H is partitioned in a collection of nj subsets F = {H1, ..., H ~ } whose cardinalitiess are distributed according to the weights of the statements. Each processor p on each of the new created sets Ht has at least one partner or neighbor in each of the other subsets in the collection F - {Ht }. This way a graph is associated with the nesting of balanced processor assignments in which links join partner processors. The way partners are selected in the other groups is called a partnership policy. The family of topologies defined this way is called weighted dynamic hypercubes. Although the formulas for the algorithm to support nested load balanced processor assigmnent statements are by far more convoluted than those for the unbalanced case, they still reduce to elementary arithmetic and logical operations. This simplicity and the good performance of current multicomputers on weighted hypercube connection patterns explain the efficiency of the implementation. 3
Computational
Results
Our experiments were performed in an IBM-SP2 computer and a Silicon Graphics Origin 2000. The IBM-SP2 nodes (Power2-thin2 processors) are connected through a synchronous multistage packet switch with whormhole routing providing a real data transfer rate of 22 Mb/s [5]. The Origin 2000 is a 32 + 32 R10000 processors (196 MHz) machine with 8 GB main memory, 288 GB disk and 25.08 Gflop/s theoretical peak. For each number of processors and for each algorithm, columns labeled S in Table 1 show the average speedup for ten experiments against the corresponding best sequential algorithm. First and fifth pairs of columns of the Table compare the results obtained for the llc parallel quicksort algorithm of Figure 1, (label BALVIRT)against a message passing parallel quicksort algorithm for a hypercube multicomputer due to professor P. B. Hansen [2] (label BH). The Hansen algorithm achieves a perfect load balance through the use of the f i n d procedure due to Hoare [3]. This procedure divides the array to sort in two equal halves. On both architectures La Laguna C times are slightly better than those obtained by the Hansen algorithm. This advantage can be explained by the different approach taken in the implementation of nested parallelism. While nested parMlelism produces a processor assignment in the lle version, the implicit solution used by the Hansen algorithm roughly corresponds to a processor
543 activation. The constant complexity of the processor assignment algorithm used by the llc system is smaller than the logarithmic complexity of the processor activation algorithm, No inefficiency is introduced by the methodology in spite of the easiness introduced in the expression of parallel algorithms. Table 1. Speedups and Speedup Standard Deviations for the parallel quicksort.IBM SP2 and Origin 2000, 7M integer array BH
UNBAL S SSD
MANUAL $ SSD
BALAN S SSD
BALVIRT S SSD
PROCS
S
SSD
sP2
2 4 8 16
1.49 2.02 2.42 2.62
0.05 0.08 0.12 0.20
1.27 2.94 2.77 3.48
0.17 0.37 0.66 1.01
1.53 2.15 2.66 2.99
0.05 0.10 0.14 0.15
1.23 2.41 3.45 4.72
0.20 0.39 0.31 0.30
1.15 2.24 3.29 4.29
0.18 0.35 0.29 0.29
Origin
2 4 8
1.59 2.20 2.68
0.04 0.08 0.13
1.27 1.95 2.74
0.16 0.31 0.52
1.61 2.31 2.88
0.04 0.08 0.13
1.27 2.46 3.60
0.16 0.33 0.24
1.23 2.37 3.50
0.16 0.32 0.22
Since all the N elements of the array to sort have to be received by processor 0, the size of the array N constitutes a lower bound of the Timepar of any parallel sorting algorithm. Therefore, the speedup of any replicated sorting algorithm is bounded by the logarithm of the array size. T h e o r e m 1. If N is the size of the array to sort, then it holds: SpeedUp = N l o g ( N ) < N l o ~ ( N ) = log(N) Tirnepar
--
N
This theorem explains the poor speedup in Table 1. As a consequence, the only way to increase efficiency is to consider huge arrays much larger than the ones considered in these experiences. The results in the four rightmost speedups columns compare three different load balance approaches. The UNBAL label corresponds to the algorithm in figure 1 but substituting line 6 for the use of the unbalanced processor assignment statement PAR. The pivot is selected as the average of three randomly selected elements. The amount of work assigned to each processor depends on the selected pivot. The MANUAL label stands for a Uc program that uses Hoare f i n d procedure and unbalanced processor assignment statement. The programmer takes responsibility of load balance. The results of the algorithm of Figure 1 using the PARWEIGHT construct instead of PARWEIGHTVIRTUAL are presented under the label BALAN. Performance obtained using balanced processor assignment statements are even better than those reached using a hand-written specific load balancing algorithm designed by an experienced programmer. Label BALVIRT corresponds to the algorithm in Figure 1. Processor virtualization does not introduce appreciable inefficiency. Columns labeled SSD present for each
544 algorithm, the Speedup Standard Deviation of the experiments. Although the average speedup of the UNBAL entries are comparable to the MANUAL entries, the standard deviation shows the unpredictability of its behavior. On the other extreme, the MANUAL case presents the smaller deviation. The BALAN and BALVIR.T versions keep the standard deviation in an acceptable range. When the number of processors grows, the standard deviation keeps almost constant. The Origin 2000 scMes slightly better than the IBM SP2. The average speedup values for the distributed parallel quicksort algorithm presented in Figure 2 with eight processors are 6.66 and 6.90 for the IBMSP2 and Silicon-Origin respectively. These values were obtained as the average of 10 experiments with 16M integer arrays randomly generated according to an uniform distribution. The speedups show a better behaviour than for the replicated problems. 4
Conclusions
A methodology to efficiently implement bManced nested processor assignment statements in constant time was presented. The algorithm to implement the nesting of such statements can be used to build tools providing the capability to mix parallelism and recursion. A non negligible advantage is that their performance can be predicted using the model presented in [5]. The computational results prove that the efficiency of this approach is comparable to the obtained using specific message passing algorithms. From an efficiency point of view, the results obtained for the different approaches considered are pretty similar, so the advantage of the proposed paradigms must not be seen from this perspective. The most important benefit resides in the fact that it provides a comfortable frame to translate P R A M algorithms to standard message-passing libraries without loss of performance. A c k n o w l e d g e m e n t s This research has been done using the resources at the Centre de Computaci5 i Comunicacions de Catalunya (CESCA-CEPBA). References 1. Fortune S., Wyllie, J.: Parallelism in Random Access Machines. STOC. (1978) 114-118 2. Hansen P. B.: Do hypercubes sort faster than tree machines?. Concurrency: Practice and Experience 6 (2) (1994) 143-151 3. Hoare, C. A. R.: Proof of a Program: Find. Communications of the ACM. 14 39-45 4. Murphy, T.A., Vaughan J.: On the Relative Performance of Diffusion and Dimension Exchange Load Balancing in Hypercubes. Proceedings of the PDP'97. IEEE Computer Society Press. (1997) 29-34 5. Rodrlguez, C., Roda J.L., Morales, D. G., Almeida, F.: h-relation Models for Current Standard Parallel Platforms. Proceedings of the Euro-Par'98. Springer Verlag LNCS. (1998)
Mutual Exclusion Between Neighboring N o d e s in a Tree That Stabilizes Using R e a d / W r i t e Atomicity Gheorghe Antonoiu 1 and P r a d i p K. S r i m a n i 1 Department of Computer Science, Colorado State University, Ft. Collins, CO 80523
A b s t r a c t . Our purpose in this paper is to propose a new protocol that can ensure mutual exclusion between neighboring nodes in a tree structured distributed system, i.e., under the given protocol no two neighboring nodes can execute their critical sections concurrently. This protocol can be used to run a serial model self stabilizing algorithm in a distributed environment that accepts as atomic operations only send a message, receive a message an update a state. Unlike the scheme in [1], our protocol does not use time-stamps (which are basically unbounded integers); our algorithm uses only bounded integers (actually, the integers can assume values only 0, 1, 2 and 3) and can be easily implemented.
1
Introduction
Because of the p o p u l a r i t y of the serial m o d e l and the relative ease of its use in designing new self-stabilizing algorithm, it is worthwhile to design lower level self-stabilizing protocols such t h a t an a l g o r i t h m developed for a serial m o d e l can be run in a distributed enviromnent. This a p p r o a c h was used in [1] and can be c o m p a r e d with the layered approach use in networks protocol stacks. T h e a d v a n t a g e of such a lower level self-stabilizing protocol is t h a t it m a k e s the j o b of self-stabilizing application designer easier; one can work with the relatively easier serial model and does not have to worry a b o u t message m a n a g e m e n t at lower level. Our purpose in this p a p e r is to propose a new protocol t h a t can be used to run a serial m o d e l self stabilizing a l g o r i t h m in a d i s t r i b u t e d e n v i r o n m e n t t h a t accepts as atomic operations only send a message, receive a message an u p d a t e a state. Unlike the scheme in [1], our protocol does not use t i m e - s t a m p s (which are basically u n b o u n d e d integers); our a l g o r i t h m uses only b o u n d e d integers (actually, the integers can assume values only 0, 1, 2 and 3) and can be easily i m p l e m e n t e d . Our algorithm is applicable for d i s t r i b u t e d s y s t e m s whose underlying t o p o l o g y is a tree. It is interesting to note t h a t the proposed protocol can be viewed as a special class of self-stabilizing distributed m u t u a l exclusion protocol. In t r a d i t i o n a l dist r i b u t e d m u t u a l exclusion protocols, self-stabilizing or non self-stabilizing (for references in self-stabilizing distributed m u t u a l exclusion protocols, see [2] a n d for non self-stabilizing distributed m u t u a l exclusion protocols, see [3,4]), the objective is to ensure t h a t only one node in the s y s t e m can execute its critical
546 section at any given time (i.e., critical section execution is m u t u a l l y exclusive from all other nodes in the system; the objective in our protocol, as in [1], is to ensure that a node executes its critical section mutually exclusive from its neighbors in the system graph (as opposed to all nodes in the system), i.e., multiple nodes can execute their critical sections concurrently as long as they are not neighbors to each other; in the critical section the node executes an atomic step of a serial model self-stabilizing algorithm. 2
Model
We model the distributed system, as used in this paper, by using an undirected graph G : (V, E). The nodes represent the processors while the s y m m e t r i c edges represent the bidirectional communication links. We assume that each processor has its unique id. Each node x maintains one or more local state variables and one or more local state vectors (one vector for each local variable) t h a t are used to store copies of the locM state variables of the neighboring nodes. Each node x maintains an integer variable Sx denoting its status; the node also maintains a local state vector LS~ where it stores the copies of the status of its neighbors (this local state vector contains d~ elements, where d~ is the degree of the node x, i.e. z has dx m a n y neighbors); we use the notation LS~ (y) for the local state vector element of node x that keeps a copy of the local state variable Sy of neighbor node y. The local state of a node x is defined by its local state variables and its local state vectors. A node can b o t h read and write its local state variables and its local state vectors; it can only read (and not write) the local state variables of its neighboring nodes and it does neither read nor write the local state vectors of other nodes. A configuration of the entire system or a system state is defined to be the vector of local states of all nodes. Next, our model assumes read/write atomicity of [2] (as opposed to the composite read/write atomicity as in [5, 6]). An atomic step (move) of a processor node consists of either reading the a local state variable of one of its neighbors (and updating the corresponding entry in the appropriate replica vector), or some internal computation, or writing of one of its local state variables; any such move is executed in finite time. Execution of a move by a processor m a y be interleaved with moves by other nodes - in this case the moves are concurrent. We use the notation A/'(x) = {nz[1],....n~:[d~]} to denote the set of neighbors of node x. The process executed by each node x consists of an infinite loop of a finite sequence of Read, Conditional critical section and Write moves. D e f i n i t i o n 1. An infinite execution is fair if it contains a infinite number of
actions for any type. Remark 1. The purpose of our protocol is to ensure m u t u a l exclusion between neighboring nodes. Each node x can execute its critical section iff the predicate ~ is true at node x. Thus, in a legitimate system state, m u t u a l exclusion is ensured iff ~ ~ Vy l y is a neighbor of x, - ~ y
547 i.e., as long as node x executes its CS, no neighbor y of node z can execute its CS ( ~ I (Y is a neighbor of x) is false).
3
Self-Stabilizing
Balance
Unbalance
Protocol
for a Pair
of Processes We present an approach without using any shared variables unlike that in [2]. T h e structure of the two processes A and B is the same, i.e. infinite loop at each processor consists of an atomic read operation, critical section execution if certain predicate is true and an atomic write action. Sa and LS~ (b) are two local variables maintained by process A (LSa (b) is the variable maintained at process A to store a copy of the state of its neighbor process B); process A can write on both Sa and LS,(b) and process B can only read Sa. Similarly, 5;5 and LSb(a) are local variables to process B: process B can write on b o t h 5;5 and LSb(a) and process A can only read Sb. The difference is t h a t the variables are now ternary, i.e., they can assume values 0, 1 or 2. The proposed algorithm is shown in Figure 1:
[1
Process A
11 -"
Vroce's;'B
II
CSa : if(S. = O) then Execute CS CSb : if(Sb = 1) then Execute CS W.: if(L&(b) = G) then & = & + ~ rood 311 Wb: Sb = LSb(a); Fig. 1. Self-Stabilizing Balance UnbMance Protocol for a Pair of Processes
Since each of the processes A and B executes an infinite loop, after a Ra action the next "A" type action is W~, after a Wa action the next "A" type action is Ra, after a Rb action the next "B" type action is Wb, after a Wb action the next "B" type action is Rb and so on.
Remark 2. An execution of the system is an infinite execution of the processes A and B and hence an execution of the system contains an infinite number of each of the actions from the set {Ra, Wa, Rb, Wb}; thus, the execution is fair. The system m a y start from an arbitrary initial state and the first action in the execution of the system can be any arbitrary one from the set {Ra, Wa, Rb, Wb}. Note t h a t the global system state is defined by the variables S~ and LS~ (b) in process A and the variables 5;5 and LSb(a) in process B.
Remark 3. When a process makes a move (the system executes an action), the system state m a y or m a y not be changed. For example, in a system state where Sa ~s LSa(b), the move Wa does not modify the system state, i.e., the system remains in the same state after the move.
548 D e f i n i t i o n 2. A move (action) that modifies the system state is called a m o d ifying action. Our objective is to show that the system, when started from an arbitrary initial state (possibly illegitimate), converges to a legitimate state in finite time (after finitely many actions by the processes). We introduce a new binary relation. D e f i n i t i o n 3. We use the notation x ~- y, if x = y or x = (y + 1) mod 3, where x , y E Z3.
Remark 4. The relation ~- is neither reflexive, nor transitive, nor symmetric. For example, 1 ~- 0, 2 ~- 2, 2 ~ 0, 2 ~- 1, 0 ~ 1, etc. D e f i n i t i o n 4. Consider the ordered sequence of variable (S~, LSb(a), Sb, LSa(b) ) ; a system state is l e g i t i m a t e /f (i) S~ ~_ nSb(a) A nSb(a) ~- Sb ASb ~_ LSa(b) and (ii) if at most one pair of successive variables in the previous sequence are unequal.
Example 1. For example, {Sa : 1, L S a ( b ) :- O, Sb : 0, LSb(a) = 1} is a legitimate state while {S~ = 2, LSa(b) = 1, Sb = 1, LSb(a) = 0} is not. T h e o r e m 1. In a legitimate state the two processes A and B execute their critical sections in mutual exclusive way, i.e., if process A is executing CS then process B cannot execute CS and vice versa, i.e. Sa = 0 ~ Sb ~ 1 and
Sb=I~Sar Proof. The proof is obvious since in a legitimate state S~ ~" LSb (a) A LSb (a) Sb ASb ~- LS~ (b) and at most one pair of successive variables in the sequence can be unequal. T h e o r e m 2. Any arbitrary move from the set { Ra, Wa, Rb, Wb } made in a legitimate state of the system leads to a legitimate state after the move.
Pro@ Since there are only four possible moves, it is easy to check the validity of the claim. For example, consider the move Ra ; the variable LS~(b) is affected; if LSa(b) = Sb before the move, this move does not change the system state; if LSa(b) ~ Sb before the move, then the system state before the move must satisfy Sa = LSb(a) = Sb (since it is legitimate) and after the move it will satisfy S~ = LSb(a) = Sb = LS~(b) (hence, the resulting state is legitimate). L e m m a 1. Any infinite fair execution contains infinitely many modifying actions (see Definition 2).
Proof. By contradiction. Assume that after a finite number of moves Sa does not change its value anymore. Then after a complete loop executed by process B, LSb(a) = Sa and Sb = Sa. In the next loop the process A must move, which contradicts our assumption. It is easy to see that if Sa changes its value infinitely many times, any other variable changes its value infinitely many times.
549 L e m m a 2. For any given fair execution and for any initial state, a state such that three variables from the set {Sa, LSD(a), Sb, LS~(b)} are equal each other
is reached in a finite number of moves. Proof. Consider the first move that modifies the state of Sa. After this move S~ • LS~ (b). To change again the value of Sa, the LSa (b) variable must change its value and become equal to Sa. But LS~ (b) always takes the value of Sb. Since S~ change its value infinitely many times a state such that Sa = LSa(b) and LSa(b) = Sb is reached in a finite number of moves. L e m m a 3. For any given fair execution and for any initial state, a state such
that Sa 7s LS~(b), Sa 7s Sb, Sa 7s LSb(a), is reached in a finite number of moves. Proof. Since we use addition modulo 3, the variables Sa, LSb(a), Sb, LS~(b), can have values in the set Z3 = {0, 1, 2). When three variables from the set {Sa,LSb(a),Sb, LSa(b)) are equal to each other, Lemma 2, there is a value i E Z3 such that Sa 7s i, LS~(b) 7s i, S b r i, LSb(a) 7s i. When LSb(a), Sb, LSa(b) change their values, they only copy the value of one variable in the set {S~, LSb(a), Sb, LSa(b)}. Thus, when S~ reaches for the first time the value i, the condition Sa 7~ LS~(b), S~ 7s Sb, Sa 7s LSb(a) is met. T h e o r e m 3. For any given fair execution, the system starting from any arbi-
trary state reaches a legitimate state in finitely many moves. Proof. The system reaches a state such that Sa 7s LSa(b), S a r Sb, Sa 7s LSb(a) in a finite number of moves, Lemma 3. Let S~ = i E Z3 Since LSb (a) 7s i, Sb 7s i and LS~ (b) 7s i, Sa can not change its value until LS~ (b) becomes equal to i, LS~(b) can not become equal to i until Sb becomes equal to i and Sb can not become equal to i until LSb(a) becomes equal to i. Thus, S~ can not modify its state until a legitimate state is reached. 4
Self-Stabilizing Mutual Exclusion Network Without Shared Memory
Protocol
for a Tree
Consider an arbitrary rooted tree; the tree is rooted at node r. We use the notation d~ for the degree of node x, nx[j] for the j - t h neighbor of the node x, J~f(x) = {n~[1],....nx[d~]} for the set of neighbors of node x, C(x) for the set of children of x and P , for the parent of node x; since the topology is a tree each node x knows its parent P , and for the root node r, Pr is Null.. As before, each node x maintains a locM state variable S~ (readable by its neighbors) and a local state vector LS, used to store the copies of the states of its neighbors; we use notation LS,(y) to denote the component of the state vector LSa that stores a copy of the state variable Sy of node y, Vy E Af(x). All variables are now modulo 4 integers (we explain the reason later). We assume that each node x maintains a height variable Hz such that Hr = 0 for the root node and for Vx 5s r, Ha is
550 II
R~:.
LS~(n~[1]).
Root node r = S,,~[~];
]]R~: gs~(n~[d~]) = S~[a~]; CS~: i f S~ = 0 t h e n Execute Critical Section;
Wr: i f (Ay~e(~)(S~ = LS,.(y))) then S~=S~+I
rood 4;
Leaf node y LS~(P~) = Sp~; .[CSy: if di(y) t h e n Execute Critical Section; RI:
Ifw~: II n~: : CSx:
s~ = LS~(P,); Internal Node x nSx(n~[1]) = S,~x[1]
i f ~(x) t h e n Execute Critical Section
t h e n Sx : RH~(P~);
Fig. 2. Protocol for an Arbitrary Tree the number of edges in the unique path from node x to the root node. It is easy to see that if the root node sets Hr = 0 and any other node x sets its H~ to H p x q- 1 (level of its parent plus 1), the height variables will correctly indicate the height of each node in the tree after a finite number of moves, starting from any illegitimate values of those variables. To avoid cluttering the algorithm, we do not include the rules for handling H~ variable in our algorithm specification. As before, the root node, internal nodes as well as the leaf nodes execute infinite loops of reading the states of neighbor(s), executing critical sections (if certain predicate is satisfied) and writing its new state. The protocols (algorithms) for root, internal nodes and leaf nodes are shown in Figure 2 where the predicate
9 (x) is: O(x) --- (S~ = 0 A (H~ is even)) V (S~ : 2 A (Hx is odd)) Note, as before, the state of a node x is defined by the variable S~: and the vector LSx; the global system state is defined by the local states of all participating nodes in the tree. D e f i n i t i o n 5. Consider a link or edge (x, y) such that node x is the parent of node y in the given tree. The state of a link (x, y), in a given system state, is
551
defined to be the vector (S,, LSy(X), Sy, LS~(y)). The state of a link is called l e g i t i m a t e iff Sx ~ LSy(x) A LSy(x) ~ Sy A Sy ~- LSx(y) and at most one pair
of successive variables in the vector ( S~ , L Sy ( x ) , Sy , n S~ ( y ) ) are unequal. D e f i n i t i o n 6. The system is in a legitimate state if all links are in a legitimate
state. T h e o r e m 4. For an arbitrary tree in any legitimate state, no two neighboring processes can execute their critical sections simultaneously.
Proof. In a legitimate state (when the H variables at nodes have stabilized) for any two neighboring nodes x and y, we have (either L~ is even ~ Ly is odd), or (L~ is odd ~ Ly is even). Hence, O(x) and O(y) are simultaneously (concurrently) true iff S~ = 2 and Sy = 0 or S~ = 0 and Sy = 2. But since link (x,y) is in a legitimate state, such condition can not be met; hence, two neighboring nodes cannot execute their critical sections simultaneously. L e m m a 4. Consider an arbitrary link (x, y). The node x modifies the value of
the variable Sx infinitely many times, if and only if the node y modifies the value of the variable Sy infinitely many times. Proof. If node x modifies Sx finitely many times, then after a finite number of moves the value of Sx is not modified anymore. T h e next complete loop of node after the last modification of makes (x) = and (x) is not be modified by subsequent moves. Hence, after at most one modification, Sy remains unchanged. Conversely, if node y modifies Sy finitely many times then after a finite number of moves the value of S~ is not modified anymore. The next complete loop of node x after the last modification of Sy, makes LS~ (y) = Sy and LS~ (y) is not be modified by subsequent moves. Hence, after at most one modification, S~ remains unchanged. L e m m a 5. For any fair execution, variable S~ at root node r is modified infinitely many times.
Proof. By contradiction. Assume that the value of Sr is modified finitely many times. Then, after a finite number of moves, the value Sy for each child y of r will not be modified anymore, L e m m a 4. Repeating the argument, after a finite number of moves no S or L S variables for any node in the tree m a y be modified. Consider now the leaf nodes. Since the execution is fair, the condition Sz = LSz(Pz) must be met for each leaf node z. If this condition is met for leaf nodes it must be met for the parents of leaf nodes too. Repeating the argument, the condition must be met by all nodes in the tree. But, if this condition is met, the root node r modifies its Sr variable in its next move, which is a contradiction. L e m m a 6. For any fair execution and for any node x, the variable S~ is mod-
ified infinitely many times.
552
Proof. If node x modifies its variable S~ finitely many times, its parent, say node z, must modify its variable Sz only finitely many times, L e m m a 4. Repeating the argument, the root node also modifies its variable Sr finitely m a n y times, which contradicts Lemma 5. L e m m a 7. Consider an arbitrary node z (~ r). If all links in the path from r
to z are in a legitimate state, then these links remain in a legitimate state after an arbitrary move by any node in the system. Proof Let (x, y) be an link in the path from r to z. Since (x, y) is in a legitimate state, we have S~ ~_ nSy (x) A LSy (x) N_-Sy A Sy ~ LS~ (y) and at most one pair of successive variables in the sequence (S~, LSy(x), Sy, nS~(y)) are unequal. We need to consider only those system moves (executed by nodes x and y) that can change the variables Sx, LSy (x), Sy, LS~ (y). Considering each of these moves, we check that legitimacy of the link state is preserved in the resulting system state. The read moves (that update the variables LSy(X), LS~(y)) obviously preserve legitimacy. To consider the move Wx, we look at two cases differently: C a s e 1 (x = r): When Wz is executed, S~ can be modified (incremented by 1) only when S~ = LS~(y). Thus, since the state is legitimate, a W~ move can increment S~ only under the condition S~ = LSy(X) = Sy : LS~(y) and after ~.he move the link (x, y) remains in a legitimate state. C a s e 2 (x • r): Since the link (t, x) (where node t is the parent of x, i.e., : P~) is in a legitimate state, we have LS~ (t) = S~ or LS~ (t) = (S~ + 1) mod 4. ,/hen W~ is executed, S~ can be modified only by setting its value equal to LS~(t); hence, after the move the link (x, y) remains in a legitimate state. L e m m a 7 shows that if a path of legitimate links from the root to a node is created the legitimacy of the links in this path is preserved for all subsequent states. The next lemma shows that a new link is added to such a path in finite time. L e m m a 8. Let (x, y) be a link in the tree. If all links in the path from root node
to node x are in a legitimate state, then the link (x, y) becomes legitimate in a finite number of moves. Proof. First, we observe that the node y modifies the variable Sy infinitely m a n y times, L e m m a 6. Then, we use the same argument as in the proof of Theorem 3 to show that the link (x, y) becomes legitimate after a finite number of moves. T h e o r e m 5. Starting from an arbitrary state, the system reaches a legitimate
state in finite time (in finitely many moves). Proof. The first step is to prove that all links from the root node to its children become legitimate after a finite number of moves. This follows from the observation that each child x of the root node modifies its S variable infinitely many times and from an argument similar to the argument used in the proof of Theorem 3. Using Lemma 8, the theorem follows.
553
References 1. M. Mizuno and H. Kakugawa. A timestamp based transformation of self-stabifizing programs for distributed computing environments. In Proceedings of the lOth International Workshop on Distributed Algorithms (WDAG'96), volume 304-321, 1996. 2. S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming only read/write atomicity. Distributed Computing, 7:3-16, 1993. 3. M. Raynal. Algorithms ]or Mutual Exclusion. MIT Press, Cambridge MA, 1986. 4. P. K. Srimani and S. R. Das, editors. Distributed Mutual Exclusion Algorithms. IEEE Computer Society Press, Los Alamitos, CA, 1992. 5. M. Flatebo, A. K. Datta, and A. A. Schoone. Self-stabilizing multi-token tings. Distributed Computing, 8:133-142, 1994. 6. S. T. Huang and N. S. Chen. Self-stabilizing depth-first token circulation on networks. Distributed Computing, 1993. 7. E, W. Dijkstra. Solution of a problem in concurent programming control. Communication of the ACM, 8(9):569, September 1965. 8. L, Lamport. A new solution of Dijkstra's concurrent programming problem. Communications oJ the ACM, 17(8):107-118, August 1974. 9. L. Lamport. The mutual exclusion problem: Part II - statement and solutions. Journal of the ACM, 33(2):327-348, 1986. 10. H. S. M. Kruijer. Self-stabilization (in spite of distributed control) in treestructured systems. Inf. Processing Letters, 8(2):91-95, 1979.
Irreversible D y n a m o s in Tori * P. Flocchini 1, E. Lodi 2, F. Luccio 3, L. Pagli 3, N. Santoro 4 1 Universit@ de Montreal, Canada 2 Universit~ di Siena, Italy 3 Universit~ di Pisa, Italy 4 Carleton University, Canada
A b s t r a c t . We study the dynamics of majority-based distributed systems in presence of permanent faults. In particular, we are interested in the patterns of initial faults which may lead the entire system to a faulty behaviour. Such patterns are called dynamos and their properties have been studied in many different contexts. In this paper we investigate dynamos for meshes with different types of toroidal closures. For each topology we establish tight bounds on the number of faulty elements needed for a system break-down, under different majority rules.
1
Introduction
Consider the following repetitive process on a synchronous network G: initially each vertex is in one of two states (colors), black or white; at each step, all vertices simultaneously (re)color themselves either black or white, each according to the color of the "majority" of its neighbors (majority rule). Different processes occur depending on how majority is defined (e.g., simple, strong, weighted) and on whether or not the neighborhood of a vertex includes that vertex. The problem is to study the initial configurations (assignment of colours) from which, after a finite number of steps, a monochromatic fixed point is reached, that is, all vertices become of the same colour (e.g., black). The initial set of black vertices is called dynamo (short for "dynamic monopoly") and their study has been introduced by Peleg [17] as an extension of the study of monopolies. The dynamics of majority rules have been extensively studied in the context of cellular automata, and much effort has been concentrated on determining the asymptotic behaviors of different majority rules on different graph structures. In particular, it has been shown that, if the process is periodic, its period is at most two. Most of the existing research has focused, for example, on the study of the period-two behavior of symmetric weighted majorities on finite {0, 1 } - and { 0 , . . . , p}-colored graphs [9, 19], on the number of fixed points on finite {0, 1}colored rings [1, 2, 10], on finite and infinite {0, 1}-colored lines [12,13], on the behaviors of infinite, connected {0, 1 }-colored graphs [14]. Furthermore, dynamic majority has been applied to the immune system and to image processing [1, 8]. Although the majority rule has been extensively investigated, not much is known regarding dynamos. Some results are known in terms of catastrophic fault * This work has been supported in part by MURST, NSERC, and FCAR.
555 patterns which are dynamos based on "one-sided" majority for infinite chordal rings (e.g., [5, 15, 20]). Further results are known in the study of monopolies, that is dynamos for which the system converges to all black in a single step [3, 4, 16]. Other more subtle definitions have been posed. An irreversible d y n a m o is one where the initial black vertices do not change their colour regardless of their neighbourhood. This is opposed to reversible dynamos, for which a vertex may switch colour several times according to a changing neighborhood. Among the latter,monotone reversible dynamos are the ones for which black vertices remain Mways black because the neighborhood never forces them to turn white. Recently, some general lower and upper bounds on the size of monotone dynamos have been estabilished in [17], and a characterization of irreversible dynamos has been given for chordal rings in [7]. In this paper we consider irreversible dynamos in tori and we focus on their dimension, that is, the minimum number of initial black elements needed to reach the fixed point. The motivation for irreversible dynamos comes from faulttolerance. Initial black vertices correspond to permanent faulty elements, and the white correspond to non-faulty. The faulty elements can induce a faulty behavior in their neighbors: if the majority of its neighbors is faulty (or has a faulty behavior), a non-faulty element will exhibit a faulty behavior and will therefore be indistinguishable from a faulty one. Irreversible dynamos are precisely those patterns of permanent initial faults whose occurrence leads the entire system to a faulty behavior (or catastrophe). In addition to its practical importance and theoretical interest, the study of irreversible dynamos gives insights on the class of monotone dynamos. In particular, all lower bounds established on the size of irreversible dynamos are immediately lower bounds for the monotone case. The torus is one of the simplest and most natural way of connecting processors in a network. We consider different types of tori: the toroidal mesh (the classieM architecture used in VLSI), the torus cordalis (also known as doubleloop interconnection networks), and the torus serpentinus (e.g., used by ILIAC IV). For each of these topologies we derive lower and upper bounds on the dimensions of dynamos. For a summary of results see table 1. The upper bounds are constructive, that is we derive the initial black vertices constituting the dynamo, and we also analyze the completion time (i.e., the time necessary to reach the fixed point). Limiting the discussion to meshes with toroidal connections avoids to examine border effects for some vertices, as it would occur in simple meshes. Our results and techniques can be easily adapted to simple meshes. In the following, due to space limitations, all the proofs have been omitted; the interested reader is referred to [6].
2
Basic
Definitions
Let us consider an m • n mesh M and denote with #i,j, 0 ~ i < m - 1, 0 _< j < n - 1, a vertex of M. The differences among the considered topologies consist only in the way that border vertices (i.e., #i,0, Pi,n-1 with 0 < i < m - 1
556 T a b l e 1. Bounds on the size of irreversible dynamos for tori of m x n vertices, with M = m a x { m , n } , N = rain{re, n}, and H , t ( = m , n or H , I ( = n , m (choose the alternative that yields stricter bounds). Simple majority Strong majority Lower Bound Upper Bound Lower Bound Upper Bouna Toroidal
r~-~] - 1
To~.~ cordalis
[~1
Toru8
N M LyJ, for I-~1 > LNJ
serpentinus M
M
[~-~] - 1
[~J
+ 1
[Nj + 1
,r~+11~,
[ - ~ ] ( K + 1)
,r ' - -3+ ~ l,
I T~] (
[~-+i 1
[~](K+
3
/
n
+ 1)
1)
LTJ, for [T1 < [~J
and #o,j, ]-tm-l,j with 0 < j < n - 1) are linked to other processors. T h e vertices #i,,~-1 on the last column are usually connected either to the opposite ones on the same rows (i.e. to #i,o), thus f o r m i n g ring connections in each row, or to the opposite ones on the successive rows (i.e. to/Zi+l,0), in a snake-like way. T h e s a m e linking s t r a t e g y is applied for the last row. In the toroidal mesh rings are formed in rows and columns; in the torus cordalis there are rings in the columns and snake-like connections in the rows; finally, in torus serpentinus there are snake-like connections in rows and columns. Formally, we have: D e f i n i t i o n 1 Toroidal Mesh A toroidal mesh o f r n x n vertices is a mesh where each vertex Ti,j, with 0 < i < m - l , 0 < j < n - 1 is connected to the f o u r vertices v(i-1) rood re,j, V(i+D rood re,j, ri,(j-1) rood n, Vi,(j+l) rood ~ (mesh connections).
D e f i n i t i o n 2 Torus Cordalis A torus cordalis o f m • n vertices is a mesh where each vertex vi,j, with 0 < i < rn - 1, 0 < j < n - 1 has mesh connections except f o r the last vertex vi,n-1 o f each row i, which is connected to the first vertex r(i+l) rood m,O of row i + 1. Notice t h a t this torus can be seen as a chordal ring with one chord. D e f i n i t i o n 3 Torus Serpentinus A torus serpentinus of m x n vertices is a mesh where each vertex ri,j, with 1 < i < m - 1,0 < j <_ n - 1 has mesh connections, except f o r the last vertex ri,n-1 of each row i which is connected to the first vertex r(i+l ) rood m,O o f row i + 1, and f o r the last vertex r,~-l,j of each column j which is connected to the first vertex r0,(j-1) rood n of column j -- 1.
557 Majority will be defined as follows: D e f i n i t i o n 4 Irreversible-majority rule. A vertex v becomes black if the majority of its neighbours are black. In case of tie v becomes black (simple m~jority), or keeps its color (strong majority). In the tori, simple (or strong) irreversible-majority asks for at least two (or three) black neighbours. We can now formally define dynamos: D e f i n i t i o n 5 A simple (respectively: strong) irreversible d y n a m o is an initial set of black vertices from which an all black configuration is reached in a finite number of steps under the simple (respectively: strong) irreversible-majority rule. Simple and strong majorities will be treated separately because they exhibit different properties and are treated by different techniques.
3
Irreversible Dynamos with Simple Majority
Network behaviour changes drastically if we pass from simple to strong majority. We start our study from the former case.
3.1
Toroidal Mesh
Consider a toroidal mesh with a set T of m • n vertices, m, n > 2. Each vertex has four neighbors, then two black neighbors are enough to color black a white vertex. Let S C T be a generic subset of vertices, a n d R s be the smallest rectangle containing S. The size of R s is m s x as. If S is all black, a spanning set for S (if any) is a connected black set c~(S) _D S derivable from S with consecutive applications of the simple majority rule. We have:
Propositionl L e t s be a black set, m s < r n - 1 , ns < n - 1 .
Then, any (non necessarily connected) black set B derivable from S is such that B C_ R s .
Proposition 2 Let S be a black set. The existence of a spanning set q(S) implies Isl ->- [ ~ 21 l" From Propositions i and 2 we immediately derive our first lower bound result: T h e o r e m I Let S be a simple irreversible dynamo for a toroidal mesh m x n. We have: (i) m s >_ m - l , n s > n - 1 ;
(ii)
ISl > , _
2
, - 1.
To build a matching upper bound we need some further results.
Proposition 3 Let S be a black set, such that a spanning set c~(S) exists. Then a black set R s can be built from S. (note: R s is then a spanning set of S).
558 D e f i n i t i o n 6 An alternating chain C is a sequence of adjacent vertices starting and ending with black. The vertices of C are alternating black and white, however, if C has even length there is exactly one pair of consecutive black vertices somewhere. T h e o r e m 2 Let S be a black set consisting of the black vertices of an alternating chain C, with m s = m - 1 and n s = n - 1. Then, the whole torus can be colored black starting from S, that is S is a simple irreversible dynamo. An example of alternating chain of proper length, and the phases of the algorithm, are illustrated in Figure 1. From Theorem 2 we have:
0000000
0000000
0000000
OO00000
OO00000
OOOO0@O
0000000
0000000
OOOOO@O
O0000@O
@O00000
O@O@@@O
0000000
00000$0
O0@@O@O
O@O00@O
@O@@O@O
O @ O @ O 0 0
0 0 0 0 0 0 0
O0000@O
@O@0@O0
O0000eO
O00@O@O
O O O O O 0 0
S
~(S)
Rs
Fig. 1. The initial set S, the spanning set a(S) and the smallest rectangle Rs containing S.
C o r o l l a r y 1 A n y m • n toroidal mesh admits a simple irreversible dynamo S with ISl = [ 1. Note that the lower bound of Theorem 1 matches the upper bound of Corollary 1. From the proof of this corollary we see that a d y n a m o of minimal cardinality can be built on an alternating chain. Furthermore we have: C o r o l l a r y 2 There exists a simple irreversible dynamo S of minimal cardinality starting from which the whole toroidal mesh can be colored black in [ ~2L~1 steps. An example of such a dynamo is shown in Figure 3.1. An interesting observation is that the coloring mechanism shown in corollary 2 works also for asynchronous systems. In this case, however, the number of steps looses its significance. 3.2
Torus Cordalis
If studied on a torus cordalis, simple irreversible-majority is quite easy. We now show that any simple irreversible dynamo must have size > [~], and that there exist dynamos with almost optimal size.
559 00000 0@000 00000 0@000 00000
OOQO0
0@000
000@0
00@00
0000@
000@0
00000
0000@
0000@
Fig. 2. Examples of dynamos requiring
[ m2i~l ] steps.
T h e o r e m 3 Let S be a simple irreversible dynamo for a torus cordalis m x n. we have: IS'l > T h e o r e m 4 A n y m x n torus cordalis admits a simple irreversible dynamo S with ISI = [~] + 1. Starting from S the whole torus can be colored black in
L~-~Jn + 3 steps. An example of such dynamos is given in Figure 3.
00000000
0 0 0 0 0 0 0 0 0
00000000
0 0 0 0 0 0 0 0 0
@0@0@0@0
e o o e o e o e o
0@000000
0 0 0 0 0 0 0 0 0
00000000
o o o o o o o 0 o
00000000
o o 0 o o o o 0 o
00000000
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fig. 3. Simple irreversible dynamos of [~J + 1 vertices for tori cordalis, with n even and n odd.
3.3
Torus Serpentinus
Since the torus serpentinus is symmetric with respect to rows and columns, we assume without loss of generality that m > n, and derive lower and upper bounds to the size of any simple irreversible dynamo. For m < n simply exchange m with n in the expressions of the bounds. Consider a white cross, that is a set of white vertices arranged as in Figure 5, with height m and width n. The parallel white lines from the square of nine vertices at the center of the cross to the borders of the mesh are called rays. Note t h a t each vertex of the cross is adjacent to three other vertices of the same cross, thus implying:
560 O0 O0 O0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 O0 O0 O0 O0
Fig. 4. The white cross configuration.
P r o p o s i t i o n 4 In a torus serpentinus the presence of a white cross prevents a simple irreversible dynamo to exist.
We then have: P r o p o s i t i o n 5 Ira torus serpentinus contains three consecutive white rows, any simple irreversible dynamo must contain > L~J vertices. Based on Proposition 5 we obtain the following lower bounds: T h e o r e m 5 Let S be a simple irreversible dynamo for a torus serpentinus m x n. We have:
ISI _> [TJ, 2. ISI > []1, 1.
for [-~] > L~J, .for < L~-J.
The upper bound for the torus serpentinus is identical to the one already found for the torus cordalis. In fact we have: T h e o r e m 6 A n y m x n torus serpentinus admits a simple irreversible dynamo S with ISi = [~J + 1. Starting from S the whole torus can be colored black in L Jn + 3 steps 4
Irreversible
Dynamos
with
Strong
Majority
A strong majority argument allows to derive a significant lower bound valid for the three considered families of tori (simply denoted by tori). Since these tori have a neighbourhood of four, three adjacent vertices are needed to color black a white vertex under strong majority. We have: T h e o r e m 7 Let S be a strong irreversible dynamo for a torus m • n. Then
IsI >- - r~__~-~l / 3 /" We now derive an upper bound also valid for all tori.
561 T h e o r e m 8 A n y m x n torus admits a strong irreversible dynamo S with ISI = Starting from S the whole torus can be colored black in [~J +1 steps. An example of such a d y n a m o is shown in Figure 5.
@0000000 0@0@000@ @O@O@OtO @0000000 OlO@O@O@ @0@0@0@0 @0000000 0 @ 0 @ 0 @ 0 @ @ 0 @ 0 @ 0 @ 0
= 9,
= 8, I S i =
m
+
1) =
27
Fig. 5. A strong irreversible dynamo for toroidal mesh, torus cordalis or torus serpentiBUS.
For particular values of m and n the bound of T h e o r e m 8 can be m a d e stricter for the toroidal mesh and the torus serpentinus. In fact these networks are symmetrical with respect to rows and columns, hence the pattern of black vertices reported in figure 5 can be turned of 90 degrees, still constituting a dynamo. We immediately have: C o r o l l a r y 3 A n y m x n toroidal mesh or torus serpentinus admits a strong irreversible dynamo S with ISl = - , a x { [ - ~ ] ( n + 1), [~](r~ + 1)}.
References 1. Z. Agur. Fixed points of majority rule cellular automata with application to plasticity and precision of the immune system. Complex Systems, 2:351-357, 1988. 2. Z. Agur, A. S. Fraenkel, and S. T. Klein. The number of fixed points of the majority rule. Discrete Mathematics, 70:295-302, 1988. 3. J-C Bermond, D. Peleg. The power of small coalitions in graphs. In Proc. of And Coll. on Structural Information and Communication Complexity, 173-184, 1995. 4. J-C Bermond, J. Bond, D. Peleg, S. Perennes. Tight bounds on the size of 2monopolies In Proc. of 3rd Coll. on Structural Information and Communication Complexity, 170-179, 1996. 5. R. De Prisco, A. Monti, L. Pagli. Efficient testing and reconfiguration of VLSI linear arrays. To appear in Theoretical Computer Science. 6. P.Flocchini, E. Lodi, F.Luccio, L.Pagli, N.Santoro. Irreversible dynamos in tori. Technical Report n, TR-98-05, Carleton University, School of Compiter Science, 1998.
562 7. P. Flocchini, F. Geurts, N. Santoro. Dynamic majority in general graphs and chordal rings. Manuscript, 1997. 8. E. Goles, S. Martinez. Neural and automata networks, dynamical behavior and applications. Maths and Applications, Kluwer Academic Publishers, 1990. 9. E. Goles and J. Olivos. Periodic behavior of generalized threshold functions. Discrete Mathematics, 30:187-189, 1980. 10. A. Granville. On a paper by Agur, Fraenkel and Klein. Discrete Mathematics, 94:147-151, 1991. 11. N. Linial, D. Peleg, Y. Rabinovich, M. Sachs. Sphere packing and local majority in graphs In Proc. of 2nd ISTCS, IEEE Comp. Soc. Press. pp. 141-149, 1993. 12. G. Moran. Parametrization for stationary patterns of the r-majority operators on 0-1 sequences. Discrete Mathematics, 132:175-195, 1994. 13. G. Moran. The r-majority vote action on 0-1 sequences. Discrete Mathematics, 132:145-174, 1994. 14. G. Moran. On the period-two-property of the majority operator in infinite graphs. Transactions of the American Mathematical Society, 347(5):1649-1667, 1995. 15. A. Nayak, N. Santoro, R. Tan. Fault intolerance of reconfigurable systolic arrays. Proc. 20th Int. Symposium on Fault-Tolerant Computing, 202-209, 1990. 16. D. Peleg. Local majority voting, small coalitions and controlling monopolies in graphs: A review. In Proc. of 3rcl Coll. on Structural Information and Communication Complexity, 152-169, 1996. 17. D. Peleg. Size bounds for dynamic monopolies In Proc. of 3rd Coll. on Structural Information and Communication Complexity, 151-161, 1997. 18. S. Poljak. Transformations on graphs and convexity. Complex Systems, 1:10211033, 1987. 19. S. Poljak and M. Sflra. On periodical behavior in societies with symmetric influences. Combinatorica, 1:119-121, 1983. 20. N. Santoro, J. Ren, A. Nayak. On the complexity of testing for catastrophic faults. In Proc. of 6th Int. Symposium on Algorithms and Computation, 188-197, 1995.
MPI-GLUE: Interoperable High-Performance MPI Combining Different Vendor's MPI Worlds Roll Rabenseifner
Rechenzentrumder Universits Stuttgart, Allmandring 30, D-70550 Stuttgart,
rabenseifner9 http://www.hlrs.de/people/rabenseifner/
A b s t r a c t . Several metacomputing projects try to implement MPI for homogeneous and heterogeneous clusters of parallel systems. MPI-GLUE is the first approach which exports nearly full MPI 1.1 to the user's application without losing the efficiency of the vendors' MPI implementations. Inside of each MPP or PVP system the vendor's MPI implementation is used. Between the parallel systems a slightly modified TCP-based MPICH is used, i.e. MPI-GLUE is a layer that combines different vendors' MPIs by using MPICH as a global communication layer. Major design decisions within MPI-GLUE and other metacomputing MPI libraries (PACX-MPI, PVMPI, Globus and PLUS) and their implications for the programming model are compared. The design principles are explained in detail.
1
Introduction
The development of large scale parallel applications for clusters of high-end MPP or PVP systems requires efficient and standardized programming models. The message passing interface MPI [9] is one solution. Several metacomputing projects try to solve the problem that mostly different vendors' MPI implementations are not interoperable. MPI-GLUE is the first approach that exports nearly full MPI 1.1 without losing the efficiency of the vendors' MPI implementations inside of each parallel system. 2
Architecture
of MPI-GLUE
MPI-GLUE is a library exporting the standard MPI 1.1 to parallel applications on clusters of parallel systems. MPI-GLUE imports the native MPI library of the system's vendor for the communication inside of a parallel system. Parallel systems in this sense are MPP or PVP systems, but also homogeneous workstation clusters. For the communication between such parallel systems M P I - G L U E uses a portable TCP-based MPI library (e.g. MPICH). In MPI-GLUE the native MPI library is also used by the portable MPI to transfer its byte messages inside of each parallel system. This design allows any homogeneous and heterogeneous combination of any number of parallel or nonparallel systems. The details are shown in Fig. 1.
564 MPI processes on system I
parallel application MPI-GLUE layer native I portable MPI ~pr I based on o[ -" [ native & TCP vendor 1 comm. native comm [ TCP
local comm.
MPI processes on system 2 standardized MPI - - t ~ interface
parallel application MPI-GLUE layer portable MP1 native based on I TCP & native MPI of :omm. vendor 2 TCP native comm.
global communication
local comm.
Fig. 1. Software architecture
3
Related
Works
The P A C X - M P I [1] project at the computing center of the University of S t u t t g a r t has a similar approach as M P I - G L U E . At Supercomputing '97 PACXM P I was used to demonstrate that MPI-based m e t a c o m p u t i n g can solve large scale problems. Two CRAY T3E with 512 processors each were combined to run a flow solver (URANUS) which predicts aerodynamic forces and high temperatures effecting on space vehicles when they reenter E a r t h ' s atmosphere [2]. Furthermore a new world record in simulating an FCC crystal with 1.399.440.00 a t o m s has been reached on that cluster Ill]. P A C X - M P I uses a special module for the T C P communication in contrast to the portable MPI in M P I - G L U E . The functionality of P A C X - M P I is limited by this module to a small subset of M P I 1.1. The T C P communication does not directly exchange messages between the M P I processes. On each parallel system two special nodes are used as concentrators for incoming and outgoing T C P messages. P A C X - M P I can compress all T C P messages to enlarge bandwidth. P V M P I [4] has been developed at UTK, ORNL and ICL, and couples several M P I worlds by using PVM based bridges. The user interface does not meet the M P I 1.1 standard, but is similar to a restricted subset of the functionality of MPI_Comm_join in M P I 2.0 [10]. It is impossible to merge the bridge's intercommunicator into a global intra-communicator over all processes. Therefore, one can not execute existing MPI applications with P V M P I on a cluster of parallel systems. In the P L U S [3] project at PC 2 at the University of Paderborn a bridge has been developed too. With this bridge M P I and P V M worlds can be combined. Inefficient T C P implementations of some vendors have been substituted by an own protocol based on UDP. It is optimized for wide area networks. As in P A C X - M P I the communication is concentrated by a daemon process on each parallel system. The interface has the same restrictions as described for P V M P I , moreover, only a restricted subset of d a t a t y p e s is valid. PLUS is p a r t of the MOL (Metacomputer Online) project.
565 In the G l o b u s [5] project and in the I-WAY [7] project MPICH [8] is used on top of the multi-protocol communication library NEXUS [6]. NEXUS is designed as a basic module for communication libraries like MPI. On the parallel system itself MPICH is used instead of the vendor's MPI library.
4
D e s i g n Decisions and R e l a t e d P r o b l e m s
Several decisions must be taken in the design of an MPI library usable for heterogeneous metacornputing. Major choices and the decisions in the mentioned projects are viewed in this section. T h e glue a p p r o a c h versus t h e p o r t a b l e m u l t i - p r o t o c o l M P I imp l e m e n t a t i o n : The goals of high efficiency and full MPI functionality can be achieved by the following alternatives (Fig. 2): (a) The glue layer exports an MPI interface to the application and imports the native MPIs in each parallel system and a global communication module which allows communication between parallel systems. This gerneral glue approach is used in MPI-GLUE, PACX-MPI, PVMPI and PLUS. (b) In the portable multi-protocol MPI approach (used in Globus) each parallel system has its own local communication module. A common global communication module enables the communication between the parallel systems. The MPI library uses both modules. In this approach the global communication module is usually very small in comparison with the glue approach: it does only allow byte message transfers. The main disadvantage of this approach is that the performance of the native MPI library can not be used. G l o b a l w o r l d versus d y n a m i c a l l y c r e a t e d bridges: The standards MPI 1.1 and M Pl 2.{)imply I,womel,aCOml)utiug alternatives: (a.) M PI_COMM_WOIH3) collects all processes (in MPI-GLUE, PACX-MPI and Globus) or (b) each partition of the cluster has a separate MPI_COMM_WORLD and inter-communicators connect the partitions. (doesn't conform to MPI 1,1, used in PVMPI and PLUS). G l o b a l c o m m u n i c a t i o n using an o w n m o d u l e v e r s u s i m p o r t i n g a p o r t a b l e M P I : The global communication module in the glue approach can be (a) a portable TCP-based MPI implementation (in MPI-GLUE), or (b) a special TCP-based module (in PACX-MPI, PVMPI and PLUS). With (a) its possible to implement the full MPI 1.1 standard without reimplementing most of the MPI's functionality inside of this special module.
parallel application glue layer native global MP1 communication impler~ention module~ local network
MPI
global network
parallel app!ication I.,~_MPI portable multi-protocol MPI I implementation ] local ] global ] commumcatloncommunication local network
global network
Fig. 2. The glue approach (left) and the multi-protocol MPI approach 0ight)
566 Using d a e m o n s or n o t using d a e m o n s : The global communication can be (a) concentrated by daemons or can be (b) done directly. In the daemon-based approach (a) the application processes send their messages on the local network (e.g. by using the native MPI) to a daemon, that transfers them on the global network (e.g. with TCP) to the daemon of the target system. From there the messages are received by the application processes within the local network of the target system. In the direct-communication approach (b) the work of the daemons is done by the operating system and each node communicates directly via the global network. With the daemon-based approach the application can fully parallelize computation and communication. The application is blocked only by native MPI latencies although in the background slow global communication is executed. Currently only the direct-communication approach is implemented in MPI-GLUE, but it would be no problem to add the daemon-based approach. A detailed discussion can be found in [12]. T h e first p r o g r e s s p r o b l e m : All testing and blocking MPI calls that have to look at processes connected by the local network and at other processes connected by the global network should take into account that the latencies for probing the local and global network are different. On clusters of MPPs the ratio can be 1:100. Examples are a call to MPI_Recv with source=MPI_ANY_SOURCE, or a call to MPI_Waitany with requests handling local and global communication. More a workaround than a solution are asymmetric polling strategies: if polling is necessary then only in the n t h round the central polling routine will inquire the state of the global interface, while in each round the local interface will be examined, n is given by the ratio mentioned above. Without this trick the latency for local communication may be expanded to the latency of the global communication. But with this trick the latency for global communication may be in the worst case twice the latency of the underlaying communication routine. This first progress problem arises in all systems that use no daemons for the global communication (MPI-GLUE, PVMPI configured without daemons, and Globus). In MPI-GLUE the asymmetric polling strategy has been implemented. In Globus it is possible to specify different polling intervals for each protocol. T h e second progress p r o b l e m : The restriction in MPI 1.1, page 12, lines 24ff ("This document specifies the behavior of a parallel program assuming that only MPI calls are used for communication") allows that MPI implementations make progress on non-blocking and buffered operations only inside of MPI routines, e.g. in MPI_Wait... / _Test... / Aprobe or _Finalize. This is a weak interpretation of the progress rule in MPI 1.1, page 44, lines 41ft. The rationale in MPI 2.0, page 142, lines 14-29 explicitely mentions this interpretation. This allows that a native MPI makes no progress, while the portable MPI is blocked by waiting for a message on the TCP link, and vice versa. Therefore, in the glue approach the problem arises that the glue layer must not implement anything by blocking or testing only within the native MPI without giving the global communication module a chance to make progress, and vice versa. The only hard problem is the combination of using collective routines of the native MPI (because there isn't any non-blocking alternative) and
567 using sends in the global M P I and these sends are buffered at the sender and the sender is blocked in the native collective operation. This problem can be solved in daemon-based systems by sending all messages (with a destination in another parallel system) immediately to the daemons and buffering t h e m there (implemented in PACX-MPI and PLUS). It can also be solved by modifying the central probe and dispatching routine inside the native MPI. T h e analogous problem (how to give the native M P I a chance to m a k e progress, while the glue layer blocks in a global collective routine) usually only exists, if one uses an existing MPI as a global communication module, b u t in M P I - G L U E this problem has been solved, because the global M P I itself is a multi-protocol implementation which uses the local M P I as one protocol stack.
5
Design Details
The notation " M P I " is used for the M P I interface exported by this glue layer and for the glue layer software. T h e notation ' " M P X " is used for the imported T C P - b a s e d M P I - I . 1 library in which all names are modified and begin with "MPX_" instead of "MPI_". M P X allows the exchange of messages between all processes. T h e notation " M P L " is used for the native (local) M P I 1.1 library. All names are modified ("MPL_" instead of "MPI_"). All M P L _ C O M M _ W O R L D s are disjoint and in the current version each process is m e m b e r of exactly one MPL_C O M M _ W O R L D (that m a y have any size>_l). Details are given in the table below and in [12]. In the table " X " and " L " are abbreviations for M P X and MPL. MPI constant, handle, routine ranks group handling communicators datatypes
impl.ed by using X X L and X L and X
remarks
i.e. the MPI ranks are equal to the MPX ranks. i.e. group handles and handling are impl. with MPX. and roots-communicator, see Sec. Opt. of collective op. and a mapping of the L ranks to the X ranks and vice versa. LorX MPI_IRECVs with MPI_ANY_SOURCE and communirequests cators spawning more than one MPL region store all aror delayed guments of the IRECV call into the handle's structure. The real RECV is delayed until the test or wait calls. sends, receives L or X (receives only with known source partition). receives with ANY_SOURCE: - blocking L and X with asymmetric polling strategy. nonblocking GLUE postponed until the wait and test routines, see MPI requests above. wait and test L or X or L&X. MPI_Testall Testing inside partial areas of the communicator may complete request handles, but the interface definition requires that only none or all requests are completed. Therefore (only for Testall), requests may be MPL/MPXfinished, but not MPI-finished. -
568 T h e M P I process s t a r t and MPI_Init is mapped to the MPL process start, MPL-Init, MPX process start and MPXJnit. The user starts the application processes on the first parallel system with its MPL process start facility. MPIAnit first calls MPL_Init and then MPX_Init. The latter one is modified to reflect that the parallel partition is already started by MPL. Based on special arguments at the end of the argument list and based on the MPL ranks the first node recognizes inside of MPX_Init its role as MPICH's big master. It interactively starts the other partitions on the other parallel systems or uses a remote shell method (e.g. rsh) to invoke the MPL process start facility there. There as well the MPIAnit routine first invokes MPL_Init and then MPX_Init. Finally MPI_Init initializes the predefined handles. MPI_Finalize is implemented in the reverse order. O p t i m i z a t i o n of collective o p e r a t i o n s : It is desirable that metacomputing applications do not synchronize processes over the TCP links. This implies that they should not call collective operations on a global communicators, especially those operations which make a barrier synchronization (barrier, allreduce, reduce_scatter, allgather(v), alltoall(v)). The following schemes help to minimize the latencies if the application still uses collective operations over global communicators. In general a collective routine is mapped to the corresponding routine of the native MPI if the communicator is completely inside of one local M P L world. To optimize MPI_Barrier, MPI_Bcast, MPI_Reduce and MPI_Allreduce if the communicator belongs to more than one M P L world, for each MPI-communicator, there is an additional internal ,vots-communicatorthat combines the first process of any underlying MPL-communicator by using MPX, e.g. MPI_Bcast is implemented by a call to native Beast in each underlying MPLcommunicator and one call to the global Bcast in the roots-communicator. 6
Status,
Future
Plans, and Acknowledgments
MPI-GLUE implements the full MPI 1.1 standard except some functionalities. Some restrictions result from the glue approach: (a) It is impossible to allow an unlimited number of user-defined operations. (b) Messages must be and need to be received with datatype MPI_PACKED if and only if they are sent with MPI_PACKED. Some other (hopefully minor) functionalities are still waiting to be implemented. A detailed list can be found in [12]. MPI-GLUE is portable. Current test platforms are SGI, CRAY T3E and Intel Paragon. Possible future extensions can be the implementation of MPI 2.0 functionalities, optimization of further collective routines, and integrating a daemon-based communication module inside the used portable MPI. I would like to thank the members of the MPI-2 Forum and of the PACX-MPI project as well as the developers of MP1CH who have helped me in the discussion which led to the present design and implementation of MPI-GLUE. And I would like to thank Oliver Hofmann for implementing a first prototype in his diploma thesis, Christoph Grunwald for implementing some routines during his practical course, especially some wait, test and topology functions, and Matthias M(iller for testing MPI-GLUE with his P3T-DSMC application.
569
7
Summary
MPI-GLUE achieves interoperability for the message passing interface MPI. For this it combines existing vendors' MPI implementations losing neither full MPI 1.1 functionality nor the vendors' MPIs' efficiency. MPI-GLUE targets existing MPI applications that need clusters of MPPs or PVPs to solve large-scale problems. MPI-GLUE enables metacomputing by seamlessly expanding MPI 1.1 to any cluster of parallel systems. It exports a single virtual parallel MPI machine to the application. It combines all processes in one MPI_COMM_WORLD. All local communication is done by the local vendor's MPI and all global communication is implemented with MPICH directly, without using deamons. The design has only two unavoidable restrictions which concern user defined reduce operations and packed messages.
References 1. Beisel, T., Gabriel, E., Resch, M.: An Extension to MPI for Distributed Computing on MPPs. In Marian Bubak, Jack Dongarra, Jerzy Wasniewski (Eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, LNCS, Springer, 1997, 75-83. 2. T. B6nisch, R. Rfihle: Portable Parallelization of a 3-D Flow-Solver, Parallel Comp. Fluid Dynamics '97, Elsevier, Amsterdam, to appear. 3. Brune, M., Gehring, J., Reinefeld, A.: A Lightweight Communication Interface for Parallel Programming Environments. In Proc. High-Performance Computing and Networking HPCN'97, LNCS, Springer, 1997. 4. Fagg, G.E., Dongarra, J.J.: PVMPh An Integration of the PVM and MPI Systems. Department of Computer Science Technical Report CS-96-328, University of Tennessee, Knoxville, May, 1996. 5. Foster, I., Kesselman, C.: Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications (to appear). 6. Foster, I., Geisler, J., Kesselman, C., Tuecke, S.: Managing Multiple Communication Methods in High-Performance Networked Computing Systems. J. Parallel and Distributed Computing, 40:35-48, 1997. 7. Foster, I., Geisler, J., Tuecke, S.: MPI on the I-WAY: A Wide-Area, Multimethod Implementation of the Message Passing Interface. Proc. 1996 MPI Developers Conference, 10-17, 1996. 8. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Paralle Computing, 22, 1996. 9. MPh A Message-Passing Interface Standard. Message Passing Interface Forum, June 12, 1995. 10. MPI-2: Extensions to the Message-Passing Interface. Message Passing Interface Forum, July 18, 1997. 11. Mfiller, M.: Weltrekorde durch Metacomputing. BI 11+12, Regionales Rechenzentrum Universits Stuttgart, 1997. 12. Rabenseifner, R.: MPI-GLUE: Interoperable high-performance MPI combining different vendor's MPI worlds. Tech. rep., Computing Center University Stuttgart. http :/lwww .hlrs
.delpeoplelrabenseifner/publ/mpi~lue_tr_apr98.
ps. Z
High Performance Protocols for Clusters of C o m m o d i t y Workstations P. Melas and E. J. Zaluska Electronics and Computer Science University of Southampton, U.K. pm95r~ecs.soton.ac.uk [email protected]
Abstract. Over the last few years technological advances in microprocessor and network technology have improved dramatically the performance achieved in clusters of commodity workstations. Despite those impressive improvements the cost of communication processing is still high. Traditional layered structured network protocols fall to achieve high throughputs because they access data several times. Network protocols which avoid routing through the kernel can remove this limit on communication performance and support very high transmission speeds which are comparable to the proprietary interconnection found in Massively Parallel Processors.
1
Introduction
Microprocessors have improved their performance dramatically over the past few years. Network technology has over the same period achieved impressive performance improvements in both Local Area Networks (LAN) and Wide Area Networks (WAN) (e.g. Myrinet and ATM) and this trend is expected to continue for the next several years. The integration of high-performance microprocessors into low-cost computing systems, the availability of high-performance LANs and the availability of common programming environments such as MPI have enabled Networks Of Workstations (NOWs) to emerge as a cost-effective alternative to Massively Parallel Processors (MPPs) providing a parallel processing environment that can deliver high performance for many practical applications. Despite those improvements in hardware performance, applications on NOWs often fail to observe the expected performance speed-up. This is largely because the communication software overhead is several orders of magnitude larger than the hardware overhead [2]. Network protocols and the Operating System (OS) frequently introduce a serious communication bottleneck, (e.g. traditional UNIX layering architectures, TCP/IP, interaction with the kernel, etc). New software protocols with less interaction between the OS and the applications are required to deliver low-latency communication and exploit the bandwidth of the underlying intercommunication network [9]. Applications or libraries that support new protocols such as BIP [4], Active Messages [6], Fast Messages [2], U-Net [9] and network devices for ATM and
571 Myrinet networks demonstrate a remarkable improvement in both latency and bandwidth, which are frequently directly comparable to MPP communication performance. This report investigates commodity "parallel systems", taking into account the internode communication network in conjunction with the communication protocol software. Latency and bandwidth tests have been run on various MPPs (SP2, CS2) and a number of clusters using different workstation architectures (SPARC, i86, Alpha), LANs (10 Mbit/s Ethernet, 100 Mbit/s Ethernet, Myrinet) and OS (Solaris, NT, Linux). The work is organised as follows: a review of the efficiency of the existing communication protocols and in particular T C P / I P from the NOW perspective is presented in the first section. A latency and bandwidth test of various clusters and an analysis of the results is presented next. Finally an example specialised protocol (BIP) is discussed followed by conclusions. 2
Clustering
with TCP/IP
over LANs
The task of a communication sub-system is to transfer data transparently from one application to another application, which could reside on another node. T C P / I P is currently the most widely-used network protocol in LANs and NOWs, providing a connection-oriented reliable byte stream service [8]. Originally T C P / I P was designed for WANs with relatively high error rate. The philosophy behind the T C P / I P model was to provide reliable communication between autonomous machines rather than a common resource [6]. In order to ensure reliability over an unreliable subnetwork T C P / I P uses features such as an in-packet end-to-end checksum, "SO_NDELAY" flag, "Timeto-live" IP field, packet fragmentation/reassembly, etc. All these features are useful in WANs but in LANs they represent redundancy and consume vital computational power [6]. Several years ago the bandwidth of the main memory and the disk I/O of a typical workstation was an order of magnitude faster than the physical network bandwidth. The difference in magnitude was invariably sufficient for the existing OS and communication protocol stacks to saturate network channels such as 10 Mbit/s Ethernet [3]. Despite hardware improvements workstation memory access time and internal I/O bus bandwidth in a workstation have not increased significantly during this period (the main improvements in performance have come from caches and a better understanding of how compilers can exploit the potential of caches). Faster networks have thus resulted in the gap between the network bandwidth and the internal computer resources being considerably reduced [3]. The fundamental design objective of traditional OS and protocol stacks at that time was reliability, programmability and process protection over an unreliable network. In a conventional implementation of T C P / I P a send operation involves the following stages of moving data: data from the application buffer are copied to the kernel buffer, then packet forming and calculation of the headers and the checksum takes place and finally packets are copied into the network
572 interface for transmission. This requires extra context switching between applications and the kernel for each system call, additional copies between buffers and address spaces, and increased computational overhead [7].
+,"........... ~
. . . . . . . . . . . . . :,
=.+
I
MPI
Applicallon
+
P
+
I
MPI
Al~hcalioa
TCP/UDP
TCP~DP
f
.... +
+/+"'~
++
[
+.
TCP/IP
+
IP
+
IP MPI-BIP IP-B]P P~hex
f
mac . . . . . . .
t,,+,~ ,++ [
Network
BIP
%
~' (a)
",_./
(b)
(c)
Fig. 1. a) Conventional TCP/IP implementation b)MPI on top of the T C P / I P protocol stack c)The BIP protocol stack approach
As a result, clusters of powerful workstations still suffer a degradation in performance even when a fast interconnection network is provided. In addition, the network protocols in common use are unable to exploit fully all of the hardware capability resulting in low bandwidth and high end-to-end latency. Parallel applications running on top of communication libraries (e.g. MPI, PVM, etc.) add an extra layer on top of the network communication stack, (see Fig. 1). Improved protocols attempt to minimise the critical communication path of application-kernel-network device [4, 6, 2, 9]. Additionally these protocols exploit advanced hardware capabilities (e.g. network devices with enhanced DMA engines and co-processors) by moving as much as possible of their functionality into hardware. Applications can also interact directly with the network interface avoiding any system calls or kernel interaction. In this way processing overhead is considerably reduced improving both latency and throughput.
3
Latency
and Bandwidth
Measurements
In order to estimate and evaluate the effect of different communication protocols on clusters a ping-pong test measuring both peer-to-peer latency and bandwidth between two processors over MPI was written [2]. The latency test measures the time a node needs to send a sequence of messages and receive back the echo. The bandwidth test measures the time required for a sequence of back-to-back messages to be sent from one node to another. In both tests receive operations were posted before the send ones. The message size in tests always refers to the payload. Each test is repeated many times in order to reduce any clock jitter, first-time and warm-up effects and the median time from each measurement test presented in the results. Various communication models [1, 5] have been developed in order to evaluate communication among processors in parallel systems. In this paper we follow a
573 linear approach with a start-up time a (constant per segment cost) and a variable per-byte cost/7. The message length at which half the m a x i m u m bandwidth is achieved (nl/2) is an i m p o r t a n t indication as well. =. +
(i)
Parameters t h a t can influence tests and measurements are also taken into account for each platform in order to analyse the results better, i.e. measurements on non-dedicated clusters have also been made to assess the effect of interference with other workload.
Tesls on MPPs (SP2 and CS2) Latency and bandwidth tests have been run on the SP2 and CS2 in the University of S o u t h a m p t o n and on the SP2 at Argonne National Laboratory. The results of the M P P test are used as a b e n c h m a r k to analyse and compare the performance of N O W clusters. In b o t h SP2 m a chines the native I B M ' s MPI implementation was used while on the CS2 the mpich 1.0.12 version of MPI was used. The difference in performance between the two SP2 machines is due to the different type of the high-performance switch ( T B 2 / T B 3 ) . The breakpoint at 4 KB on the SP2 graph is due to the change in protocol used for sending small and large messages in the M P I implementation. The N T cluster. The N T cluster is composed of 8 D E C Alpha based workstations and uses a dedicated interconnection network based around a 100Mbit/s Ethernet switch. The Abstract Device Interface makes use of the T C P / I P protocol stack that N T provides. Results on latency and bandwidth are not very impressive (Fig. 2), because of the experimental version of the M P I implementation used. The system was unable to make any effective use of the communication channel, even on large messages. An unnecessarily large number of context switches resulted in very low performance.
." I
.....
.o,"~ t.,.~ iuu~,,.
-
Fig. 2. Latency and bandwidth on NOW clusters
574
The Linux cluster. This cluster is build around Pentium-Pro machines running Linux 2.0.x. Tests have run using both the 10Mbit/s channel and FastEthernet with MPI version mpich 1.1. A poor performance was observed similar to the NT cluster, although in this case the non-dedicated interconnection network affected some of the results. Although its performance is slightly better than the NT cluster, the system fails to utilise the underlying hardware efficiently. In Fig.2 we can see a break point at the bandwidth plots close to 8 KB due to the interaction between T C P / I P and the OS. The Solaris cluster. The Solaris cluster uses a non-dedicated 10Mbit/s Ethernet segment running SunOS 5.5.1 and the communication library is mpich 1.0.12. This cluster has the best performance among the clusters for 10Mbit/s Ethernet. Both SPARC-4 and ULTRA SPARC nodes easily saturated the network channel and push the communication bottleneck down to the Ethernet board. The nondedicated intercommunication network had almost no affect on the results. The Myrinet cluster. This cluster is build around Pentium-Pro machines running Linux 2.0.1, with an interconnection based on a Myrinet network. The communication protocol is the Basic Interface for Parallelism (BIP), an interface for network communication which is targeted towards message-passing parallel computing. This interface has been designed to deliver to the application layer the maximum performance achievable by the hardware [3]. Basic features of this protocol are a zero-copy protocol, user application level direct interaction with the network board, system call elimination and exploitative use of memory bandwidth. BIP can easily interface with other protocol stacks and interfaces such as IP or APIs, (see Fig. 1). The cluster is flexible enough to configure the API either as a typical T C P / I P stack running over 10Mbit/s Ethernet, T C P / I P over Myrinet or directly on top of BIP. In the first case the IP protocol stack on top of the Ethernet network provides similar performance to the Linux cluster discussed above, although its performance is comparable to the Sun cluster. Latency for zero-size message length has dropped to 290 #s and the bandwidth is close to 1 MB/s for messages larger than 1 KB. Changing the physical network fl'om Ethernet to Myrinet via the T C P / B I P protocol provides a significant performance improvement. Zero size message latency is 171 #s and the bandwidth reaches 18 MB/s, with a nl/2 figure below 1.5KB. Further change to the configuration enables the application (MPI) to interact directly with the network interface through BIP. The performance improvement in this case is impressive pushing the network board to the design limits. Zero length message latency is 11 #s and the bandwidth exceeds 114 MB/s with a nl/2 message size of 8 KB. Figure 3 shows the latency and bandwith graphs for those protocol stack configurations. A noticeable discontinuity at message sizes of 256 bytes reveals the different semantics between short and long messages transmission modes.
575
/
I:1
/
Fig. 3. Latency and bandwidth on a Myrinet cluster with page alignment
4
Analysis of the Results
The performance difference between normal NOW clusters and parallel systems is typically two or more orders of magnitude. A comparable performance difference can sometimes be observed between the raw performace of a NOW cluster and the performance delievered at the level of the user application (the relative latency and bandwidth performance of the above clusters is presented in Fig. 2). With reference to the theoretical bandwidth of a 10 Mbit/s Ethernet channel with T C P / I P headers we can see that only the Sun cluster approaches close to that maximum. The difference in computation power between the Ultra SPARC and SPARC-4 improves the latency and bandwidth of short messages. On the other hand the NT and Linux cluster completely fail to saturate even a 10Mbit/s Ethernet channel and an order of magnitude improvement in the interconnection channel to FastEthernet does not improve throughput significantly. In both clusters the communication protocol implementation drastically limits performance. It is worth mentioning that both software and hardware for the Solaris cluster nodes comes from the same supplier, therefore the software is very well tuned and exploits all the features available in the hardware. By contrast the open market in PC-based systems requires software to be a more generM and thus less efficient. Similar relatively-low performance was measured on the Myrinet cluster using the T C P / I P protocol stack. The system exploits a small fraction of the network bandwidth and latency improvement is small. Replacing the communication protocol with BIP the Myrinet cluster is then able to exploit the network bandwidth and the application level performance becomes directly comparable with MPPs such as the SP2, T3D, CS2, etc. As can be seen from Fig. 3, the Myrinet cluster when compared to those MPPs has significantly better latency features over the whole range of the measurements made. In bandwidth terms for short messages up to 256 bytes Myrinet outperforms all the other MPPs. Then for messages up to 4KB (which is the breakpoint of the SP2 at Argonne), the SP2 has a higher performance, but after this point performance of Myrinet is again better.
576 Table 1. Latency and Bandwidth results Configuration MPI over T C P / I P MPI over T C P / I P MPI over T C P / I P MPI over T C P / I P MPI over T C P / I P MPI over T C P / I P MPI over T C P / B I P MPI over BIP
Cluster NT/Alpha Linux/P.Pro Linux/P.Pro SPARC-4 ULTRA Linux/P.Pro Linux/P.Pro L!nux/P.Pro
H/.W min Lat. max BW n1/2 FastEth. 673 ps 120 KB/s 100 Ethernet 587 ps 265 KB/s 1.5 K FastEth. 637 ps 652 KB/s 1.5 K Ethernet 660 ps 1.03 MB/s 280 Ethernet 1.37 ms 1.01 MB/s 750 Ethernet 280 ps 1 MB/s 300 Myrinet 171 ps 17.9 MB/s 1.5 K Myrinet 11 ps 114 MB/s 8 K
. . . . . . ~ i .......
i.y.. mr
Fig. 4. Comparing latency and bandwidth between a Myrinet cluster and MPPs
5
Conclusion
In this paper the importance and the impact of some of the most c o m m o n interconnecting network protocols for clusters has been examined. The T C P / I P protocol stack was tested with different cluster technologies and comparisons m a d e with other specialised network protocols. Traditional communication protocols utilising the critical application-kernel-network path impose excessive communication processing cost and cannot exploit any advanced features in the hardware. Conversely communication protocols that enable applications to interact directly with network interfaces such as BIP, Fast Messages, Active Messages, etc can improve the performance of clusters considerable. The majority of parallel applications use small-size messages to coordinate p r o g r a m execution and for this size of message latency overhead dominates the transmission time. In this case the communication cost cannot be hidden by any p r o g r a m m i n g model or technique such as overlapping or pipelining. Fast communication protocols that deliver a drastically reduced communication latency are necessary for clusters to be used effectively in a wide range of scientific and commercial parallel applications as an alternative to MPPs.
577
A C K N O W L E D G E M E N T S We acknowledge the support of the Greek State Scholarships Foundation, S o u t h a m p t o n University C o m p u t e r Services, Argonne National Laboratory and L H P C in Lyon.
References 1. J. J. Dongarra and T. Dunigan. Message-passing performance of various computers. Technical Report UT-CS-95-299, Department of Computer Science, University of Tennessee, July 1995. 2. Mario Lauria and Andrew Chien. MPI-FM: High performance MPI on workstation clusters. Journal of Parallel and Distributed Computin9, 40(1):4-18, January 1997. 3. Bernard Tourancheau LoYc Prylli. Bpi: A new protocol design for high performance networking on myrinet. Technical report, LIP-ENS Lyon, September 1997. 4. Loic Prylli. Draft: Bip messages user manual for bip 0.92. Technical report, Laboratoire de l'Informatique du Parallelisme Lyon, 1997. 5. PARKBENCH Committee: Report-l, assembled by Roger Hockney (chairman), and Michael Berry secretary). Public international benchmarks for parallel computers. Technical Report UT-CS-93-213, Department of Computer Sci/ence, University of Tennessee, November 1993. 6. Steven H. Rodrigues, Thomas E. Anderson, and David E. Culler. High-performance local-area communication with fast sockets. In USENIX, editor, 1997 Annual Technical Conference, January 6-10, 1997. Anaheim, CA, pages 257-274, Berkeley, CA, USA, January 1997. USENIX. 7. A. Chien S. Pakin, M. Lauria. High performance messaging on workstations: Illinois fast messages (flu) for myrinet. In In Supercomputing '95, San Diego, California, 1995. 8. W. R. Stevens. TCP/IP Illustrated, Volume 1; The Protocols. Addison Wesley, Reading, 1995. 9. Thorsten von Eicken, Anindya Basu, Vineet Buch, and Wether Vogels. U-net: a user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), volume 29, pages 303-316, 1995.
Significance and Uses of Fine-Grained Synchronization Relations Ajay D. Kshemkalyani Dept. of Electrical &: Computer Engg. and Computer Science University of Cincinnati, P. O. Box 210030, Cincinnati, OH 45221-0030, USA aj ayk@ececs .uc. edu
A b s t r a c t . In a distributed system, high-level actions can be modeled by nonatomic events. Synchronization relations between distributed nonatomic events have been proposed to allow applications a fine choice in specifying synchronization conditions. This paper shows how these fine-grained relations can be used for various types of synchronization and coordination in distributed computations.
1
Introduction
High-level actions in several distributed application executions are realistically modeled by nonatomic poset events (where at least some of the component atomic events of a nonatomic event occur concurrently), for example, in distributed multimedia support, coordination in mobile computing, distributed debugging, navigation, planning, industrial process control, and virtual reality. It is important to provide these and emerging sophisticated applications a fine level of granularity in the specification of various synchronization~causality relations between nonatomic poset events. In addition, [20] stresses the theoretical and practical importance of the need for such relations. Most of the existing literature, e.g., [1,2,4,5,7, 10, 13, 15, 18-20] does not address this issue. A set of causality relations between nonatomic poset events was proposed in [11, 12] to specify and reason with a fine-grained specification of causality. This set of relations extended the hierarchy of the relations in [9, 14]. An axiom system on the proposed relations was given in [8]. The objective of this paper is to demonstrate the use of the relations in [11, 12]. The following poset event structure model is used as in [4,9,10,12, 14-16, 20]. (E, -<) is a poset such that E represents points in space-time which are the primitive atomic events related by the causality relation -< which is an irreflexive partial ordering. Elements of E are partitioned into local executions at a coordinate in the space dimensions. In a distributed system, E represents a set of events and is discrete. Each local execution Ei is a linearly ordered set of events in partition i. An event e in partition i is denoted ei. For a distributed application, points in the space dimensions correspond to the set of processes (also termed nodes), and Ei is the set of events executed by process i. Causality between events at different nodes is imposed by message communication.
579 High-level actions in the computation are nonempty subsets of E. Formally, if g denote the power set of E and .4 (r 0) _C (g - 0), then each element A of A is termed an interval or a nonatomie event. It follows that if A A Ei r ~, then (A Ei) has a least and a greatest event. ,4 is the set of all the sets that represent a higher level grouping of the events of E that is of interest to an application. Table 1. Some causality relations [9]. Relation r Expression for r(X, Y)-] R1
Vx Vy Vx 3y 3x Vy 3x 3y
R11 R2 R2' R3 R3 ~ R4 R4'
E XVy E YVx E X3y E YVx E XVy G YBx E X3y E Y3x
E Y,x E X,x E Y, x E X, x E Y, x G X, x E Y, x E X, x
-~ y -~ y -~ y ~ y -~ y -~ y ~ y -~ y
The relations in [9] formed an exhaustive set of causality relations to express the possible interactions between a pair of linear intervals. These relations R1 R4 and RI' - R4 ~ from [9] are expressed in terms of the quantifiers over X and Y in Table 1. Observe that R2 ~ and R3 ~ are different from R2 and R3, respectively, when applied to posers. Table 1 gives the hierarchy and inclusion relationship of the causality r e l a t i o n s / t l -/~4. Each cell in the grid indicates the relationship of the row header to the column header. The notation for the inclusion relationship between causality relations is as follows. The inclusion relation "is a subrelation of" is denoted 'C' and its inverse is 'B'. For relations rl and r2, we define rl II r2 to be (rl ~ r2 A r2 ~ rt). The relations { R1, R2, R3, R4 } form a lattice hierarchy ordered by _. Table 2. Inclusion relationships between relations of Table 1 [9]. relation of row 1/eaderlml/~21fi31R411 to column header I / I R1 = ~t C
m
R3 ...... R4
~ ~
tlt 2
II = -7
The relations in [9] formed a comprehensive set of causality relations to express all possible interactions between a pair of linear intervals using only the -~ relation between atomic events, and extended the partial hierarchy of rela-
580 tions of [14]. However, when the relations of [9] are applied to a pair of poset intervals, the hierarchy they form is incomplete. Causality relations between a pair of nonatomic poset intervals were formulated by extending the results [8, 9] to nonatomic poset events [11, 12]. The relations form a "comprehensive" set of causality relations between nonatomic poset events using first-order predicate logic and only the -4 relation between atomic events, and fill in the existing partial hierarchy formed by relations in [9, 14]. A relational algebra for the relations in [11, 12] is given in [8]. Given any relation(s) between X and Y, the relational algebra allows the derivation of conjunctions, disjunctions, and negations of all other relations that are also valid between X and Y, as well as between Y and X. Section 2 reviews the hierarchy of causality relations [11, 12]. Section 3 gives the uses of each of the relations. Some uses of the relations include modeling various forms of synchronization for group mutual exclusion, initiation of nested computing, termination of nested processing, and monitoring the start of a computation. Section 4 gives concluding remarks. The results of this paper are included in [8]. 2
Relations
between
Nonatomic
Poset
Events
Previous work on linear intervals and time durations, e.g., [1,2, 4, 5], identifies an interval by the instants of its beginning and end. Given a nonatomic poset interval, one needs to define counterparts for the beginning and end instants. These counterparts serve as "proxy" events for the poset interval just as the events at the beginning and end of linear intervals such as time durations serve as proxies for the linear interval. The proxies identify the durations on each node, in which the nonatomic event occurs. Two possible definitions of proxies are (i) L x = {el E X ] Ye~ E X , ei __. e~} and Ux = {el E X ] Ye~ E X , ei ~" e~}, and (ii) L x = {e E X l y e ' E X , e ~r e'} and Ux = {e E X l y e ' E X , e ;~ e'). Assume that one definition of proxies is consistently used, depending on context and application. Fig. 1 depicts the proxies of X and Y. There are two aspects of a relation that can be specified between poset intervals. One aspect deals with the determination of an appropriate proxy for each interval. A proxy for X and Y can be chosen in four ways corresponding to the relations in {R1, R2, R3, R4). From Table 1, it follows that these four relations form a lattice ordered by U. The second aspect deals with how the atomic elements of the chosen proxies are related by causality. The chosen proxies can be related by the eight relations R1, _R1/, R2, R2 ~, R3, R3 ~, R4, R41 of Table 1, which are renamed a, a ~, b, b~, c, c~, d, d ~, respectively, to avoid confusion with their original names used for the first aspect of specifying the relations between poset intervals. The inclusion hierarchy among the six distinct relations forms a lattice ordered by ___;see Table 3. The two aspects of deriving causality relations, described above, are combined to define the relations. The lattice of relations { RI*, R2*, R3*, R4* ) between proxies of X and Y, and the lattice of relations { a, a ~, b, b~, c, d, d, d t } between
581
w
Ly/~
/.7. /:J."
X
(..\-" -" I.').
U
.
Y space
|
atomic event
time
F i g . 1. Poset events X and Y and their proxies.
T a b l e 3. Full hierarchy of relations of Table 1 [9]. Relations R1, RI', R2, R2', R3, R3 t, R4, R4 t of Table 1 are renamed a, a t, b, bt, c, c', d, d', respectively. Relations in the row and column headers are defined between X and Y. Relation names: its.quantifiers forx -~ y
~l,a (--nl',a'~
(VxV,j(=v,jVx)
l'~2,b: Vx3y R2',b': 3yVx I~3,c: 3xVy 113I,d; Vy3x R4,d (=R4r,d'): ' 3x3yi--' 3y3x)
Rl,a (=Rl',a'):[R2,bi[ n2',b' :lR3,c:[n3',c' :ln4,d (=R4',d'): v~v~ (=v,jv!:) iWa~l a~w p.v~ I v~,3. 13~3~(=3~3x) .7" 7
-1
"
c II !1
~_
II
il
=
II
il
II II
= ~_
c =
c c ~ ..... c
582 b'
b
3
R2b'
R2b
R3c
R3c'
Fig. 2. Hierarchy of relations in [11, 12].
the elements of the proxies, give a product lattice of 32 relations over .A • .4 to express r(X,Y). The resulting set of poset relations, denoted 7~, is given in Table 4. The relations in T~ form a lattice of 24 unique elements as shown in Fig. 2. 7~ is comprehensive using first-order predicate logic and only the -~ relation between atomic events. Relation R ? # ( X , Y) means that proxies of X and Y are chosen as per ?, and events in the proxies are related by ~ . The two relations in [14], viz., > and - - -% correspond to Rla and R4d, respectively, whereas the relations in [9] and listed in Table 1 correspond to the these relations as follows: RI=RF, R2, R2 ~, R3, R3 ~, R 4 = R 4 ~ correspond to Rla, R2b, R2E, R3c, R3c ~, R4d, respectively. 3
Significance
of the Relations
The hierarchy of causality relations is useful for applications which use nonatomicity in reasoning and modeling and need a fine level of granularity of causality relations to specify synchronization relations and their composite global predicates between nonatomic events. Such applications include industrial process control applications, distributed debugging, navigation, planning, robotics, diagnostics, virtual reality, and coordination in mobile systems. The hierarchy provides a range of relations, and an application can use those that are relevant and useful to it. The relations and their composite (global) predicates provide a precise handle to express a naturally occurring or enforce a desired fine-grained level of causality or synchronization in the computation. A specific meaning of each relation and a brief discussion of how it can be enforced is now given. In
583 Table 4. Relations r(X, Y) in 7~ [11,12]. Relation r( X, Y)
Relation definition specified by quantifiers for x -4 y, wherex 9149 Rla Vx 9 UxVy 9 Ly Rla' (=R!a) Vy 9 LyYx 9 Ux Rib Vx 9 Ux3y 9 Ly Rib' B y 9 LyVx 9 Ux Rlc Bx 9 UxVy 9 Ly Rlc' Vy E Lygx 9 Ux Rld Bx 9 Ux~y E Ly Rld' (=Rld) By 9 LyBx 9 Ux R2a Vx 9 UxVy 9 Uy R2a' (=R2a) Vy 9 UyVx 9 Ux R2b Vx 9 Ux3y 9 Uy R2b~ 3y 9 UvVx E Ux R2c 3x 9 UxVy E Uy R2c' Vy E Uy~X 9 UX R2d 3x 9 Ux3y E Uv m e (=R2d)] 3y E Uv3x E Ux
Relation r( X, Y)
Relation definition specified by quantifiers for x -4 y, wherex 9149 R3a Vx 9 9 Ly R3a' (=R3a) Yy 9 LyVx 9 Lx R3b Vx 9 Lx3y E L y R3b' 3y 9 LyVx 9 Lx" R3c 3x 9 Lx'iy 9 Ly R3c' Vy 9 Ly3X 9 Lx R3d 3x E Lx3y 9 Lv R3d' (--R3d) By 9 LyBx 9 Lx R4a 'r 9 LxVy 9 Uy R4a' (=.R4a!,, Vy 9 VyVx 9 Lx R4b ~/X 9 Lx3y 9 Uv R4br 3y E UyVx 9 Lx R4c 3x 9 LxVy 9 Uy R4d Yy E Uy3a 9 Lx R4d 3x 9 L x 3y 9 Uy R4d' (=R4d) 3y E Uv3x 9 Lx
the following discussion, the "X computation" and "Y computation" refer to the computation performed by the nonatomic events X and Y, respectively. A proxy of X is denoted X. We first consider the significance of the groups of relations R*a(X, Y), R*b(X, Y), R*b'(X, Y), R*c(X, Y), R*c'(X, Y), and R*d(X, Y). Each group deals with a particular proxy X and Y. - R*a(X, Y): All events in ]Y know the results of the X computation (if any,) upto all the events in X. This is a strong form of synchronization between X and Y. - R * b ( X , Y ) : For each event in )f, some event in ~" knows the results of the X computation (if any,) upto that event in _~. The Y computation may then exchange information about the X computation upto X, among the nodes participating in the Y computation. - R * b ' ( X , Y ) : Some event in ]Y knows the results of the X computation (if any,) upto all events in )f. These relations are useful when it is sufficient for one node in N? to detect a global predicate across all nodes in N 2. If the event in ~z is at a node that behaves as the group leader of N]~, then it can either inform the other nodes in N~ or make decisions on their behalf. R * c ( X , Y ) : All events in 1~ know the results of the X computation (if any,) upto some common event in )(. This group of relations is useful when it is sufficient for one node in N2 to inform all the nodes in NIp of its state, -
584 such as when all the nodes in N t have a similar state. If the node at which the event in X occurs has already collected information about the results/ states of the X computation upto )~ from other nodes in N 2 (thus, that node behaves as the group leader of )(), then the events in 17 will know the states of the X computation upto X. - R*cI(X, Y): Each event in Y knows the results of the X computation (if any,) upto some event in )(. If it is important to the application, then the state at each event in )~ should be communicated to some event in I7. - R*d(X, Y): Some event in 17 knows the results of the X computation (if any,) upto some event in .~. The nodes under consideration at which the events in ]? and X, respectively, occur may be the group leaders of N~. and N 2 , respectively. This group leader of N)~ may have collected relevant state information from other nodes in N~, and conveys this information to the group leader of N ? , which in turn distributes the information to all nodes in N ? . The above significance of each group of relations applies to each individual relation of that group. The specific use and meaning of each of the 24 relations in 7~ is given next. We do not restrict the explanation that follows to any specific application. 1" (X, Y): This group of relations deals with Ux and L y . Each relation signifies different degree of transfer of control for synchronization, as in group mutual ~xclusion (gmutex), from the X computation to the Y computation.
- R l a ( X , Y ) : The Y computation at any node in Ny begins only after that node knows that the X computation at each node in N x has ended, e.g., a conventional distributed gmutex in which each node in N y waits for an indication from each node in N x that it has relinquished control. - R l b ( X , Y ) : For every node in N x , the final value of its X computation is known by (or its mutex token is transferred to) some node in N y before that node in Ny begins its Y computation. Thus, nodes in Ny collectively (but not individually) know the final value of the X computation before the last among them begins its Y computation. This is a weak version of synchronization/gmutex. RlbI(X, Y): Before beginning its Y computation, some node in Ny knows the final value of the X computation at each node in N x . This is a weak version of synchronization/gmutex (but stronger than Rlb) with the property that at least one node in Ny cannot begin its Y computation until the final value of the X computation at each node in N x is known to it. - R l c ( X , Y ) : The final value of the X computation at some node in N x is known to all the nodes in Ny before they begin their Y computation. This is a weak form of synchronization/gmutex which is useful when it suffices for a particular node in N x to grant all the nodes in Ny gmutex permission to proceed with the Y computation; this node in N x may be the group leader of N x , or simply all the nodes in f i x have the same final local state of the X computation within this application. -
585
-
- R l d ( X , Y): Each node in Ny begins its Y computation only after it knows the final value of the X computation of some node in N x . This is a weak form of synchronization/gmutex (weaker than Rlc) which requires each node in ivy to receive a final value (or gmutex token) from at least one node in N x before starting its Y computation. This relation is sufficient for some applications such as those requiring that at most one (additional) process be admitted to join those in the critical section when one process leaves it. R l d ( X , Y ) : Some node in Ny begins its Y computation only after it knows the final value of (or receives a gmutex token from) the X computation at some node in N x . This is the weakest form of synchronization/gmutex.
R2*(X,Y): This group of relations deMs with Ux and Uy. The relations can signify various degrees of synchronization between the termination of computations X and Y, where X is nested within Y or X is a subcomputation of Y. Alternately, Y could denote activity at processes that have Mready spawned X activity in threads, and Y can complete only after X completes. - R2a(X, Y): The Y computation at any node in N y can terminate only after that node knows the final value of (or termination of) the X computation at each node in N x . This is a strong synchronization before termination, between X and Y. - R2b(X,Y): For every node in N x , the final value of its X computation is known by at least one node in Ny before that node in Ny terminates its Y computation. Thus, all the nodes in Ny collectively (but not individually) know the final values of the X computation before they terminate their Y computation. This is a weak synchronization before termination. - R2b'(X, Y): Before terminating its Y computation, some node in Ny knows the final value of the X computation at all nodes in N x . This is a stronger synchronization before termination than R2b, wherein at least one node in Ny cannot terminate its Y computation without knowing the final state of the X computation at all nodes in N x . This suffices for M1 applications in which it is adequate for one node in Nv to detect the termination of the X computation at each node in N x before that node terminates its Y computation. - R2c(X,Y): The final value of the X computation at some node in N x is known to all the Ny nodes before they terminate the Y computation. This is a weak form of synchronization. The pertinent node in N x could represent a critical thread in the X computation, or could be the group leader of N x that represents the X computation at all nodes in N x . - R2c'(X,Y): Each node in Ny terminates its Y computation only after it knows the final value of the X computation at some node in N x . This is a weak form of synchronization before termination (weaker than R2c), but is adequate when all the nodes in N x are performing a similar X computation. - R2d(X, Y): Some node in .IVy terminates its Y computation after it knows the final value of the X computation at some node in N x . This is a weak form of synchronization; however, if the concerned nodes in N x and N y are the respective group leaders of the X and Y computations and, respectively,
586 collect/distributed information f r o m / t o their groups, then a strong form of synchronization can be implicitly enforced because when Y terminates, it is known to each node in Ny that the X computation has terminated. R3*(X, Y): This group of relations deals with L x and Ly. The relations can signify various degrees of synchronization between the initiation of computations X and Y, where Y is nested within X or Y is a subcomputation of X. Alternately, X could denote activity at processes that have already spawned Y activity in threads.
- R3a(X, Y): The Y computation at any node in Ny begins after that node knows the initial values of the X computation at each node in N x . This is a strong form of synchronization between the beginnings of the X and Y computations. R3b(X, Y): For each node in N x , the initial state of its X computation is known to some node in Ny before that node in Ny begins its Y computation. Thus, all the nodes in N r collectively (but not individually) know the initial state of the X computation. This synchronization is sufficient when the forked Y computations at each node in Ny are only loosely coupled and should not know each others' initial states communicated by the X computation; while at the same time ensuring that the initial state of the X computation at each node in N x is available to at least one node in Ny before it commences its Y computation. - R3b~(X,Y): Before beginning its Y computation, some node in N y knows the initial state of the X computation at all the nodes in N x . Thus the Y computation at this node can run a parallel X computation for faulttolerance, or can be made an entirely deterministic function of the inputs to the X computation. The subject node in Ny can coordinate the Y computation of the other nodes in Ny. This synchronization is weaker than R3a but stronger than R3b. - R3c(X, Y): The initial state of the X computation at some node in N x is known to all the nodes in N r before they begin their Y computation. This is a weak synchronization; however, it is adequate when the subject node in N x has forked all the threads that will perform Y, and behaves as the group leader of X that initiates the nested computation Y. - R3cI(X,Y): Each node in N r begins its Y computation only after it knows the initial state of the X computation at some node in N x . Thus each node executing the computation Y has its Y computation forked or spawned by some node in N x and its Y computation corresponds to a nested branch of X. The nodes in N r may not know each others' initial values for the Y computation; the X computations at (some of) the N x nodes have semiindependently forked the Y computations at the nodes in Ny. - R3d(X, Y): Some node in N r begins its Y computation only after it knows the initial state of the X computation at some node in N x . This is a weak form of synchronization in which only one node in N x and one node in N r coordinate their respective initial states of their local X and Y computations. However, if the node in N x that initiated the X computation forks off -
587 the main thread for the Y computation, then this form of synchronization between the initiations of X and Y is adequate to have Y as an entirely nested computation within X. R4* (X, Y): This group of relations deals with Lx and Uy. The relations signify different degrees of synchronization between a monitoring computation Y t h a t knows the initial vMues with which the X computation begins, and then the monitoring computation Y terminates.
- R4a(X, Y): The Y computation at any node in Ny terminates only after that node knows the initial values of the X computation at each node in N x . This is a strong form of synchronization between the start of X and the end of Y. R4b(X, Y): For every node in Nx, the initial state of its X computation is known by at least one node in N y before that node in Ny terminates its Y computation. Even if there is no exchange of information in the Y computation about the state of the X computation at individual nodes in N x , this relation guarantees that when Y completes, the (initial) local states at each of the Nx nodes are collectively (but not individually) known by Ny. R4b~(X, Y): Before terminating its Y computation, some node in Ny knows the initial state of the X computation at all the nodes in Nx. This node in Ny can detect if an initial global predicate of the X computation across the nodes in N x is satisfied, before it terminates its Y computation. If this node in N r is a group leader, it can then inform the other nodes in Ny to terminate their Y computations. - R4c(X,Y): The initial state of the X computation at some node in N x is known to all the nodes in Ny before they terminate their Y computation. This weak synchronization is adequate for applications where all the N x nodes start their X computation with similar values. Alternately, if the node in N x behaves as a group leader, it can first detect the initial global state of the X computation and then inform all the nodes in Ny. - R4c'(X,Y): Each node in Ny terminates its Y computation only after it knows the initial state of the X computation at some node in N x . This is a weaker form of synchronization than R4c because the states of M1 nodes in N x may not be observed before the nodes in Ny terminate their Y computation. But this will be adequate for applications in which each node in Nx is reporting the same state/value of the X computation, and each node in Ny simply needs a confirmation from some node in Nx before it terminates its Y computation. For example, a mobile host (an Ny node) may simply need a confirmation from some base station (an N x node) before it exits its Y computation. R4d(X,Y): Some node in N r terminates its Y computation after it knows the initial state of the X computation at some node in Nx. This weak form of synchronization is sufficient when the group leader of X which is responsible for kicking off the rest of X informs some node (or the group leader) of the monitoring distributed program Y that computation X has successfully begun. -
-
-
588 Consider enforcing group mutual exclusion (gmutex) between two groups of processes G1 and G2. Multiple processes of either group, but not both groups, are permitted to access some critical resources, such as distributed database records, at any time. Relations RI* represent different degrees of gmutex that can be enforced, as explained for RI*(X,Y) earlier in this section. Also, the strongest form of gmutex, Rla, can also be enforced by RIU, Rlc, and Rld, if the communicating nodes in G1 /G2 are the respective group leaders. Thus, for Rlb', the nodes in G1 communicate their states (gmutex tokens) to the group leader of G2 which then collects all these states (gmutex tokens), and distributes them within G2. For Rlc, the group leader of G1 collects all the gmutex tokens from G1, then informs all the nodes in G2. For Rld, the group leader of G1 collects all the gmutex tokens from G1, then informs the group leader of G2 which then informs all the nodes in G2. The above four ways of expressing the distributed mutual exclusion provide a choice in trade-offs of (i) knowledge of membership of G1 and/or G2, by members and/or group leaders, within each group and across groups (this is further complicated with mobile processes), (ii) different delay or the number of message exchange phases to achieve gmutex, (it is critical to have rapid exchange of access rights to the distributed database), (iii) different number of messages exchanged to achieve gmutex, (bandwidth is a constraint, particularly with the use of crypto techniques), and (iv) fault-tolerance implications (critical to sensitive applications). 4
Concluding
Remarks
We showed the uses of fine-grained synchronization relations by applications that use nonatomicity in modeling actions and need a fine degree of granularity to specify synchronization relations and their composite global predicates. We showed the specific meaning and significance of each relation. Some uses of the relations in a distributed system include modeling various forms of synchronization for group mutual exclusion, initiation of nested computing, termination of nested computing, and monitoring the start of a computation. The synchronization between any X and Y computations can be performed at any "synchronization barrier" for the X and Y computations of any application. For example, R2*(X,Y) synchronization can be performed at a barrier to ensure that subcomputation X which is nested in subcomputation Y completes before subcomputation Y completes, following which R3*(Y, X) synchronization can be performed to kick off another nested subcomputation X within Y. This barrier synchronization may involve blocking of processes which need to wait for expected messages, analogous to the barrier synchronization for multiprocessor systems [17]. The synchronization relations provide a precise handle to express various types of synchronization conditions in first-order logic. Each synchronization performed satisfies a global predicate in the execution. (A classification of some types of global predicates is given in [3,6].) Complex global predicates can be formed by logical expressions on such synchronization relations. It is an inter-
589 esting problem to classify the global predicates that can be specified using the synchronization relations. Observe that performing a synchronization (corresponding to one of the relations) involves message passing, and hence also provides a direct way to evaluate global state functions and global predicates involving the nodes participating in the synchronization. Thus, performing the synchronization enables the detection of global predicates. Also, global predicates can be enforced by initiating the synchronization only when some local predicates become true. Identifying the types of global predicates that can be detected or enforced using the synchronization relations is an interesting problem. The design of algorithms to enforce global predicates is also a topic for study.
References 1. Linear time, branching time, and partial orders in logics and models of concurrency, J. W. de Bakker, W. P. de Roever, G. Rozenberg (Eds.), LNCS 354, SpringerVeriag, 1989. 2. J. V. Benthem, The Logic of Time, Kluwer Academic Publishers, (led. 1983), 2ed. 1991. 3. R. Cooper, K. Marzullo, Consistent detection of global predicates, A CM/ONR Workshop on Parallel and Distributed Debugging, 163-173, May 1991. 4. C. A. Fidge, Timestamps in message-passing systems that preserve partial ordering, Australian Computer Science Communications, Vol. 10, No. 1, 56-66, Feb. 1988. 5. P. C. Fishburn, Interval Orders and Interval Graphs: A Study of Partially Ordered Sets, J. Wiley & Sons, 1985. 6. V. Garg, B. Waldecker, Detection of weak unstable predicates in distributed programs, IEEE Transactions on Parallel and Distributed Systems, 5(3), 299-307, March 1994. 7. W. Janssen, M. Pod, J. Zwiers, Action systems and action refinement in the development of parallel systems, In J.C. Baeten, J.F. Groote, (Eds.) Concur91, LNCS 527, Springer-Verlag, 298-316, 1991. 8. A. Kshemkalyani, Temporal interactions of intervals in distributed systems, TR29.1933, IBM, Sept. 1994. 9. A. Kshemkalyani, Temporal interactions of intervals in distributed systems, Journal of Computer and System Sciences, 52(2), 287-298, April 1996. (Contains some parts of [8]). 10. A. Kshemkalyani, Framework for viewing atomic events in distributed computations, Theoretical Computer Science, 196(1-2): 45-70, April 1998. 11. A. Kshemkalyani, Relative timing constraints between complex events, 8th IASTED Conf. on Parallel and Distributed Computing and Systems, 324-326, Oct. 1996. 12. A. Kshemkalyani, Synchronization for distributed real-time applications, 5th Workshop on Parallel and Distributed Real-time Systems, IEEE CS Press, 81-90, April 1997. 13. L. Lamport, Time, clocks, and the ordering of events in a distributed system, CACM, 558-565, 21(7), July 1978. 14. L. Lamport, On interprocess communication, Part h Basic formalism, Part Ih Algorithras, Distributed Computing, 1:77-101, 1986.
590 15. F. Mattern, Virtual time and global states of distributed systems, Parallel and Distributed Algorithms, North-Holland, 215-226, 1989. 16. F. Mattern, On the relativistic structure of logical time in distributed systems, In: Datation et Controle des Executions Reparties, Bigre, 78 (ISSN 0221-525), 3-20, 1992. 17. J. Mellor-Crummey, M. Scott, Algorithms for scalable synchronization on sharedmemory multiprocessors, ACM Transactions on Computer Systems, 9(1): 21-65, Feb. 1991. 18. E.R. Olderog, Nets, Terms, and Formulas, Cambridge Tracts in Theoretical Computer Science, 1991. 19. A. Rensink, Models and Methods for Action Refinement, Ph.D. thesis, University of Twente, The Netherlands, Aug. 1993. 20. R. Schwarz, F. Mattern, Detecting causal relationships in distributed computations: In search of the holy grail, Distributed Computing, 7:149-174, 1994.
A Simple P r o t o c o l to C o m m u n i c a t e C h a n n e l s over C h a n n e l s Henk L. Muller and David May Department of Computer Science, University of Bristol, UK henkm@cs, bris. ac. uk, dave@cs, bris. ac. uk
A b s t r a c t . In this paper we present the communication protocol that we use to implement first class channels. Ordinary channels allow data communication (like CSP/Occam); first class channels allow communicating channel ends over a channel. This enables processes to exchange communication capabilities, making the communication graph highly flexible. In this paper we present a simple protocol to communicate channels over channels, and we show that we can implement this protocol cheaply and safely. The implementation is going to be embedded in, amongst others, ultra mobile computer systems. We envisage that the protocol is so simple that it can be implemented at hardware level.
1
Introduction
T h e traditional approach to communication in concurrent and parallel p r o g r a m ming languages is either very flexible, or very restricted. A language like Occ a m [1], based on CSP [2], has a static communication graph. Processes are connected by means of channels. D a t a is transported over channels synchronously, meaning that input and output operations wait for each other before d a t a can be communicated. In contrast, communication in concurrent object oriented languages is carried out via object identifiers. Once a process obtains an identifier of another object, it can communicate with that object, and any object can communicate with that object. Similarly, languages based on the 1r-calculus [3], use many-to-one communications. We have recently developed a form of communication t h a t lies between the purely static and very dynamic communication constructs. We enforce c o m m u nication over point-to-point channels, just like Occam, but instead of having a fixed communication graph we allow channels to be passed between processes. This gives us a high degree of flexibility. A channel can be seen as a communication capability to or from a process. This is close to the NIL approach [4]. These flexible channels can be extremely useful in m a n y application areas. Allowing channel ends to be passed like ordinary d a t a gives us an elegant p a r a d i g m to encode the (mobile) software required for wearable computers. Multi-user games often require communication capabilities to be exchanged between nodes in the system. A similar type of channels has been successfully used for audio processing [5], and mobile channels appear to be a natural c o m m u n i c a t i o n m e d i u m for continuous media.
592
i) Initial state
ii) if b is moved
iii) if a is movedsubsequently
Fig. 1. A system, with two nodes 'D' and 'O', and a channel connecting ports a and b. (i) the channel is local on 'O'; ii) port b is passed to node 'D', stretching the channel; iii) port a is transferred to 'D', the channel snaps back.
In all of these applications, one wants to be able to pass around a communication capability from one part of a system to another. Having one-to-one communication channels as opposed to one-to-many communication models (which m a n y concurrent OO languages offer as their default mechanism) makes it easier for the p r o g r a m m e r to reason about the p r o g r a m behaviour, while it simplifies and speeds up the implementation. This is especially i m p o r t a n t when we are going to implement these protocols at hardware level. In this paper we discuss the implementation of the mobile channel model. In Section 2 we first give a brief overview of the high level p a r a d i g m t h a t we use to model these flexible channels. After that, we discuss the protocols that we have developed to implement the channels in Section 3. In Section 4 we present some performance figures of our implementation.
2
Moving
Ports
Paradigm
The p r o g r a m m i n g paradigm that we have developed, Icarus, is designed to enable the development of mobile distributed software. In this section we give a brief description of the Icarus p r o g r a m m i n g model, concentrating on the communication aspects; for an in-depth discussion of Icarus we refer to a companion paper [6]. Icarus processes communicate by means of channels. Like CSP and O c c a m channels, Icarus channels are point-to-point, and uni-directionM. T h a t means, at any m o m e n t in time, a channel has exactly one process t h a t m a y input from the channel, and one process t h a t m a y output to the channel. We call these channel ends ports, so each channel has an associated input port and output port. The two ports that are connected by a channel are called companion ports. In contrast with CSP and Occam, Icarus ports are first-class. This means that one can declare a variable of the type "input port of integers", and assign it a channel end (the type system requires this to be the input-end of an integer channel). Alternatively one can declare a "channel of input port of integers", and communicate a port over this channel. Because an Icarus port can be passed around between processes, a flexible communication structure can be created. There are two ways to look at mobile channels and ports. The easiest way to visualise mobile channels is to view a channel as a rubber pipe connecting two ports. Wherever the ports are moved, the channel connects t h e m and transports d a t a from the input port to the output port when required. T h e grey line in
593 { chart ! i n t d ; chart ? i n t c ; p a r { { chart z ; c ! z ; d ! z ;
1 Initial
~
/ ' ~
Configuration
} [[
{ ?int x ; int b ; c ? x ;
x ? b ; }
I]
{ !int y ; d?y; y ! i ; }
}
}
2 Thepogs / ~
_~
/F~x~
ofzare moving
3 The p o r t s / have arrived in x and y
~
~
x
N
%
F i g . 2. E x a m p l e Icarus program. Left: the code. Right: the execution, the processes before, during and after communication over c and d.
Figure l(i) shows the channel between the two ports labelled a and b. Another way to view channels and ports is to enumerate them. If we assign a unique even numbers to each channel, say C, then we can define the ports of t h a t channel to have numbers C and C + 1. D a t a output over port C will always arrive at its companion port C + 1; regardless the physical locations of the ports. In order to preserve the point-to-point property of channels, ports are not copied when assigned or sent over a channel. Instead, they are moved. W h e n reading a port variable, the variable is not just read, but read-and-destroyed in one atomic operation. These moving semantics guarantee that at any time each channel connects exactly two ports. A port-variable can thus either contain a port, that is, it is connected via a channel to a companion port, or it is unconnected, which we can represent by having no value in the port. T h e syntax of Icarus allow us to declare an output port of something by using an exclamation mark: ! T is the type of a port over which values of type T can be output. The type ?T denotes the ports over which values of T can be input. So the type ! ? i n t is the type of port over which one can output a port over which one can input an integer. Icarus has output (!) and input (?) operators, and a select statement to allow a choice between multiple possible inputs. Icarus communication is synchronous, which, as is shown later in this paper, simplifies the implementation dramatically. The example program in Figure 2 creates three processes. Process 1 sends an input port to process 2 and an output port to process 3. This establishes a channel between processes 2 and 3, over which they can exchange data.
3
T h e P r o t o c o l for Moving Ports
The protocol used to communicate ports over channels is simple. We present the protocol b o t t o m up: we start with the basic communication of simple values, then
594
Port index Ill~i] Companion port index[1132[Remote Address_ Port-state[iDLE...M~ IDLE ...
Fig. 3. Part of the port data structure, with two companion ports.
we discuss sending ports itself, and we finally consider the difficult cases such as sending two companion ports simultaneously. We assume that the protocol is implemented on top of a reliable in-order packet delivery mechanism.
3.1
Communicating Single Values
The port data structure consists of an integer that denotes the index of the companion port, an optional field to denote the remote address of the companion port (not used for a port with a local companion), a field that gives the state of the port, and some additional fields for storing process identifications. Two companion ports are shown ill Figure 3. Communicating single values is implemented conventionally, using the same algorithms as used in, for example, the Transputer family [7]. The channel can be in one of three states, IDLE~ INPUTTING and OUTPUTTING. This uniquely defines the state of both inputting and outputting processes. The IDLE state denotes that no process wishes to communicate over the channel. The INPUTTING state denotes that a process has stopped for input and is waiting for communication, but that no process is ready for outputting yet. Conversely, the OUTPUTTING state denotes that a process is ready for output but that no process is ready for input. There is no state for two processes being ready for input and output, as that state is resolved immediately when the second process performs its i n p u t / o u t p u t operation. The state of a channel is represented as the state of its two ports. It is sufficient to store the state INPUTTING on the output port, and to store the s t a t e OUTPUTTING on the input port. This ensures that a check of the local port suffices to determine the state of the channel. If we want to support a select-operation, we will need an extra state, SELECTING, which indicates that a process is willing to input. Upon communication the selecting process will perform some extra work to clean other ports which have been placed in the state SELECTING.
3.2
Communicating Simple Values Off-Node
Off-node communication is synchronous, but has been optimised to hide latency as much as possible. A process that can output data will immediately send the data on the underlying network, and then wait for a signal that the d a t a has been consumed before continuing. Note that the process stops after outputting, until the d a t a has been input; this is essential when implementing the full port-moving protocol.
595 Because the port-state of a remote companion port is not directly accessible, a port-state is stored at both nodes. Ports with a remote companion can be in the states REMOTE_IDLE, REMOTEANPUTTING~ REMOTE_OUTPUTTING and REMOTE-READY. The inputting port can be REMOTE.IDLE (no process willing to input at this m o m e n t ) , REMOTE_INPUTTING (a process is willing to input and waiting) or REMOTE-READY (no process is willing to input, but d a t a has been sent and is waiting to be input). The outputting port can be in the state REMOTE..IDLE (no process is willing to output at this m o m e n t ) or REMOTE_OUTPUTTING (a process has output d a t a and is waiting for the d a t a to be input). When implementing a select-operation, a port m a y get into a state REMOTE_SELECTING to denote that a process is waiting for input from more t h a n one source (note t h a t our select-operation only allows selective input, we do not support a completely general select-operation [8])
3.3
Communicating Ports Off-Node
Locally, within a node, a port can be simply an index in an array. Within a node, ports can be freely moved around from one process to another simply by communicating this index value. We discuss transfers of ports to another node in four steps: 1. Transfer of a port from a source node to a destination node, where the companion port is on the source node (channel stretches). 2. Transfer of a port from a source node to a destination node, where the companion port is on a third node. 3. Transfer of a port from a source node to a destination node, where the companion port is on the destination node (channel snaps back) 4. Difficult cases (simultaneous transfers of two companion ports) 1. T r a n s f e r o f a p o r t f r o m a s o u r c e n o d e t o a d e s t i n a t i o n n o d e , w h e r e t h e c o m p a n i o n p o r t is o n t h e s o u r c e n o d e . In this case the companion port is local, and stays behind. According to Section 3.1, the port t h a t will be sent, must be in the state IDLE, INPUTTING, or OUTPUTTING. Because ports are associated with exactly one process, we can be sure that the port t h a t is going to be sent is not active. T h a t is, no process can be inputting or o u t p u t t i n g on the port which is being sent: if a process was inputting or outputting on this port, then the process cannot send the port (it could only a t t e m p t to send the port over itself; the type system prevents this from happening). Because the port being sent is not active, it is sufficient to send some identification of the companion port to the destination node. The identification t h a t we send consists of a local index number, and tile node address of the source (for example an IP-address and IP port-number). This tuple (local index, node address) is a globally unique identification of the port. Upon receiving the identification, a new port is allocated, and the companion port index and node address are stored in this port. The port-state is initialised to REMOTE_IDLE, for the port was inactive. The second step is to inform the
596
"l
~ ~.~
~LE .'~.2 ~
] ~ ~
1. Sendcompanionport-ID (137.222.102.50,2314,13) ?
........ ~ , ~ [ ~
" ~ , ~
13 I 34 28 REMOTEIDLE~-,
28 . . . .REMOTE_IDLE .. z - -[-~' -, ~
2. SendID of allocatedport (137.222.103.28,1127,34)
,~
~ ~ ~ (34)
-
~
i
...
Fig. 4. Steps to send a port to a remote node. Grey arrows are channels.
companion node of the newly created port index, and the companion port's state is set to REMOTE_IDLE, I~EMOTE_INPUTTING, or REMOTE_OUTPUTTING. After the companion port is updated, step three is to delete the original port. Even though the port send was inactive, it was possible that at most one message was waiting on it, in that case the data is forwarded in order to preserve the invariant that the data has been sent to the remote node. Finally, in the fourth step an acknowledgement is sent to the destination node, signalling that the companion port is ready. This four step process is summarised in Figure 4. Between steps 1 and 2 is a transitional period when the companion port is not connected to anything. For this reason, the companion port-state is set to GONE in step 1. Any I / O operation between steps 1 and 2 will wait until this transient state disappears. 2. T r a n s f e r o f a p o r t f r o m a s o u r c e n o d e t o a d e s t i n a t i o n n o d e , w h e r e t h e c o m p a n i o n p o r t is o n a different, third~ n o d e . This case is slightly more difficult, and we need a generalised version of the previous protocol. This generalised protocol has four steps, outlined in Figure 5. In step 1 the global identification of the companion port is sent from 'S' to 'D'. In step 2 the newly allocated port-index is sent from the destination node 'D' to the companion node 'C'. This will allow the companion port to be updated. Step 3 will acknowledge to 'S' that the port has been transferred, allowing the source node to delete the original port. Finally step 4 is an acknowledgement to 'D' signalling that the port is ready for use. Performing these four steps in this order will ensure that any message that might have been sent to the source node is collected and forwarded to the destination node. In the worst case, a message might be output on the companion port, just after the source node has initiated the transfer. This message will be delivered to the source node while the port is in state GONE. The d a t a will be kept here until step 3 of the protocol, whereupon it is transferred to the destination node.
597 ~ urce Node
~)
- uelete port
1 Send companion ID 4 Acknowledge
. .~ ,(
' Destination Node
b
of NeW port-~
Fig. 5. Steps to send a port in a system involving three nodes.
Note that the case with two nodes discussed in the previous section is a special case of this four step protocol. If the companion port resides on the source node, then step 3 is a local operation. 3. T r a n s f e r o f a p o r t f r o m a s o u r c e n o d e t o a d e s t i n a t i o n n o d e , w h e r e t h e c o m p a n i o n p o r t is o n t h e d e s t i n a t i o n n o d e . Because the port is being sent to the node where the companion resides, we m u s t ensure t h a t b o t h companion ports are in a 'local' state, allowing d a t a transfer to be optimised for local transport. We use the same protocol as before. Because the destination node and the companion node are now one and the same, step 2, informing the c o m p a n i o n node of its new status, is now a local activity. We still need to execute steps 3 and 4, telling the source node that the port can be deleted, and acknowledging the completion. These steps take care of the situation in which the port sent was originally in the state I~EMOTE_READY. The companion had sent d a t a to the input-port, and the d a t a was waiting to be picked up. Before the port is deleted, we forward the d a t a to the destination node. Both companions are now on the destination node, so we change the input port-state to OUTPUTTING, and keep the d a t a in the outputting process until a process is ready to input it. T h e d a t a can be safely kept in the outputting process because it had not been allowed to proceed. 4. D i f f i c u l t cases. There are two difficult cases worth discussing: two companion ports that are sent at the same time (to the same or different destination nodes); and the case where a port is transferring a port, which is in turn transferring a port (which in turn might be transferring a port). When two ports are transferred simultaneously, the protocols moving the two ports might interfere with each other. The source of the interference is step 2, where the companion node is updated to reflect the new location of the port transferred. We overcome this problem by checking during step 2 whether the companion port is actually being moved, which is indicated by the p o r t - s t a t e GONE. If this is the case, then the message is forwarded to the c o m p a n i o n ' s destination node. If a companion port is found in the state GONE, then the other port must find its companion GONE (because of the message ordering), and b o t h nodes will forward the message of step 2. Eventually, both newly created ports will have a reference to each others' location, and both old ports are deleted. Note t h a t if
598
Fig. 6. Ports over ports over ports. two companion ports are sent simultaneously, then both ports must have been IDLE, and no data will be sent until both have arrived at their destinations. The most difficult situation is the case where a port is sent, and this port is actually a port of ports, carrying a port. In Figure 6 we have sketched a situation with 4 nodes (A, B, C and D) and 6 ports (a, b, c, d, e and f). The ports are linked up as denoted by the grey lines. Outputting port d over e is a normal output operation as discussed before. Similarly, outputting port b over c to d is a normal operation. If the transfers of b and d happen simultaneously, then b is actually making a double hop, from node B via node C to node D. Indeed, there may be an arbitrarily long finite list of ports to be moved, only restricted by the number of '?' in the port type definition (the typing system guarantees it to be finite). This forwarding causes a serious complication in the protocol (around 20% of the code), and we are considering whether it is worthwhile to prohibit this.
3.4
Security
The protocol described above is not secure. Most notably, an attacker can forge a message for step 2 of the protocol, and deliver it to a node, in order to obtain a port in the process. The solution is to extend our protocol with an extra step. When executing step 1 of the process, we have to send a clearance message to the companion port, informing it that a move is imminent. The companion node will only execute step 2 of the protocol when the source node has sent this clearance. This provides security against unknown processes taking channel ends, but on the assumption that the underlying network enforces a unique naming scheme, and secure delivery of messages. 3.5
Discussion
The protocol has been optimised so that the frequent operations, such as sending ordinary data or ports, have high performance. There are communication patterns which will perform badly. The worst case is a program where a process outputs a large array over a port, while there is no process willing to input this data. If the input-port is now sent from one process to the next, then the d a t a is shipped with it. If the input port were to be sent many times, we would waste bandwidth and increase latency. We could optimise this case by not sending d a t a until required, but this would seriously impair performance in the more common case, when data is communicated between two remote nodes (incurring double latency on each transferral).
599
~ode a
ode b
Yoyo: send a port up and downbetweentwo nodes Diabolo: sendboth ports up and down .................
Bouncing Bristol-a Bristol-a Bristol-a
............
between Yo-yo Diabolo Bristol-a 7.70 #s 16 ps Bristol-b 4.7 m s 9.9 m s Amsterdam 322 m s 660 m s
Fig. 7. The two test programs that we measured, with round trip times.
4
Results
T h e protocol has been implemented in C, to implement Icarus on a network of workstations. The implementation of the protocol takes less than 1000 lines of C, underlining its simplicity. We use U D P (User D a t a Protocol) for the underlying network, with an extra layer to guarantee reliable in-order delivery. We have tested the protocol with m a n y different applications. A class of students has written multi user games in Icarus. To convince the reader t h a t our implementation works over the Internet, we show timings of the two test programs displayed in Figure 7. The first test program, Yo-yo, keeps one port at a fixed node, and sends the companion port forwards and backwards to a process on a second node. In this program the channel must stretch and snap back every time. The second test program, Diabolo, stretches the protocol with some more demanding communication. Two pMrs of processes are running on two nodes, each pair of processes sends one of two companion forwards and backwards between two nodes. We have measured three configurations: one where all processes are running within the same node (which measures communications inside the run-time support only), a configuration where we run the processes on two machines on the same LAN, and a configuration where we run the processes on two distant machines. The performance is linearly related to the average latency between the nodes involved, and to the number of messages that is needed for a round trip. This is clearly shown in the timings of Yo-yo versus Diabolo for the three configurations.
5
Conclusions
In this paper we presented a simple protocol that allows us to treat communication channels as first class objects. The p r o g r a m m i n g paradigm allows processes to communicate over point-to-point channels. Values c o m m u n i c a t e d can either
600
be simple values (like integers or arrays of integers), or ports (ends of channels). Because the paradigm allows a synchronous implementation, we have been able to produce a simple implementation. Whenever a port is transferred, the port is known to be idle (because ports are point to point and synchronous), so we do not have to transfer any state around. We have developed a four step protocol to move a communication port from one node to another. Step one carries the port identifier, creating a new port on the destination node. Step two informs the companion port of its new companion. Step three informs the source that the original port is to be deleted. Step four allows the newly created port to start communication. These steps may be local or remote, depending on the relative positions of the source, destination, and companion nodes. If the two connected ports end up in the same node, the channel snaps back to within the node, and the implementation switches to a highly efficient local communication protocol. The applications of this protocol lie in many areas: multi user games (where players receive channels that allow them to communicate with other players), continuous media switchboards (where a channel carrying, for example, video signals is handed out to two terminals), and wearable computing (establishing communication channels between parts of a changing environment). We are currently integrating this protocol in a wearable computer system, and its applications in continuous media. References 1. Inmos Ltd. Occam-2 Reference Manual. Prentice Hall, 1988. 2. C. A. R. Hoare. Communicating Sequential Processes. Communications of the A CM, 21(8):666-677, August 1978. 3. R. Milner, J. Parrow, and D. Walker. A Calculus of Mobile Processes, I. Information and Computation, 100(1):1-40, Sept. 1992. 4. R. Strom and S. Yemini. The NIL Distributed Systems Programming Language: A Status Report. SIGPLAN notices, 20(5):36-44, May 1985. 5. R. Kirk and A. Hunt. MIDAS-MILAN An Open Distributed Processing System for Audio Signal Processing. Journal Audio Engineering Society, 44(3):119-129, Mar. 1996. 6. D. May and H. L. Muller. Icarus language definition. Technical report, Department of Computer Science, University of Bristol, January 1997. 7. M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, Routers 8J Transputers. IOS Press, 1993. 8. G. N. Buckley and A. Silberschatz. Art Effective Implementation for the Generalised Input-Output Construct of CSP. A CM Transactions on Programming Languages and Systems, pp 224, Apr. 1983.
SciOS: Flexible Operating System Support for SCI Clusters* Povl T. Koch ~ and Xavier Rousset de Pina SIRAC Laboratory, INRIA Rh6ne-Alpes, ZIRST - 655, avenue de l'Europe 38330 Montbonnot Saint-Martin (France) scios@inrialpes .fr
A b s t r a c t . The bottleneck for many parallel and distributed apphcations on networks of workstations is the high cost of communication on traditional network interfaces. Memory-mapped network interfaces provide latencies of a few microseconds and bandwidths close to the maximum of the local I/O bus. A major drawback with the current operating system support is that applications have to deal directly with data consistency and the allocation and placement of pinned physical memory. We propose a more elaborate interface, called SciOS, that provides a sharedmemory abstraction where physical memory is treated like in a NUMA architecture. To lower the average memory access times, we use a relaxed memory model and dynamic page migration and replication combined with use of idle remote memory instead of disk swap. We describe the SciOS programming model and describe the issues for its implementation on an SCI cluster.
1
Introduction
Traditional network interfaces and operating systems often require that every communication is through expensive calls to the operating system, includes extensive buffer handling, and require memory copies between address spaces to enforce protection and provide flexibility and reliability. These interface designs typically result in high message latencies, bandwidth limitations, and high software overheads of communication. This hurts the performance of a wide range of parallel and distributed applications. Many new networking technologies are based on virtual m e m o r y - m a p p e d interfaces and have bandwidths in the range of Gigabits per second, message latencies of a few microseconds, and hardware-based reliable delivery and congestion control[i,6, 8, 14]. After the setup of remote memory mappings, all communication is handled through the processor's normal load and store instructions * This work was supported in part by the Region Rh6ne-Alpes in the framework of the Emergence Research Program (C.R. E901010001) ** Supported in part by the European Commission, the 2"MR Program. Also associated with University of Copenhagen (Denmark)
602 or system managed DMA with very little overhead on the sending and receiving processors. These types of interfaces have been shown to be well suited for message-passing interfaces [10] and parallel programming languages [9] because of their low latencies and the hardware support of zero-copy protocol implementations. Our approach is to use the user-level load and store interface of memorymapped networks as a means to build a Non-Coherent NUMA architecture and to provide a coherent shared memory system through operating system mechanism. A major performance issue for NUMA architectures is that applications have to deal with the high performance difference between local and remote memory references. On Cache-Coherent NUMA machines (CC-NUMA), remote accesses are typically 2-3 times more expensive than local access times. Therefore, the physical placement of data is crucial to performance of many applications. Each application has to deal explicitly with this locality issue which can become complex when dealing with dynamic sharing patterns. Mechanisms have been implemented to dynamically migrate and replicate shared memory pages [2, 3, 5, 16] to increase memory locality. For memory-mapped network interfaces, the difference between local and remote memory access is much higher than for CC-NUMA machines because the remote memory accesses have to pass by the I/O bus. In Sect. 2, we describe the capabilities of memory-mapped networks together with the current operating system support provided for message-passing interfaces and its limitations. Section 3 describes the main features and implementation issues of our system, SciOS, based mainly on a coherent shared memory abstraction. Section 4 describes related work and Sect. 5 concludes the paper.
2
Memory-Mapped
Network
Implementations
The implementation of memory-mapped networks is based on a processor's userlevel access to the network interface placed on the I/O bus. Although the I/O bus, e.g., PCI, is slower than the memory bus, it allows more general implementations that can be used on a wide range of architectures. We describe below the main mechanisms of the PCI-SCI adapter and the operating system support provided for it. 2.1
Dolphin's PCI-SCI Adapter
Dolphin's PCI-SCI adapter is based on an IEEE standard called Scalable Coherent Interface (SCI) [11] which specifies the implementation of a coherent shared memory abstraction and a low-level message passing interface. The cache coherence is optional and is typically not implemented on the I/O-based SCI implementations we are considering. SCI specifies a shared 64-bit physical address space. The high order 16 bits of an SCI address designates a node and the remaining 48 bits allow addressing of physical memory within each node. The
603 PCI-SCI adapter [6] provides a number of mechanisms to map remote memory, to send and receive low-level messages, generate remote interrupts, and to provide atomic operations on shared memory. SCI is based on a ring topology and Dolphin's implementation is based on 200 MB/s bidirectional point-to-point links and SCI switches which allow the interconnection of rings. The PCI-SCI adapter does not implement the cache consistency mechanisms in the SCI standard. Each application will therefore have to deal with issues such as when to flush or invalidate processor caches, buffers in the PCI-SCI adapter, etc. Below, we focus on the adapter's ability to map remote memory using a standard 32-bit PCI bus. To share physical memory between nodes, a process can allocate local physical memory. The physical memory is accessed locally through a normal virtualto-physical memory mapping. For a process on a remote node to access the shared memory, the process must set up two mappings using the operating system. (1) The remote memory is mapped into the address space of the local PCI bus. This is done by updating an address translation table (ATT) in the PCISCI adapter with the node ID and physical address of the remote memory. (2) The operating system then maps the specific address range of the PCI bus into its virtual address space through manipulation of its page tables. This mapping also enforces protection since the process can only access mapped m e m o r y and only in the right mode (read/write/execute). The mappings are shown in Fig. 1 where Process A has remotely mapped a physical memory page allocated and locally mapped by Process B on another node. Process
A
Process
B
ATT
SCI nterconnec
virtual address
PCI bus
PCI bus
space
physical memory
virtual address space
Fig. 1. Memory mappings for physical memory and PCI-SCI adapter. 2,2
Driver Support for the PCI-SCI Adapter
The driver support for the PCI-SCI adapter is structured as an operating system independent module which has different interfaces for each operating system and adapter versions. The main primitives directly represent the hardware functions presented in Sect. 2.1. The management of physical memory and m e m o r y mappings is implemented using device files. DMA functionality is provided through standard r e a d and w r i t e calls. A number of reliability functions are provided to shield nodes from hardware failures.
604 All physical memory for a segment is marked as pinned so it is not placed on disk by the operating system. This is because the PCI bus only accesses the memory using physical addresses and the PCI-SCI adapter does not have a TLB functionality for incoming requests. An application can create a shared segment between 8 KB and 512 KB which is allocated in contiguous physical memory. The allocation follows certain alignment requirements because the address translation table entries in the PCI-SCI adapter are aligned at 512 KB boundaries. The number of address translation table entries is 8,192 which can map up to 4 GB of remote memory, which is more than sufficient. But the main limitation is how much address space the adapter is allowed to use on the PCI bus. Currently, 128 MB is reserved on the PCI bus for the PCI-SCI adapter, which corresponds to only 256 active A T T entries. For the mapping of small memory blocks, e.g., for special synchronization mappings, a whole A T T entry must be used to map only a few physical pages. A T T entries must therefore be considered a scarce resource which limits the number of remote mappings. After some allocation and de-allocation of physical memory, it can be hard for the operating system to find contiguous and correctly aligned physical memory. Depending on the operating system, it can mean that only a limited number of segments can be created. For memory not perfectly allocated, more A T T entries are therefore needed. (The next generation PCI-SCI adapter has more A T T entries and allows mapping at a granularity smaller than 512 KB, thereby reducing the alignment requirements.) Each segment has a unique key which is used to name the segment when it is mapped from a remote node. When the segment is mapped to the local PCI bus, the mmap call can map the segment into the virtual address space of a process. The processor cache is disabled for the region of the PCI bus that is used by the PCI-SCI adapter. This means that all load and store operations have to pass through the PCI-SCI adapter, i.e., loading a remote location recently stored to by the same processor results in PCI and SCI traffic. Since writes can arrive out of order special flush operations or remote reads can be used as a store barriers to guarantee some memory consistency. Although the allocation, deallocation, mapping, and unmapping of segments is completely dynamic, the location of a segment is fixed during its lifetime. This is useful for simple, fixed sharing patterns, e.g., when implementing a messagepassing interface. Since the bandwidth of remote writes much higher than those for remote reads, a receive buffer should be fixed on the receiving node to obtain the best performance. This performance may be hurt seriously for segments that have a dynamic sharing pattern.
3
The
SciOS
Prototype
The SciOS prototype provides a flexible coherent shared memory abstraction on a cluster of workstations connected by a load/store-based m e m o r y - m a p p e d network interface. The goal is to minimize the cost of all memory references for different sharing patterns.
605 3.1
The SciOS Programming
Model
SciOS provides a large, flat, shared address space. The abstraction for sharing is a memory block which is a contiguous range of virtual memory. To allow efficient implementations of the coherent shared memory, SciOS model is based on lazy release consistency [12]. This is a relaxed consistency model that allows many low-level optimizations such as the delaying of remote writes and delivery of remote writes out of order. The observation behind release consistency is that correctly synchronized applications only need to have shared m e m o r y coherent while inside critical sections. For an application to see a sequentially consistent memory, as in uniprocessor systems, all accesses to shared data are required to be bracketed with appropriate acquire and release operations when respectively entering and leaving a critical section. These operations are provided by the SciOS implementation. Traditional synchronization primitives such as locks can be implemented directly using the corresponding acquire/release operations while a barrier entry can use the release operation and a barrier departure, the acquire operation. To allow the application to exercise some control over the underlying memory system, SciOS supports five different memory protocols. The protocols range from a simple, fixed memory placement as described in Sect. 2.2 to a more complex implementation with migration and replication of shared m e m o r y blocks: F I X E D _ I N I T Physical pages are pinned and fixed on the node that allocates it. Virtual addresses are coordinated so the same memory block is mapped at the same addresses in M1 sharing processes. Implementation of message passing interfaces and applications with multiple-producers/single-consumer patterns can benefit from using this memory protocol. F I R S T _ T O U C H Physical pages are allocated and pinned on the first node that accesses them. After being allocated, a physical page is fixed on the node, as with the FIXEDANIT protocol. This protocol is targeted at parMlel applications with a mostly-write sharing pattern or little active sharing. M I G R A T I O N Physical pages are allocated as in the FIRST_TOUCH protocol, but they may migrate individually between the nodes depending on the sharing pattern. Mechanisms ensure that thrashing (ping-pong) is minimized. Sequential sharing patterns can benefit from this protocol. M I G _ R E P This is an extension of the MIGRATION protocol where physical pages are replicated when read by multiple nodes. Coherency mechanisms assure that the memory model is respected. Sharing patterns with mostlyread shared data can use this protocol. G L O B A L This is similar to the MIG_REP protocol, but to globally minimize swap activity, decisions are coordinated with the virtual m e m o r y system and idle remote memory is used as swap space instead of a local disk. This protocol is primarily targeted at applications where each node is memorybound ("out-of-core" applications).
606 3.2
Our SciOS Testbed
We are currently implementing the SciOS prototype on a cluster of a total of six uniprocessor Intel nodes running the Linux operating system. The nodes all have a standard 32-bit P C I bus. Each node has a single PCI-SCI adapter connected to our SCI switches. When the machines are connected back-to-back, we observe that the latency for remote 4-byte, non-cached accesses is 2.4 microseconds for writes and 4.6 microseconds for reads: Remote access bandwidth is measured to approximately 10 M B / s for reads and 35-70 M B / s for writes for packets sizes as small as 512 bytes and when using all hardware optimizations. Detailed performance analysis of this adapter type can be found in [15]. Dolphin's PCI-SCI driver runs as a Linux kernel module. SeiOS is being implemented as another module t h a t uses a few basic functions in the driver module. For SciOS's m e m o r y blocks, we use the normal UNIX file abstraction t h a t can be implemented using Linux's Virtual File System (VFS) facility. After the SciOS module has been loaded and a directory mounted, an application can use the standard UNIX system calls open, i o c t l , mmap, munmap, and c l o s e on m e m o r y blocks. The kernel directs these system calls - - along with other kernel events such as page faults, swap in, and swap out - - for all SeiOS files to our SciOS module. This way, we can implement SciOS at the kernel level with full control of the m e m o r y blocks and the associated mappings and physical memory. It also allows us to manage the m e m o r y blocks with standard UNIX c o m m a n d s like touch, rm, and I s commands.
3.3
Issues for hnplementing
SciOS
In this section we will discuss the implementation issues for SciOS on our Linux cluster. The main SciOS d a t a structure is a global table which holds information a b o u t each node and a directory for all the SciOS files. When a file is created, a page table holds information about the state of the physical m e m o r y allocated for it. With the SciOS implementation, we are dealing with the following four issues: (1) dynamic m a n a g e m e n t of the scarce A T T entries in the P C I - S C I adapter, (2) allowing the use of normal, swappable physical m e m o r y for SciOS files, (3) the implementation of NUMA techniques for migration and replication of shared memory, and (4) the integration of the N U M A techniques with a remote swap mechanism. These issues are discussed below. The PCI-SCI driver described in Sect. 2.2 simply refuses more remote m a p pings when there are no more A T T entries available. In SciOS, applications are allowed to m a p remote m e m o r y because A T T entries are shared dynamically, like the entries in the processor's TLB. When there are no A T T entries available for a new remote mapping, SciOS will invalidate all the virtual m e m o r y mappings for an already used A T T entry and reuse it. When the old m a p p i n g is used again, an A T T entry will be allocated in the same way. Since the P C I - S C I adapter does not provide statistics on the extent to which the A T T entry is used, we are forced to use a simple FIFO policy.
607 To alleviate the problem of allocation and alignment of pinned physical memory for the PCI-SCI adapter, it is possible during the startup of the Linux kernel to reserve a large part of physical memory. Since a static reservation is hard to make optimally and it does not allow the sharing of the physical m e m o r y with other applications, SciOS uses normal swappable physical memory. When the operating system is running out of physical memory, a "least recently used" page will be placed on disk so the page can be used by another process. SciOS is notified about such events (swapout) for the pages it has allocated. All remote mappings for the page must then be invalidated to avoid wrongly accessing the data of another application. An invalidation request must therefore be sent to all nodes that map the page and the page cannot be placed on disk until all acknowledgments have been received. When a remote process accesses an invalidated remote mapping, the page must be brought back from disk into the physical memory and the remote mapping must then be reestablished. The NUMA techniques are reflected in SciOS's MIGRATION and MIG_REP protocols. During a page fault, two possibilities exist for the MIGRATION protocol: establish a remote mapping or migrate the page to allow local accesses. If the page has migrated much in the past, it is frozen for a period of time [5] and a remote mapping is established to avoid a ping-pong effect. Otherwise, the page is migrated to the node that accesses it. For the MIG_I~EP protocol, a third possibility exists: a page that is not frozen can be replicated on a read access. Replication is preferred over migration if the page has not been modified recently. On write accesses, page replicas must be invalidated to maintain memory coherency. In addition to the page fault decisions, a daemon periodically collects sharing information and may redo migration/replication decisions. Normal NUMA replication protocols eagerly invalidate all replicas before a write access can be allowed to enforce sequential consistency. With lazy release consistency [12], the invalidation of remote replicas can be postponed until the remote nodes perform an acquire operation. We will use this lazy technique for implementing the MIG_REP protocol. On current architectures, access to remote memory is much faster than to local disk [7] even with traditional network. By avoiding disk swap activity, performance can be gained for applications with high memory requirements or when multiprogramming each SciOS node. With the GLOBAL protocol, SciOS takes into consideration the amount of idle physical memory on each node when pages are swapped out and while deciding when to migrate and replicate pages. If there is no idle physical memory on a node, SciOS can decide to map a page remotely instead of migrating or replicating the page. When the operating system wants to swap out a page, SciOS can simply discard replica pages. Instead of placing the page on a disk, it can be migrated to a node with a remote mapping and idle physical memory. In cases where the mapping nodes have no idle m e m o r y or when the page is not remotely mapped, the page can be migrated to other nodes with idle memory. Only when all physical memory is used on all nodes, SciOS will place pages on disk. To support the GLOBAL protocol, each node places information about its amount of idle memory in the globM table.
608 4
Related
Work
Our work draws on the research in the area of operating system support for distributed shared memory (DSM) abstractions for workstation clusters and NUMA architectures. Numerous DSM systems implement a coherent shared m e m o r y abstraction on workstation clusters without any hardware support for remote m e m o r y access. Munin [4] implements multiple memory protocols to support different sharing patterns like we will do in SciOS. The remote load/store capabilities of the PCI-SCI adapter widens the design space for SciOS compared traditional DSM system. The use of page migration and replication can lower m e m o r y access time and reduce load on the interconnect [3, 13]. Ibel et al. have implemented a global address space at the user level where application manages a pool of preallocated pinned segments and mappings of remote segments. Because SciOS is implemented in the kernel, we can share the physical m e m o r y and remote mappings between applications/processes. Verghese et al. [16] also takes the amount of idle physical memory into consideration when replicating pages like SciOS's GLOBAL protocol. Feeley et al.'s Global Memory Service [7] uses idle physical memory to reduce expensive disk swap but they do not deal with consistency of write-shared pages as SciOS's GLOBAL protocol. 5
Summary
Instead of using the memory-mapped network interfaces as a means to provide efficient message passing, we use it as a non-coherent NUMA architecture. We have presented SciOS, a system which provides a programming model based on a coherent shared memory that facilitates the programming of an SCI cluster. Since the placement of physical memory is important for performance, SciOS provides five different protocols so the programmer is able to control the behavior of the underlying memory system. Mechanisms such as migration and replication of shared pages will lower the average memory access times. A contribution of SciOS is a global memory protocol that tightly integrates the shared memory abstraction with the swap functionality of the virtual m e m o r y system. This will provide much better performance for applications with high m e m o r y requirements because the physical memory on all nodes is used as cache before pages are placed on a disk. By only using normal swappable physical m e m o r y and dynamically sharing address translation table entries, the SciOS implementation will remove many of the restrictions found in the current driver support. We are currently implementing SciOS on an Intel cluster using Dolphin's PCISCI adapter. Acknowledgments Kaare Lochsen and Hugo Kohmann from Dolphin Interconnect Solutions (Norway) willingly granted us access to the source codes for the PCI-SCI adapter and answered all of our questions. Roger Butenuth from Universit/it Paderborn (Germany) kindly gave us access to his Linux port of the PCI-SCI driver.
609
References 1. Matthias A. Blumrich, Kai Li, Richard Alpert, Cezary Dubnicld, and Edward W. Felten. Virtual memory mapped network interface for the SHRIMP multicompurer. In Proceeding of the 21st Annual International Symposium on Computer Architecture, pages 142-153, April 1994. 2. William J. Bolosky, Robert P. Fitzgerald, and Michael L. Scott. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating System Principles, pages 19-31, December 1989. 3. Edouard Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. In Proceedings of the 16th A CM Symposium on Operating System Principles, pages 143-156, October 1997. 4. John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th A CM Symposium on Operating System Principles, October 1991. 5. A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceedings of the l~th ACM Symposium on Computer Architecture, December 1990. 6. Dolphin Interconnect Solutions. PCI-SCI cluster adapter specification, May 1996. Version 1.2. See also h t t p ://wuw. d o l p h i n i c s .no. 7. Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath. Implementing global memory management in a workstation cluster. In Proceedings of the 15th A CM Symposium on Operating System Principles, pages 201-212, December 1995. 8. Marco Fillo and Richard B. Gillett. Architecture and implementation of MEMORY CHANNEL. Digital Technical Journal, 9(1):27-41, 1997. 9. Richard B. Gillett. Memory channel network for PCI. In IEEE Micro, volume 16, pages 12-18. February 1996. 10. Maximilian Ibel, Klaus E. Schauser, Chris J. Scheiman, and Manfred Weis. Highperformance cluster computing using SCI. In Hot Interconnects Symposium V, August 1997. 11. IEEE. IEEE Standard for Scalable Coherent Interface (SCI). 1992. Standard 1596. 12. Peter Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter USENIX Conference, pages 115-132, January I994. 13. Richard P. Larowe Jr., Carla Schlatter Ellis, and Laurence S. Kaplan. The robustness of NUMA memory management. In Proceedings of the 13th A CM Symposium on Operating System Principles, pages 137-151, October 1991. 14. Evangelos P. Markatos and Manolis G.H. Katevenis. Telegraphos: Highperformance networking for parallel processing on workstation clusters. In Pro-
ceedings of the Second International Symposium on High-Performance Computer Architecture (HPCA). February 1996. 15. Knut Omang. Performance of a cluster of PCI based UltraSparc workstations interconnected with SCI. In Proceedings of Network-Based Parallel Computing, Communication, Architecture, and Applications (CANPC'98), number 1362 in Lecture Notes in Computer Science, pages 232-246, January 1998. 16. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In
Proceedings of the 7th Symposium on Architectural Support ]or Programming Languages and Operating Systems, pages 279-289, October 1996.
Indirect Reference Listing: A Robust Distributed GC* Jos~ M. Piquer, Ivana Visconti Universidad de Chile, Casilla 2777, Santiago, Chile jpiquer@dcc, uchile, cl
A b s t r a c t . Reference Listing is a distributed Garbage Collection (GC) algorithm which replaces Reference Counts by a list of sites holding references to a given object. It has been successfully implemented on many systems during the last years, and some extensions to it have been proposed to collect cycles. However, Reference Listing is difficult to extend to an environment supporting migration and must remain blocked during a reference duplication to avoid race conditions on the reference lists. On the other hand, Indirect GCs are a family of algorithms which use an inverse tree to group all the references to an object, supporting migration and in-transit references. This paper presents a new algorithm, called Indirect Reference Listing which extends the Reference Listing to the tree, supporting migration, intransit references (without blocking) and some degree of fault-tolerance. The GC has been implemented on a distributed Lisp system and the execution time overhead (compared to a Reference Count) is under 3~0.
1
Introduction
Distributed Reference Counting algorithms (Lermen and Maurer [8]), (Bevan [1]), (Watson and Watson [21]) and (Piquer [10]) are being replaced by Reference Listing algorithms (Shapiro, Dickman and Plainfoss~ [17]; Sirrel et al. [2]) in order to tolerate failures, as the algorithm is more robust. On the other hand, some extensions to Reference Listing have been proposed to recover cycles based on back tracing(Fuchs [5]; Maheshwari and Liskov [9]) and in partial tracing (Rodrigues and Jones [15]). As in Reference Count, migration, in-transit references and the mixture of increment and decrement messages (now called add and delete) can cause problems. Indirect Garbage Collection (Piquet [12]) is a family of algorithms based on a diffusion tree of the references, eliminating the need of two messages (only decrements exist) and supporting migration. However, the tree structure was very fragile and did not support any kind of failure. * This work has been partially funded by Direcci6n de Investigaci6n, Fac. Cs. Ffsicas y Matem~ticas, U. de Chile
611 The environment in which our distributed garbage collector is designed to run is a loosely-coupled multi-processor system with independent memories, and a reliable point-to-point message passing system. In general, we will suppose that message passing is very expensive and remote pointers already have many fields of information concerning the remote processor, object identifier, etc. In this kind of system, we can accept to lose memory (e.g., adding fields to the remote pointers) if extra messages can be avoided when requiring remote access, or when doing garbage collection. In this paper, we present an new GC, integrating Reference Listing with the Indirect GC family. This algorithm has been implemented on a distributed Lisp environment (Piquer [11]) showing a minimum overhead. Furthermore, we will discuss how Reference Listing can be used to enhance the failure tolerance of the system. The paper is organized as follows: section 2 presents the abstract model of a distributed system with remote pointers used throughout the paper. Section 3 presents the family of indirect garbage collectors, and the detailed algorithm of Indirect Reference Listing. Section 4 discusses the behavior of the algorithm in front of failures and section 5 compares it with other related work. Finally, section 6 presents the conclusions. 2
The
Model
The basic model we will use in this paper is coInposed of the set of processors P , communicating through a reliable message-passing system, with finite but not bounded delays. The set of all objects is noted O. Given an object o E O, it always resides at one and only one processor Pi E P: Pi is called the o w n e r of o. Every object has always one and only one valid o w n e r at a given moment in time. The set of all the remote pointers to an object o is denoted R P ( o ) . It is assumed that an object can migrate, thus changing its o w n e r . However, this operation is considered an atomic operation, and objects are never in transit between two processors. This assumes a strong property in the underlying system, but in general a distributed system blocks every access to a migrating object, and the effect is the same. During a migration operation, a local object is transformed into a remote one and vice-versa. All the other remote pointers in the system remain valid after a migration as they refer to the object, not to a processor. A remote pointer to an object o is called an o-reference 1. For every object o E O, R P ( o ) includes all of the existing o-references, including those contained in messages already sent but not yet received. This means that asynchronous remote pointer sending is allowed. In-transit remote pointers are considered to be always accessible. An o-reference is a s y m b o l i c pointer, usually implemented via an object identifier. It points to the object o, not to its o w n e r . 1 The o-reference terminology and the basis of this model were proposed by Lermen and Maurer ([8]), Piquer ([10]) and Tel and Mattern ([20]).
612 Only remote pointers are considered in this model, so the local pointers to o are not included in R P ( o ) and they are not called o-references. In fact, this paper just ignores local references to objects, assuming that only distant pointers exist. The only acceptable values inside an object are remote pointers. T h e model also requires that, given an object o, every processor can hold at m o s t one o-reference. Remote pointers are usually handled this way: if there are multiple pointers at a processor to the same remote object o, they pass by local indirection. This indirection represents for us the only o-reference at the processor. This is just for simplicity of the model, but it is not a requisite for the algorithms presented, although it simplifies some implementations. In our model then, an object containing a remote pointer to another object o is seen as an object containing a local pointer to an o-reference. T h e four primitive operations supported are: 1. Creation of an o-reference A processor Pi, o w n e r of an object o, transmits an o-reference to another processor p j . The operation happens each time that the o w n e r of o sends a message with an o-reference to another processor. 2. Duplication of an o-reference A processor p j , which already has an o-reference to an object o (not being its o w n e r ) , transmits the o-reference to another processor pk3. Deletion of an o-reference A processor p j , discards an o-reference. This means that the o-reference is no longer locally accessible 2 . 4. Migration of an object o A processor p j , holding an o-reference to an object o located at a distant processor Pi, becomes the new o w n e r of o. In general, this operation implies the following actions: the o l d o w n e r transforms its local object o into an oreference and the n e w o w n e r transforms its o-reference into a local object o. This operation preserves the number of elements of R P ( o ) since it just swaps an object o and an o-reference. A migration also generates m a n y Duplication operations: one for each remote pointer contained in the migrating object, from the old to the new o w n e r . The migration protocol is independent of the G C algorithm. However, the G C protocol must support asynchronous owner changes, with in-transit messages, without losing self-consistency. We suppose that every o-reference creation is performed via one of the above operations. Remote pointer redirection is not supported, so an o-reference always points to the same object o. The minimal set of primitive operations supported is powerful enough to model most of the existing distributed systems. Our simplified model enables a very simple exposition of the main ideas behind the algorithms but not their detailed implementation, which is usually very complex and system dependent. In fact, the model includes an underlying local GC: objects contain every remote In real implementations, this operation is usually invoked by the local GC.
613 pointer accessible from them. The Delete operation provides the interface to be used by the local GC to signal a remote pointer that is no longer locally accessible.
3
Indirect Reference Listing
In general, a distributed GC uses the information kept in the pointers by the underlying system to send its messages to the referenced object. This can be performed using the object finder, or just using the o w n e r field of the pointer, and using forward pointers (Fowler [4]). An Indirect GC is any GC modified to use a new data structure, exclusively managed and used by the GC, to send its messages: every object o has a distributed inverted tree associated to it, containing every host containing an o-reference. The o-references are extended with a new field P a r e n t for the tree pointers. The object o is always at the root of the tree (see Fig. 1). 3.1
The GC Family
Any GC can be transformed into an Indirect GC, by simply replacing the messages sent directly to the object with messages sent through the inverted tree. This has two advantages: the GC is independent from the object manager, and it supports migrations, as will be shown later. The inverted tree structure represents the diffusion tree of the pointer throughout the system. This structure contains every o-reference and the object o itself. When an o-reference is created or duplicated, the destination node is added as a child of the sender. If the destination node already had a valid P a r e n t for this o-reference, it refuses the second P a r e n t to avoid cycles in the structure. When an o-reference is deleted, the corresponding G C algorithm will delete the o-reference from the tree only if it was a leaf. If not, a special o-reference (called a zombie o-reference) with all the GC fields, but no object access fields, is kept in the structure until it becomes a leaf. The zombie o-references are a side effect of the inverted tree: if one leaf of the sub-tree is still locally accessible, the GC needs zombie o-references to reach the object at the root. Obviously, when the root is the only node in the tree, there are no more o-references in the system, and the object itself is garbage. The P a r e n t field contains a processor identifier 3. ff the tree structure is always preserved (during every remote pointer operation), the GC can work on it without depending on the object finder. It is enough to use the P a r e n t fields to send the GC messages to the ancestors in the tree, and following them recursively we can reach the object at the root, without using the underlying system. 3 This optimization cannot be applied if we accept multiple o-references to the same object from the same processor. The P a r e n t field is in fact a pointer to a remote pointer, but our simplified model allows this to be implemented as a pointer to a processor.
614
ParenQ~0
(5o Fig. 1. A migration in the inverted tree
Migration is also allowed, by simply making a root change in the tree structure, which is trivial in an inverted tree (see Fig. 1). On the other hand, the object finder itself does not need to use the P a r e n t field. It can access the object by means of the owner fields, or whatever other fields are convenient. 3.2
Indirect
Reference
Listing
Indirect Reference Counting (IRC) was originally presented in (Piquer [10]). In a different context, this algorithm was also independently discovered at the same time by other authors (Ichisugi and Yonezawa [6]), (Rudalics [16]). In (Tel and Mattern [20]) it is shown that IRC is equivalent to the Dijkstra and Scholten ([3]) termination detection algorithm. Basically, IRC installs the reference counts at each node of the diffusion tree, counting the children of each one. To implement Indirect Reference Listing (IRL), we will simply replace the counters by child lists, containing the site identifier of each node to which a reference has been sent. The changes to IRC in the implementation are minimal:
- Creation and Duplication 9 at Pi, when an o-reference is sent to pj, pj is added to R e f _ J i s t ( x ) (x being the object o for a Creation or the original o-reference for a Duplication) 4 9 at pj, upon reception of an o-reference from Pi, if it was Mready known, a delete message is sent to Pl- If it is newly installed, the o-reference is created with P a r e n t <---Pi and R e f _ l i s t <-- N I L . -
Deletion 9 When an o-reference is deleted, it is marked as Deleted.
4 Ifpj was already in R e f J i s t ( x ) , this operation will generate a duplicate. It is safe to do so, because a delete message could be in transit from pj. It is a transient situation only, because pj will refuse a second P a r e n t for o, generating a delete message.
615 9 Upon reception of a delete message for x from pj, pj is deleted from Reflist(x). 9 When an o-reference is marked as Deleted and its R e f l i s t = N I L , a delete message is sent to P a r e n t ( o - r e f e r e n c e ) , and the o-reference is reclaimed. This test must be made upon reception of a delete message for an o-reference and upon o-reference deletion.
-
Since we assume in the model that every processor holds at m o s t one oreference per object o, a deleted o-reference with a non-null R e f _ l i s t is a zombie o-reference. If the same o-reference is received again in a message, a delete message must be sent to the o-reference sender, because the zombie already has a valid P a r e n t . Migration The migration of an object o from Pi to pj means a change of the root in the diffusion tree. This operation is trivial on an inverted tree if the old root is known. For IRL, the lists must also be kept consistent, so the new owner (pj) has a new child (adding p~ to its R e f A i s t ) and has no P a r e n t (sending a delete to its old P a r e n t ) . The migration costs one delete message, plus some work to be done locally at the respective processors, Pi and pj.
When the R e f J i s t of an object is NIL, the object can be collected. An inverted tree with reference lists is shown in Fig. 2.
P1 ( ~ _ ~ ) ReLlist=(P2P3)
Fig. 2. Indirect Reference Listing
IRL is very straightforward to implement, but like Reference Count, it lacks the ability to collect cycles. However, the extensions proposed for Reference Listing to collect cycles, could also be applied here. Back tracing could be implemented following the reference lists through the tree. IRL uses the inverted tree to send messages, so even as the P a r e n t pointers are modified during migration (the other operations are only allowed to initialize them, not to change them), the reference lists never move. Objects m i g r a t e alone
616 and they get the local o-reference R e f l i s t . Therefore, any in-transit delete message will always arrive at the correct destination. IRL was implemented in a distributed Lisp System, an compared to I R C it only adds a 3% overhead on execution time. Of course, the main overhead is in space, depending on the total number of o-references. However, an interesting point, worth noting, is that IRL is backward-compatible with I R C (if decrement and delete messages are considered equivalent), and thus some sites could be using IRL while others use IRC in a correct Distributed GC.
4
Failures
Using Reference Lists, it is possible to resist some failures. IRL is a little different, so we will examine in turn: a node wanting to leave the system, a node crash and message loss or duplication. 4.1
Shutdown
One i m p o r t a n t problem for Indirect GCs is that it is not obvious for a node to leave the system, once it has diffusion trees traversing it. In large distributed '~stems, it is common for a node to need a shutdown, for maintainance or load asons. It is i m p o r t a n t to be able to execute a shutdown at the node, deleting all o-references and migrating all its remote-pointed objects. For Indirect GCs, zombie o-references must also be eliminated. In IRC, a zombie o-reference could not be deleted, because its children need it to exist. On the other hand, we knew how m a n y children it had, but not who they were. With IRL, we know the children identifiers, so we can start an intermediate node deletion algorithm (suppose Pi is shutting down): - foreach x zombie o-reference 9 Pi sends (adopt, Ref_list(x)) to P a r e n t ( x ) - at pj, upon reception of (adopt, list) for x, foreach p in list 9 Add p to R e f A i s t ( x ) , send (new_parent, o-reference) to p - At Pk, upon reception of new_parent for z from pj: 9 send deletion to P a r e n t ( x ) , set P a r e n t ( x ) +-- pj. Processor Pi waits until there are no more zombie o-references before finally shutting down. Any processor implementing IRL can shutdown cleanly using this protocol. If the other nodes are only using IRC, this protocol still works. If the node willing to leave does not implement IRL, this protocol cannot be engaged. Another possible use of this protocol is to implement intermediate node deletion in the tree, avoiding zombie o-references completely: instead of m a r k i n g an o-reference as deleted, the Delete operation could send an adopt message to the Parent.
617 4.2
Crashes
Supposing that every node is informed of a node crash, we would like to have a protocol to rebuild a consistent tree for every object accessed through the failed site. Every site can search into its R e f d i s t s the failed site, and they can delete the references coming from it. However, indirect references could also exist, and the failed site could also be P a r e n t of some subtrees. These subtrees need to be rebuilt. One alternative is to generate an error on access, and to need to ask for the object again, or try to move the subtrees to the root. Anyway, the involved nodes now can detect this situation and handle it. 4.3
Message Loss/Duplication
Reference Lists are usually used to tolerate message loss and duplication. IRL does not support these behaviors, because we accept multiple copies of a site in a R e f _ l i s t to protect in-transit references. If it is necessary to deal with unreliable transports, exchanging the R e f l i s t s as proposed by Shapiro, Dickman and Plainfossd [17] is probably the best solution.
5
Related Work A reference listing algorithm using indirect references to provide fault-tolerance is SSP (Stub-Scion Pairs) (Shapiro, Dickman and Plainfoss$ [17]; Plainfoss6 [14]). This algorithm is based on replacing the reference counts by a site list, containing every site where a duplication has been sent. The reference lists are exchanged between sites to keep the information up to date. This extra information provides tolerance to duplication and disorder if a timestamp is added to the messages. A prototype implementation has shown a 10% overhead compared to plain IRC (Plainfossd and Shapiro [13]), but supporting message failures. Indirect Reference Listing is simpler and more efficient, being also more fragile in front of failures. Using Reference Lists to tolerate failures was originally proposed by Birrel et al [2] and it has been implemented into Sun's Remote Method Invocation for Java (Sun [19]). It does not support migration and must block the sender of a duplicated reference until an acknowledgment is received from the owner. IRL is not so strong in front of failures, but provides a more efficient solution to the problem. - Fuchs [5] and Maheshwari and Liskov [9] propose back tracing techniques based on Reference Listing. Basically, they perform a tracing GC backwards, from the objects to their roots. Reference Lists are useful for that, because they give us the list of sites pointing to our site. However, inside local spaces the problem is harder, because an o-reference does not know which object contains it. Both papers propose techniques to deal with that issue. These -
-
618 techniques can be equally applied to IRL, because following the R e f _ l i s t pointers the tree can be traversed from the root to the leaves (which is a back trace in this case).
6
Conclusions
We propose a new distributed GC algorithm, mixing Reference Listing with Indirect Garbage Collection: IRL. The algorithm is very simple and straightforward to implement, and backward compatible with IRC. The advantage is that a node can leave the system using a shutdown protocol, zombie o-references can be deleted and site crashes can be handled. Furthermore, back tracing can be applied to IRL, extending it to collect cycles (which is not possible with plain IRC). An implementation in a distributed Lisp system shows an execution overhead under 3% and demonstrates the ability to inter-operate between IRC and IRL. Compared with related work, IRL is simpler, supports migration and intransit references and does not block a site during duplication. In the other hand it is more robust than plain IRC.
References 1. Bevan, D. I.: "Distributed Garbage Collection Using Reference Counting," LNCS 259, PARLE'87 Proceedings Vol. II, Eindhoven, Springer Verlag, June 1987. 2. Birrel, A., Evers D., Nelson G., Owicki S. and Wobber E.: "Distributed Garbage Collection for Network Objects," Technical Report 116, DEC-SRC, 1993. 3. Dijkstra, E. W. and Scholten, C. S.: "Termination Detection for Diffusing Computations," Information Processing Letters, Vol. 11, N. 1, August 1980. 4. Fowler, R. J.: "The Complexity of Using Forwarding Addresses for Decentralized Object Finding," in Proc. 5th Annual ACM Symp. on Principles of Distributed Computing, pp. 108-120, Alberta, Canada, August 1986. 5. M. Fuchs, "Garbage Collection on an Open Network," LNCS 986, Proc. 1995 International Workshop on Memory Management (IWMM'95), Springer Verlag, pp. 251-265, Kinross, UK, September 1995. 6. Ichisugi, Y. and Yonezawa, A.: "Distributed Garbage Collection using group reference counting," Tech. Report 90-014, Dept Information Science, Univ., Of Tokyo, 1990. 7. Lang, B., Queinnec, C. and Piquer, J.: "Garbage Collecting the World," 19th ACM Conference on Principles of Programming Languages 1992, Albuquerque, New Mexico, January 1992, pp. 39-50. 8. Lermen, C. W. and Maurer, D.: "A Protocol for Distributed Reference Counting," Proc. 1986 ACM Conference on Lisp and Functional Programming, Cambridge, Massachussets, August 1986. 9. Maheshwari, U., Liskov, B.: "Collecting Distributed Garbage Cycles by Back Tracing," Proc. 1997 A CM Symp. on Principles of Distributed Computing (PODC'97), Santa Barbara, California, August 1997, pp. 239-248. 10. Piquer, J.: "Indirect Reference Counting: A Distributed GC," LNCS 505, P A R L E '91 Proceedings Vol I, pp. 150-165, Springer Verlag, Eindhoven, The Netherlands, June 1991.
619 11. Piquer, J.: "A Re-Implementation of TransPive: Lessons from the Experience," Proc. Parallel Symbolic Languages and Systems (PSLS'95), LNCS 1068, Vol I, pp. 310-329, Springer Verlag, Beaune, France, October 1995. 12. Piquet, J.: "Indirect Distributed Garbage Collection: Handling Object Migration," ACM Trans. on Programming Languages and Systems, V. 18, N. 5, September 1996, pp. 615-647. 13. Plainfossr, D. and Shapiro, M.: "Experience with a Fault-Tolerant Garbage Collector in a Distributed Lisp System," LNCS 637, Proc. 1992 International Workshop on Memory Management (IWMM'92}, Springer Verlag, pp. 116-133, St-Malo, France, September 1992. 14. Plainfossr, D.: "Distributed Garbage Collection and Referencing Management in the Soul Object Support System," PhD Thesis, U de Paris VI, France, June 1994. 15. Rodrigues, H. and Jones R,: "A Cyclic Distributed Garbage Collector for Network Objects," LNCS 1151, Proc. lOth Workshop on Distributed Algorithms (WDAG'96), Springer-Verlag, pp. 123-140, Bologna, Italy, October 1996. 16. Rudalics, M.: "Implementation of Distributed Reference Counts," Tech. Report, RISC, J. Kepler Univ, Linz, Austria, 1990. 17. Shapiro, M., Dickman, P. and Plainfossr, D.: "Robust, Distributed References and Acyclic Garbage Collection," A CM Symposium on Principles of Distributed Computing, Vancouver, Canada, August 1992. 18. Shapiro, M., Dickman, P. and Plainfossr, D.: "SSP Chains: Robust, Distributed References Supporting Acyclic Garbage Collection," INRIA Res. Report 1799, November 1992. 19. Sun Microsystems Computer Corporation: "Java Remote Method Invocation Specification", h t t p : / / s u n s i t e . unc. edu/j ava, November, 1996. 20. Tel, G. and Mattern, F.: "The Derivation of Distributed Termination Detection Algorithms from Garbage Collection Schemes," A CM Trans. on Programming Languages and Systems, Vol. 15, N. 1, January 1993, pp. 1-35. 21. Watson, P. and Watson, I.: "An Efficient Garbage Collection Scheme for Parallel Computer Architectures," LNCS 259, PARLE Proceedings Vol. II, Eindhoven, Springer Verlag, June 1987.
Active Ports: A Performance-Oriented Operating System Support to Fast LAN Communications G. Chiola, G. Ciaccio DISI, Universit& di Genova, via Dodecaneso 35, 16146 Genova, ItMy
{chiola,ciaccio}@disi.unige.it http://www.disi.unige.it/project/gamma/
A b s t r a c t . The Genoa Active Message MAchine (GAMMA) is an efficient communication layer for 100base-T clusters of Personal Computers running Linux. It is based on Active Ports, a communication mechanism derived from Active Messages. GAMMA Active Ports deliver excellent communication performance at user level (latency 12.7 ps, maximum throughput 12.2 MByte/s, half-power point reached with 192 byte long messages), thus enabling cost-effective cluster computing on 100base-T. Despite being implemented at kernel level in the Linux OS, the performance numbers of GAMMA Active Ports are much better than many other LAN-oriented communication layers, including so called "user-level" ones (e.g. U-Net).
1
Introduction
Networks of workstations (NOWs) or, even better, clusters of Personal Computers (PCs) networked by inexpensive commodity interconnects, potentially offer a cost-effective support to parallel processing. The only obstacle to overcome is the inefficiency caused by the OS layers that sedimented in the communication architecture during the past evolutive eras of computer and communication technology. Thus the issue of enhancing or by-passing OS kernels with performanceoriented support to inter-process communication has become crucial. So far this issue has been addressed by following two ways, that we call the "efficient OS support" approach and the "user-level" approach. In the first approach the messaging system is supported by the OS kernel with a small set of flexible and efficient low-level communication mechanisms and by a simplified communication protocols carefully designed according to a performance-oriented approach. Issues like choosing the right communication abstraction and implementing such abstraction efficiently are of primary concern here. The "efficient OS support" architecture fits nicely into the structure of modern OSs providing protected access to the communication devices. This way multi-users, multitasking features need not be limited or modified. The "user-level" approach aims at improving performance by minimizing the OS involvement in the communication path in order to obtain a closer integration between the application and the communication device: The use of
621 system calls is minimized allowing direct access to the network interface whenever possible. Here the challenge is to provide direct access to the communication devices without violating the OS protection model. In order for such approach not to compromise protection in the access to the communication devices, either a single-user environment or some form of gang scheduling are usually required. These two alternatives have their own obvious drawbacks. A third solution is to leverage programmable NICs which can run the necessary support for device multiplexing in place of the OS kernel. Currently this implies much higher hardware costs. The best known representative of the "user-level" approach is U-Net. With U-Net, user processes are given direct protected access to the network device with no virtuMization. Any communication layer as well as standard interface must move to user-level programming libraries. U-Net runs on an ATM network of SPARCstations [7], where a communication co-processor located on the NIC executes proper firmware in order to multiplex a pre-defined number of virtual "endpoints" over the NIC itself, with no OS involvement. In the Fast Ethernet emulation of U-Net [8], the low-cost NIC does not provide a programmable coprocessor. The host CPU itself multiplexes the endpoints over the NIC by means of proper OS support, thus actually following an "efficient OS support" rather than a "user-level" architecture. The "user-level" architecture is believed to deliver better communication performance than the "efficient OS support" architecture by avoiding the OS involvement. Such belief is now contradicted by our experience, at least in the case of commodity off-the-shelf devices.
2
Active
Ports
in GAMMA
The Genoa Active Message MAchine (GAMMA) [3] is a fast messaging system running on a 100base-T cluster of Pentium PCs running Linux and equipped with either 3COM 3C595-TX Fast Etherlink or 3COM 3C905-TX Fast Etherlink XL PCI 100base-T adapters. Porting GAMMA to DEC "tulip"-based and Intel EtherExpressPro NICs is in progress. Following the "efficient OS support" principle, with GAMMA the Linux kernel has been enhanced with a communication layer implemented as a small set of additional light-weight system calls (that is, system calls with no intervention of the scheduler upon return) and a custom NIC driver. Most of the communication layer is embedded in the Linux kernel at the NIC driver level, the remaining part being placed in a user-level programming library. All the multiuser, multi-tasking functionalities of the Linux kernel have been preserved. All the communication software of GAMMA has been carefully developed according to a performance-oriented approach, extensively described in [3]. The adoption of a communication mechanism that we call Active Ports allowed a "zero copy" protocol, with no intermediate copies of messages along the whole user-to-user communication path.
622 With GAMMA, each process owns a number of (currently 255) active ports. G A M M A active ports are inspired to the Thinking Machines' C M A M L Active Message library for the CM-5. Each active port can be used to exchange messages to another process in the same as well as a different process group. G A M M A directly supports SPMD parallel p r o g r a m m i n g with a u t o m a t i c replication of the same process on different computation nodes. However plain MIMD programming is allowed as well since sender and receiver need not share code address space. Indeed the message handlers as used in active ports are designated by the receiver, not by the sender as in Generic Active Messages [6]. Prior to using an active port for outbound communications, the owner process m u s t bind it to a destination, in the form of a triple (process group, process instance, destination active port). To send a message throughout an active port, the process has to invoke a "send" light-weight system call which traps to the G A M M A device driver. Each message is then transmitted as a sequence of one or more Ethernet frames. Following a "zero copy" policy, frames are arranged directly into the N I C ' s transmit FIFO by DMA of each f r a m e b o d y directly from the sender process' d a t a structure with no intermediate buffering in kernel m e m o r y space. Frame headers are pre-computed at port binding time. As soon as the last frame of a message has been written to the N I C ' s transmit F I F O the "send" system call returns and the sender process resumes user-level activity. Active ports can be used to broadcast messages as well. T h e Ethernet broadcast service is exploited directly, so that one single physical transmission occurs. Efficient barrier synchroniziation is also implemented using this facility [2]. For inbound communications, prior to use a given active port L the owner process must bind L to an application-defined function called receiver handler as well as to an application-defined final destination d a t a structure (spanning a contiguous virtual m e m o r y region) where each message incoming through port L is to be stored. The final destination must be pinned down in physical RAM. At this point the local instance of the G A M M A driver knowns in advance the final destination in user space for messages incoming through port L. Therefore, when the G A M M A device driver detects frame arrivals for port L, the f r a m e payloads can be copied directly in their final destination, with no t e m porary storage in kernel m e m o r y space and regardless of the schedule state of the receiver process at the arrival time. Moreover, as soon as the last piece of message has been delivered to port L, the G A M M A driver can immediately run the receiver handler bound to L so as to allow a real-time m a n a g e m e n t of the message itself. The receiver handler is run on the interrupt stack. If the receiver process is not currently running, a t e m p o r a r y context switch is performed by the G A M M A driver so as to ensure that the handler is run in the receiver context. The possibility of running the receiver handler in real time is crucial in order to avoid t h a t subsequent messages incoming to port L overlap the current one if this is to be avoided. Indeed a typical receiver handler: 1) integrates the current message into the main thread of the destination process; 2) notifies the message reception to the m a i n thread itself; 3) re-binds port L to a fresh final destination in user space for the next incoming message.
623 The GAMMA implementation of Active Ports relies on a very simplified communication protocol with neither error recovery nor flow control. However the only source of frame corruption in modern LANs is due to frame collisions in shared networks. This may be avoided either using switched networks or explicitly scheduling communications at the application level. Flow control is unnecessary as long as the network is the throughput bottleneck. The flexibility of GAMMA allows either to use its simplified and efficient protocol "as is", or to build more complex and reliable higher level protocols using the GAMMA error detection features [3].
2.1
Error Handlers for User-Defined Error Recovery
As a by-product of the simplification of the communication protocol and of the elimination of intermediate copies of messages GAMMA provides neither message acknowledgement nor explicit flow control. However: - In a LAN environment with good quality wiring, the only source of frame corruption is collision with other frames in case of shared Ethernet. Such possibility could be eliminated by using a switch instead of a repeater hub. In any case frames corrupted by collision have bad CRC and are discarded by the receiver NIC driver. - If the receiver handler on the receiver side of a communication runs quickly enough (as usually is the case) both the RAM-to-NIC transfer rate on the sender side and the NIC-to-RAM transfer rate on the receiver side are much greater than the Fast Ethernet transfer rate. Hence, flow is implicitly controlled by simply testing whether there is enough room in the transmit FIFO of the sender's NIC before writing a frame. The choice of GAMMA is to implement some error detection mechanisms and to allow applications to have their own error management policies if required. This is supported by allowing applications to bind each active port with an error handlers. An error handler is similar to a receiver handler but runs in case of communication errors rather than on message arrivals. Error handlers can implement higher level reliable communication protocols, if really needed for a given application.
3
Communication
Performance
and
Conclusions
According to our average estimates based on a "ping-pong" microbenchmark, the GAMMA messaging system exhibit an unprecedented one-way user-to-user latency (delay for a zero byte message) as low as 12.7 #s, with a m a x i m u m throughput of 12.2 MByte/s (98% of the Fast Ethernet raw asymptotic bandwidth) with long (500 kByte) messages. The half-power point is achieved with messages as short as 192 bytes. Indeed such numbers enable cost-effective parallel computing on 100base-T clusters of PCs [1,4, 5].
624 The improvement achieved by G A M M A with respect to Linux T C P / I P sockets on similar hardware (latency 150 #s, m a x i m u m throughput 7 MByte/s) is more than one order of magnitude for latency and 71% for bandwidth. A latency improvement of more than 2 times has been obtained with respect to the U-Net "user-level" on similar or superior hardware (U-Net latency is 30 /zs on 100base-T and 35.5/~s on 140 Mbps ATM). Therefore G A M M A delivers better communication performance than U-Net while providing a higher level and more flexible communication abstraction with same quality of service. As a conclusion, we contradict the c o m m o n belief t h a t only a "user-level" communication architecture m a y yield a satisfactory answer to the d e m a n d for efficient inter-process communication in a low-cost LAN environment: the improved applications-to-device integration yielded by the "user-level" approach, and the corresponding performance gain, is negligible given that communication hardware cannot be closely integrated with the host C P U in c o m m o d i t y - b a s e d NOWs. Much better results can be obtained by optimizing the communication protocol and choosing a suitable communication abstraction. Indeed the G A M M A Active Ports are a flexible communication abstraction which preserves the Linux multi-user multi-tasking environment while still allowing best c o m m u nication performance, based on traditional kernel-level protection mechanisms and low-cost c o m m o d i t y interconnects.
References 1. G. Chiola and G. Ciaccio. Architectural Issues and Preliminary Benchmarking of a Low-cost Network of Workstations based on Active Messages. In Proc. 14th conf. on Architecture of Computer Systems (ARCS'97), Rostock, Germany, Sept. 1997. 2. G. Chiola and G. Ciaccio. Fast Barrier Synchronization on Shared Fast Ethernet. 2nd Int. Workshop on Communication and Architectural Support]or Network-Based Parallel Computing (CANPC'98), LNCS 1362, pages 132-143, Feb. 1998. 3. G. Ciaccio. Optimal Communication Performance on Fast Ethernet with GAMMA. In Proc. Workshop PC-NOW, IPPS/SPDP'98, pages 534-548, Orlando, FI., April 1998. LNCS 1388. 4. G. Ciaccio and V. Di Martino. Porting a Molecular Dynamics Application on a Low-cost Cluster of Personal Computers running GAMMA. In Proc. Workshop PC-NOW, IPPS/SPDP'98, pages 524-533, Orlando, lVl., April 1998. LNCS 1388. 5. G. Ciaccio, V. Di Martino, and P. Lanucara. Porting the Flame Front Propagation Problem on GAMMA. In Proc. HPCN Europe '98, Amsterdam, The Netherlands, April 1998. 6. D. Culler, K. Keeton, L.T. Liu, A. Mainwaring, R. Martin, S. Rodriguez, K. Wright, and C. Yoshikawa. Generic Active Message Interface Specification. Tech. Report White Paper of the NOW Team, CS Dept., U. California at Berkeley, 1994. 7. T. yon Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. 15th ACM Syrup. on Operating Systems Principles (SOSP'95), Copper Mountain, Co., Dec. 1995. ACM Press. 8. M. Welsh, A. Basu, and T. yon Eicken. Low-latency Communication over Fast Ethernet. In Proc. Euro-Par'96, Lyon, France, August 1996.
Workshop 6q-16+18 Languages Henk Sips, Antonio Corradi and Murray Cole Co-chairmen Three Workshops(6, 16, and 18) have programming languages and models as central theme. Workshop 6 focuses on the use of object oriented paradigms in parMlel programming; Workshop 16 has the design of parallel languages as primary focus, and Workshop 18 deals with programming models and methods. Together they present a nice overview of current research in these areas.
Object Oriented Programming. The Object-Oriented (OO) technology has received a renovated stimulus by the ever-increasing usage of the Web and Internet technology. The globalisation has enlarged the number of the potentially involved users to such an extent to suggest a reconsideration of the available environments and tools. This has also motivated the attempt to clarify all debatable and ambiguous points, not all of which of practical and immediate application. On the one hand, several issues connected to program correctness and semantics are still unclear. In particular, the introduction of concurrency and parallelism within an object framework is still subject to discussion, in its verification and modelling. The same is for the aggregation of different objects and behaviour in predetermined patterns. How to accommodate several execution capacities and resources into the same object or pattern requires still work and new proposals. In any case, there is much to be done, both in the abstract area and in the applied field. On the other hand, the distributed framework has not only introduced examples in need of practical solutions and environments, but also forced to reflect on all applicable models, starting from traditional ones, such as the client-server one and RPCs, to less traditional ones, such as the agent models. The growth of Java, CORBA, and their capacity of attracting implementors and resources produce unifying perspectives, with the possibility of offering a new integrated framework in which to solve most common problems. This starts to produce the possibility of creating generally available components to be really employed, by reducing the convenience of the redesign from scratch. This conference session is an occasion to expose to an enlarged audience working into parallelism some of the hot topics and researches going on in the OO area. And, even if the OO community has many occasions of meeting and many forums to exchange opinions, EUROPAR seems a particular opportunity of both presenting experiences and receiving contributions with possibilities of cross-fertilisation. The five papers presented in the conference explore several of the most strategic directions of evolution of the OO area.
626 The first paper, "Dynamic Type Information in Process Types" by Puntigam, uses the process as an example of objects with dynamic type. The goal is to make possible all the checks typical of static types even in the dynamic case: the process is modelled as an active object that, depending on its state, is capable of accepting different messages from different clients. The presented model is a refinement of a previous work of the same author and is based on a calculus of objects that communicate with asynchronous message passing. The third paper of the session, by Gehrke, "An Algebraic Semantics for an Abstract Language with Intra-Object-Concurrency," addresses the problems of intra-object concurrency, working with a process algebra method. The goal is to introduce the formal semantics for intra-object concurrency in 0 0 frameworks where active processes can be distinguished from passive object. Let us recall that this is the Java assumption. The topics of the other papers are all connected to Java. The paper by Launay and Pazat, "Generation of distributed parallel Java programs" and the fifth paper, by Giavitto, De Vito and Sansonnett, "A Data Parallel Java Client-Server Architecture for Data Field Computations" addresses the point of enlarging the usage of the Java Framework. Launay and Pazat propose a framework capable of transparently distributing Java components of an application onto the available target architecture. Giavitto, De Vito and Sansonnett apply their effort to make Java usable in the data- parallel paradigm, for client-server applications. The fourth paper, by Lorcy and Plouzeau, "An object-oriented framework for managing the quality of service of distributed applications", addresses the quality of service problem for interactive applications. The authors elaborate on the known concept of 'contract', with new considerations and insight. Programming
Languages.
As up to now data parallel languages have been the most succesful attempt to bring parallel programming closer to the application programmer. The most seriuos attempt has been the definition of High Performance Fortran (HPF) as an extension of Fortran 90. Several commercial compilers are available today. However, user experience indicates that for irregularly structured problems, the current definition is often inadequate. The first two papers in the HPF session deal with this problem. The paper by Brandes and Germain, called "A tracing protocol for optimizing data parallel irregular computations" describes a dynamic approach by allowing the user to specify which data has to be traced for modifications. The second paper by Brandes, Bregier, Counilh, and Roman proposes a programming style for irregular problems close to regular problems. In this way compile-time and run-time techniques can be readily combined. A recent language extension for shared-memory programming that has caught a lot of attention is OpenMP. OpenMP is a kind of reincarnation of the old PCF programming model. The paper by Chapman and Mehrotra describes various ways how HPF and OpenMP can be combined to form a combined powerful programming system. Also the Java language can be fruitfully used as a basis
627 for parallel programming. The paper by Carpenter, Zhang, Fox, Li, Li, and Wen outlines a conservative set of language extensions to Java to support SPMD style of programming. GenerM parMlel programming languages have a hard time in obtaining optimal performance for specific cases. If the application domain is restricted, better performance can be obtained by using a domain specific parallel programming language. The paper by Spezzano and TMia describes the langauge CARPET intended for programming cellular automata systems. Finally, the paper by Hofstadt presents the integration of task parallel extensions into a functional programming language. This approach is illustrated by a branch and bound problem example. Programming
Models
and Languages.
Producing correct software is already a difficult task in the sequential context. The challenge is compounded by the conceptual complexity of parallelism and the requirement for high performance. Building on the foundations laid in Workshop 7 in the preceding instantiation of Euro-Par, Workshop 18 focuses on programming and design models that abstract from low-level programming techniques, present software developers with interfaces that reduce the complexity of the parallel software construction task, and support correctness issues. It is also concerned with methodological aspects of developing parallel programs, particularly transformational and calculational approaches, and associated ways of integrating cost information into them. The majority of papers this year work from a "skeletal" programming perspective, in which syntactic restrictions are used both to raise the conceptual level at which parallelism is invoked and to constrain the resulting implementation challenge. Mallet's work links the themes of program transformation (here viewed as a compilation strategy) and cost analysis, using symbolic methods to choose between distribution strategies. His source language is the by now conventional brew of nested vectors and the map, fold, scan skeleton family, while the cost analysis borrows from the polytope volume techniques of the Fortran parallelization world, an interesting and encouraging hybrid. Skillicorn and colleagues work with the P3L language as source and demonstrate that the use of BSP as an implementation mechanism enables a significant simplification of the underlying optimisation problem. The link is continued in the work of Osoba and Rabhi, in which a skeleton abstracting the essence of the multigrid approach benefits from the portability and costability of BSP. In contrast, Keller and Chakravarty work with the well know data-parallel language NESL, introducing techniques which allow the existing concept of "flattening transformations" (which allow efficient implementation of the nested parallel structures expressible in the source language) to be extended to handle user-defined recursive types, and in particular parallel tree structures. Finally, Vlassov and Thorelli apply the ubiquitous principle of simplification through abstraction to the design of a shared memory programming model.
628 In summary, we expect from these session a fruitful discussion of the hot topics in concurrency and parallelism in several areas. This enlarged exchange of ideas can impact on advances of the discipline in the whole distributed and concurrency field.
Workshop 1 Support Tools and Environments Chris Wadsworth and Helmar Burkhart Co-chairmen
Support Tools and Environments Programming tools are a controversial topic: on the one hand, everyone agrees that good tools are needed for productive software development, and a plethora of tools was built in past; on the other hand, many of the tools developed are not used (may be not even by their developers) and it is not easy for companies offering tools for high-performance computing to survive. In order to improve the situation, both scientific activities such as workshops and conferences discussing technical challenges, as well as organizational activities such as consortia and software tool repositories are needed. The one-day workshop Support Tools and Environments consists of a special session on PSPTA projects (see the separate introduction by Chris Wadsworth) and five additional papers concentrating on Performance-oriented Tools and Debugging and Monitoring. Peter Kacsuk (KFKI, Hungary), Vaidy Sunderam (Emory University, USA), and Chris Wadsworth (Rutherford Appleton Lab, UK) have been co-organizers of this workshop. I would like to thank them and the reviewers for all help and work, which only made this event possible.
Introduction to PSTPA Session Background and Aims Despite the forecast demand for the high performance computing power of parallel machines, the fact remains that programming such machines is still an intellectually demanding task. While there have been some significant recent advances (e.g. the evolution and widespread adoption of the PVM and MPI message passing standards, and the development of High Performance Fortran), there remains an urgent need for a greater understanding of the fundamentals, methods, and methodology of parallel software development to support a broad application base and to engender the next generation of parallel software tools. The PSTPA programme seeks in particular: – to capitalise on UK strengths in research into the foundations of generic software tools, languages, application generators, and program transformation techniques for parallel computing, D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 75–79, 1998. c Springer-Verlag Berlin Heidelberg 1998
76
Chris Wadsworth and Helmar Burkhart
– to widen the application programmer base and to improve productivity by an order of magnitude, – to maximise the exploitation potential of tools developed in the programme, – to maximise architectural independence of the tools and to maximise the leverage of open systems standards, – to focus on easy-to-use tools which can provide particular benefit to the engineering, industrial, and information processing user sectors, and – to prove the effectiveness of the tools and the value of the standards to potential users with appropriate demonstrations. A long term goal is to hide the details of parallelism from the user, and this programme also aims to address this. In the short term, however, it remains necessary to provide tools which facilitate the exploitation of parallelism explicitly at an appropriate level for the community of application programmers. These tools will help in the expression, modelling, measurement, and management of parallel software, as much as possible in a portable fashion. The scope of the programme thus covers the spectrum from incremental advances in the present generation of tools to research that will engender the next generation of tools. In the medium term (five years) it is envisaged that tools will become increasingly ‘generative’, e.g. application generators, and ‘performanceoriented’ at least for particular application domains, e.g. for databases, embedded systems, or signal and image processing. The effectiveness of the tools in practice is a key objective. 15 projects (listed below) have been supported by the programme, including both projects to aid the porting of existing software and ones that are researching methods and tools to reduce the effort in developing new applications software. Each project is academically led with one or more (up to 7 in one case) industrial or commercial partners (not listed). At the time of writing (May 1998) 9 of the projects have finished, with the remaining 6 finishing over the next 12 months. Papers The six papers in this PSTPA Special Session are from different projects and have been refereed and accepted as regular submissions for EuroPar’98. It is pleasing to note that two of these — by Hill et al (Oxford) and by Delaitre et al (Westminster) — have been chosen as Distinguished Papers for EuroPar’98. The first paper “Process Migration and Fault Tolerance of BSPlib Programs Running on a Network of Workstations” by Hill, Donaldson and Lanfear builds on the achievements of the BSP (Bulk Sychronous Parallelism) model over recent years. The PSTPA project has been developing a programming and run-time environment for BSP to extend the parctical methodology. Hill et al describe a checkpointing technique that enables a parallel BSP job to adapt itself continually to run on the least loaded machines in the network and present experimental results for a network of workstations. Distributed techniques for determining global load are also addressed. It is seen that the ‘superstep’ structure of BSP provides natural times when communication is quiescent for taking checkpoints.
Workshop 1 Support Tools and Environments
77
The authors conclude that fault tolerance and process migration are achievable in a transparent way for BSP programs. The next paper “A Parallel-System Design Toolset for Vision and Image Processing” by Fleury, Sarvan, Downton and Clarke is targetted at real-time, data-dominated systems such as those typically found in embedded systems, in vision, and in signal and image processing. Many systems in these areas are naturally built as software pipelines each stage of which can contain internal parallelism. The authors identify a generic form of pipelines of processor farms (PPF) as of wide interest. The paper summarises the PPF methodology and goes on to describe the development of a toolset for the construction, analysis, and tuning of pipeline farms. Performance prediction for the tuning phase is by simulation, using a simulation visualizer, augmented by analytic prediction for some known standard distributions. It is seen that a key benefit is that pipeline farms may be crafted quantitatively for desired performance characteristics (minimum latency, maximum throughput, minimum cost etc). Image processing is also the target domain for “Achieving Portability and Efficiency through Automatic Optimisation: an Investigation in Parallel Image Processing” by Crookes et al. They present the EPIC model, a set of portable programming abstractions for image processing, and outline an automatically optimising implementation. This uses a coprocessor structure and an optimiser which extends the coprocessor’s instruction set for compositions of built-in instructions as they are used on a program by program basis. The technique removes most forms of inefficiency that otherwise arise from a straightforward abstract-machine based implementation. The paper concludes that the EPIC model is a portable software platform for its domain and the implementation demonstrates portability without loss of effieciency for this domain. The design and construction of a performance-oriented environment is addressed in “EDPEPPS: A Toolset for the Design and Performance Evaluation of Parallel Applications” by Delaitre et al. Their environment is designed for a rapid prototyping approach based on a design-simulate-analyse cycle, including graphical tools for design, simulation, and performance visualisation and prediction. The simulator has been developed and refined over several projects, with results in the paper showing good comparisons against measured execution times. The present toolset is built around PVM, however the modular strucure of the simulation model makes it reconfigurable. Databases are an area in which parallel processing is already being exploited but remains in need of performance-oriented tools for accurate modelling and prediction. In “Verifying a Performance Estimator for Parallel DBMSs” Dempster et al address issues in predicting the performance characteristics of parallel database systems, in verifying the results by both simulation and process algebra. There is particular application to topics such as application sizing, capacity planning, and data placement. The paper describes a performance estimation tool STEADY designed to predict maximum transaction throughput, resource utilisation, and response times. Close accuracy is verified for predictions for
78
Chris Wadsworth and Helmar Burkhart
throughput and resource utilisation. Predictions for response times are less accurate as yet, particular as the transaction workload increases. Finally, “Generating Parallel Applications of Spatial Interaction Models” by Davy and Essah characterises application generators as a high-level approach to generating programs, typically here by specifying input model and parameters, usually for a particular application domain. Davy and Essah point to a wide range of applications of Spatial Interaction Models (SIM) in the social sciences and describe opportunities for parallelism in the model evaluation and in the application. Practical experience is presented building an application generator for SIM applications, with results presented showing that the run-time overheads (compared to hand-coded versions) are less than 2% for a variety of model types and number of processors. Parallelism is encapsulated within the application generator, hiding the need for knowledge of it by the user who needs only to specify a SIM application by defining its parameters. List of Projects The 15 projects with title, leader, and email contact are as follows: A Portable Coprocessor Model for Parallel Image Processing Prof Danny Crookes, Queen’s University Belfast; email: [email protected] Portable Software Tools for the Parallelisation of Computational Mechanics Software Prof Mark Cross, University of Greenwich; email: [email protected] An Application Generator for Spatial Interaction Modelling on a Scalable Computing Platform Prof Peter Dew, University of Leeds; email: [email protected] Portable Software Components Dr Peter Dzwig, QMW College London; email: [email protected] Parallel Software Tools for Embedded Signal Processing Applications Prof Andy Downton, University of Essex; email:[email protected] A Distributed Application Generator in the Search/Optimisation Domain Dr Hugh Glaser, University of Southampton; email: [email protected] Automatic Generation of Parallel Visualisation Modules Dr Terry Hewitt, University of Manchester; email: [email protected] An Integrated Environment for Modelling High-Performance Parallel Database Systems Prof Tony Hey, University of Southampton; email: [email protected] A BSP Programming Environment Prof Bill McColl, University of Oxford; email: [email protected] Automatic Checking of Message-Passing Programs Dr Denis Nicole, University of Southampton; email: [email protected] The Inference of Data Mapping and Scheduling Strategies from Fortran 90 Programs on the Cray T3D Dr Alex Shafarenko, University of Surrey; email: [email protected] A Framework for Distributed Application Prof Philip Treleaven, University College London; email: [email protected]
Workshop 1 Support Tools and Environments
79
occam for All Prof Peter Welch, University of Kent; email: [email protected] Application Sizing, Capacity Planning and Data Placement for Parallel Database Systems Prof Howard Williams, Heriot-Watt University; email: [email protected] An Environment for the Design and Performance Evaluation of Portable Parallel Software Prof Steve Winter, University of Westminster; email: [email protected]
A Tracing Protocol for Optimizing Data Parallel Irregular Computations Thomas Brandes 1 and Cdcile Germain 2 1 Institute for Algorithms and Scientific Computing (SCAI) GMD, Schlofl Birlinghoven, D-53754 St. Augustin, Germany e-mail: brandes@gmd,de
2 Laboratoire de Recherche en Informatique (LRI-CRNS) Universit4 Paris-Sud, F-91405 Orsay Cedex, France e-mail: [email protected]
A b s t r a c t . High Performance Fortran (HPF) is the de facto standard language for writing data parallel programs. In case of applications that use indirect addressing on distributed arrays, HPF compilers have limited capabilities for optimizing such codes on distributed memory architectures, especially for optimizing communication and reusing communication schedules between subroutine boundaries. This paper describes a dynamic approach for optimizing unstructured communication in codes with indirect addressing. The basic idea is that runtime data reflecting the communication patterns will be reused if possible. The user has only to specify which data in the program has to be traced for modifications. The experiments and results show the effectiveness of the chosen approach.
1
Introduction
Data parallel languages as High Performance Fortran [9, 10] have been designed to make possible efficient and high-level parallel programming for distributed memory machines. As calculations in computationally intensive codes mostly occur in loops or intrinsic calls involving loops, a challenging problem is to handle efficiently loops with indirectly referenced distributed arrays. For irregular parallel loops, complex data structures, called schedules, are required for accessing remote items of distributed arrays and for communication optimization. The schedules cannot be worked out at compile time, but only at run-time, when the values of the indirection arrays are known. The code design which first builds the schedule, then uses it to carry out the actual communication and computation, has been coined as the inspector/executor scheme that was pioneered very early [16, 12], and developed in the PARTI [6,2] and CHAOS [17] libraries. It is the state-of-the-art design for academic compilers [13, 11, 18, 4] as well as commercial ones [15]. The preprocessing associated with the inspector creates a large run-time overhead, because it involves a large amount of computation and all-to-all communications. Hence the inspector/executor scheme targets irregular loops which
630 exhibit especially temporal locality where the schedule can be reused. The schedule reusing problem can be addressed at compile-time, through data-flow analysis [6, 8]. The main drawbacks of this approach are the limits of inter-procedural analysis and the wall of separate compilation. More precisely, real codes with indirect addressing often require inter-procedural analysis [1, 7, 5]. At the other extreme, language support makes the concept of schedule visible to the end-user. The HPF-J- project [3] proposes gather and scatter clauses referring to explicit schedule variables, or reuse clause for independent loop. The user is responsible for tracking the modifications, but through high level conditional constructs. Reusing schedules across procedure calls relies on the Fortran 90 save attribute for the schedule variable. The seminal paper [14] proposes a schedule reusing method based on timestamps. Each modification of an array is mirrored by an update of the timestamp, and the time-stamps are recorded at inspector invocation. A schedule may be reused if all relevant arrays have the same time-stamps at the point of reuse as the recorded ones. As no directive is provided, all array modifications have to be tracked. Moreover, array descriptors are shared through all arrays having the same structure parameters, so that an indirection array may be considered modified without necessity. In this paper, we propose a new protocol to handle temporal locality. It has been implemented and evaluated in the framework of the A D A P T O R H P F compilation system [4] that already supported the inspector/executor scheme. The protocol is called URB, because the run-time system can either Use, Refresh or Build a schedule for each parallel loop. With some help from the user, URB is able to track changes in the irregular loops exactly: a new schedule is built only when either an irregular parallel loop is structurally different from the previously executed ones, or when the indirection arrays have been modified. The structural information includes the distribution, shape and size of the arrays, and is completely handled by the compiler and the run-time system. The intervention from the user is limited to provide an attribute for each indirection array, through a directive. This attribute indicates that all modifications of this array have to be recorded at run-time, and allows to handle procedure calls without interprocedural analysis. Finally, the tracking scheme is conservative in the sense that, when tracing directives are missing, the generated code simply rebuilds a new schedule for each execution of an irregular loop. Hence, codes using the trace directive and other codes (e.g. library routines) can securely be mixed. The rest of the paper is organized as follows. Section 2 describes the realization of indirect addressing of distributed arrays via the inspector/executor scheme. Section 3 introduces the protocol that is needed for tracking indirection arrays and how it can be realized. Section 4 presents performance results for kernels and a real application, and we conclude in section 5.
631
2
Unstructured
Communication
In the following, we introduce a formal definition of the communication schedule t h a t is built up during the inspector phase. We discuss it for a typical gather operation. The array T specifies the home of the iteration where the d a t a is needed. forall (I E IT, i ( I ) )
T(I) = A(L(I))
Such an indirect addressing will be noted a s ~T,A,L,M -~ {IT, dT , I A , d A, L, M } the inspector info, where IT and 1A specify the index space of the arrays T and A, dT and dA their distributions on the available processors P. M is a predicate defining the subset ST that is the iteration space of the parallel loop. L stands for the indirect addressing of the array A. All indices and arrays can be multi-dimensional ( I = I 1 , . . . , In), and L stands for a set of integer arrays
L1, 9
Lm. dT :IT ~4 P M : I T - - + { t r u e , false},
L: ST(c IT) -+ IA,
ST : = { I
dA : IA --+ P I M(I)=true}C_IT
I
L(I)
We assume that the function L is specified by an integer array or by a set of integer arrays Lj and that they are aligned with the array T. In other words, L(I) is available on the processor that owns T(I). Also the predicate M , if available as a logical array, will not involve any communication. For every processor pair (p, q), p r q, the set Cp~-q specifies which elements of IA processor p needs from processor q:
Cp~q := {(I, L(I))II ~ ST, dT(I) = p, dA(L(I)) = q} Cpe-q C dTl(p) x dAl(q), so for every (I, K) E Cp,-q, I is a local index of T on processor p and K is a local index of A on processor q. D a t a m o v e m e n t from processor q to processor p will be necessary if Cp~-q is not empty. Processor q has to send the corresponding d a t a and processor p must receive it. In the following, the collection of sets Cp~_q, 8T,A,L,M ~- {Cpr is called the communication schedule. Due to the fact that L is not a function, the c o m p u t a t i o n cannot be done by closed formulas as it might be possible in case of structured communication. for q E P: Cp,-q := {} end for for I in d7r 1(p) N ST q :--
dA(L(1))
cp _q := cp -q + { g , g(I)))
forqEP, p~q send Cp~-q to processor q recv Cq,-p from processor q end for
end for end for Fig. 1. Schedule building algorithms
632
for q E P, p # q, Cq~p # { } compute Vq~v (local on p) send Vq+_v to processor q end for for q e P, p # q, Cp~-q # {} recv Vpt-q from processor q for (I, v) E Yp(_.q set T(I) = v end for
f o r q E P , p # q , Cp~-q # {} compute Wv~-q (local on p) send Wp~q to processor q end for f o r q E P, p # q, Cq+..p# {} recv Wq+-v from processor q for (v, t() e Wq~_p set A(t() = v end for
Fig. 2. Executor algorithms Each processor p has to compute the sets Cp+_q for all processors q, as shown in Fig. 1, left part. By an all-to-all communication, the processors exchange the information which d a t a they need from each other (index exchange, Fig. 1, right part). This is necessary in order to make sure that the other processors know which values they have to send. The inspector info together with its schedule builds the inspector data. In the executor phase, the processors use the communication schedule of the inspector d a t a to carry out the communication for gathering or scattering data. Therefore the values of T and A belonging to a domain X are transferred.
valA : IA --+X
Vp~_q :=
{(I, valA(K))I(I,K ) E Cp~q}
Figure 2 (left part) shows the algorithm which is executed on each processor p to gather the values of the distributed array A. For a scatter operation, forall(I E IT, M ( I ) ) A ( L ( I ) ) = T(I), the communication schedule can be used in the same way to realize the necessary communication (Fig.2, right part).
valT : IT --+ X
3 3.1
Wp~q := {(valT(I), K ) [ ( I , K ) E Cp~q}
The URB Protocol Dynamic
Reuse of Communication
Schedules
Let be ZT,A,L,M = (IT, dT, IA, dA, L, M } an indirect addressing for which a communication schedule 8T,A,L,M has been computed. This communication schedule can be reused if the sets Cp~_q have not changed. This is the case if IT, IA, tiT, dA, L and M have not changed. This model points out that the inspector does not depend on the actual values of the arrays T and A, but only on their shape and distribution. Thus, an inspector has the potential to be reused for instance for different sections of the same array, or for aligned arrays. Our URB protocol is based on run-time analysis. For this dynamic approach, it will be decided at runtime whether a communication schedule can be reused or not. Therefore, inspectors are kept in a database. Two problems must be addressed: at first, tracing the modifications of the indirection and m a s k arrays, and secondly, avoiding the explosion of the d a t a base while keeping useful information.
633 The penalty of this approach is that modifications of the indirection arrays must be traced. In the presence of a mask, this is also necessary for the corresponding arrays within the mask.
3.2
Principle of Dynamic Tracing
The previous formulation shows the inspector database management as an instance of a cache management problem. The URB protocol is based on this analogy. To allow tracing, the descriptor of an indirection array L includes a dirty flag. Each write to L sets its dirty flag. The schedule can thus be reused if and only if the dirty flag of L is clear. The approach has to guarantee that every update or every possible update of L will set its dirty flag. Furthermore, if the dirty flag of an array L is set, all inspector for which this array is involved, become invalid. The entries for these inspectors can be freed avoiding the explosion of the database. In fact, the previous scheme fails if an indirection array Lj is involved in more than one indirect assignment. The reason is simply that the flag of the Lj may be shared by many inspectors, but that their status with respect to these inspectors should not be shared. Hence, the descriptor of a traced array actually includes not only one dirty flag, but a dirty mask, with a flag for each possible inspector. Each write to the array sets all flags of the dirty mask, while each inspector update clears only the corresponding flag in the dirty mask. The overall run-time algorithm for building or retrieving schedules is sketched in Fig. 3. Even when the schedule is reusable, the actual schedule can need refreshing. This means that all the hard work of computing ownership and local addresses is done, but that a translation term from the base address has to be taken into account. The inspector data then is updated with the refreshed schedule.
Schedule GetSchedule (Inspectorln]o Z) begin vat InspeetorData : 7) /* InspectorData = InspectorInfo 4- Schedule ~'/ 79 := DBSearch (Z) L := IndireetionArrayDescriptor(Z) /] (79 = = NULL) or DirtyMask(L, 7~) then DBFree (79) 79 := lnspectorBuild (Z) DirtyMaskClear(L, 79) elsei] not EqualBaseAddress (7~, Z) then ScheduleRefresh (7~, Z) endi] return (Schedule(79)) end Fig. 3. The URB protocol
634
3.3
Tracing of Arrays
Tracing clearly belongs to the compiler, which has to insert the code setting the dirty flags as a side-effect of a write of some arrays. The requirement for tracing is described by the user directive !ADP$ TRACE. This directive has a Fortran attribute semantics, and follows Fortran 90 rules of scoping. By default, arrays are not traced. To insure correctness, this case must lead to the conservative scheme where the schedule is computed at each access. If at least one indirection array is not traced, no entry in the data base is created. The schedule is built, used and destroyed immediately after its use. An array must be flagged dirty when it is changed. The syntactic constructs that can modify an array are limited, and can be analyzed at compile-time: assignments with this array as a left-hand-side, redistributions, allocations and deallocations. Thus, the trace attribute is compatible with the DYNAMIC and ALLOCATABLEattributes, but not with the POINTER or TARGETattribute, because the compiler will not be able to record modification of such arrays.
3.4
Procedure Handling
We want the URB protocol to be compatible with the general H P F framework that does not include the tracing directive at the present time. Thus, all combination of traced and not traced for actual and d u m m y arguments must be considered. The run-time system copies information from the descriptor of the actual argument into the descriptor of the d u m m y argument at the entry of the subprogram, and vice versa when leaving the subprogram. Four cases must be considered, following the fact that the actual and the d u m m y arguments have or have not the trace attribute. If none has, we are done. If the actual and the d u m m y arguments both have, the modifications of the array will be traced through the subroutine boundaries, by copying forth and back the dirty mask. If the actual does not have the attribute and the d u m m y has, the modifications of the dirty mask of the d u m m y inside the procedure have no impact on the actual: reusing of schedules where this array is involved is only possible during the lifetime of this subroutine call. FinMly, procedures without explicit or implicit interface, or with an explicit or implicit interface that does not give the trace attribute to an output d u m m y argument, are assumed to modify the actual argument. Thus, on the calling side, the compiler will set the dirty flag of the corresponding array when the procedure returns. With this scheme, the case of library subroutines that modify an indirection array, and that are not aware of the trace attribute, is correctly handled. 4
Performance
Results
All experiments were performed on an IBM SP2 equipped with P2SC nodes, and a 80MB/s maximal bandwidth per node. We used the user space protocol, and the hardware timer of the Power2 architectures. Measurements are averages.
635 4.1
Basic Performance
We studied the basic performance of the URB protocol for a typical gather operation, A(I) = B ( L ( I ) ) , in order to make evident the various components of the inspector-executor scheme. Three schemes of indirect addressing L have been considered: identity (i), a cyclic shift (c), and uniform random values (r). The random and shift distributions are standing here for two typical degrees of irregularity, heavy and light. Identity is useful to measure pure overhead. It also describes data-dependent locality, where no communication is needed because the indirection array operates inside disjoint index subsets. Domain decomposition methods are a typical example of this situation. First, a breakdown of the various components of the inspector and of the executor, for the random scheme, on an 8 processors configuration and a 4K array of real (Table 1) shows that the only potential overhead of the URB protocol, namely searching the database, is negligible. It also highlights the cost of the inspector step. Table 1. Timing of unstructured communication (seconds, 8 processors). inspector step 0.16 sec 0.25e-4 sec
no reuse reuse
2e-01
I
I
I
I
executor step 0.07 sec 0.07 sec
I
I
I
r-ie,~p
s.-!et_,e- - -
6e-02 r
cf
2e-02
i
f-ex ........ g-ex I-ex - - -
"'"
..w
4e-03
,', .ix
le-03
i,f,.~Vl,.f~t~.r'hccr)'J~. .
~t./"
2e-04
6e-05 2K
I 4K
I 8K
I 16K
t.l. i'
I 32K
le
I 64K 128K 2 5 6 K Array Size (real)
I 512K
I IM
Fig. 4. Performance effects of schedule reusing: 16 processors. Curves labeled ie are the non-reuse case, and curves labeled ex are the reuse case. Figure 4 presents a detailed comparison of timings with and without schedule reuse, on a 16 processors configuration. The random scheme displays a nearly
636 linear difference (constant speedup) between the reuse and non reuse case, because of the index exchange, which in this case has exactly the same volume as the d a t a movement. The identity scheme, in the reuse case, describes the minimal cost of an indirect operation, which includes reading the schedule and the local copy between the user arrays. The shift scheme, in the reuse case, has the only additional overhead of each processor sending one 4-bytes message. This overhead is very significant for small arrays, but becomes negligible for large arrays, where the performance converges with the one of the identity case. The shift and identity schemes have very similar performance in the non-reuse case, and display the overhead of building the schedule (ownership and offset computations). 4.2
Performance
Results for a Real Application
The A E R O L O G computational fluid dynamic software developed by M A T R A is devoted to the study of compressible fluid flows around complex geometries. The kernel of the numerical solver (called CLASS for Conservative LAws Systematic Solver) is c o m m o n to all physical models and is based on a second order accurate time and space finite volume scheme. Within the European Esprit project PHAROS, this code has been ported to H P F [5]. The H P F parallelization strategy is based on the coarse grain parallelism given by the decomposition of the mesh d a t a into sub-domains. In one time step, the time increments for the physical variables are computed. The routines called within one time step can be sorted into two groups, the local routines and the boundary routines. The local routines are called independently over subdomains. This group represents typically up to 90 % of the total C P U time of the serial code. T h e local computations scale well as no communication is necessary. The boundary routines perform the boundary conditions. These routines m a k e an intensive use of indirect addressing and involve dependences between d a t a belonging to different sub-domains. Table 2. Execution times in seconds (304200 mesh points, 20 iterations).
local boundary replicated compress notrace trace
P=I 190.46
P=2 95.84
P=4 46.68
P=8 23.43
P=16 12.86
11.24 12.03 165.65 34.42
35.99 11.49 96.88 18.72
38.53 11.56 52.43 12.16
42.97 13.24 28.66 7.73
52.19 14.05 21.12 5.88
In the initial H P F version, the whole mesh d a t a involved in the b o u n d a r y computations as well as the boundary computations itself are replicated on all processors. This avoids unstructured communication but the replication is very expensive. In the tuned H P F version, the mesh d a t a is packed before the replication. This reduces the communication volume, but the scalability is still limited.
637 The third version distributes the mesh data thus requires unstructured communication. Performance results are very poor if the communication schedule is not reused (notrace). Only the reuse of communication schedules (trace) gives acceptable and scalable performance results. With A D A P T O R , the user gets this performance gain only by inserting some TRACE directives in his H P F code. Table 2 shows the performance results for a medium industrial test case with 16 sub-domains of 304200 total mesh points executing for 20 iterations. 5
Conclusions
In this paper, we have presented a strategy to implement effectively an inspectorexecutor paradigm where communication schedules can be reused. It requires no more data-flow and inter-procedural analysis at compile time. It will also reuse communication schedules in situations where even best compile time anMysis might fail, e.g. if the indirection arrays are modified under a mask. Although the method described here requires a new language construct, the TRACE directive, the user does not have to track the multiple execution paths. This should be task of the compiler and of the run-time system. The realization of our concept in the A D A P T O R H P F compilation system showed that it can be easily integrated in an existing H P F compiler that supports the inspector/executor scheme. The compiler itself has only to deal with the new TRACE attribute. The runtime system requires the support of reusing communication schedules and the implementation of a data base for these schedules. For the tracing of arrays with the TRACE attribute, the array descriptor has to be extended by a dirty mask. Tracing can be tracked along subroutine boundaries if an array or a section of an array is passed via such a descriptor. This is generally the case for ADAPTOR, but tracking fails if the H P F compiler passes an array only by a pointer. The work described here demonstrates two new ideas. The first is that a schedule can be associated with more than one parallel indirect assignment, allowing aggressive optimization even in dynamic applications. The second is the analogy to schedule reusing with cache management. We believe that the ideas of the URB protocol might also be useful for other optimizations in an HPF compiler, especially in the context of indirect and dynamic distributions, and for the efficient support of DOACROSS loops with irregular dependencies. In particular, the fact that the URB protocol is very efficient on the identity case has an important application: the H P F 2.0 Approved Extensions deal with data-dependent locality through the coupling of the ON HOM~. and RESIDENT directives. Schedule reusing is simpler, for the user and the compiler, and more versatile, because the same code can be kept unmodified whether data-dependent locality exists or not. The improvement of the current strategy as well as the investigation of benefits for other optimizations will be part of our future work.
Acknowledgments We acknowledge the CRI and the CNUSC for access to their IBM SP2.
638
References 1. G. Agrawal and J. Saltz. Interprocedural communication optimizations for distributed memory compilation. In Language and Compilers for Parallel Computing, pages 1-16, Aug. 1994. 2. G. Agrawal, A. Sussman, and J. Saltz. Compiler and run-time support for structured and block-structured applications. In Supercomputing, pages 578-587. IEEE, 1993. 3. S. Benkner and H. Zima. Definition of HPFq- Rel.2. Technical report, HPF+ Consortium, 1997. 4. T. Brandes and F. Zimmermann. ADAPTOR - A Transformation Tool for HPF Programs. In K. Decker and R. Rehmann, editors, Programming Environments for Massively Parallel Distributed Systems, pages 91-96. Birkh~iuser Verlag, Apr. 1994. 5. T. Brandes, F. Zimmermann, C. Borel, and M. Br~dif. Evaluation of High Performance Fortran for an Industrial Computational Fluid Dynamics Code. In Proceedings of VECPAR 98, Porto, Portugal, June 1998. Accepted for publication. 6. R. Das et al. Communication optimization for irregular scientific computations on distributed memory architecturess. Journal of Parallel and Distributed Computing, (22):462-478, 1994. 7. C. Germain, J.Laminie, M. Pallud, and D. Etiemble. An HPF Case Study of a Domain-Decomposition Based Irregular Application. In PACT'97. LNCS, Sep 1997. 8. M. Gupta, E. Schonberg, and H. Srinivasan. A unified data-flow framework for optimizing communication in data-parallel programs. IEEE Trans. on Parallel and Distributed Systems, 7(7):689-704, 1996. 9. High Performance Fortran Forum. High Performance Fortran Language Specification. Rice Univ., Nov. 1994. Version 1.1. 10. High Performance Fortran Forum. High Performance Fortran Language Specification, Oct. 1997. Version 2.0. 11. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD Distributed-Memory machines. CACM, 35(8):66-80, Aug. 1992. 12. S. Hiranandani, J. Saltz, P. Mehrotra, and H. Berryman. Performance of hashed cache migration schemes on multicomputers. Journal of Parallel and Distributed Computing, 12:415-422, 1991. 13. C. Koelbel, P. Mehrotra, and J. V. Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd: Symp. on Principles and Practice of Parallel Programming, pages 177-186. ACM, 1990. 14. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In ACM Int. Conf. on Supercomputing, pages 361-370, 1993. 15. Portland Group, Inc. PGHPF User's Guide. Manual Release 2.2, PGI, 1997. 16. R. Mirchandaney et al. Principles of run-time support for parallel processing. In ACM Int. Conf. on Supercomputing, pages 140-152, 1988. 17. S. D. Sharma and al. Run-time and Compile-time Support for Adaptive Irregular Problems. In Supercomputing'94, pages 99-106. IEEE, 1994. 18. H. Zima and B. Chapman. Compiling for distributed-memory systems. Proceedings of the IEEE, Feb. 1993.
Contribution to Better Handling of Irregular Problems in H P F 2 Thomas Brandes 1, Frdddric Brdgier 2, Marie Christine Counilh 2, Jean Roman 2 1 GMD/SCAI, Institute for Algorithms and Scientific Computing, German National Research Center for Computer Science, Schloss Birlinghoven, PO Box 1379, 53754 St. Augustin, Germany 2 LaBRI, ENSERB and Universit6 Bordeaux I, 33405 Talence Cedex, France
A b s t r a c t . In this paper, we present our contribution for handling irregular applications with HPF2. We propose a programming style of irregular applications close to the regular case, so that both compiletime and run-time techniques can be more easily performed. We use the well-known tree data structure to represent irregular data structures with hierarchical access, such as sparse matrices. This algorithmic representation avoids the indirections coming from the standard irregular programming style. We use derived data types of Fortran 90 to define trees and some approved extensions of HPF2 for their mapping. We also propose a run-time support for irregular applications with loop-carried dependencies that cannot be determined at compile-time. Then, we present the TriDenT library, which supports distributed trees and provides runtime optimizations based on the inspector/executor paradigm. Finally, we validate our contribution with experimental results on IBM SP2 for a sparse Cholesky factorization algorithm. 1
Introduction
High Performance Fortran ( H P F ) [i0]is the current standard language for writing data parallel programs for shared and distributed memory parallel architectures. H P F is well suited for regular applications; however, many scientific and engineering applications (fluid dynamics, structural m e c h a n i c s , . . . ) use irregular d a t a structures such as sparse matrices. Irregular data need a special representation in order to save storage requirements and computation time, and the d a t a accesses use indirect addressing through pointers stored in index arrays (see for example SPARSKIT [14]). In these applications, patterns of computations are generally irregular and special data distributions are required to both provide high data locality for each processor and good load balancing. Then, compiletime analysis is not sufficient to determine data dependencies, location of d a t a and communication patterns; indeed, they are only known at run-time because they depend on the input data. Therefore, it is currently difficult to achieve good performance with these irregular applications. The second version of the language, H P F 2 [10], introduces new features for efficient irregular data distributions and for d a t a locMity specification in irregular computations. To deal with the lack of compile-time information on irregular
640 codes, run-time compilation techniques based on the inspector/executor method have been proposed and widely used. Major works include PARTI [16] and CHAOS [12] libraries used in Vienna Fortran and Fortran90D [13] compilers, and PILAR library [11] used in PARADIGM compiler [2]. They are efficient for solving iterative irregular problems in which communication and computation phases alternate. RAPID [9] is another runtime system based on a computation specification library for specifying irregular data objects and tasks manipulating them. It is not limited to iterative computations and can address algorithms with loop-carried dependencies. In order to reduce the overhead due to these run-time techniques, the technique called sparse-array rolling (SAR) [15] exploits compile-time information about the representation of distributed sparse matrices for optimizing d a t a accesses. In the Vienna Fortran Compilation System [6], a new directive (SPARSE directive) [15] specifies the sparse matrix structure used in the program to improve the performance of existing sparse codes. Bik and Wijshoff [3] propose another approach based on a "sparse compiler" that converts a code operating on dense matrices and annoted with sparsity related information into an equivalent sparse code. Our approach consists of combining compile-time analysis and run-time techniques, but in a way that differs from the previously cited works. We propose a programming style of irregular applications close to the regular case so that both compile-time and run-time techniques can be more easily performed. In order to do so, we use the well-known tree data structure to represent sparse matrices and, more generally, irregular data structures with a hierarchical access. This Mgorithmic representation masks the implementation structure and avoids the indirections coming from the standard irregular programming style. Like in the SAR approach [15], the compiler knows more information about the distributed data, but unlike this approach, information naturally result from the representation of the irregular data structure itself. Our approach is not restricted to sparse matrices but it necessitates a coding of irregular applications using our tree structure. In order to increase portability, we use Derived Data Types (DDT) of Fortran 90 to define trees and hence irregular d a t a structures, and we use some approved extensions of HPF2 for their mapping. In all the following of this paper, we will refer to our language proposition as H P F 2 / T r e e . We Mso propose a run-time support for irregular applications with loop-carried dependencies that cannot be determined at compile-time. Each iteration is performed on a subset of processors only known at run-time. This support is based on an inspector/executor scheme which builds the processor sets, the loop indices for each processor and generates the necessary communications. So, the first step of our work has been to develop a library, called T r i D e n T , which supports distributed trees and implements the above-mentioned optimizations. The performance of this library have been demonstrated on irregular applications such as sparse Cholesky factorization. The following step of this work is to provide an HPF2/Tree compiler; we are currently working on the integration of the TriDenT library in the A D A P T O R compilation platform [5].
641 This paper (see [4] for a more detailed version) is organized as follows. In Section 2, we give a short background on a sparse column Cholesky factorization algorithm which will be our illustrative and experimental example for the validation of our approach. Section 3 presents H P F 2 / T r e e and section 4 describes the TriDenT library. Section 5 gives an experimental study on IBM SP2 for sparse Cholesky factorization applied on real size problems, and provides an analysis of the performance. Finally, section 6 gives some perspectives of this work. 2
An
Illustrative
Example
The Cholesky factorization of sparse symmetric positive definite matrices is an extremely important computation arising in many scientific and engineering applications. However, this factorization step is quite time-consuming and is frequently the computational bottleneck in these applications. Consequently, it is a significant interesting example for our study. The goal is to factor a sparse symmetric positive definite n • n matrix A into the form A = LL T, with L lower triangular. Two steps are typically performed for this computation. First, we perform a symbolic factorization to compute the non-zero structure of L from the ordering of the unknowns in A; this ordering, for example using a nested dissection strategy must reduce the fill in and increase the parallelism in the computations. This (irregular) data structure is allocated and its initial non-zero coefficients are those of A. In the pseudo-code given below, this step is implicitly contained in the instruction L = A. From this symbolic factorization, one can deduce for each k the two following sets defined as the sparsity structure of row k and the sparsity structure of column k of L : Struct(Lk,) = {j < k such that lkj ~6 0} Struct(L.k) = {i > k such that lik • 0}. Second, the numerical factorization computes the non-zero coefficients of L in the data structure. This step, which is the most time-consuming, can be performed by the following sparse column-Cholesky factorization algorithm (see for example [7, 1] and included references). 1. L - - A 2. f o r k = I t o n d o 3. f o r j E Struct(Lk.) do 4. f o r { E Struct(L,j), i >_ k do ~, cmod(k,j,f) 5. ~ = l~k - l~j * l~j 6. lkk = ~ ~, cdiv(k) Y, 7. f o r i E Struct(L.k) do 8. l~k -- I/k / lkk where cmod(k, j, i) represents the modification of the rows of the column k by the corresponding terms in column j, and cdiv(k) represents the division of the column k by the scalar/x/T~kk. This algorithm is said to be a left-looking algorithm, since at each stage it accesses needed columns to the left of the current column in the matrix. It is also referred to as a fan-in algorithm, since the basic operation is to combine the effects of multiple previous columns on a single subsequent column.
642 If we consider now the parallelism induced by sparsity and achieved by distributing the columns on processors (so, we exploit the parallelism of the outer loop of instruction 2.), we can see that a given column k depends only of columns belonging to S~ruct(Lk.); so, we have loop-carried dependencies in this algorithm. In order to take advantage of this structural parallelism, we use an irregular distribution called subtree-to-subcube mapping and which leads to an efficient reduction of communication while keeping a good load balance between processors; this mapping is computed algorithmically from the sparse d a t a structure for L (see for example [7] and included references).
3
HPF2/Tree
3.1
Representation of Irregular Data Structures with Trees
In this paper, we consider irregular data structures with a hierarchical access and we propose the use of a tree data structure for their representation. For example, the sparse matrix used in the previous oriented column Cholesky algorithm must be a data structure with a hierarchical column access and the user can represent it by a tree with three levels numbered from 0 to 2. The k-th node on level 1 represents the k-th column. Its p-th son represents the p-th nonzero element defined by its value and row number. Fig. 1 gives a sparse matrix and its corresponding tree. In order to distinguish the levels, we number the columns with roman numbers and the rows with arabic numbers. 1
11
iii
iv
v
111 ~--
| | - - , ........
5
-
\
I
7
8 .......
~ type level2 ~,.~ integerROW ! row number \ ~ real VAL I non zero value v"~ ~ v.~l~"~ vm end type level2
Vl v i i v i i i
mll
m |
[] 1
52
5
4
6
I /1\ II
I / I\ I I \ I/ (\ I { ~1~ ~ ~11 ~ 8 5 6
7
8 7
8
type level1 type (level2), allocatable end type leve,1
:: COL(:)
type (levell), anocatable :: A(:)
IADP$TREEA
Fig. 1. A Sparse Matrix and its HPF2/Tree Declaration.
3.2
T r e e R e p r e s e n t a t i o n a n d TREE D i r e c t i v e in H P F 2
Trees are embedded in the HPF2 programming language by using the Derived Data Type (DDT) of Fortran 90. For every level of a tree (except level 0), a D D T must be defined. It contains all the data type definitions for this level and (except for the level with the greatest number) the declaration of an array variable whose type is the one associated with the following level. A tree is then declared as an array of the type the DDT associated with the level 1 (cf. Fig. 1). The TREE directive distinguishes tree variables from other variables using D D T because these variables are used in a restricted way; in particular, pointer
643 attributes of Fortran 90 and recursions are not allowed. So, our tree d a t a structures must not be recursively defined and cannot be used for example to represent quadtree decompositions of matrices [8]. However, this limitation leads to an efficient implementation of trees by the compiler (cf. 4.1). The accesses to a specific tree element or to a subtree are performed according to the usual Fortran 90 notation. The advantage of this programming style is the analogy between the level notion for trees and the classical dimension notion for arrays used in regular applications. Then, the compiler can perform the classical optimizations based on dependence analysis much more easily than with indirection arrays currently used in irregular applications. 3.3
Distribution of Trees
In order to support irregular applications, H P F 2 / T r e e includes the GEN-BL0CK and INDIRECT distribution formats and allows the mapping of the components of D D T according to HPF2.0 approved extensions [10]. The constraints for the mapping of derived type components allow the mapping of structure variables at only one level, the referencelevel. For the two distributions of tree h given at Fig. 2, the reference level is level 1. In HPF2/Tree, the levels preceding the reference level are replicated while the levels following it are distributed according to the distribution of this reference level. This distribution implies an implicit alignment of data with the d a t a of the reference level. This data locality is an important advantage of tree distribution and a compiler can exploit it to generate an efficient code. This is clearly more difficult to obtain when a sparse storage format, such as CSC format [14], is used because the indirection and data arrays are of different sizes and their distribution formats also differ.
Fig. 2. Tree Distributions with 3 Processors. 3.4
Application to the Sparse Fan-In Cholesky Algorithm
This section describes an H P F 2 / T r e e code (cf. Program 3 (a)) for the sparse fanin Cholesky algorithm presented in section 2. This code uses the tree A declared in Fig. 1 and another tree, named B, used to store the dependencies between columns. Then, the first value of B(K)Y,C0L( : )Y,DEP is K, and the other ones are the column numbers given by Struct(Lk.). In section 4.3, we will show that these
644 data can be computed during an appropriate inspection phase so that this tree 13 can be avoided. The distribution of tree A uses a specific INDIRV.CT distribution of nodes on level 1 according to a subtree-to-subcube mapping (cf. section 2). Then, the tree B is aligned with the tree A using the H P F ALIGN directive. In the outer K loop, the 0t{ directive asserts that the statements at iteration K are performed by the set of processors that own at least one column with a non-zero value in row K. The internal L loop performs a reduction inside this set of processors using the N~.Wvariable THP_VALspecified in the ON directive. Each iteration L is performed by the processor which owns the column with number J = B (K)~C{:}L(L)~DEP; so, the computation of the contributions for the column K is distributed. Moreover, a compiler might identify that this reduction uses an all-to-one scheme due to the instruction following the loop. The tree notation which avoids indirection in the code, and the data locality and alignment provided by the distribution of the tree make the compiler able to extract the needed communications outside this internal loop (for A(K)~COL(:)7.ROW and B(K)~.COL( : )~.DEP variables). This would not be possible by using a sparse storage format (such as CSC) due to the indirections. V e r s i o n (a) : HPF2/TREE
Version ~b) : HPF2.rI'REE § ON HOME E x t e n s i o n
type Blevel2 integer DEP I Column number (in E(K)) which modifies A(K) end type Blevel2 type Blevell type (Bievel2), anocetable :: COL{:) end type Blevell type (BteveI1), aifocetabte :: B(:)
DO K = 1, size(A) IHPF$ ON HOME(A(J), J = I:K, ANY(A(J)%COL(:)%ROW) .EQ. K), $ NEW(TMP_VAL) BEGIN f Active processor set
IADP$ TREE B IHPF$ DISTRIBUTE A(INDIRECT(map_array)) IHPF$ ALIGN B WITH A DO K = 1, slze(A) IHPF$ ON HOME (A(B{K)%COL{:)%DEP)), NEW(TMP_VAL) BEGIN allocate (TMP_VAL(size{A(K)%COL))) TMP. VAL(:) = 0.0
IHPF$ INDEPENDENT, REDUCTION (TMP_VAL) O0 L : 2, slze(B(K)%COL) IHPF$ ON HOME (A(B(K)%CO[.(L)%DEP)) BEGIN 3 = B(K)%COL(L)%DEP CMOD('FMP_VAL(:), A(K)%COL(:)%ROW, $ A(J)"/oCOL(:)%VAL, A(J)%COL(:)%ROW) IHPE$ END ON END DO A(K)%COL(:)%VAL = A(K)%COL(:)%VAL - TMP_VAL{:) CDIV(A(K)%COL(:)%VAL) IHPE$ END ON END DO
I Local array for the reduction allocate (TMP_VAI.(size(A(K)%COL))) TMP_VAL(:) = 0.0
IHPF$ INDEPEHDENT, REDUCTION (TMP_VAL) DO J = 1,K-1 iHPF$ ON HOME{A(J), ANY(A(J)%COL(:)%ROW) .EQ. K) BEGIN I Only for contribution columns CMO[• A(K)%COL(:}%ROW, $ A(J)%COL(:)%VAL, A{J)%COL(:)%ROW) IHPF$ END ON END DO A(K)%COL(:)%VAL = A(K)%COL(:)%VAL - TMP_VAL(:) I Implies an all.to-one reduction CDIV(A(K)%COL{:)%VAL) IHPF$ END ON END DO
P r o g r a m 3 . 2 Versions of the Fan-In Cholesky Algorithm. However, this example also shows that run-time techniques are required because the set of active processors specified by the ON directive cannot be known until run-time. In the following section, we describe the TriDenT run-time system which provides supports for tree and for such irregular active processor sets. 4
The
TriDenT
Library
TriDenT is a library which consists of two parts: a set of routines for the manipulation of trees, and another one for the optimization of computations and communications in irregular processor sets based on the inspector/executor paradigm.
645 We currently use this library to validate the tree approach and our optimization proposals by writing HPF2 codes with explicit calls to TriDenT primitives. Furthermore, the TriDenT library has been designed to be integrated as easily as possible in the compilation platform ADAPTOR [5] in order to automatically translate HPF2/Tree codes in SPMD codes with calls to TriDenT primitives. ADAPTOR also supports HPF2 features and is based on the D A L I B (Distributed Array LIBrary) run-time system for distributed arrays. The following sections present the two components of the TriDenT library. 4.1
S u p p o r t for D i s t r i b u t e d Trees
TriDenT implements trees by using an array representation. The TREE directive allows the compiler to generate one-dimensional arrays for the implementation of the Derived Data Types associated with the tree; this implementation consists of one array for each scalar variable of a DDT and of some pointer arrays (cf. Fig. 4). Hence, TriDenT provides efficient access functions to tree elements and to subtrees; moreover, the array implementation allows to take advantage of the potential cache effects.
Fig. 4. Indirect Distribution and Local Data for Processor a. The distribution of trees (cf. 3.3) is performed by using the technique described in [13]. It consists in distributing all the data arrays, for the levels from the reference level, with G~.N_BL0CKdistribution format according to the tree distribution format. Then, the processor local data are stored in a contiguous memory space (cf. Fig. 4 for the indirect distribution of Fig. 2). 4.2
S u p p o r t for I r r e g u l a r P r o c e s s o r Subsets
TriDenT is a run-time system which integrates the active processor set notion introduced in HPF2 by the ON directive. This directive specifies how computation is partitioned among processors and therefore may affect the efficiency of computation. Then TriDenT includes context push and pop as well as reduction and broadcast for active processor sets. Moreover, if the ON directive includes an indirection (e.g. ON HOME (A(INDIR(1:4) ) ))7 the set can be only determined at run-time and so an inspection phase is necessary. TriDenT provides further optimizations for handling ON directive used inside a global DO loop with loopcarried dependencies which can be determined before this loop. Our main goal is to improve the efficiency of the global computation (inspection and execution), even if the executor is used only once. The inspector consists of three parts.
646 The Set I n s p e c t o r analyzes the ON directive of the global loop in order to create in advance the processor subsets associated with each iteration. This will avoid during the execution of the global loop the synchronizations due to the set building messages. In order to allow an automatic generation of these irregular sets, we propose to extend the ON directive by specifying the algorithmic properties to be verified by the variables in the HOMEclause (cf. Program 3 (b)). The L o o p I n d e x I n s p e c t o r determines the useful loop indices for each processor. The C o m m u n i c a t i o n I n s p e c t o r inspects two kinds of communications: those to extract out of the global loop and across-iterations communications. Moreover, the executor optimizes the communications by splitting the send and receive operations and integrating them in the code so as to minimize the synchronisations. 4.3
Application to the Sparse Fan-In Cholesky A l g o r i t h m
We briefly described the optimizations achieved by inpection mechanisms on the HPF2/Tree code for the sparse fan-in Cholesky algorithm given in Program 3 (b). This code uses the HOMEclause extension introduced in section 4.2 and avoids the effective storage of the dependence tree B used in Program 3 (a) (whose size is of the same order of magnitude as t ( : )7,COL( : )ZROW). The Set Inspector involved by the extended ON directive for the global loop directly generates from the A(:)~,COL(:)~,ROW data (and so without the tree B) the processor set for each iteration K (ACTIVE.PROC(K)). In this example, the Loop Index Inspection for the inner loop allows to avoid the scan of the A(J)~,COL( : )~,ROW, and to directly use the valid iteration index sets for each processor. The compiler can identify A(K)Y,COL(:)~,ROW as loop invariant, and can detect that these data must be sent to the ACTIVE_PROC(K) set; then the Communication Inspector can extract the emission outside the global loop and keep the reception inside iterations.
5
Experimental
Results
We validate our contribution with experimental results for a sparse fan-in Cholesky factorization on IBM SP2 with 16 processors. The test matrix is a n x n sparse matrix with n = 65024 achieved from a 2D-grid finite element problem. This matrix is distributed with an INDIRECT mapping according to the subtreeto-subcube distribution (cf. section 2). We compare the three versions described at Fig. 5. The A version uses the tree B (eft Program 3 (a)), whereas the B and C versions use the proposed extended HOMEclause (cf. Program 3 (b)). In all the following, the global time consists in the summation of inspection and execution times. All these versions use the Processor Set Inspection. We have verified the effectiveness of this basic inspection by comparing the execution of A with another version which doesn't use the processor subsets (35 seconds against 460 seconds for the global time with 16 processors).
647 Name Irregular Set Insp. Loop Index Insp. Communication A Yes No No B Yes Yes No C Yes Yes Yes
Insp. Dep. (T)ree
B
o r (0)H T O O
HOHE
Ext.
Fig. 5. Characteristics of the Three Versions of the Fan-In Cholesky Factorization The Fig. 6 presents the relative efficiency with regard to version A, first for the execution time only, and second for the global time. The first graph demonstrates the improvment obtained : from 2 to 8% for the Loop Index Inspection (B), from 3 to 13% with Communication Inspection added (C). The second graph illustrates that the global time, even for only one executor phase, can be better than for the reference version A. It appears that the Loop Index Inspection leads to an interesting improvment (4% for B). However, the cost of the Communication Inspection is just recover by the executor on 16 processors; nevertheless, one can expect a greater improvement with more processors. In M1 our experimentations, we can notice that inspection times are proportional to global time. This property is important for the scalability of our execution support. For example, this time represents between 9% and 15% for version C according to the number of processors. 114 (A) Processor Set
--r
(c) s.t, ,.,,,,, ,,,~l~) s:t,c.o ~.:,.~ ,<,~,~i?, -..-
112
104
/I
(B) Set and lamp Indices ~ / (C) Set, Lmlp I ndices and Communications --it- ~
/4
I10 108
N/d,
106
100
"
"
/A/ C~
l
i i
104 i 11)2 9
10o
9
~
I
96 ~
A i ' i ~ Nh of Proc. 16
1
! 2
4
8 Nb of prec. 16
Fig. 6. Relative Efficiencies (Execution Time and Global Time). So, these experimental results vMidate our run-time system. Of course, if the factorization step is executed many times, then the inspector cost will be negligible compared with the gain achieved by multiple executions. See [4] for other measures which validate this approach. 6
Conclusion
and Perspectives
In this paper, we present a contribution to better handling irregular problems in a HPF2 context. We use a tree representation to take data or distribution irregularity into account. This can help the compiler to perform standard analysis and optimizations. We also introduce an extension for the ON HONEdirective in order to enable automatic irregular processor set creations, and we propose a run-time support based on inspection/execution paradigm for active processor sets, loop indices and communications for algorithms with loop-carried dependencies. Our approach is validated by experimental results for sparse Cholesky factorization.
648 Our future work is to extend the properties of our trees (geometrical distributions as BRS or MRD [15], flexible trees in order to enable reallocations and redistributions), and to improve the performance of our Inspector/Executor system (use of the task parallelism induced by the loops). Finally, we are currently working on the integration of TriDenT in the compilation platform A D A P T O R .
References 1. C. Ashcraft, S. C. Eisenstat, J. W.-H. Liu, and A. H. Sherman. A Comparison of Three Column Based Distributed Sparse Factorization Schemes. In Fifth S I A M Conference on Parallel Processing for Scientific Computing, 1991. 2. P. Banerjee, J. A. Chandy, M. Gupta, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers. In the First International Workshop on Parallel Processing, Bangalore, India, December 1994. 3. A. J.C. Bik and H. A.G. Wijshoff. Simple Quantitative Experiments with a Sparse Compiler. In A. Ferreira J. Rolim Y. Saad and T. Yang editors, editors, Proc. of Third International Workshop, IRREGULAR'96, volume 1117 of Lecture Notes in Computer Science, pages 249-262, Springer, August 1996. 4. T. Brandes, F. Br@gier, M.C. Counilh, and J. Roman. Contribution to Better Handling of Irregular Problems in HPF2. Technical Report RR 120598, LaBRI, May 1998. ,. T. Brandes and F. Zimmermann. Programming Environments for Massively Parallel Distributed Systems, chapter ADAPTOR - A Transformation Tool for HPF Programs, pages 91-96. In I(.M. Decker and R.M. Rehmann, editors, Birkhauser Verlag, April 1994. 6. B. Chapman, S. Benkner, R. Blasko, P. Brezany, M. Egg, T. Fahringer, H.M. Gerndt, J. Hulman, B. Knaus, P. Kutschera, H. Moritsch, A. Schwald, V. Sipkova, and H. Zima. Vienna Fortran Compilation System, User's Guide Edition, 1993. 7. K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, 1990. 8. J. D. Frens and D. S. Wise. Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. Technical Report 449, Computer Scince Department, Indiana University, December 1996. 9. C. Fu and T. Yang. Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 42:143-156, 1997. 10. HPF Forum. High Performance Fortran Language Specification, January 1997. Version 2.0. 11. A. Lain. Compiler and Run-time Support for Irregular Computations. PhD thesis, Illinois, 1996. 12. S.S. Mukherjee, S. D. Sharma, M. D. Hill, J. R. Larus, A. Rogers, and J. Saltz. Efficient Support for Irregular Applications on Distributed-Memory Machines. In ACM SIGPLAN Symposium on Principles 8J Practice of Parallel Programming (PPoPP), July 1995. 13. R. Ponnusamy, Y.S. Hwang, R. Das, J. Saltz, A. Choudhary, and G. Fox. Supporting Irregular Distributions in Fortran 90D/HPF Compilers. IEEE Parallel and Distributed Technology, 1995. Technical Report CS-TR-3268 and UMIACS-TR-9457.
649 14. Y. Saad. SPARSKIT : a Basic Tool Kit for Sparse Matrix Computations - Version 2. Technical report, CSRD, University of Illinois, June 1994. 15. M. Ujaldon, E. L. Zapata, B. M. Chapman, and H. P. Zima. New Data-Parallel Language Features for Sparse Matrix Computations. In Proc. of 9th IEEE International Parallel Processing Symposium, Santa Barbara, California, April 1995. 16. J. Wu, R. Das, J. Saltz, H. Berryman, and S. Hiranandani. Distributed Memory Compiler Design for Sparse Problems. IEEE Transactions on Computers, 44(6), 1995.
OpenMP and HPF: Integrating Two Paradigms* B a r b a r a C h a p m a n 1 and Piyush Mehrotra 2 1 VCPC, University of Vienna, Vienna, Austria. 2 ICASE, MS 403, NASA Langley Research Center, Hampton VA 23681. barbara~vcpc .univie. ac. a t , pra@icase, edu
A b s t r a c t . High Performance Fortran is a portable, high-level extension of Fortran for creating data parallel applications on non-uniform memory access machines. Recently, a set of language extensions to Fortran and C based upon a fork-join model of parallel execution was proposed; called 0penMP, it aims to provide a portable shared memory programming interface for shared memory and low latency systems. Both paradigms offer useful features for programming high performance computing systems configured with a mixture of shared and distributed memory. In this paper, we consider how these programming models may be combined to write programs which exploit the full capabilities of such systems.
1
Introduction
Recently, high performance systems with physically distributed memory, and multiple processors with shared m e m o r y at each node, have come onto the m a r ket. Some have relatively low latency and m a y support at least m o d e r a t e levels of shared m e m o r y parallelism. This has motivated a group of hardware and compiler vendors to define OpenMP, an interface for shared m e m o r y p r o g r a m m i n g which they hope to see adopted by the community. 0penl~P [2] supports a model of shared m e m o r y parallel p r o g r a m m i n g which is based to some extent on P C F [3], an earlier effort to define a standard interface, but increases the power of its parallel constructs by permitting parallel regions to include subprograms. HPF was designed to exploit d a t a locality and does not provide features for utilizing shared m e m o r y in a system; 0penMP provides the latter, however it does not support d a t a locality. In this paper we consider how these p a r a d i g m s might be combined to program distributed m e m o r y multiprocessing systems (DMMPs) which m a y require both of these. We assume that such a system consists of a number of interconnected p r o c e s s i n g n o d e s , or simply n o d e s , each of which is associated with a specific physical m e m o r y module and one or more general-purpose p r o c e s s o r s which share this m e m o r y and execute a p r o g r a m ' s instructions at run time: we emphasize this two-level structure by calling such systems SM-DMMPs. * This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-97046 while both authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001.
651 We first consider interfacing the paradigms, enabling each of them to be used where appropriate. However, there is also potential for a tighter integration of at least some features of HPF with those offered by 0penl~P. We therefore examine an approach which combines HPF data mappings with 0penl~P constructs, permitting exploitation of shared memory whilst coordinating the mapping of data and work to the target machine in a manner which preserves data locality. This paper is organized as follows. Section 2 briefly describes the features of each paradigm, and in Section 3 we consider an interface between them based upon the HPF extrinsic mechanism. We speculate on the potential for a closer integration of the two, and illustrate each of them with a multiblock application example. 2
A Brief Comparison
of HPF
and OpenMP
High Performance Fortran (HPF) is a set of extensions to Fortran, designed to facilitate data parallel programming on a wide range of architectures [1]. It presents the user with a conceptual single thread of control. HPF directives allow the programmer to specify the distribution of data across processors, thus providing a high level description of the program's data locality, while the compiler generates the actual low-level parallel code for communication and scheduling of instructions. HPF also defines directives for expressing loop parallelism and simple task parallelism, where there is no interaction between tasks. Mechanisms for interfacing with other languages and programming models are provided. 0penYlP is a set of directives, with bindings for Fortran 77 and C / C + + , for explicit shared memory parallel programming ([2]). It is based upon the forkjoin execution model, in which a single master thread begins execution and spawns worker threads to perform computations in parallel as required. The user must specify the parallel regions, and may explicitly coordinate threads. The language thus provides directives for declaring such regions, for sharing work among threads, and for synchronizing them. An order may be imposed on variable updates and critical regions defined. Threads may have private copies of some of the program's variables. Parallel regions may differ in the number of threads which execute them, and the assignation of work to threads may Mso be dynamically determined. Parallel sections easily express other forms of task parallelism; interaction is possible via shared data. The user must take care of any potential race conditions in the code, and must consider in detM1 the data dependences which need to be respected. Both paradigms provide a parallel loop with private data and reduction operations, yet these differ significantly in their semantics. Whereas the HPF INDEPENDENT loop requires data independence between iterations (except for reductions), 0penl~P requires only that the construct be specified within a parallel region in order to be executed by multiple threads. This implies that the user has to handle any inter-iteration data dependencies explicitly. HPF permits private variables only in this context; otherwise, variables may not have different values in different processes. 0penl~P permits private variables - with potentially differ-
652 ent values on each thread - in all parallel regions and work-sharing constructs. This extends to c o m m o n blocks, private copies of which m a y be local to threads. Even subobjects of shared c o m m o n blocks m a y be private. Sequence association is permitted for shared common block d a t a under 0penl~p; for HPF, this is the case only if the c o m m o n blocks are declared to be sequential. The HPF p r o g r a m m i n g model allows the code to remain essentially sequential. It is suitable for data-driven computations, and on machines where d a t a locality dominates performance. The number of executing processes is static. In contrast, 0penMP creates tasks dynamically and is more easily adapted to a fluctuating workload; tasks m a y interact in non-trivial ways. Yet it does not enable the binding of a loop iteration to the node storing a specific d a t u m . Thus each prog r a m m i n g model has specific merits with respect to p r o g r a m m i n g SM-DMMPs, and provides functionality which cannot be expressed within the other. 3
Combining
HPF
with
OpenMP
In this section we examine two ways of combining HPF and 0penMP such t h a t we can exploit the features of both approaches. Note that the logical processors of HPF can be associated with either the nodes or with the individual processors of an SM-DMMP. In the former case, the user must rely on a u t o m a t i c shared m e m o r y parallelization to exploit all processors on a node. In the latter case, a typical implementation would create separate address spaces for each individual processor, and even communication between two processors on the same node will require an explicit data transfer. 3.1
OpenMP
as a n H P F E x t r i n s i c K i n d
It is a stated goal of HPF to interoperate with other p r o g r a m m i n g paradigms. The language specification defines the extrinsic mechanism for interfacing to p r o g r a m units which are written in other languages or do not assume the HPF model of computation. The p r o g r a m m i n g language and computation model must be defined. An interface specification must be provided for each extrinsic procedure. This approach keeps the two models completely separate, which implies t h a t the compilers can also be kept distinct; in particular, the 0penMp compiler does not need to know anything about HPF d a t a distributions. For the HPF calling program, execution should proceed just as if an HPF s u b p r o g r a m was invoked. HPF provides language interfaces for Fortran (90/95), Fortran 77 and C. We consider only Fortran programs and 0penMP constructs in this p a p e r ) HPF provides the pre-defined p r o g r a m m i n g models global, serial and local for use with extrinsics. It also permits additional models so long as conceptually one copy of the called procedure executes initially. T h a t is the case for an 0penlqP program. We define an extrinsic model below which allows us to invoke 0pen/qP subroutines and functions from within an HPF program. We also indicate how an alternative model might be defined. 1 The two Fortran language versions differ in the assumptions on sequence and storage association, particularly when arguments are passed to subroutines.
653 T h e O p e n M P e x t r i n s i c m o d e l We base the OpenMPextrinsic kind upon the predefined local model. It assumes that processors defined in an HPF program are associated with nodes of the target system (the number of nodes, not the total number of processors, is returned by HPF's NURBER..0FA~R[:}CESSORSfunction). An 0penMP extrinsic procedure called under this model is realized by creating a single local procedure on each node (or HPF processor). These execute concurrently and independently, exclusively within their respective node, until they return to the HPF calling program upon completion. The calling program blocks until all local procedures have terminated. With the extrinsic kind OpenMP, therefore, each local procedure is an independent 0penMP routine, initially executed by a master thread on a processor of the node it is invoked on. It will create worker threads on the same node when a parallel region is encountered; no communication is possible with a local 0penMp procedure on another node. Only the processors and memory on that node are available to it. An explicit interface will describe the ttPF mapping of data required by the routine; any remapping required is performed before the OpenMPextrinsic procedure is called. Each invocation will access the local part of distributed data only. Scalar data arguments are available on each node. Only data visible to the master thread will be returned: if they are scalar, the value must be the same on each node. The 0penMP routine may not return to the caller from a parallel region or from within a synchronization construct, and may not have alternate entries. An 0penMP library routine will return information related to the procedure executing on the node where it is invoked only. This programming model might be very useful for systems with high latency between nodes, such as clusters of workstations, where it is important to map d a t a to the node which needs them. It permits use of a finer grain of parallelism to exploit the processors within each such node once the data is in place. However, it is also restrictive: an OpenMP extrinsic routine can only exploit parallelism within a single node, since other nodes, and their data, are not visible to it. A n a l t e r n a t i v e e x t r i n s i c m o d e l Other extrinsic kinds may be defined which conform to the requirements of HPF. In particular, an alternative can be defined which permits an OpenI~P routine to utilize an entire platform, i.e. it begins with a single master thread which is cognizant of all processors executing the program. The team of threads created to execute a parallel region in the OpenlgP code may thus span the entire machine. Since the 0penMP routine will not understand HPF data mappings, all distributed arguments must be mapped to the processor which will initiate the routine during its setup. This model enables the user to switch between programming models, depending upon which is most suitable for some region of the program. However, it does not enable 0penMP to take advantage of HPF's data locality, and the initial overhead might be large. Its practical value is therefore not clear.
654 3.2
Merging HPF and OpenMP
An interface between two paradigms obviously does not permit full exploitation of their features. The OpenMP extrinsic may be satisfactory for SM-DMMP architectures with high latency. However, the alternative model is less appealing as an approach to programming large low-latency SM-DMMPs, which may benefit from both data locality and the flexibility of a shared memory model. We thus consider augmenting the 0penMP directives with HPF directives directly, so that both shared-memory parallelism and data locality can be expressed within the same routines. 0penMP directives can then describe the parallel computation, while HPF directives may map data to the nodes where it is used and bind the execution of loop iterations to nodes storing distributed data. D a t a M a p p i n g s The HPF data mapping directives were carefully defined to affect a program's performance, but not its semantics. Thus they can be easily integrated into QpenMP without changing the semantics of 0penMP programs. It is natural to associate logical HPF processors with processing nodes under this model, since data will be shared within a node unless it is thread-private. Techniques from HPF compilers may be applied to prefetch data in bulk transfers where permitted by the semantics of the code. The resulting language might allow privatization of distributed variables just as HPF does inside INDEPENDENT loops. In general, however, we expect that HPF directives will be used to explicitly map large data objects, which are likely to be shared rather than being private to each thread. HPF does not support sequence or storage association for explicitly mapped variables. This restriction must carry over to the integrated language so that the compiler may exploit the mapping information. Common blocks with distributed constituents must follow rules of both HPF and 0penMP. That is, if they are named in a THREADPRIVATE directive, or in a PRIVATE clause, each private copy inherits the original distribution. Storage for the explicitly mapped constituents is dissociated from the storage for the common block itself. In HPF programs, mapped data objects may be implicitly remapped across procedure boundaries to match the prescriptive mapping directives for the formal arguments. On the other hand, if they have been declared DYNAMIC, objects can be explicitly remapped using a REDISTRIBUTEor REALIGNdirective. In order to facilitate remapping within an 0penMP program, we must impose the following constraint: Any code which implicitly or explicitly remaps a data objects must be encountered by all the threads that share that object. This ensures that if the object is remapped, the new mapping is visible to all the threads which have access to the data. A d d i t i o n a l H P F C o n s t r u c t s HPF provides the ON directive to specify the locus of computation of a block of statements, a loop iteration or subroutine call. This might be used as an alternative to the 0penMP scheduling options to bind the execution of code in a work-sharing construct to one or more nodes, or
655 logical HPF processors. An ON clause enables the user to specify, for example, the node where a SINGLE region or a SECTION of a parallel sections construct is to be executed. Similarly, it may bind each iteration of a DO construct to the node where a specific data item is stored. Finally, in contrast to the INDEPENDENT directive of HPF, parallel loops in 0penMP may have inter-iteration data dependencies. Thus an additional worksharing construct could be based upon the INDEPENDENTloop, which permits the prefetching of data and may eliminate expensive synchronizations.
4
An Example
Scientific and engineering applications sometimes use a set of grids, instead of a single grid, to model a complex problem domain. Such multiblock codes use tens, hundreds, or even thousands of such grids, with widely varying shapes and sizes. S t r u c t u r e o f t h e A p p l i c a t i o n The main data structure in such an application stores the values associated with each of the grids in the multiblock domain. It is reMized by an array, MBLOCK, whose size is the number of grids NGRID which is defined at run time. Each of its elements is a Fortran 90 derived type containing the data associated with a single grid. T Y P E GRID I N T E G E R NX, NY REAL, P O I N T E R :: U(:, :), V(:, :), F(:, :), R(:, :) E N D T Y P E GRID T Y P E (GRID), A L L O C A T A B L E :: MBLOCK(:) The decoupling of a domain into separate grids creates internal grid boundaries. Values must be correctly transferred between these at each step of the iteration. The connectivity structure is also input at run time, since it is specific to a given computational domain. It may be represented by an array CONNECT (not shown here) whose elements are pairs of sections of two distinct grids. The structure of the computation is as follows: DO ITERS = 1, MAXITERS C A L L UPDATE_BDRY (MBLOCK, CONNECT) C A L L IT_SOLVER (MBLOCK, SUM) IF ( SUM . LT. EPS) THEN finished E N D DO The first of these routines transfers intermediate boundary results between grid sections which abut each other. We do not discuss it further. The second routine comprises the solution step for each grid. It has the following main loop: S U B R O U T I N E IT.SOLVER( MBLOCK, SUM )
656 SUM = 0.0 DO I -- 1, NGRID CALL SOLVE_GRID( MBLOCK(I)%U, MBLOCK(I)%V, . . . ) CALL RESID_GRID( ISUM, MBLOCK(I)%R, MBLOCK(I)%V . . . . ) SUM = SUM + ISUM E N D DO END Such a code generally exhibits two levels of parallelism. The solution step is independent for each individual grid in the multiblock application. Hence each iteration of this loop may be invoked in parallel. If the grid solver admits parallel execution, then major loops in the subroutine it calls may also be parallelized. Finally the routine also computes a residual; for each grid, it calls a procedure RESID_GRID to carry out a reduction operation. The sum of the results across all grids is produced and returned in SUM. This would require a parallel reduction operation if the iterations are performed in parallel.
D i s t r i b u t i n g t h e G r i d s We again presume that an HPF processor maps onto a node of the underlying machine rather than to an individual physical processor. We map each grid to a single node (which may "own" several grids) by distributing the array MBLOCI(. Such a mapping permits levels of parallelism to be exploited. It implies that the boundary update routine may need to transfer data between nodes, whereas the solver routine can run in parallel on the processors of a node using the shared memory in the node to access data. A simple BLOCK distribution, as shown below, may not ensure load balance. !HPF$ D I S T R I B U T E MBLOCK( B L O C K ) The INDIRECT distribution of HPF may be used to provide a finer grain of control over the mapping.
4.1
Using an OpenMP
Extrinsic Routine
This version calls an OpenMP extrinsic to perform the solution step on each grid as well as compute local contributions to the residual. The boundary updates and the final residual summation are performed in the calling HPF program. An interface declaration is required for the OpenMP extrinsic routine: E X T R I N S I C ('OPENMP', 'LOCAL') SUBROUTINE OPEN(MBLOCK, M, LSUM) I N T E G E R M(:) R E A L LSUM(:) T Y P E (GRID), A L L O C A T A B L E :: MBLOCK(:)
657 Only the local segment of MBLOCI( will be passed to the local procedure, M is its size and LSUM is the residual value for the local grids on the node. The extrinsic subroutine is invoked on each node independently. The master thread creates the required number of threads locally. It will return when each of the threads has terminated. The calling HPF routine waits until all extrinsic procedures have terminated. OPEN( LOC_IVIBLOCK, NLOC, LOC_SUM ) SUM = 0.0 CALL GET_THREAD(NLOC, LOC-MBLOCK, N) !$OMP CALL OMP_SET_NUM_TIIREADS(N)
SUBROUTINE
!$OMP PARALLEL DO SCHEDULE ( D Y N A M I C ) , DEFAULT ( S H A R E D )
!$OMP R E D U C T I O N (+: LOC_SUM) DO I -= 1, NLOC CALL SOLVE_GRID( LOC_MBLOCK(I)%U . . . . ) C A L L RESID_GRID( ISUM, LOC_MBLOCK(I)%R, ... ) LOC_SUM = LOC_SUM + ISUM E N D DO RETURN END
This approach is suitable for systems where the latency between nodes is comparatively high. It permits direct experimentation with, and exploitation of, threads in order to balance work despite large differences in grid sizes. 4.2
Combining HPF and OpenMP
The combined model is very similar to the above; however, it is simpler to write in the sense that there is no need to construct an extrinsic interface. The data declarations and distribution are identical to those of the HPF program. The IT_SOLVER routine can be directly written with 0penMP directives in a manner very similar to the code shown above: IT_SOLVER( MBLOCK, SUM ) !HPF$ D I S T R I B U T E MBLOCK( B L O C K ) SUM --- 0.0 C A L L GET_THREAD(NGRID, MBLOCK, N) !$OMP C A L L OMP_SET_NUM_THREADS(N) SUBROUTINE
!$OMP PARALLEL DO SCttEDULE ( D Y N A M I C ) , DEFAULT ( S H A R E D )
!$OMP R E D U C T I O N (+: SUM) DO I = 1, NGRID !$HPF ON ( H O M E (MBLOCK(I)), R E S I D E N T C A L L SOLVE_GRID( MBLOCK(I)%U, . . . ) CALL RESID_GRID( ISUM, MBLOCK(I)%R . . . . ) ~$HPF E N D ON SUM = SUM + ISUM END DO RETURN END
658 This version again deals with the full array MBLOCK, rather than with local segments. Thus, there may be a need to specify the locus of computation so that each iteration is performed on the processor which stores its data. This can be done using an HPF-style 0N-block along with the RESIDENT directive, as shown. The execution differs. This time, the initialization of SUM will be performed once only, on a single master thread which controls the entire machine. The number of threads, N, is computed for the entire system, and the work is distributed across all processors. This may enable an improved load balance. We have avoided a two-step computation of SUM. This approach is appropriate for a system with comparatively low latency, which allows a single master thread on the machine and a global scheduling policy, permitting a compiler/runtime system to do globM load balancing if appropriate.
5
Summary
0penNP was recently proposed for adoption in the community by a group of major vendors. It is not yet clear what level of acceptance it will find, yet there is certainly a need for portable shared-memory programming. The functionality it provides differs strongly from that of HPF, which considers the needs of distributed memory systems. Since many current architectures have requirements for both data locality and efficient shared memory parallel programming, there are potential benefits in combining elements of these two paradigms. We have shown in this paper that the HPF extrinsic mechanism enables usage of each within a single program. This may facilitate the programming of workstation clusters with standard interconnection technology, for example. It is less clear if a large low latency machine will benefit from a simple interface. For these, there appears to be added vMue in providing some level of integration. Given the orthogonality of many concepts in the two programming models, they are fundamentally highly compatible and a useful integrated model can be specified. In particular, a combination of HPF data locality directives with 0penMP parallelization constructs extends the scope of applicability of each of them. Since at least one major vendor has already provided data mapping options together with a shared memory parallel programming model [4], we expect that this avenue will be closely explored by the 0penMP consortium in the near future.
References 1. High Performance Fortran Forum: High Performance Fortran Language Specification. Version 2.0, January 1997 2. OpenMP Consortium: OpenMP Fortran Application Program Interface, Version 1.0, October, 1997 3. B. Leasure (Ed.): Parallel Processing Model for High Level Programming Languages, Draft Proposed National Standard for Information Processing Systems, April 1994 4. Silicon Graphics, Inc.: MIPSpro Fortran 77 Programmer's Guide, 1996
Towards a Java E n v i r o n m e n t for S P M D Programming Bryan Carpenter, Guansong Zhang, Geoffrey Fox Xiaoming Li*, Xinying Li and Yuhong Wen NPAC at Syracuse University Syracuse, New York, NY 13244, USA {dbc, zgs, gcf, 1xm, xl • wen}@npac. syr. edu
A b s t r a c t . As a relatively straightforward object-oriented language, Java is a plausible basis for a scientific parallel programming language. We outline a conservative set of language extensions to support this kind of programming. The programming style advocated is Single Program Multiple Data (SPMD), with parallel arrays added as language primitives. Communications involving distributed arrays are handled through a standard library of collective operations. Because the underlying programming model is SPMD programming, direct calls to other communication packages are also possible from this language.
1
Introduction
Java boasts a direct simplicity reminiscent of Fortran, but also incorporates many of the important ideas of modern object-oriented programming. Of course it comes with an established track-record in the domains of Web and Internet programming. The idea that Java may enable new programming environments, combining attractive user interfaces with high performance computation, is gaining increasing attention amongst computational scientists [7, 8]. This article will focus specifically on the potential of Java as a language for scientific parallel programming. We envisage a framework called HPJava. This would be a general environment for parallel computation. Ultimately it should combine tools, class libraries, and language extensions to support various established paradigms for parallel computation, including shared m e m o r y programming, explicit message-passing, and array-parallel programming. This is a rather ambitious vision, and the current article only discusses some first steps towards a general framework. In particular we will make specific proposals for the sector of HPJava most directly related to its namesake: High Performance Fortran. For now we do not propose to import the full H P F programming model to Java. After several years of effort by various compiler groups, H P F compilers are still quite immature. It seems difficult justify a comparable effort for Java * Current address: Peking University
660 before success has been convincingly demonstrated in Fortran. In any case there are features of the H P F model that make it less attractive in the context of the integrated parallel programming environment we envisage. Although an H P F program can interoperate with modules written in other parallel programming styles through the H P F extrinsic procedure interface, that mechanism is quite awkward. Rather than follow the H P F model directly, we propose introducing some of the characteristic ideas of HPF--specifically its distributed array model and array intrinsic functions and libraries--into a basically SPMD programming model. Because the programming model is SPMD, direct calls to MPI [1] or other communication packages are allowed from the H P J a v a program. The language outlined here provides HPF-like distributed arrays as language primitives, and new distributed control constructs to facilitate access to the local elements of these arrays. In the SPMD mold, the model allows processors the freedom to independently execute complex procedures on local elements. All access to non-local array elements must go through library functions--typically collective communication operations. This puts an extra onus on the programmer; but making communication explicit encourages the programmer to write algorithms that exploit locality, and simplifies the task of the compiler writer. On the other hand, by providing distributed arrays as language primitives we are able to simplify error-prone tasks such as converting between local and global array subscripts and determining which processor holds a particular element. As in HPF, it is possible to write programs at a natural level of abstraction where the meaning is insensitive to the detailed mapping of elements. Lower-level styles of programming are also possible. 2
Distributed
Arrays
H P J a v a adds class libraries and some additional syntax for dealing with distributed arrays. These arrays are viewed as coherent global entities, but their elements may be divided across a set of cooperating processes. The new objects represent true multidimensional arrays. They allow regular section subscripting, similar to Fortran 90 arrays. These properties all appear very attractive within the scientific parallel programming community, and have become established in H P F and related languages. But they imply a kind of array structurally very different to Java's builtin arrays. A design decision was made not to a t t e m p t integration of the new arrays with standard Java array types: instead they are a new kind of entity that coexists in the language with ordinary Java arrays. The type-signatures and constructors of the multidimensional array use double brackets to distinguish them from ordinary arrays.
661 In this example: Procs2 p = new procs2(3, 2) ; Range x = new BlockRange(lO0, Range y = new BlockRange(200, float
p.dim(O)) p.dim(1))
; ;
[ [ , ] ] a : new f l o a t [[x, y]] on p ;
a is created as a 100 • 200 array, block-distributed over the 6 processes in p. The ideas will be discussed in more detail in the following paragraphs, but the fragment is essentially equivalent to the H P F declarations !HPF$ PROCESSORS
P(3, 2)
REAL A(lO0, 200) !HPF$ DISTRIBUTE A(BLOCK,
BLOCK) ONTO P
Before examining the H P J a v a syntax for declaration of distributed arrays we will discuss the process groups over which their elements are scattered. A base class Group describes a general group of processes. It has subclasses P r o c s l , P r o c s 2 , . . . , representing one-dimensional process grids, two-dimensional process grids, and so on. In the example p represents a 3 by 2 grid of processes. At the time the constructor of p is executed the program should be running on six or more processes. Declaration of p in the J a v a program corresponds directly to declaration of the processor arrangement P in the H P F fragment. In H P J a v a additional objects describing individual dimensions of a process grid are accessed through the inquiry m e m b e r dim. Some or all of the dimensions of a multi-dimensional array can be declared as d i s t r i b u t e d ranges. In general a distributed range is represented by an object of class Range. A Range object defines a range of integer subscripts, and specifies how they are m a p p e d into a process grid dimension. For example, the class BlockRange is a subclass of Range describing a simple block-distributed range of subscripts. Like BLOCK distribution format in HPF, it m a p s blocks of contiguous subscripts to each element of its target process dimension s. The constructor of BlockRange usually takes two arguments: the extent of the range and a Dimension object defining the process dimension over which the new range is distributed. In general the constructor of the distributed array must be followed by an on clause, specifying the process group over which the array is distributed. Distributed ranges of the array must be distributed over distinct dimensions of this group. The on clause can be omitted in some circumstances--see Sect. 3. A distributed array can also have ordinary sequentiM dimensions, in which case the range slot in the constructor is replaced by an integer extent, and the corresponding slot in the type signature should contain an asterisk: float
[[,*]] b = new float
[Ix, I0]] on p ;
Other range subclasses include CyclicRange, which produces the equivalent of CYCLIC distribution format in H P F .
662 T h e asterisk is not m a n d a t o r y (array types decorated in this way are regarded as subtypes of the undecorated ones) but it allows some extra flexibility in subscripting sequential dimensions 2. With one or two extra constructors for groups and ranges, the framework described here captures all the distribution and alignment features of H P F . Because arrays like a are declared as a collective objects we can apply collective operations to them. For example a standard library called Adlib (one of a n u m b e r of existing communication libraries that m a y eventually be brought into the H P J a v a framework) provides m a n y of the array functions of Fortran 90. f l o a t [ [ , ] ] c = new f l o a t [[x, y]] on p ; Adlib.shift(a, c, -I, O, CYCL) ;
At the edges of the local segment of a the s h i f t operation causes the local values of a to be overwritten with values of c from a processor adjacent in the x dimension. Subscripting operations on distributed arrays are subject to certain restrictions. The H P J a v a model is a formalization of the distributed m e m o r y S P M D p r o g r a m m i n g style, and all m e m o r y accesses must refer to the m e m o r y of the local processor. In general therefore an array reference like a
[17,
23]
= 13
;
is illegal because it m a y imply access to an element held on a different processor. The language provides several distributed control constructs to alleviate the inconvenience of this restriction. These will be discussed in the next few sections.
3
The
on Construct
and
the
Active
Process
Group
T h e concept of process groups has an i m p o r t a n t role in the H P J a v a model. The class Group (process grid classes are special cases) has a m e m b e r function called l o c a l . This returns a boolean value t r u e if the local process is a m e m b e r of the group, f a l s e otherwise. In if (p.local O) { } the code inside the conditional is executed only if the local process is a m e m b e r p. H P J a v a provides a short way of writing this construct
on(p) ( o , .
} 2 Incidentally, because b is has no range distributed over p. dim (1) it is understood to be replicated over this dimension of the grid.
663 The on construct provides some extra value. The language incorporates a formal idea of the active process group (APG). At any point of execution some process group is singled out as the APG. An o n ( p ) construct specifically changes the value of the A P G to p. On exit from the construct, the A P G is restored to its value on entry. Elevating the A P G to a part of the language allows some simplifications. For example, it provides a natural default for the oa clause in array constructors. More importantly, formally defining the A P G simplifies the s t a t e m e n t of various rules about what operations are legal inside distributed control constructs like on. A set of formal rules defines how these constructs can be nested and what d a t a accesses are legal inside them. These rules have been presented elsewhere.
4
L o c a t i o n s a n d t h e a~ C o n s t r u c t
As noted at the end of Sect. 2, a mechanism is needed to ensure t h a t the array reference a [17,
23]
= 13
;
is legal, because the local process holds the element in question. In general determining whether an element is local m a y be a non-trivial task. Rather than use integer values directly as local subscripts in distributed array dimensions, the idea of a location is introduced. A location can be viewed as an abstract element, or "slot", of a distributed range. An individual location is described by an object of the class L o c a t i o n . Each L o c a t i o n element is m a p p e d to a particular slice of a process grid. In general two locations are identical only if they come from the same position in the same range. A subscripting syntax is used to represent location n in range x: Location
i = x
[n]
Locations are used to parametrize a second distributed control construct called the at construct. This is similar to on, except t h a t its body is executed only on processes that hold a specified location. For distributed array dimensions, locations rather than integers are used as subscripts. Access to element a [17, 23] can be safely written as: Location i = x [17], j = y [23] ; at (i) at(j) a [i,
j]
= 13
;
Locations used as array subscripts must be elements of the corresponding ranges of the array.
5
Distributed Loops
The at mechanism of the previous section is often useful, but a more urgent requirement is a mechanism for parallel access to distributed array elements.
664 The last and most important distributed control construct in the language is called overall. It implements a distributed parallel loop. Conceptually it is quite similar to the FORALL construct of Fortran, except t h a t the overall construct specifies exactly where its parallel iterations are to be performed. The argument of overall is a m e m b e r of the special class Index. This class is a subclass of L o c a t i o n , so it is syntactically correct to use an index as an array subscript. Here is an example of a pair of nested overall loops: float
[[,]] a = new float
[[x, y]], b = new float
[Ix, y]]
;
, . o
Index i, j ; o v e r a l l ( i = x ~ :) o v e r a l l ( j = y I :) a [i, j] = 2 * b [i,
j]
;
The body of an overall construct executes, conceptually in parallel, for every location in the range of its index. An individual "iteration" executes on just those processors holding the location associated with the iteration. T h e net effect of the example above should be reasonably clear. It assigns twice the value of each element of b to the corresponding element of a. Because of the rules about where an individual iteration iterates, the body of an overall can usually only combine elements of arrays that have some simple alignment relation relative to one another. The idx m e m b e r of range can be used in parallel updates to yield expressions that depend on global index values. T h e ":" in the overall examples above can be replaced by a general Fortranlike subscript triplet of the form 1 : u : s, where 1, u and s are respectively lower bound upper bound and a stride. The ": s" part can be omitted, in which case s defaults to 1; 1 or u can be omitted, in which case the default to 0 or N - 1 respectively, where N is the size of the range specified before the "l" Hence the degenerate isolated ":" selects the whole of the range. With the overall construct we can give some useful examples of parallel programs. Figure 1 gives a parallel implementation of red-black relaxation in the extended language. To support the i m p o r t a n t stencil-update paradigm, ghost regions are allowed on distributed arrays. Ghost regions are extensions of the locally held block of a distributed array, used to cache values of elements held on adjacent processors. In our case the width of these regions is specified in a special form of the BlockRange constructor. The ghost regions are explicitly brought up to date using the library function w r i t e H a l o . Its arguments are an array with suitable extensions and a vector defining in each dimension the width of the halo that must actually be updated. Note that the new range constructor and w r i t e H a l o function are library features, not new language extensions. One new piece of syntax is needed: the addition and subtraction operators are overloaded so that integer offsets can be added or subtracted to locations, yielding new, shifted, locations. This kind of shifted does not imply access to off-processor data. It only works if the subscripted array has suitable ghost extensions. Figure 2 gives a parallel implementation of Cholesky decomposition in the extended language. The first dimension of a is sequential ("collapsed" in H P F
665 parlance). The second dimension is distributed (cyclically, to improve loadbalancing). This a column-oriented decomposition. The example introduces Fortran-like array sections. These are distinguished from local subscripting operations by the use of double brackets. The subscripts are typically scalar (integers) or triplets a. The example also involves one new operation from the Adlib library. The function remap copies the elements of one distributed array or section to another of the same shape. The two arrays can have any, unrelated decompositions. In the current example remap is used to implement a broadcast. Because b has no range distributed over p, it implicitly has replicated mapping; reraap accordingly copies identical values to all processors. We have covered most of the importaut language features we are implementing. Two additional features that are quite important in practice but have not been discussed are subranges and subgroups. A subrange is simply a range which is a regular section of some other range, created by syntax like x [0 : 49]. Subranges can be used to create distributed arrays with general HPF-like alignments. A subgroup is some slice of a process array, formed by restricting process coordinates in one or more dimensions to single values. Subgroups formally describe the state of the active process group inside at and overall constructs. For a more complete description of a slightly earlier version of the proposed language, see [31. Note that if distributed array accesses are genuinely irregular, the necessary subscripting cannot usually be directly expressed in our language, because subscripts cannot be computed randomly in parallel loops without violating the fundamentM SPMD restriction that all accesses be local. This is not regarded as a shortcoming: on the contrary it forces explicit use of an appropriate library package for handling irregular accesses (such as CHAOS [6]). Of course a suitable binding of such a package is needed in our language. 6
Discussion
We have described a conservative set of extensions to Java. In the context of an explicitly SPMD programming environment with a good communication library, we claim these extensions provide much of the concise expressiveness of HPF, without relying on very sophisticated compiler analysis. The object-oriented features of Java are exploited to give an elegant parameterization of the distributed arrays in the extended language. Because of the relatively low-level programming model, interfacing to other parallel-programming paradigms is more natural than in HPF. With suitable care, it is possible to make direct calls to, say, MPI from within the data parallel program (in [2] we suggest a concrete Java binding for MPI). It is necessary to address various questions: Why introduce language extensions and additional s y n t a x - - w h y not work entirely with class libraries for distributed arrays? In fact we consciously put In the example upper bounds of various triplets default to N - 1, determined by the extent of the corresponding array range.
666 Procs2
p = new P r o c s 2 ( P ,
on(p) Range Range float int
x = n e ~ BlockRange(W, y = new B1ookRange(W, [[,]] u = new float
[] widths
//
P)
... some
= ~I,
code
to
;
p.dim(O), p.dim(1), [Ix, y]]
1} ;
I) ; I) ;
1 I
;
// Widths
initialise
I/ ghost width // ghost width
updated
by
r
~
~u ~
for(int iter = 0 ; iter < NITER ; iter++) for(int parity = 0 ; parity < 2 ; parity4+) Adlib.writeHalo(u,
widths)
;
I n d e x i, j ; overall(i = x { I : W - 2) overall(j -- y I 1 + ( x . i d x ( i ) + parity) ~ 2 : W - 2 : 2) U [ i , j ] = 0 . 2 5 * (u [ i - 1 , j ] + u [ i + 1 , j ] § u [ i , j - 1] § u [ i , j + 1 ] ) ;
}
Fig. 1. Red-black iteration using writeHalo. P r o c s l p = new P r o c s l ( P ) ; on(p) R a n g e x = new C y c l i c R a n g e ( W ,
float
[[*,]]
float
[[*]]
]]
...
some
p.dim(0));
a = new f l o a t
Jill,
b = new float
[Cw]]
code to initialise
x]]
;
;
/] buffer
'a ~
Location 1 ; Index m ; for(int
k = 0 ; k < W - I ; k§
{
at(l
= x [k]) { float d = Math.sqrt(a
[k, i])
;
a [ k , 1] = d ; for(int s = k + 1 ; s < W ; s++) a I s , 1] l = d ; } Adlib.refaap(b
[[k + 1 : ]], a Ilk 9 1 : , k3]);
overall(ra = x I k + I : ) for(int i = x.idx(m) ; i < W ; i++) a [ i , m] - = b [ i ] * b [x.idx(m)] ;
a t ( 1 = x [~/ - 1 ] ) a [N - 1 , 1] = M a t h . s q r t ( a
[N - 1 ,
1])
;
Fig. 2. Cholesky decomposition.
667 as much of the framework into libraries as seemed practical, based on extensive experimentation (in Java and C + + ) . All communication operations and collective operations are library functions. Although we introduce a basic syntactic framework for creating distributed arrays 4, all variants of distribution format are captured in runtime libraries, in the Range hierarchy. Similary all variants of processor arrangement are captured in the Group hierarchy. Unfortunately it seems to be difficult to obtain a pure library implementation of the basic dataparallel loops and local array subscripting operations that is flexible, convenient, and efficient. Our suggestion is that, while it would certainly be unrealistic to expect the syntax extensions outlined here to be incorporated in the standard Java language, relatively simple preprocessors can convert the extended language to ugly but efficient standard Java plus class-library calls, as required. Why does the language seem to have awkward restrictions on subscripting, compared with HPF? The answer is: because HPJava is supposed to be a thin layer on top of a straightforward distributed SPMD program with cMls to class libraries. In general the restrictions make it much easier to implement efficiently than HPF. Why adopt the distributed memory model--why not do parallel programming with Java threads? Mainly because our target hardware is clusters of commodity processors, not expensive SMPs. The language extensions described were devised partly to provide a convenient interface to a distributed-array library developed in the PCI=tC project [5, 4]. Hence most of the run-time technology needed to implement the language is available "off-the-shelf". The existing library includes the run-time descriptor for distributed arrays and a comprehensive array communication library. The HPJava compiler itself is being implemented initially as a translator to ordinary Java, through a compiler construction framework also developed in the PCRC project [12]. A complementary approach to communication in a distributed array environment is the one-sided-communication model of Global Arrays (GA) [9]. For task-parallel problems this approach is often more convenient than the scheduleoriented communication of CHAOS (say). Again, the language model we advocate here appears quite compatible with GA approach--there is no obvious reason why a binding to a version of GA could not be straightforwardly integrated with the the distributed array extensions of the language described here. Finally we mention two language projects that have some similarities. Spar [11] is a Java-based language for array-parallel programming. There are some similarities in syntax, but semantically Spar is very different to our language. Spar expresses parallelism but not explicit data placement or communication--it is a higher level language. ZPL [10] is a new programming language for scientific computations. Like Spar, it is an array language. It has an idea of performing computations over a region, or set of indices. Within a compound statement 4 Even this framework could be eliminated if Java provided a feature like C § templates.
668 prefixed by a region specifier, aligned elements of arrays distributed over the same region can be accessed. This has similarities to our overall construct.
References 1. Bryan Carpenter, Yuh-Jye Chang, Geoffrey Fox, Donald Leskiw, and Xiaoming Li. Experiments with HPJava. Concurrency: Practice and Experience, 9(6):633, 1997. 2. Bryan Carpenter, Geoffrey Fox, Xinying Li, and Guansong Zhang. A draft Java binding for MPI. h t t p ://www.npae. syr. e d u / p r o j e e t s / p c r c / d o c . 3. Bryan Carpenter, Guansong Zhang, Geoffrey Fox, Xinying Li, and Yuhong Wen. Introduction to Java-Ad. h t t p ://www. npac. s y r . edu/proj e c t s / p c r c / d e c . 4. Bryan Carpenter, Guansong Zhang, and Yuhong Wen. NPAC PCRC runtime kernel definition. Technical Report CRPC-TR97726, Center for Research on Parallel Computation, 1997. Up-to-date version maintained at http ://www. npac. syr. edu/pro j ects/pcrc/doc. 5. Parallel Compiler Runtime Consortium. C o m m o n runtime support for highperformance parallel languages. In Supercomputing '93. IEEE Computer Society Press, 1993. 6. R. Das, M. Uysal, J.H. Salz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal o] Parallel and Distributed Computing, 22(3):462-479, September 1994. 7. Geoffrey C. Fox, editor. Java /or Computational Science and Engineering-Simulation and Modelling, volume 9(6) of Concurrency: Practice and Experience, June 1997. 8. Geoffrey C. Fox, editor. Java Jot Computational Science and Engineering-Simulation and Modelling II, volume 9(11) of Concurrency: Practice and Experience, November 1997. 9. J. Nieploeha, R.J. Harrison, and R.J. Littlefield. The Global Array: Non-uniformmemory-access programming model for high-performance computers. The Journal of Supercomputing, 10:197-220, 1996. 10. Lawrence Snyder. A ZPL programming guide. Technical report, University of Washington, May 1997. http ://www. cs. washington, edu/research/proj ect s/zpl/. II. Kees van Reeuwijk, Arjan J. C. van Gemund, and Henk J. Sips. Spar: A programmint language for semi-automatic compilation of parallel programs. Concurrency: Practice and Experience, 9(11):1193-1205, 1997. 12. Guansong Zhang, Bryan Carpenter, Geoffrey Fox, Xiaoming Li, Xinying Li, and Yuhong Wen. PCRC-based HPF compilation. In lOth International Workshop on Languages and Compilers/or Parallel Computing, 1997. To appear in Lecture Notes in Computer Science.
Language Constructs and Run-Time System for Parallel Cellular Programming Giandomenico Spezzano and Domenico Talia ISI-CNR c/o DEIS, Universits della Calabria, 87036 Rende (CS), Italy {spezzano, talia}@si .deis .unical. it
A b s t r a c t . This paper describes the main features of CARPET, a highlevel programming language based on cellular automata theory. The language has been designed for supporting the development of parallel highperformance software abstracting from the parallel architecture on which programs run. A CARPET user can write cellular programs to describe the actions of a very large number of simple active agents interacting locally. The CARPET run-time system allows a user to observe, also in a graphical format, the global results that arises from their parallel execution.
1
Introduction
The lack of high-level languages, tools, and application-oriented environments often limits the design and implementation of parallel algorithms that are portable, efficient, and expressive. The restricted-computation structures represent one of the most important models of parallel processing [8]. The interest for these models is due to the possibility to restrict the form of computations so as to restrict communication volume achieving high performance. Restricted-computation models offer a user a structured paradigm of parallel programming and improve the performance of the parallel algorithms reducing the overheads due to the communication latency. Further, tools can be designed to estimate the performance of various constructs of a high-level language on a specific parallel architecture. Cellular processing languages based on the cellular a u t o m a t a (CA) model [10] represent a significant example of restricted-computation that it is used to model parallel computation for a large number of applications in biology, physics, geophysics, chemistry, economics, artificial life, and engineering. A cellular automaton consists of one-dimensional or multi-dimensional lattice of cells, each of which is connected to a finite neighborhood of cells that are nearby in the lattice. Each cell in the regular spatial lattice can take any of a finite number of discrete state values. Time is discrete, as well, and at each time step M1 the cells in the lattice are updated by means of a local rule, called transition function, that determines the cell's next state based upon the states of its neighbors. T h a t is, the state of a cell at a given time depends only on its own state and the states
670 of its nearby neighbors at the previous time step. Different neighborhoods can be defined for the cells. The most common neighborhoods in the two-dimensional case are the von Neumann neighborhood consisting of the North, South, East, West neighbors and the Moore neighborhood composed of eight neighbor cells. The global behavior of the an automaton is defined by the evolution of the states of all cells as a result of multiple interactions. CA are intrinsically parallel so they can be simulated onto parallel computers running the cell transition functions in parallel with high efficiency, as the communication flow between processors can be kept low. In fact, in our approach, a cellular algorithm is composed of all the transition functions of cells that compose the lattice. Each transition function generally contains the same local rule, but it is also possible to define some cells with different transition functions (inhomogeneous cellular automata). According to this approach, we designed and implemented a high-level programming language, called C A R P E T (CellulAR Programming EnvironmenT) [9], that allows a user to design cellular algorithms. In particular, C A R P E T has been used for programming cellular algorithms in the CAMEL (Cellular Aut o m a t a environMent for systems ModeLing environment) [3]. A user can design cellular programs by C A R P E T describing the actions of many simple active agents (implemented by the cells) interacting locally, then the CAMEL system runs cell transition functions in parallel allowing a user to observe the globM complex evolution that arises from all the local interactions. A number of cellular programming languages such as Cellang [4], CDL [6], and CARP [7] have been defined in the last decade. However, none of those contains all the features of C A R P E T neither a parallel run-time support for them has been implemented till today.
2
CARPET
The rational of C A R P E T is to make parallel computers available to applicationoriented users hiding the implementation issues coming from their architectural complexity. A C A R P E T user can program complex problems that may be represented as discrete across a lattice. Parallelism inherent to its programming model is not apparent to the programmer. C A R P E T implements a cellular automaton as an SPMD program. CA are implemented as a number of processes each one mapped on a distinct P E t h a t executes the same code on different data. According to this approach, a user must specify by C A R P E T only the transition function of a single cell of the system he wants to simulate. The language uses the control structures, the types, the operators of the C language. A C A R P E T program is composed by a declaration part that appears only once in the program and must precede any statement (except those of C pre-processor) and by a program body. The program body has the usual C statements, without I / O instructions, and a set of statements to deal with the state of a cell and its neighborhood. Further, C A R P E T users may use C functions or procedures to improve the structure of programs.
671 The declaration section includes constructs to define the dimensions of the a u t o m a t o n (dimension), the radius of the neighborhood (radius), the type of the neighborhood (neighbor), and to specify the state of a cell (state) as a set of typed substates that can be char, shorts, integers, floats, doubles and arrays of these basic types. The dimension declaration defines the number of dimensions of the automaton in a discrete Cartesian space. The m a x i m u m number of dimensions allowed in the current implementation is 3. Radius defines the set of cells that can compose the neighborhood of a cell. In C A R P E T the state of a cell is composed of a record of typed substates, unlike classical cellular automata where the cell state is represented by a few bits. The typification of the substates extends the range of the applications that can be coded in C A R P E T simplifying the writing the programs and improving their readability. In the following example the state is composed of three substates:
rotate (short direction, floa~ mass, speed); A substate of the current cell can be referred by the variable cell substate (eg., cell_speed). To guarantee the semantics of cell updating in cellular aut o m a t a the value of one substate of a cell can be modified only by the update operation. After an execution of the update statement, the value of the substate, in the current iteration, is unchanged. The new value does take effect in the next iteration. The neighbor declaration assigns a name to a set of specified neighbor cells (of the current cell). This mechanism allows a cell to access by name the values of the substates of its neighbor cells. Neighborhoods can be asymmetrical or have any other special topological properties (e.g., hexagonal). Furthermore, in the neighbor declaration it must be defined the name of a vector that has as length the number of elements composing the logic neighborhood. The name of the vector can be used as an alias in referring to the neighbor cell. For instance, the von Neumann neighborhood shown in figure 1, can be defined as follows: n e i g h b o r Neum[4] ( [0, - 1] North, [- 1,0] West, [0,1] S o u t h , [ 1 , 0 ] East) ; A substate of a neighbor cell is referred, for instance, as N o r t h s p e e d . Using the vector name the same substate can be referred also as Neum[0] speed. This referring way makes simpler to write loops in C A R P E T programs. C A R P E T allows the program to know the number of iterations that have been executed by the predefined variable step. Step is silently updated by the run-time system. Initially its value is 0 and it is incremented by 1 each time Ml the cells of the a u t o m a t a are updated. This feature allows a user to define also time-dependent neighborhoods. The step variable permits also to change dynamically the values of the substates dependent upon the iterations. In modeling a complex system, it is often necessary to describe some global features of the system. C A R P E T allows a user to define global parameters and initialize them to specific values. The value of a parameter is the same in every cell of the automaton. For this reason, the value of each parameter cannot be
672
Fig. 1. The von Neumann neighborhood in a two-dimensional CA lattice.
changed in the program but it can only be modified, during the simulation, by the UI. An example of parameter declaration in C A R P E T is
parameter (permeability 0.9) ; Input and output of a C A R P E T program can be performed by files or by the edit function of the UI. A file can contain the values of one substate of all cells, so these values can be loaded at the step 0 to initialize the a u t o m a t o n . The o u t p u t of a C A R P E T program can automatically be saved in files at regular intervals to be post-processed by m a t h e m a t i c a l or visualization tools. C A R P E T offers a user the possibility to define non-deterministic rules using of a r a n d o m function. Finally, C A R P E T allows a user to define cells with different transition functions by means of the Getx, Gety, Getz operations that return the value of coordinates X, Y, and Z of a cell in the automaton. Differently form other cellular languages, C A R P E T does not provide statements to configure the a u t o m a t a , to visualize the cell values or to define the process-to processor mapping. These features can be defined by means of the G U I of its runtime system. The C A M E L GUI allows a user to define the size of cellular a u t o m a t a , the number of the processors onto which an a u t o m a t o n must be executed, and choose the colors to be assigned to the cell substates to support the graphical visualization of their values. By excluding constructs for configuration and visualization from the language, it is possible to execute the same C A R P E T compiled code using different configurations.
3
A Parallel
Cellular
Run
Time
System
Parallel computers are the best practical support for the effective implementation of high-performance CA [1]. According to this basic idea has been developed CAMEL. It is a parallel software system based on the cellular a u t o m a t a model that constitutes the parallel run-time system of C A R P E T . C A M E L has been implemented on a parallel computer composed of a mesh of Transputers connected to a host node [2]. CAMEL provides a G U I to configure a program, to monitor the parameters of a simulation and to dynamically change t h e m at run time. The parallel execution of cellular algorithms is implemented by the parallel execution of the transition function of each cell in the Single P r o g r a m Multiple
673 D a t a (SPMD) way. A portable implementation of C A M E L for MIMD parallel computers based on the MPI communication library is developing. The C A M E L system is composed by a set of Macrocell processes, each one running on a single processing element of the parallel machine, and by a Controller process running on a master processor. Each Macrocell process implements an a u t o m a t o n partition that includes several elementary cells, and it makes use of a communication system that handles the d a t a exchange a m o n g cells. All the Macrocells of the lattice execute in parallel the local rule t h a t results in a global transformation of the whole lattice state. C A M E L uses a load balancing strategy for assigning lattice partitions to the processors of a parallel computer [2]. This load balancing is a domain decomposition strategy similar to the scattered decomposition technique.
4
A Simple Program
This section shows a simple program written by C A R P E T . This example should familiarize the reader with the C A R P E T approach. The p r o g r a m in figure 2 shows how the C A R P E T constructs can be used to implement the parity rule algorithm. The parity rule is a very simple example of "self-reproduction" in cellular a u t o m a t a ; an initial pattern is replicated at some specific iteration t h a t is a power of two. The cells can have 0 or 1 values only. A cell takes the sum of its neighbors and goes to 1 if the sum is odd, or to 0 if the sum is even. Let us call N the number of '1' cells among the four nearest neighbors of a given cell. The transition rule is the following: given a cell, if N is odd, the new value of the cell will be 0; if N is even the cell's value does not change.
cae'f
{ dimel~ion 2; /*bidimensional l a t t i c e */ radiua 1 ; state (short value) ; n e i g h b o r c r o s s [4] ( [0, - i] North, [- i, O] West,
[0,1] South, [ 1,0] East) ; } {int i; s h o r t N = O; for (i = O; i<4; i++) N = cross_value[i] (NY.2 == I)
+ N;
if
update
(cell value,O);
/*updating
the state of a cell*/
}
Fig. 2. The parity rule algorithm written in CARPET.
674 5
Performance
Together with the minimization of elapsed time, scalability is a major goal in the design of parallel computing applications. Scalability shows the potential ability of parallel computers to speedup their performance by adding more processing elements. The scalability of CARPET programs has been measured increasing the number of PEs used to solve the same problem, and using the same number of processing elements to solve bigger problems [2]. The speedup obtained on 32 Transputers with an automaton composed of 224x70 cells which simulated the Ontake mountain (Japan) landslide was 25.0 (78.1% efficient). Table 1 shows the times (in seconds) and the speed-up obtained running the landslide simulation on CAMEL using 1, 2, 4, 8, 16 and 32 PEs. The second column shows the number of cells, which are mapped, on each PE. Similar performance results have been obtained in other applications developed .by CARPET [3]. Table 1. Execution times and speed-up. Number of PEs Number Cells/PE Time for 2000 s t e p s ~ 1 15680 10231.59 1 6392.48 1.6 2 7840 4 3920 3181.40 3.2 8 1960 1447.96 7.0 16 980 728.11 14.0 407.84 25.0 32 490
6
Final Remarks
and Future
Work
As stated also by M. J. Flynn [5], the cellular automata model is a new mathematical way to represent problems that allows to effectively use parallel computers achieving scalable performance. In this paper, we described the CARPET programming language designed for programming cellular algorithms on parallel computers. CARPET has been used successfully to implement several real life applications such as landslide simulation, lavaflow models, freeway traffic simulation, image processing, and genetic algorithms. The experience during the design, implementation, and use of the CARPET language showed us that high-level languages are very useful in the development of cellular algorithms for solving complex problems in science and engineering. Currently, the CARPET language is used in the COLOMBO (Parallel COmputers improve clean up of soils by Modelling BiOremediation) project within the ESPRIT framework. The COLOMBO main objective is the use of CA models for the bioremediation of contaminated soils. This project is developing a
675 portable MPl-based implementation of C A R P E T and C A M E L on M I M D parMlel computers such as the Meiko CS-2, the CRAY T3E, and workstation clusters. The new implementation will make the C A R P E T programs portable on a large number of M I M D machines. A c k n o w l e d g e m e n t s This research has been partially funded by the C E C E S P R I T project n~ 24,907.
References 1. B. P. Brinch Hansen, Parallel Cellular Automata: a Model for Computational Science. Concurrency: Practice and Experience, 5:425-448, 1993. 2. M. Cannataro, S. Di Gregorio, R. Rongo, W. Spataro, G. Spezzano, and D. Talia, A Parallel Cellular Automata Environment on Multicomputers for Computational Science. Parallel Computing, 21:803-824, 1995. 3. S. Di Gregorio, R. Rongo, W. Spataro, G. Spezzano, and D. Talia., A Parallel Cellular Tool for Interactive Modeling and Simulation. IEEE Computational Science K: Engineering 3:33-43, 1996. 4. J. D. Eckart, Cellang 2.0: Reference Manual. ACM Sigplan Notices, 27:107-112, 1992. 5. M.J. Flynn, Parallel Processors Were the Future and May Yet Be. IEEE Computer 29:152, 1996. 6. C. Hochberger and R. Hoffmann, CDL - a Language for Cellular Processing. In: Proc. 2nd Intern. Conference on Massively Parallel Computing Systems, IEEE Computer Society Press, 1996. 7. G. :lunger, Cellular Automaton Tool User Manual. GMD, Sankt Augustin, Germany, 1994. 8. D.B. Skillicorn and D. Talia, Models and Languages for Parallel Computation. ACM Computing Survey, 30, 1998. 9. G.Spezzano and D. Talia, A High-Level Cellular Programming Model for Massively Parallel Processing. In: Proc. 2nd Int. Workshop on High-Level Programming Models and Supportive Environments (HIPS97), IEEE Computer Society, pages 55-63, 1997. 10. 3. von Neumann, Theory of Self Reproducing Automata. University of Illinois Press, 1966.
Process Migration and Fault Tolerance of BSPlib Programs Running on Networks of Workstations Jonathan M.D. Hill1 , Stephen R. Donaldson1 , and Tim Lanfear2 1 2
Oxford University Computing Laboratory, UK. British Aerospace Sowerby Research Centre, UK.
Abstract. This paper describes a system that enables parallel programs written using the BSPlib communications library to migrate processes among a network of workstations. Not only does the system provide fault tolerance of BSPlib jobs, but by utilising a load manager that maintains an approximation of the global load of the system, it is possible to continually schedule the migration of BSP processes onto the least loaded machines in a network. Results are provided for an industrial electromagnetics application that show that we can achieve similar throughput on a publically available collection of workstations as a dedicated NOW.
1
Introduction
The Bulk Synchronous Parallel (BSP) model [14,10] views a parallel machine as a set of processor-memory pairs, with a global communication network and a mechanism for synchronising all processors. A BSP program consists of a sequence of supersteps. Each superstep involves all of the processors and consists of three phases: (1) processor-memory pairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other processor’s memories; and (3) all processors barrier synchronise. The globally consistent state that is available after the barrier not only helps when reasoning about parallel programs, but also suggests a programming discipline in which computation (and communication) is balanced across all the processes. As balance is so important to BSP, profiling tools have concentrated upon exposing imbalances to the programmer so that they can be eliminated [5,7]. However, not all imbalances that arise during program execution are caused by the program. In an environment where processors are not dedicated resources, the BSP computation proceeds at the speed of the slowest processor. This would suggest that the synchronous nature of BSP is a disadvantage compared to the more lax synchronisation regime of message passing systems such as MPI. However, most programs written using collective communications, or scientific applications such as the NAS parallel benchmarks [1] are highly synchronous in nature, and are therefore limited to the performance of the slowest running process in either BSP or MPI. Therefore, if a network user logs onto a machine that is part of a BSP job, this may have an undue effect on the entire job. This paper describes a technique that ensures a p process BSP job continually adapts itself to run on the p least loaded processors in a network consisting of P machines (p ≤ P ). D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 80–91, 1998. c Springer-Verlag Berlin Heidelberg 1998
Process Migration and Fault Tolerance of BSPlib
81
Dedicated parallel machines can impose a global scheduling policy upon their user community such that, for example, parallel jobs do not interfere with each other in a detrimental manner. The environment that we describe is one where it is not possible to impose some schedule on the user community. The components of the parallel computation are invariably guests on other peoples machines and should not impose any restrictions on them for hosting the computation. Further, precisely because of this arrangement, the availability of the nodes and the available resources at these nodes is quite erratic and unpredictable. We adopt the philosophy that in such a situation it is reasonable to expect the parallel job to look after itself. We briefly describe the steps involved in migrating a BSP job, that has been written using the BSPlib [6] communications library, among a set of machines and the strategy used in making check-pointing and migration decisions across all machines. The first technical challenge (Section 2) describes how we capture the state of a UNIX process and restart it in the same state on another machine of the same type and operating system. The simplicity of the superstep structure of BSP programs provides a convenient point at which local checkpoints capture the global state of the entire BSP computation. This therefore enables process migration and check-pointing to be achieved without any changes to the users program. Next we describe a strategy whereby all processes simultaneously decide that a different set of machines would provide a better service (Section 3). When the BSP job decides that processes should be migrated, all processes perform a global checkpoint, they are then terminated and restarted on the least loaded machines from that checkpoint. Section 4 describes a technique for determining the global load of a system, and Section 5 describes an experiment using an industrial electro-magnetic application on a network of workstations. This demonstrates how the scientist or engineer is allowed to concentrate on the application and not on maintaining or worrying about the choice of processors in the network. Section 6 describes some related work and Section 7 concludes the paper.
2
Check-pointing and Restarting Single Processes
BSPlib provides a simple API for inter-processor communication in the context of the BSP model. This simple interface has been implemented on four classes of machine: (1) distributed memory machines where the implementation uses either proprietary message passing libraries or MPI; (2) Distributed memory machines where the implementation uses primitive one sided communication, for example the Cray SHMEM library of the T3E; (3) shared memory multi-processors where the implementation uses either proprietary concurrency primitives or System V semaphores; and (4) Networks of workstations where the implementation uses TCP/IP or UDP/IP. In this paper we concentrate upon check-pointing programs running on the network of workstations version of the library. Unlike other checkpointing schemes for message passing systems (See Section 6), by choosing to checkpoint at the barrier synchronisation that delimits supersteps, because there
82
Jonathan M.D. Hill et al.
is a globally consistent state upon exiting the barrier (where all communication is quiesced), the task of performing a global checkpoint reduces to the task of check-pointing all the processes at the local process level. All that is required to perform a local checkpoint is to save all program data that is active. Unfortunately, because data (i.e., the state) can be arbitrarily dispersed amongst the stack, heap and program text, capturing the state of a running program is not as simple as it would first appear. A relatively straightforward solution in a UNIX environment is to capture an image of the running process and create an executable which contains the state of the modified data section (including any allocated heap storage) and a copy of the stack. When the check-pointed executable is restarted the original context is restored and all BSPlib supporting IPC connections (pipes and sockets) are re-established before the computation is allowed to proceed. All this activity is transparent to the programmer as it is performed as part of the BSPlib primitives. Furthermore, by restricting the program to the semantics of BSPlib, no program changes are required. The process of taking a checkpoint involves making a copy of the stack on the heap, saving the current stack pointer and frame pointer registers, and saving any additional state information (for example, on the SPARC the register windows need to be flushed onto the stack before it is saved, and the subroutine return address needs to be saved as it is stored in a register; in contrast, on the Intel X86 architecture, all necessary information is already stored on the stack). The executable that captures this state information is built using the unexec() function which is distributed as part of Emacs [11]. The use of unexec() is similar to its use (or the use of undump) in Emacs, LaTeX (which build executables containing initialised data structures) and Condor which also performs checkpointing [4]. However, the state saving in Condor captures the point of execution and the stack height using the standard C functions setjmp() and longjmp() which only guarantee far jumping into activation records already on the stack and within the same process instance. Instead, we capture the additional required information based on the concrete semantics of the processor. To restart a process and restore its context, the restart routine adjusts the stack pointer to create enough space on the stack so that the saved stack can be copied to its original address and restores any saved registers.
3
Determining when to Migrate Processes
As mentioned above, our philosophy is that the executing job be sensitive to the environment in which it is executing and it is the job, and not an external scheduler, that makes appropriate scheduling decisions. For the job to make an informed decision, some global resource information needs to be maintained. The solution we have adopted is that there are daemons running on each machine in the network which maintain local approximations to the global load. The accuracy of these approximations is discussed in the next section. Here we assume that each machine contains information on the entire network which is no more than a few minutes old with a high probability.
Process Migration and Fault Tolerance of BSPlib
83
When a BSP job requests a number of processes, the local daemon is queried for a suitable list of machines on which to execute the job (the daemon responds so that the job may be run on the least loaded machines). In order that the decisions are not too fickle, the five minute load averages are used. Also, it is a requirement that not too much network traffic be generated to maintain a reasonable global state. Since the five minute load averages are being used, it is not too important that entries in the load table become slightly out of date as the wildest swings in the load averages take a couple of minutes to register in the five minute load average figures. Let Gi be the approximation to the global load on machine i, then given P machines, the true global load is G = G1 · · ·GP ; where is used to merge the approximations from two machines. Given a BSP job running on p processors with machine names1 j in the set {i1 , . . . , ip }, we use the approximation G = Gi1 · · · Gip which is better than any of the individual approximations with a high probability. G is a sequence of machine names sorted in decreasing priority order (based on load averages, number of CPUs and their speeds). If the top set of p entries of G is not {i1 , . . . , ip } then an alternate and better assignment of processes to machines exists (call this predicate fn). In order not to cause processes to thrash between machines, a measure x of the load of the job (where 0 ≤ x ≤ 1) is added to the load averages of all machines not involved in the BSP computation before the predicate fn is applied. This anticipates the maximum increase in the load of a machine when a process is migrated to it. Any observed increase in load greater than x is therefore caused by additional external load. Our aim is to ensure that the only overhead in process migration is the time taken to write p instances of the check-pointed program to disk. Therefore, we require that the calculation that determines when to perform process migration does not unduly impact the computation or communication performance of BSPlib. We need to solve fn(G ) = fn(Gi1 · · · Gip ) either on a superstep by superstep basis or every N supersteps. However, the result can be obtained without first merging the global load approximations. This can be done by each processor independently computing its load approximations Gi and checking that it is amongst the top p after adding xj to the loads of machines not involved in the current BSP job; where 0 ≤ xj ≤ 1 is the contribution of the BSP process on machine j, to the load average on that machine. This calculation can be performed on entry to the superstep T seconds after the last checkpoint (i.e., this checking does not have to be performed in synchrony). The Boolean result from each of the processors is reduced with the or-operator to ensure that all processors agree to checkpoint during the same superstep. In the TCP/IP and UDP/IP implementations of BSPlib, this is piggy-backed onto a reduction that is used to globally optimise communication [3]. Therefore if a checkpoint is not necessary, there is no substantial impact on the performance of BSPlib.
1
we distinguish between machines and processors as each machine may contain a number of processors, each of which runs multiple processes.
84
Jonathan M.D. Hill et al. . . .
0
1
2
3
4
Fig. 1. Markov chain considering only single direct updates
4
Determining the Global Load of a System
The load daemons use a protocol in which the locally held global load states are periodically sent to k randomly chosen daemons running on other machines. This update period is uniformly and randomly distributed with each machine being independent. When a load daemon receives an update, it responds by merging the locally held load table and sending the resultant table back to the sender. The sender then merges the entries of the returned table with the entries of the locally held table. For purposes of simplifying the analysis, the updates are assumed to happen at fixed intervals and in a lockstep fashion. We also assume that the processors do not respond with the merged table, but merely update their tables by merging in the update requests. The analysis that follows always provides an upper bound for the actual protocol used. If each processor sent out a message at the end of each interval to all the other processors, this would require P 2 messages to maintain the global state. A job requiring p ≤ P processes could contact P machines and arrive at an optimal choice of processors with considerably fewer messages provided that jobs did not start very often. However, with a large network of machines, the rate of jobs starting and the number of machines P would quickly lengthen the delay in scheduling a BSP job. By maintaining a global processor utilisation state at each of the machines, starting a job only involves contacting the local daemon when choosing a set of p processors and thus need not contribute to network traffic. Even once the ordering of processors has been selected, the problem of over assigning work to the least loaded machines can be avoided by having those machines reject the workload request based on knowledge built up locally. The algorithm then simply tries the machine with the next highest priority. The quality of the decision for the optimal set of machines depends on the ages of the entries in the distributed load averages tables. If each machine uniformly and randomly chooses a partner machine at the end of each interval and sends its load average value to that machine, then the mean age of each entry in a distributed load average table can be calculated by considering the discrete time Markov chain shown in Figure 1. In this case there would only be p messages at the end of each interval, but the age distribution {πi : i ∈ N}, and the mean age µ are given by:
πi =
1 p−1
p−2 p−1
i (1)
Process Migration and Fault Tolerance of BSPlib
85
. . .
0
1
2
3
...
4
...
...
Fig. 2. Markov chain when indirect updates are allowed
µ=
∞
iπi = p − 2
(2)
i=1
By sending out messages more frequently, or sending out k messages at the end of each interval, the mean age can be reduced to O(p/k), but this increases the traffic and required number of virtual circuits to pk. By allowing each machine to exchange all their load average information with k other machines at the end of each interval, a significant reduction in the mean age of the entries can be achieved with pk circuits. This scheme allows machines to choose between the most current load average figures, even if they were indirectly obtained from other machines. Figure 2 shows the corresponding Markov chain. In this stochastic process, transitions into state 0 can only arise out of direct updates, that is, a machine directly updating its entry in a remote table. The distribution, {πi : i ∈ N}, of the ages of the entries in the load average table is given by the recurrence: k/(p − 1), if i = 0 (3) πi = πi−1 P {no useful updates} + ∞ π P {min age is i }, otherwise j=i j Figure 3 shows the mean table entry ages of the three strategies when k = 1, 3, 6 against P . As P is given as a log scale, it is clear from the figure that while the first two strategies give a mean age µ as O(P ), the third strategy (allowing indirect updates) gives a mean age of µ as O(log P ). If we replace the discrete times of the Markov chains with update intervals, t, the distributions above give the mean age at the beginning of each of the intervals. The figure shows that in order to bound the mean age to, say, five minutes we must ensure that: 1 t(µ + ) ≤ 5 minutes, or 2 10 t≤ (4) 2µ + 1 Therefore when p = 32, t should be less than or equal 3 31 minutes. The line marked “experimental results” shows the actual bounds on the algorithm for t = 4 minutes for all values of P . The experimental results are better than the upper bounds of the analysis as the updates don’t occur in lock-step fashion, and a shorter sequence of updates are therefore possible. Also the actual protocol reuses the established circuit to send the merged tables back to the sender daemon;
86
Jonathan M.D. Hill et al. 4 Direct updates k=1 Direct updates k=3 Indirect updates k=3 Indirect updates k=6 Experimental results Merged P\p entries for k=3 and p=4
3.5
Time in update periods
3
2.5
2
1.5
1
0.5
0 4
8
16 Number of machines
32
Fig. 3. Mean ages achieved by each of the three strategies and the merged P \ p entries this in effect allows the system to perform twice as many updates with fewer circuits. As described in Section 3, by having the p processes involved in a BSP computation merge their load average tables before choosing where to migrate the processes, the current five minute load averages for the p processors executing the job is obtained and the ages of the load average table entries for the other P − p machines have a distribution {πi : i ∈ N} where p i p x πt )p−x (5) πi = (πi ) (1 − x x=1 t=0 Figure 3 includes the resulting mean age from this distribution for p = 4 against the total number of machines P , and compares it with the mean derived from the original distribution {πi : i ∈ N}. It is clear from the figure that as p increases, the mean age of the data from the merged tables decreases until the mean age is zero when p = P .
5
Experimental Results for a Electro-magnetics Application
The code EMMA T:FE3D (part of the British Aerospace EMMA electromagnetic software suite), uses the finite element time domain method for sol-
Process Migration and Fault Tolerance of BSPlib
87
Table 1. Execution time in seconds for the electro-magnetics simulation p ypcat order daemon order daemon order + daemon order + forced migration selective migration 2 3255 623 841 658 4 3855 1155 1916 1092 8 2968 1161
ving Maxwell’s equations in three dimensions. The finite element approach offers several advantages over other full wave solution methods (e.g., finite difference time domain, method of moments). A volume of space around the target is filled with an unstructured mesh of tetrahedra which can conform accurately to the geometry of the object being analysed and, because of the unstructured nature of the mesh, many small elements can be introduced in regions where the solution has rapid spatial variation. A time marching algorithm using a Taylor-Galerkin method is used to advance the fields through time to simulate the propagation of a wave through the mesh. The CFD community have developed considerable knowledge and expertise in unstructured mesh generation and finite element solvers which have been exploited for solving electro-magnetic problems. Applications areas for electro-magnetic solvers in the aerospace industry include electro-magnetic scattering, analysis of electro-magnetic compatibility and hazards, antenna design, and modelling of effects of lightning strike. Table 4 shows the results from the electro-magnetics application running on various numbers of processors. The four columns of execution time show the following: (1) a job running on a random choice of machines. These may be highly loaded or may not have fast processors; (2) a job that is initiated on the p best machines (i.e., the fastest least loaded machines); (3) a job that is started on the best p machines, but is check-pointed every two minutes; and (4) a job that is started on the best p machines and checks every two minutes whether it is beneficial to check-point and migrate. The first experiment is an approximation to a job that encounters poor service during execution. As can be seen from the dramatic decrease in execution times in the second experiment it is always beneficial to start a job on the most powerful unloaded machines. In the situation where the chosen processors have the same power, and the job is long running, then the second experiment will degrade to the performance of the first. The third experiment quantifies the overhead in check-pointing. As ten check-points were performed at p = 4 the increase in execution time shows that a local checkpoint takes 19 seconds to write a seven megabyte image to disc. The fourth experiment shows that there is little overhead in checking whether a check-point is necessary. As it costs 19p seconds to perform a migration, it is only beneficial to migrate in situations where the processor usage is not too erratic, thus allowing the job to recoup the cost of the migration on the set of processors that were migrated to. If jobs are long running and compute bound then there is a lot of potential for regaining lost ground due to having to migrate from a loaded machine.
88
Jonathan M.D. Hill et al. 60 Total Mflops Available Mflops 50
40
Mflops
30
20
10
0
-10 0
20
40 60 Machines in decreasing power
80
100
Fig. 4. Moore’s law: a profile of the computational power of a collection of workstations The results shown in Table 4 are from an electro-magnetics experiment that simulates a field around a sphere. As can be seen from the figure, no parallel speedup was achieved when increasing the size of the NOW; this was due to the dominance of communication over computation in this small test case. Larger realistic test cases at BAe have shown linear speedup up-to sixteen processes as computation begins to dominate. This work is based on the assumption that the available cycles on the machines changes continuously over time. However, with a large resource acquired over time, the machines are unlikely to be homogeneous in available power. Figure 4 shows a graph of available computing power on each workstation in Oxford University Computing Laboratory against available power at a particular instance in time. We express available power by the formula (n − L)s; where n is the number of processors in a machine, L is the load average on that machine and s is the Mflops/s rating of a single processor. The graph demonstrates Moore’s Law in the purchase of workstations over time, i.e., to the right of the graph, there are a large number of low powered aging workstations, whereas toward the left of the graph, there are few high performance machines procured over the last six months. When choosing to schedule a p process parallel job, either a homogeneous set of unloaded (slow) machines can be used, or the best p. The policy we adopt is that although the unloaded machines ensure little interference, they only provide a fraction of the power of the best machines. However, running on the popular powerful machines, can also make a job susceptible to
Process Migration and Fault Tolerance of BSPlib
89
low throughput as the available cycles at a node can vary dramatically over time. For example, the figure shows that the fourth machine from the left has a peak performance of 50 Mflops/s, yet it is so loaded that there are no free cycles for parallel jobs. In summary, the erratic, but powerful machines, to the left of the graph are most suited to computational intensive parallel jobs, yet they are the very machines that require process migration.
6
Related Work
The process migration work described in this paper, also provides fault tolerance if the mean time between failure is greater than the rate at which processes migrate/checkpoint. Existing work in this area has tended to concentrate on fault tolerance using redundant computation. For example Nibhanupudi and Szymanski [9] minimise the slowdown of a BSP job when external loads are applied to machines in a network by replicating computation on a number of machines. At the end of a superstep, the results of the fastest of the replicated jobs is used to form the global state of the system. Although their system allows the mean time between failure to be less than ours due to the replicated jobs, their prototype system requires user annotation of the data structures to be included in a checkpoint, and they assume that there won’t be a machine failure during communication. In contrast, the fault tolerance in our system is transparent to the user, and has no restrictions on when faults can occur. Also, our approach doesn’t suffer from the considerable overhead that would be incurred to implement a process replication scheme. As already noted, the only overhead we incur is the time taken to write the p checkpoints to disk. A similar approach to ours is that of Kaashoek et al. [8] where fault tolerance of Orca programs is provided on top of the Amoeba distributed operating system. Their approach is complicated by the fact that they have to determine locally when communication is quiescent so that a stable checkpoint can be taken. Other parallel check-pointing systems for MPI [12] and PVM [13] also suffer from this problem, as a checkpoint can only be performed if there is no communication in transit when each process performs a local checkpoint to capture the global state [2]. Fortunately, the check-pointing regime described here is far simpler than any of the approaches used in message passing systems as opportunities for a global checkpoint naturally arise out of the superstep structure of BSP programs. The process migration facilities provided for MPI [12] and PVM have usually been developed on top of the check-pointing and batch scheduling facilities provided by Condor [4] and LSF[15].
7
Conclusions
We have shown that it is possible to perform fault tolerance and process migration of BSP programs in a transparent way on a network of workstations. By paying careful attention to the design of a distributed load manager, it is possible to determine the global load of a system with minimal impact on network
90
Jonathan M.D. Hill et al.
traffic. This, in conjunction with the pro-active manner in which BSP jobs migrate between machines, enables a system that is unobtrusive to non-BSP users, whilst providing the best of the resource as a whole.
References 1. David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, and Maurice Yarrow. Nas parallel benchmarks 2.0. Technical Report 95-020, NAS Applied Research Branch (RNR), December 1995. 80 2. R. Baldoni, J. M. H´elary, A. Mostefaoui, and M. Raynal. A communicationinduced checkpointing protocol that ensures the rollback-dependency trackability property. In Proc. of the 27th IEEE Symposium on Fault-Tolerant Computing Systems (FTCS), pages 68–77, Seattle, WA, June 1997. IEEE. 89 3. Stephen R. Donaldson, Jonathan M. D. Hill, and David B. Skillicorn. Predictable communication on unpredictable networks: Implementing BSP over TCP/IP. In EuroPar’98, LNCS, Southampton, UK, September 1998. Springer-Verlag. 83 4. D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide flock of condors : Load sharing among workstation clusters. Future Generations of Computer Systems, 12, 1996. 82, 89 5. Jonathan M. D. Hill, Stephen Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev. Portable and architecture independent parallel performance tuning using a callgraph profiling tool. In 6th EuroMicro Workshop on Parallel and Distributed Processing (PDP’98), pages 286–292. IEEE Computer Society Press, January 1998. 80 6. Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSPlib: The BSP Programming Library. Parallel Computing, to appear 1998. see www.bsp-worldwide.org for more details. 81 7. Jonathan M.D. Hill, Stephen Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev. Analysing an sql application with a bsplib call-graph profiling tool. In EuroPar’98, LNCS, Southampton, UK, September 1998. Springer-Verlag. 80 8. M. F. Kaashoek, R. Michiels, H. E. Bal, and A. S Tanenbaum. Transparent faulttolerance in parallel orca programs. In Proc. Symp. on Experiences with Distributed and Multiprocessor Systems III, pages 297–312, 1992. 89 9. Mohan V. Nibhanupudi and Boleslaw K. Szymanski. Adaptive parallelism in the bulk synchronous parallel model. In EuroPar’96, number 1124 in Lecture Notes in Computer Science, pages 311–318, Lyon, France, aug 1996. Springer-Verlag. 89 10. David Skillicorn, Jonathan M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, Fall 1997. 80 11. Richard M. Stallman. Emacs: The extensible, customizable, self-documenting display editor. AI memo 519A, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), 1979. 82 12. G. Stellner. CoCheck: checkpointing and process migration for MPI. In IEEE, editor, Proceedings of IPPS ’96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15–19 April 1996, pages 526–531, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1996. IEEE Computer Society Press. 89 13. Kasidit Chanchio Xian-He Sun. Efficient process migration for parallel processing on non-dedicated networks of workstations. Technical Report TR-96-74, Institute for Computer Applications in Science and Engineering, December 1996. 89
Process Migration and Fault Tolerance of BSPlib
91
14. Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. 80 15. Jingwen Wang, Songnian Zhou, Khalid Ahmed, and Weihong Long. LSBATCH: A distributed load sharing batch system. Technical Report CSRI-286, Computer Systems Research Institute, University of Toronto, April 1993. 89
Task Parallel Skeletons for Irregularly Structured P r o b l e m s Petra Hofstedt* Department of Computer Science, Dresden University of Technology, hofst edt @inf. tu-dresden, de
A b s t r a c t . The integration of a task parallel skeleton into a functional programming language is presented. Task parallel skeletons, as other algorithmic skeletons, represent general parallelization patterns. They are introduced into otherwise sequential languages to enable the development of parallel applications. Into functional programming languages, they naturally are integrated as higher-order functional forms. We show by means of the example branch-and-bound that the introduction of task parallel skeletons into a functional programming language is advantageous with regard to the comfort of programming, achieving good computation performance at the same time.
1
Introduction
Most parallel programs were and are written in imperative languages. In m a n y of these languages, the programmer has to use low-level constructs to express parallelism, synchronization and communication. To support platform-independent development of parallel programs standards and systems have been invented, e.g. MPI and PVM. In functional languages, such supporting libraries have been added in rudimentary form only recently. Hence, the advantages of functional programs, such as their ability to state powerful algorithms in a short, abstract and precise way, cannot be combined with the ability to control the parallel execution of processes on parallel architectures. Our aim is to remedy that situation. A functional language has been extended by constructs for data and task parallel programming. We want to provide comfortable tools to exploit parallelism for the user, so that she is burdened as few as possible with communication, synchronization, load balancing, data and task distribution, reaching at the same time good performance by exploitation of parallelism. The extension of functional languages by algorithmic skeletons is a promising approach to introduce data parallelism as well as task parallelism into these languages. As demonstrated for imperative languages, e.g. by Cole [3], there are several approaches how to introduce skeletons into functional languages as higher-order * The work of this author was supported by the 'Graduiertenkolleg Werkzeuge zum effektiven Einsatz paralleler und verteilter Rechnersysteme' of the German Research Foundation (DFG) at the Dresden University of Technology.
677 parallel forms. However, most authors concentrated on data parallel skeletons, e.g. [1], [4], [5]. Hence, our aim has been to explore the promising concept of task parallel skeletons for functional languages by integrating them into a functional language. A major focus of our work is on reusability of methods implemented as skeletons. Algorithmic skeletons are integrated into otherwise sequential languages to express parallelization patterns. In our approach, currently skeletons are implemented in a lower-level imperative programming language, but presented as higher-order functions in the functional language. Implementation-details are hidden within the skeletons. In this way, it is possible to combine expressiveness and flexibility of the sequential functional language with the efficiency of parallel special purpose algorithms. Depending on the type of parallelism exploited, skeletons are distinguished in data and task parallel ones. Data parallel skeletons apply functions on multiple data at the same time. Task parallel skeletons express which elements of a computation may be executed in parallel. The implementation in the underlying system determines the number and the location of parallel processes that are generated to execute the task parallel skeleton. 2
The Branch-and-Bound Language
Skeleton
in a Functional
Branch-and-bound methods are systematic search techniques for solving discrete optimization problems. Starting with a set of variables with a finite set of discrete values (a domain) assigned to each of the variables, the aim is to assign a value of the corresponding domain to each variable in such a way that a given objective function reaches a minimum or a maximum value and several constraints are satisfied. First, mutually disjunct subproblems are generated from a given initial problenl by using an appropriate branching rule (branch). For each of the generated subproblems an estimation (bound) is computed. By means of this estimation, the subproblem to be branched next is chosen (select) and decomposed (branched). If the chosen problem cannot be branched into further subproblems, its solution (if existing) is an optimal solution. Subproblems with non-optimal or inadmissible variable assignments can be eliminated during the computation (elimination). The four rules b r a n c h , b o u n d , select and elimi n a t i o n are called basic rules. The principal difference between parallel and sequential branch-and-bound algorithms lies in the way of handling the generated knowledge. Subproblems generated from problems by decomposition and knowledge about local and global optima belong to this knowledge. While with sequential branch-and-bound one processor generates and uses the complete knowledge, the distribution of work causes a distribution of knowledge, and the interaction of the processors working together to solve the problem becomes necessary. Starting point for our implementation was the functional language DFS ('Datenparallele funktionale Sprache' [6]), which already contained data parallel skeletons for distributed arrays. DFS is an experimental programming lan-
678 guage to be used on parallel computers. The language is strict and evaluates DFS-programs in a call-by-value strategy accordingly. To give the user the possibility to exploit parallelism in a very comfortable way, we have extended the functional language DFS by task parallel skeletons. One of them was a branch-and-bound skeleton. The user provides the basic rules b r a n c h , b o u n d , select and e l i m i n a t i o n using the functional language. Then she can make a function call to the skeleton as follows: branch~bound branch bound select elimination problem.
A parallel abstract machine (PAM) represents the runtime environment for DFS. The PAM consists of a number of interconnected nodes communicating by messages. Each node consists of three units: the message administration unit, which handles the incoming and outgoing messages, the skeleton unit, which is responsible for skeleton processing, and the reduction unit, which performs the actual computation. Skeletons are the only source of parallelism in the programs. To implement the parallel branch-and-bound skeleton, several design decisions had to be made with the objective of good computation performance and high comfort. In the following, the implementation is characterized according to Trienekens' classification ([7]) of parallel branch-and-bound algorithms. Table 1. Classification by Trienekens knowledge sharing
knowledge use dividing the work synchronicity basic rules
global/local knowledge base complete/partial knowledge base update strategy access strategy reaction strategy units of work load balancing strategy synchronicity of each process branch, b o u n d , select, e l i m i n a t i o n
Each process uses a local partial knowledge base containing only a part of the complete generated knowledge. In this way, the bottleneck arising from the access of all processes to a shared knowledge base is avoided, but at the expense of the actuality of the knowledge base. A process stores newly generated knowledge at its local knowledge base only; if a local optimum has been computed, the value is broadcasted to all other processes (update strategy). When a process has finished a subtask, it accesses its local knowledge base, to store the results at the knowledge base and to get a new subtask to solve. If a process receives a message containing a local optimum, the process compares this optimum with its actual local optimum and the bounds of the subtasks still to be solved (access strategy). A process receiving a local optimum from another process first finishes its actual task and then reacts according to the received
679 message (reaction strategy). This may result in the execution of unnecessary work. But the extent of this work is small because of the high granularity of the distributed work. A unit of work consists of branching a problem into subproblems and computing the bounds of the newly generated subproblems. The load balancing strategy is simple and suited to the structure of the computation, because new subproblems are generated during computation. If a processor has no more work, it asks its neighbours one after the other for work. If a processor receives a request for work, it returns a unit of work - if one is available - to the asking processor. The processor sends that unit of work which is nearest to the root of the problem tree and has not been solved yet. The implemented distributed algorithm works asynchronously. The basic rules are provided by the user using the functional language.
3
Performance
Evaluation
To evaluate the performance of task parallel skeletons, we implemented branchand-bound for the language DFS as a task parallel skeleton in C for a GigaClustar GCel1024 with 1024 transputers T805(30 MHz) (each with 4 MByte local memory) running the operating system Parix. Performance measurements for several machine scheduling problems - typical applications of the branch-and-bound method - were made to demonstrate the advantageous application of skeletons. In the following, three cases of a machine scheduling problem for 2 machines and 5 products have been considered. The number of possible orders of machine allocation is 5! = 120. This very small problem size is sufficient to demonstrate the consequences for the distribution of work and the computation performance, if the part of the problem tree, which must be computed, has a different extent. In case (a) the complete problem tree had to be generated. In case (b) only one branch of the tree had to be computed. Case (c) is a case where a larger part of the problem tree than in case (b) had to be computed. Each of the machine scheduling problems has been defined first in the standard functional way and second by use of the branch-and-bound skeleton. The measurements were made using different numbers of processors. First we counted branching steps, i.e. we measured the average overall number of decompositions of subproblems of all processors working together in the computation of the problem. These measurements showed the extend of the computed problem tree working sequentially and parallelly. It became obvious that the overall number of branching steps is increasing in the case of a small number of to be branched subproblems. The local partial knowledge bases and the asynchronous behaviour of the algorithm cause the execution of unnecessary work. If the whole problem tree had to be generated, we observed a decrease of the average overall number of branching steps with increasing number of processors. This behaviour is called acceleration anomaly ([2]). Acceleration anomalies occur if the search tree generated in the parallel case is smaller than the one generated in the sequential case. This can happen in the parallel case because of
680 branching several subproblems at the same time. Therefore it is possible to find an optimum earlier than in the sequential case. Acceleration anomalies cause a disproportional decrease of the average maximum number of branching steps per processor with increasing number of processors, a super speedup.
Table 2. Average overall number of reduction steps 1 proc. 1 proc. 4 proc.] 6 proc. 8 proc.]16 proc. functional skeleton skeleton skeleton ske!eton skeleton (a) 17197 207271 1848,2 1792,9 1074,9 709,3 (b) 54 47 1 3 8 , 0 218,6 299,0 451,6 , (c) 138 118 1 7 9 , 8 261,8 258,6 484,8 Table 3. Average maximum number of reduction steps per processor
(a) (b) (c)
1 proc. 1 proc. 4 proc.] 6 proc.] 8 proc.]16 proc. functional skeleton skeleton skeleton skeleton skeleton 17197 2 0 7 2 7 896,0 710,2 390,5 113,1 54 47 43,2 44,7 47,0 46,4 138 118 65,5 66,1, 60,9 51,7
To compare sequential functional programs with programs defined by means of skeletons, we counted reduction steps. The reduction steps include besides branching a problem, the computation of a bound of the optimal solution of a subproblem, the comparison of these bounds for selection and elimination of a subproblem from a set of to be solved subproblems, and the comparison of bounds to determine an optimum. Table 2 shows the average overall number of reduction steps of all processors participating in the computation. In Table 3 the average maximum numbers of reduction steps per processor are given. Table 2 and Table 3 clearly show the described effect of an acceleration anomaly. Because the set of reduction steps contains comparison steps of the operation 'selection of subproblems from a set of to be solved subproblems', the distribution of work causes a decrease of the number of comparison steps for this operation at each processor. Looking at both Table 2 and Table 3 it becomes apparent that the numbers of reduction steps in the sequential cases of (a), (b), and (c) of the computation of the problem first defined in the standard functional way and second using the skeleton differ. That is caused by different styles of programming in functional and imperative languages. In case (a) an obvious decrease of the average maximum number of reduction steps per processor (Table 3) caused by the distribution of the subproblems onto several processors is observable. At the same time the average overall number of reduction steps (Table 2) is also decreasing
681 as explained before. The distribution of work onto several processors yields a large increase of efficiency in cases when a large part of the problem tree must be computed. In case (b) the average maximum number of reduction steps per processor nearly does not change while the overall number of reduction steps is increasing, because, firstly, subproblems, which are to be branched, are distributed, and secondly, a larger part of the problem tree is computed. Because in case (b) the solution can be found in a short time, working parallelly as well as sequentially, the use of several processors produces overhead only. In case (c) the same phenomena as in case (b) are observable. Moreover, the average maximum number of reduction steps decreases to nearly 50% in case of parallel computation in comparison to the sequential computation.
4
Conclusion
The concept, implementation, and application of task parallel skeletons in a functional language were presented. Task parallel skeletons appear to be a natural and elegant extension to functional programming languages. This has been shown using the language DFS and a parallel branch-and-bound skeleton as an example. Performance evaluations showed that using the implemented skeleton for finding solutions for a machine scheduling problem is performance better, especially if a large part of the problem tree has to be generated. Also in the case of the necessity to compute a smaller part of the problem tree only, a distribution of work is advantageous. A c k n o w l e d g e m e n t s The author would like to thank Herbert Kuchen and Hermann H~.rtig for discussions, helpful suggestions and comments.
References 1, Botorog, G.H., Kuchen, H.: Efficient Parallel Programming with Algorithmic Skeletons. In: Boug, L. (Ed.): Proceedings of Euro-Par'96, Vol.1. LNCS 1123. 1996. 2. de Bruin, A., Kindvater, G.A.P., Trienekens, H.W.J.M.: Asynchronous Parallel Branch and Bound and Anomalies. In: Ferreira, A.: Parallel algorithms for irregularly structured problems. Irregular '95. LNCS 980. 1995. 3. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press. 1989. 4. Darlington, J., Field, A.J., Harrison, P.G., Kelly, P.H.J., Sharp, D.W.N., Wu, Q., While, R.L.: Parallel Programming Using Skeleton Functions. In: Bode, A. (Ed.): Parallel Architectures and Languages Europe : 5th International PARLE Conference. LNCS 694. 1993. 5. Darlington, J., Guo, Y., To, H.W., Yang, J.: Functional Skeletons for Parallel Coordination. In: Haridi, S. (Ed.): Proceedings of Euro-Par'95. LNCS 966. 1995. 6. Park, S.-B.: lmplementierung einer datenparallelen funktionalen Programmiersprache auf einem Transputersystem. Diplomarbeit. RWTH Aachen 1995. 7. Trienekens, H.W.J.M.: Parallel Branch and Bound Algorithms. Dissertation. Universitgt Rotterdam 1990.
S y n c h r o n i z i n g C o m m u n i c a t i o n P r i m i t i v e s for a Shared M e m o r y P r o g r a m m i n g M o d e l Vladimir Vlassov and Lars-Erik Thorelli Royal Institute of Technology, Electrum 204, S-164 40 Kista, Sweden
{vlad, le}@it .kth.se
A b s t r a c t . In this article we concentrate on semantics of operations on synchronizing shared memory that provide useful primitives for shared memory access and update. The operations are a part of a new sharedmemory programming model called mEDA. The main goal of this research is to provide a flexible, effective and relatively simple model of inter-process communication and synchronization via synchronizing shared memory.
1 Introduction There is a belief [7], and we share it, that dataflow computing in combination with multithreading is a general answer on two fundamental issues in multiprocessing: memory latency and synchronization [3]. Synchronization of parallel processes [11] is a subject of considerable interest for various research group, both in industry and academia [8, 9, 12-14]. This article presents some aspects of a new shared-memory programming model that is based on dataflow principles and provides a unified approach for communication and synchronization of parallel processes via synchronizing shared memory. The model is called mEDA, Extended Dataflow Actor, where the prefix "m" is used to distinguish a new revision of the model from the EDA model presented by Wu [17, 15]. The model was inspired from different computation and memory models, such as the data flow model [6], the Actor model [1], 1-structures [3] and M-structures [5]. In this article we concentrate on the semantics of synchronizing shared memory operations introduced in the model. The model states that a shared-memory operation (store, fetch) holds a synchronization type which specifies special synchronization requirements for both the accessing process and the accessed cell. In this way, common patterns of parallel programs such as data dependency, shared protected data, barriers, stream communication, AND- and OR-parallelism, can be conveniently expressed. The notion of processes accessing shared "boxes" (cells) using such operations provides a natural way of communication and synchronization. Section2 informally presents the semantics of mEDA synchronizing m e m o r y operations. Section3 outlines synchronization and communication properties of the operations. Section4 shortly illustrates how mEDA allows parallel processes to communicate and be synchronized via shared memory. Our conclusions are given in Section5.
683
2
Synchronizing Memory Operations
Assume that a parallel application consists of a d y n a m i c set of processes (or threads) and a set of cells which are shared, uniquely identified and known to the processes of the application. The processes interact with one another in a onesided fashion storing data to, or fetching d a t a from, shared cells. The processes can also transfer values from one cell to another. A process m a y access cells only by special store/fetch/transfer operations defined in the model. Since each process has a notion of local state, on this level of abstraction, a shared cell can be regarded as a special kind of process. An m E D A operation sends to the shared m e m o r y an access request with the following attributes: address of requested cell, operation code, value to be stored (in the case of store operation), and identifier of the requesting process. The latter is used to direct a fetch response or a store acknowledgment when needed. A shared cell m a y contain stored d a t a (a value) and a F I F O queue that holds the postponed access-requests which can not be satisfied because of inappropriate state of the requested cell. We extend the set of values, which can be stored to, or fetched from, the shared cell, by an " e m p t y " value. We call a shared cell full if it contains a normal value. The cell is empty if it contains the e m p t y value. m E D A recognizes four synchronization types of store operations (denoted by X, S, I, and U-store), and four synchronization types of fetch operations (X, S, I, and U-fetch). In addition the model defines 16 transfer operations briefly described in Section2.3. The semantics of store and fetch operations is based on considerations given below. See also Fig. 1.
blocking
X ~~1
- --t~orL- b I o c k i rLg . . . .
S
I
aW"
~ox'~l
U ar~]
fcI:h Fig. 1. Categorization of EDA operations
2.1
Store Operations
First, we state that a store operation to an e m p t y cell always succeeds. If the cell is full, the store request either can be postponed until the variable is emptied (buffered store), or be served somehow as soon as it arrives (non-buffered store). Next we assume that a buffered store can be either blocking (synchronous)
684 or non-blocking (asynchronous). A blocking store, denoted by X in Fig. 1, requires an acknowledgment that a value has been stored. The requesting process is suspended until the acknowledgment arrives. On the other hand, the process executing a non-blocking buffered store (S), continues its computation as soon as the request is sent to shared cell. A non-buffered store-request to a full cell can be either ignored (I), or served by updating the requested cell independently of its state (U). In both cases, a synchronizing acknowledgment is not needed since non-buffered requests are served without delay. Based on the above observations we arrive at four types of store operations: X-store ("eXclusive"), S-store ("Stream"), I-store ("Ignore"), and U-store ("Update"). 2.2
Fetch Operations
We distinguish two kinds of fetch operations categorized in Fig. 1: (i) extracting operations which empty a shared cell after its value has been copied to the accessing process, and (ii) copying operations which copy a value from the cell to the process and leave the cell intact. We also distinguish the cases that a fetch operation may fetch (extract or copy) either only a normal value or any value (normal or empty). The request to fetch a normal value is postponed on an empty cell until the latter is filled. In all cases, a requesting process is blocked until the requested value (any or only normal) is fetched from the cell. The operations requesting any value can be classified as non-blocking. Thus, mEDA includes four types of fetch operations: X-fetch, S-fetch, I-fetch, and U-fetch, where the first two are extracting and the first and third one are blocking. 2.3
Transfer Operations
A transfer operation combines a pair fetch/store in one operation. Consider the expression s t o r e ( s _ s , B, f e t c h ( s _ f , A)) that requires to fetch a value from cell A and store the value to cell B. Here s_f and s_s specify synchronization types of fetch and store to cell A and cell B, respectively. To implement the expression efficiently, we introduce special operations (XX-trans, SX-trans, etc.) which transfer data between shared cells. The operations allow increasing the efficiency of the model. All transfer operations are non-blocking and do not affect the requesting process: the value fetched from A, is directly stored to B. 3
Synchronization and mEDA Operations
Communication
Properties
of
The mEDA operations on a shared cell issued by cooperating processes trigger each other. When a process empties a full cell or fills an empty cell, one or more postponed requests (if any) queued on the cell, are served. This may cause resuming of blocked processes waiting for the completion of X-store, X-fetch or I-fetch. Storing a normal value to an empty cell allows all first pending copy requests (I-fetch) and one following pending extract request (X-fetch), if any,
685 to be resumed. Extracting a normal value from a flfll cell (X-fetch, S-fetch) or storing the e m p t y value to the full cell (U-store) allows resuming and servicing one postponed store request, if any, queued on the cell. It is i m p o r t a n t to note that the shared m e m o r y m a y not service incoming requests until all postponed requests which can be resumed, are served.
)
bu.ff~d
fc~zl',(x) % ,
~to,e'~'u.~% ',
,~t.f211~(
fc~dXxvi ) ~surn~d fr
~.__~
f~x) ,,
bLLffe~ed , str162162 x v s , d ",d~,xaed fctchC i v u)l 9
,, ,
u
~ t o z e
~t~e(",~ >I [
---..~-
"""
b~e~cd
fr
( ", d)--~d, ~ 0,2 a
Hs~acd
",
) ~%~.d i~ KX2
fc~lg i )
b~ff~r
:
fetzlY
2,~, o, ~)-~o,, < ",~,o,~
d
"
i v~)l xr------N
xv%cl~
st.ole( fetcN
x v s "~j ~ - - -.
Fig. 2. State Diagram of a Shared Cell
Figure 2 illustrates a state diagram of a shared cell. Each state is m a r k e d by a triple index (C, q~, q]), where: - C - indicates which vMue, normal or empty, is stored in the cell: C = ~ O if the cell is e m p t y L d if the cell is full - q~ E {0, 1, 2 , . . . ) - the number of store-requests (X or S) queued on the cell. q / E {0, 1 , 2 , . . . } - the number of fetch-requests (X or I) queued on the cell.
686 Each transaction is marked by incoming or resumed access-requests causing the transaction. For instance,
-
- fetch(/V u) means a fetch request with either I or U synchronization. - store(,, O) means store the empty value O with any synchronization. store(u, d) means store a normal value d with U synchronization.
Intuitively it is easy to understand that a queue of requests postponed on a shared cell, cannot contain a mixture of storing and fetching requests. If the queue is not empty, it contains either only store requests of X a n d / o r S types, queued on the empty cell, or only fetch requests (X a n d / o r I) queued on the full cell. This consideration is confirmed by the diagram in Fig. 2. While resuming postponed requests, the cell passes through the states depicted in a dash shape until a request (if any) that heads the queue, can be resumed.
4
Process Interaction via Synchronizing Shared M e m o r y
The store and fetch operations introduced in mEDA can be used in different combinations in order to provide a wide range of various communication and synchronization patterns for parallel processes such as data dependency, locks, barriers, and OR-parallelism. For example, X-fetch in combination with any buffered store (X- or S-store) supports data-driven computation where a process accessing shared location by X-fetch is blocked until requested data are available. One can see that the pair X-fetch/X-store is similar to the pair of take and put operations on M-structures [5], and to the pair of synchronizing read and write introduced in the memory model presented by Dennis and Gao in [8]. Locks are easily implemented using X-fetch/X-store, and OR-parallelism is achieved with I-fetch/I-store, since an I-store to a full cell is discarded.
5
Conclusions
We have presented the semantics of shared memory operations introduced in the mEDA model, and demonstrated how these operations allow implementing a wide variety of inter-process communication and synchronization patterns in a convenient way. Orthogonal mEDA operations provide a unified approach for communication and synchronization of parallel processes via synchronizing shared memory supporting different access and synchronization needs. The model was implemented on top of the PVM system [10]. Two software implementation of mEDA were earlier reported in [2] and [16]. In the last version [16], the synchronizing shared memory for PVM is implemented as distributed virtual shared variables addressed by task and variable identifiers. The values stored to and fetched from the shared memory are PVM messages. The mEDA library can be used in combination with PVM message passing. This allows a wide variety of communication and synchronization patterns to be programmed in a convenient way.
687 Finally, we believe t h a t the m e m o r y system of a multiproeessor supporting m E D A synchronizing m e m o r y operations, can be relatively efficiently and cheaply implemented. This issue m a y be of interest to computer architects.
References 1. Agha, G.: Concurrent Object-Oriented Programming. Communications of the ACM 33 (1990) 125-141 2. Ahmed, H., Thorelli, L.-E., Vlassov, V.: mEDA: A Parallel Programming Environment. Proc. of the 21st EUROMICRO Conference: Design of Hardware/Software Systems, Como, Italy. IEEE Computer Society Press. (1995) 253-260 3. Arvind and Thomas, R. E.: I-structures: An Efficient Data Structure for Functional Languages. TM-178. LCS/MIT (1980) 4. Arvind and Iannucci, A.: Two fundamental issues in multiprocessing. Parallel computing in Science and Engineering. Lecture Notes in Computer Science, Vol. 295. Springer-Verlag (1978) 61-88 5. Barth, P. S., Nikhil, R. S., Arvind.: M-Structures: Extending a Parallel, Non-Strict, Functional Language with States. TM-237. LCS/MIT (1991) 6. Dennis, J. B.: Evolution of "Static" dataflow architecture. In: Gaudiot, J.-L., Bic L. (eds.): Advanced Topics in Dataflow Computing. Prentice-Hall (1991) 35-91 7. Dennis, J. B.: Machines and Models for Parallel Computing. International Journal of Parallel Programming 22 (1994) 47-77 8. Dennis, J. B., Gao, G. R.: On memory models and cache management for sharedmemory multiprocessors. In: Proe. of the IEEE Symposium on Parallel and Distributed Processing, San Antonio (1995) 9. Fillo, M., Keckler, S.W., Dally, W. J., Carter, N. P., Chang, A., Gurevich, Ye., Lee, W. S.: The M-Machine Multicomputer. In: Proc. of the 28th IEEE/ACM Annual Int. Syrup. on Microarchitecture, Ann Arbor, MI (1995) 10. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.: PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press (1994) 11. Mellor-Crummey, J. M., Scott, M. L.: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. on Computer Systems 9 (1991) 21-65 12. Ramachandran, U., Lee, J.: Cache-Based Synchronization in Shared Memory Multiprocessors. Journal of Parallel and Distributed Computing 32 (1996) 11-27 13. Scott, S. L.: Synchronization and Communication in the T3E Multiprocessor. In: Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII) (1996) 26-36 14. Skillicorn, D. B., Hill, J. M. D., MeColl, W.F.: Questions and Answers about BSP. Tech. Rep. PRG-TR-15-96. Oxford University. Computing Laboratory (1996) 15. Thorelli, L.-E.: The EDA Multiprocessing Model. Tech. Rep. TRITA-IT-R 94:28. Dept. of Teleinformatics, Royal Institute of Technology, Stockholm, Sweden (1994). 16. Vlassov, V., Thorelli, A Synchronizing Shared Memory: Model and Programming Implementation. In: Recent Advances in PVM and MPI. Proc.of the 4th European PVM/MPI Users' Group Meeting. Lecture Notes on Computer Science, Vol. 964. Springer-Verlag (1995) 288-293 17. Wu, H.: Extension of Data-Flow Principles for Multiprocessing. Tech. Rep. TRITATCS-9004 (Ph D thesis). Royal Institute of Technology, Stockholm, Sweden (1990)
Symbolic Cost Analysis and A u t o m a t i c D a t a Distribution for a Skeleton-Based Language Julien Mallet IRISA, Campus de Beaulieu, 35042 Rennes, France mallet@irisa, fr
A b s t r a c t . We present a skeleton-based language which leads to portable and cost-predictable implementations on MIMD computers. The compilation process is described as a series of program transformations. We focus in this paper on the step concerning the distribution choice. The problem of automatic mapping of input vectors onto processors is addressed using symbolic cost evaluation. Source language restrictions are crucial since they permit to use powerful techniques on polytope volume computations to evaluate costs precisely. The approach can be seen as a cross-fertilization between techniques developed within the FORTRAN parallelization and skeleton communities.
1
Introduction
A good parallel p r o g r a m m i n g model must be portable and cost predictable. General sequential languages such as F O R T R A N achieve portability but cost estimations are often very approximate. The approach described in this p a p e r is based on a restricted language which is portable and allows an accurate cost analysis. The language enforces a p r o g r a m m i n g discipline which ensures a predictable performance on the target parallel computer (there will be no "performance bugs"). Language restrictions are introduced through skeletons which encapsulate control and d a t a flow in the sense of [6], [7]. The skeletons act on vectors which can be nested. There are three classes defined as sets of restricted skeletons: computation skeletons (classical d a t a parallel skeletons), communication skeletons (data motion over vectors), and mask skeletons (conditional d a t a parallel computation). The target parallel computers are MIMD computers with shared or distributed memory. Our compilation process can be described as a series of program transformations starting from a source skeleton-based language to an SPMD-like skeleton-based language. There are three main transformations: in-place updating, making all communications explicit and distribution. Due to space concerns, we focus in this paper only on the step concerning the choice of the distribution. We tackle the problem of automatic m a p p i n g of input vectors onto processors using a fixed set of classical distributions (e.g. row block, block cyclic,...). The goal is to determine mechanically the best distribution for each input vector
689 through cost evaluation. The restrictions of the source language and the fixed set of standard data distributions guarantee that the parallel cost (computation + communication) can be computed accurately for all source programs. Moreover, these restrictions allow us to evaluate and compare accurate execution times in a symbolic form. So, the results do not depend on specific vector sizes nor on a specific number of processors. Even if our approach is rooted in the field of data-parallel and skeleton-based languages, one of its specificities is to reuse techniques developed for F O R T R A N parallelization (polytope volume computation) and functional languages (singlethreading analysis) in a unified framework. The article is structured as follows. Section 2 is an overview of the whole compilation process. Section 3 presents the source language and the target parallel language. In Section 4, we describe the symbolic cost analysis through an example: LU decomposition. We report some experiments done on a MIMD distributed memory computer in Section 5. We conclude by a review of related work. The interested reader will find additional details in an extended version of this paper ([12]). 2
Compilation
Overview
The compilation process consists of a series of program transformations: ~1
'
/22
'
/23
** /24 k*
"
/25
Each transformation compiles a particular task by mapping skeleton programs from one intermediate language into another. The source language (/21) is composed of a collection of higher-order functions (skeletons) acting on vectors (see Section 3). It is primarily designed for a particular domain where high performance is a crucial issue: numerical algorithms./21 is best viewed as a parallel kernel language which is supposed to be embedded in a general sequential language (e.g. C). The first transformation (/21 ~ / 2 2 ) deals with in-place updating, a standard problem in functional programs with arrays. We rely on a type system in the spirit of [17] to ensure that vectors are manipulated in a single-threaded fashion. The user may have to insert explicit copies in order for his program to be welltyped. As a result, any vector in a/22 program can be implemented by a global variable. The second transformation (122 -+ /2a) makes all communications explicit. Intuitively, in order to execute an expression such as map (Ax.x+y) in parallel, y must be broadcast to every processor before applying the function. The transformation makes this kind of communication explicit. In/:3, all communications are expressed through skeletons. The transformation/23 -+ /24 concerns automatic data distribution. We restrict ourselves to a small set of standard distributions. A vector can be distributed cyclicly, by contiguous blocks or allocated to a single processor. For a
690
matrix (vector of vectors), this gives 9 possible distributions (cyclic cyclic, block cyclic, line cyclic, etc.). The transformation constructs a single vector from the input vectors according to a given data distribution. This means, in particular, that all vector accesses have to be changed according to the distributions. Once the distribution is fixed, some optimizations (such as copy elimination) become possible and are performed on the program. The transformation yields an SPMD-like skeleton program acting on a vector of processors. In order to choose the best distribution, we transform the/2a program according to all the possible distributions of its input parameters. The symbolic cost of each transformed program can then be evaluated and the smallest one chosen (see Section 4). For most numerical algorithms, the number of input vectors is small and this approach is practical. In other cases, we would have to rely on the programmer to prune the search space. The 124 -+/25 step is a straightforward translation of the SPMD skeleton program to an imperative program with calls to a standard communication library. We currently use C with the MPI library. Note that all the transformations are automatic. The user may have to interact only to insert copies (121 -~/22 step) or, in s when the symbolic formulation does not designate a single best distribution (see 4.4).
3
The Source and Target Languages
The source language s is basically a first-order functional language without general recursion, extended with a collection of higher-order functions (the skeletons). The main data structure is the vector which can be nested to model multidimensional arrays. An s program is a main expression followed by definitions. A definition is either a function definition or the declaration of the (symbolic) size of an input vector. Combination of computations is done through function composition (.) and the predefined iterator i t e r f o r , i t e r f o r n f a acts like a loop applying n times its function argument f on a. Further, it makes the current loop index accessible to its function argument. The s program implementing LU decomposition is: LU(M) where M :: Vect n (Vect n Float) LU(a) = iterfor (n-l) (calc.fac.apivot.colrow) a calc(i,a,row,piv) = (i,(map (map first) .rect i (n-l) (rect i (n-l) fcalc) .map zip3.zip3)(a,row,piv)) fcalc(a,row,piv) = (a-row*piv,row,piv) fac(i,a,row,col,piv) = (i,a,row,(map(map /).map zip.zip)(col,piv)) apivot(i,a,row,col) = (i,a,row,col,map (brdcast (i-l)) ro,) colrow(i,a) = (i,a,brdcast (i-l) a,map (brdcast (i-I)) a) zip3(x,y,z) = map p2t (zip(x,zip(y,z))) p2t(x,(y,z)) = (x,y,z)
691 For handling vectors, there are three classes of skeletons: computation, communication, and mask skeletons. The computation skeletons include the classical higher order functions map, f o l d and scan. The communication skeletons describe restricted d a t a motion in vectors. In the definition of c o l r o w in LU, brdcast (• copies the (i-1)th vector element to all the other elements. Another communication skeleton is g a t h e r i g which copies the ith column of the m a t r i x M to the ith row of M. There are seven communication skeletons which have been chosen because of their availability on parallel computers as hard-wired or optimized communication routines. The mask skeletons are dataparallel skeletons which apply their function argument only to a selected set of vector elements. In the definition of t a l c in LU, r e c v i ( n - l ) f c a l c applies f c a l c on vector elements whose index ranges from i to (n-l). Note t h a t our approach is not restricted to these sole skeletons: even if it requires some work and caution, more iterators, computation or m a s k skeletons could be added. In order to enable a precise symbolic cost analysis, additional syntactic restrictions are necessary. The scalar arguments of communication skeletons, the iterfor operator and mask skeletons must be linear expressions of iterfor indexes and variables of vector size. We rely on a type system with simple subtyping (not described here) to ensure that each variable appearing in such expression, are only iterfor and vector size variables. It is easy to check that these restrictions are verified in LU. T h e integer argument of • is a linear expression of an input vector size. The arguments of r e c t and b r d c a s t skeletons are made of linear expressions of iterfor variable • or vector size variable n. The target language 124 expresses SPMD (Single P r o g r a m Multiple Data) computations. A program acts on a single vector whose elements represent the d a t a spaces of the processors. Communications between processors are m a d e explicit by d a t a motion within the vector of processors in the spirit of the source parallel language of [8]. 124 introduces new versions of skeletons acting on the vector of processors. An 124 program is a composition of computations, with possibly a communication inserted between each computation. More precisely, a p r o g r a m FunP has the following form:
F . n P ::= (pimap F.n I pitereo= E F.nP) (. Corn)~ 1 (. FunP) ~ A c o m p u t a t i o n is either a parallel operation pimap Fun which applies Fun on each processor, or an iteration p i t e r f o r E FunP which applies E times the parallel c o m p u t a t i o n FunP to the vector of processors. T h e functions Fun are similar to functions of 121. The only differences are the skeleton arguments which m a y include modulo, integer division, m i n i m u m and m a x i m u m functions. Corn denotes the communication between processors. It describes d a t a motion in the same way as the communication skeletons of 121 but on the vector of processors. For example, the corresponding parallel communication skeleton of b r d c a s t is p b r d c a s t which broadcasts values from one processor of the vector of processors to each other.
692
4
Accurate Symbolic Cost Analysis
After transformation of the program according to different distributions, this step aims at automatically evaluating the complexity of each f 4 p r o g r a m obtained in order to choose the most efficient. Our approach to get an accurate symbolic cost is to reuse results on polytope volume c o m p u t a t i o n s ([16], [5]). It is possible because the restrictions of the source language and the fixed set of d a t a distributions guarantee that the abstracted cost of all transformed source programs can be translated into a polytope volume description. First, an abstraction function CA takes the program and yields its symbolic parallel cost. The cost expression is transformed so that it can be seen as the definition of a polytope volume. Then, classic methods to compute the volume of a polytope are applied to get a symbolic cost in polynomial form. Finally, a symbolic m a t h package (such as Maple) can be used to compare symbolic costs and find the smallest cost among the s programs corresponding to the different distribution choices.
4.1
Cost Abstraction
and Cost Language s
The symbolic cost abstraction C A is a function which extracts cost information from parallel programs. The main rules are shown below. We use a non-standard notation for indexed sums: instead of }-:~%1, we note ~ i { i<_-. 1<, ~ j 9 Communication costs are expressed as polynomials whose constants depend on the target computer. For example, the cost of p b r d c a s t involves the p a r a m e t e r s abr and ~ r which denote respectively the time of one-word transfer between two processors and the message startup time on the parallel computer considered. One just has to set those constants to adapt the analysis for a specific parallel machine. CA [If1 . f2] = CA [rfl]] + cA [[f2]] CJ4 [piterfor e (Ai.f)]-- E l { ~<' } cA N CJ[ ~pimap (Ai.f)]
C.A ~[pbrdcast] CA [[rect el e2 f]]
= max P - ' CA If] i=0
where p is the number of processors f p is the number of processors = (crbr*b + ~br)*p where 1. b is the size of broadcast data = ~ i { :~<~ } CA ][f]]
The cost abstraction yields accurate costs expressed in a simple language EC. s is m a d e of sums of generalized sums (G-sums): ~ i { } ('-') and generalized maxs (G-maxs): max(= e (...) possibly nested. A G - m a x denotes the m a x i m u m expression one can get when the m a x variable ranges over the integer interval. Inequalities in G-sums involve only linear expressions and m a x i m u m and m i n i m u m function of them. There m a y be polynomial included in the b o d y of G-sums and G-maxs. The variables of polynomials are only denoting vector sizes or the number of processors. In order to illustrate the analysis, we consider here only the two best distribution choices for LU: row block and row cyclic. The ab-
693 stracted costs are given below; note that only the part of the costs which differs is detailed: t,<_~-~J \ip=0 ti_<;-~
\,p=0
9,.(o,,-,,.b)<j
3
1
j- ,_
1)~ /
n--l--i i<.i~(~_ "'/["-'--'P]~v l"
{k
+ C
We emphasize that writing the source program in s is crucial to get an accurate symbolic cost. First of all, without a severe limitation of the use of recursion no precise cost could be evaluated in general. Further, the restrictions imposed b y s ensure that every abstract parallel cost belongs to s The mask skeletons (which limit conditional application to index intervals), the communication and the computation skeletons all have a complexity which depends polynomially on the vector size. Their costs can be described by nested sums. Another i m p o r t a n t restriction is that expressions involving iteration indexes (mask skeletons and iterfor bounds) are linear. This restriction, in addition to the form of the standard distributions which keep the linearity of the vector accesses, ensure that the inequalities occurring in/:C are linear. 4.2
Transformation to Descriptions of Polytope Volume
Polytope volumes are only defined in terms of nested G-sums with linear inequalities containing neither max nor min ([5]). In order to apply methods to compute polytope volumes, we must remove the G-max, min, max occurring in the cost expressions. The transformation takes the form of a structural recursive scan of the cost expression. The rain and max appearing in inequalities are transformed into conjunctions of inequalities. For example, the expression max(0, i - iv * b) < j, in r becomes 0 < j A i - iv * b < j. G-max expressions are propagated through their subexpressions. For example, in is temporarily transformed into ~ j
1 p--1 f oKj^i-ip*b
,<_b-l^j_<m~x~-20(. . . . . p.b+b)
maxip=0 ('" ")" T h e
max value of a G-sum is a sum where the upper bounds are maximized and the lower ones minimized. For a sum inequality i _< e, we propagate the G - m a x inside e until it reaches a variable or a constant. We can finally use the facts p-1 9 -1 that maxip_ 0 ~v = p - 1 and m a ~ _ 0 k = k if k ~ ip. For an inequality e < i, the transformation propagates a G-train analogously. Since no polynomial in-s may contain u G-max index, the G-max of a polynomial is just this polynomial. In the case of distributed LU, the cost expressions are transformed into
I l
~ 1+ C
-
2
The technique described to remove G-max expressions may maximize the real cost. This is due to the duplication of G-max in G-sums. A more complicated
694 but accurate symbolic solution also exists. The idea is to evaluate (as described in the next section) the nested G-sums first. When the G-sum is reduced to a polynomial the problem amounts to computing symbolic maximum of polynomials over symbolic intervals. This can be done using symbolic differentiation, solution and comparison of polynomials. At this point, we have only used the simpler approximated solution because the approximation is minor for our examples (e.g. none for LU) and the accurate symbolic solution is not implemented as such in existing symbolic math packages. 4.3
Parametrized Polytope Volume C o m p u t a t i o n
A parametrized polytope is a set of points whose coordinates satisfy a system of linear inequalities with possibly unknown parameters. After the previous transformation, the cost expression denotes a polytope volume, that is the number of points in the associated polytope. The works of [16] and [5] describe algorithms for computing symbolically the volume of parametrized polytopes. The result is a polynomial whose variables are the system parameters. Further, in [5], Clauss presents an extension for systems of non-linear inequalities with floor ([J) and ceiling ([]) functions as appearing in Ccyc' 2 Thus, the algorithm of [5] allows one to transform each parallel cost expression into a polynomial. This method applied to LU yields:
-- ~
4.4
+n 2
+n
-3p
p-.2
-{- p2-ap+26 "[-C~- Cc%c
S y m b o l i c Cost C o m p a r i s o n
The last step is to compare the symbolic costs of different distribution choices. It amounts to computing the symbolic intervals where the difference of cost (polynomial) is positive or negative. Symbolic math packages such as Maple can be used for solving this problem. In the case of LU, Maple produces the following condition: The programmer may have to indicate if the relations given by Maple are satisfied or not. In our example, he must indicate if n _> p. Another (automatic) solution is to use the relations as run-time tests which choose between several versions of the program. In our example, the difference in the two costs can be explained by the fact that the cyclic distribution provides a much better load balancing than the block distribution whereas communications are identical.
5
Experiments
We have performed experiments on an Intel Paragon XP/S with a handful of standard linear algebra programs (LU, Cholesky factorization, Householder, Ja-
695 cobi elimination, ...). Our implementation is not completed and some compilation steps, such as the destructive update step and part of the symbolic cost computation, were done manually. Below, a table gathers the execution times obtained for LU decomposition. They are representative of the results we got for the other few programs. For all programs, the distribution chosen by the cost analysis proved to be the best one in practice. Processors Skel. cyclic Skel. bioc:C Seq. HPF cyclic ScaLAPACK cyclic 1 14.77 15.07 1 3 . 6 1 15.36 3.78 4 5.25 6.75 x 5.41 1.84 16 2.97 5.33 x 3.06 1.50 32 2.57 5.58 x 2.67 1.41 We compared the sequential execution of skeleton programs with standard (and portable) C versions and our parallel implementation with High Performance Fortran (a manual distribution approach). No significant sequential or parallel runtime penalty seems to result from programming using skeletons, at least for such regular algorithms. We compared our code with the parallel implementation of NESL, a skeletonbased language [1]. The work on the implementation of NESL has mostly been directed towards SIMD machines. On the Paragon, the NESL compiler distributes vectors uniformly on processors and comnmnications are not optimized. Not surprisingly, the parallel code is very inefficient (at least fifty times slower than our code). We also compared our implementation with ScaLAPACK, an optimized library of linear algebra programs designed for distributed memory MIMD parallel computers [4]. In ScaLAPACK, the user may explicitly indicate the data distribution. So, we indicated the best distribution found by the cost analysis in each ScaLAPACK program considered. If our code on 1 processor is much slower than its ScaLAPACK equivalent (between 3 to 5 times slower), the difference decreases as the number of processors increases (typically, 1.8 times slower on 32 processors). Much of this difference comes from the machine specific routines used by ScaLAPACK for performing matrix operations (the BLAS library). This suggests a possible interesting extension of our source language. The idea would be to introduce new skeletons corresponding to the BLAS operations in order to benefit from these machine specific routines. These preliminary results are promising but more experiments are necessary to assess both the expressiveness of the language and the efficiency of the compilation. We believe that these experiments may also indicate useful linguistic extensions (e.g. new skeletons) and new optimizations. 6
Related
Work
and Conclusion
The community interested in the automatic parallelization of FORTRAN has studied automatic data distribution through parallel cost estimation ([10], [3]).
696 If the complete FORTRAN language (unrestricted conditional, indexing with runtime value, ...) is to be taken into account, communication and computation costs cannot be accurately estimated. In practice, the approximated cost may be far from the real execution time leading to a bad distribution choice. [16], [5] focus on a subset of FORTRAN: loop bound and array indexes are linear expressions of the loop variables. This restriction allows them to compute a precise symbolic computation cost through their computations of polytope volume. Unfortunately, using this approach to estimate communication costs is not realistic. Indeed, the cost would be expressed in terms of point-to-point communications without taking into account hard-wired communication primitives. All these works estimate real costs too roughly to ensure that a good distribution is chosen. The skeleton community has studied the transformation of restricted computation patterns into lower-level parallel primitives. [7] defines a restricted set of skeletons which are transformed using cost estimation. Only cost-reducing transformations are considered. It is, however, well known that often intermediary cost-increasing transformations are necessary to derive a globally cost optimal algorithm. [15], [11] and [14] define cost analyses for skeleton-based languages. Their skeletons are more general than ours leading to approximate parallel cost (communication o r / a n d computation). Furthermore, the costs are not synlbolic (the size of input matrices and the number of processors are supposed to be known). [9] define a precise communication cost for scan and fold skeletons on several parallel topologies (hypercube, mesh, ...) which allows them to apply optimizations of communications through cost-reducing transformations. There are also a few real parallel implementations of skeleton-based languages. [2] uses cost estimations based on profiling which does not ensure a good parallel performance for different sizes of inputs. [13] uses a finer cost estimation but implementation decisions are taken locally and no arbitration of tradeoffs is possible. We have presented in this paper the compilation of a skeleton-based language for MIMD computers. Working by program transformations in a unified framework simplifies the correctness proof of the implementation. One can show independently for each step that the transformation preserves the semantics and that the transformed program respects the restrictions enforced by the target language. The overall approach can be seen as promoting a programming discipline whose benefit is to allow precise analyses and a predictable parallel implementation. The source language restrictions are central to the approach as well as the techniques to evaluate the volume of polytopes. We regard this work as a rare instance of cross-fertilization between techniques developed within the F O R T R A N parallelization and skeleton communities. A possible research direction is to study dynamic redistributions chosen at compile-time. Some parallel algorithms are much more efficient in the context of dynamic data redistribution. A completely automatic and precise approach to this problem would be possible in our framework. However, this would lead to a search space of exponential size. A possible solution is to have the user select the redistribution places.
697 Acknowledgements : T h a n k s to Pascal Fradet, Daniel Le M~tayer, Mario Sfidholt for c o m m e n t i n g on an earlier version of this paper.
References 1. G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. 2agha. Implementation of a portable nested data-parallel language. In 4th ACM Syrup. on Princ. and Practice of Parallel Prog., pages 102-112. 1993. 2. T. Bratvold. A Skeleton-Based Parallelising Compiler for ML. In 5th Int. Workshop on the Imp. of Fun. Lang., pages 23-33, 1993. 3. S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. Teng. Automatic array alignment in data-parallel program. In 20th ACM Syrup. on Princ. of Prog. Lang., pages 1628, 1993. 4. J. Choi and J. J. Dongarra. Scalable linear algebra software libraries for distributed memory concurrent computers. In Proc. of the 5th [EEE Workshop on Future Trends of Distributed Computing Systems, pages 170-177, 1995. 5. P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM Int. Conf. on Supercomputing, 1996. 6. M. Cole. A skeletal approach to the exploitation of parallelism. In CONPAR'88, pages 667-675. Cambridge University Press, 1988. 7. J. Daxlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, Q. Wu, and R. L. While. Parallel programming using skeleton functions. In PARLE '93, pages 146-160. LNCS 694, 1993. 8. J. Darlington, Y. K Guo, H. W. To, and Y. Jing. Skeletons for structured parallel composition. In 5th A CM Syrup. on Princ. and Practice of Parallel Prog., pages 19-28~ 1995. 9. S. Gorlatch and C. Lengauer. (De)Composition rules for parallel scan and reduction. In 3rd IEEE Int. Conf. on Massively Par. Prog. Models, 1998. 10. M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on mul~icomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, 1992. 11. C. B. Jay, M. I. Cole, M. Sekanina, and P. Steekler. A monadic calculus for parallel costing of a functional language of arrays. In Euro-Par'97 Parallel Processing, pages 650-661. LNCS 1300, 1997. 12. J. Mallet. Compilation for MIMD computers of a skeleton-based language through symbolic cost analysis and automatic data distribution. Technical Report 1190, IRISA, May 1998. 13. S. Pelagatti. A Methodology for the Development and the Support of Massively Parallel Programs. PhD thesis, Pise University, 1993. 14. R. Rangaswami. A Cost Analysis for a Higher-order Parallel Programming Model. PhD thesis, Edinburgh University, 1996. 15. D. B. Skillicorn and W. Cal. A cost calculus for parallel functional programming. Technical report, Queen's University, 1993. 16. N. Tawbi. Estimation of nested loops execution time by integer arithmetic in convex polyhedra. In Int. Symp. on Par. Proc., pages 217-223, 1994. 17. P. Wadler. Linear types can change the world! In P~vgramming Concepts and Methods, pages 561-581. North Holland, 1990.
Optimising Data-Parallel Programs Using the BSP Cost Model D.B. Skillicorn 1 M. Danelutto, S. Pelagatti and A. Zavanella 2 1 Department of Computing and Information Science Queen's University, Kingston, Canada 2 Dipartimento di Informatica Universit~ di Pisa, Corso Italia 40 56125 Pisa, Italy
A b s t r a c t . We describe the use of the BSP cost model to optimise programs, based on skeletons or data-parallel operations, in which program components may have multiple implementations. BSP's view of communication transforms the problem of finding the best implementation choice for each component into a one-dimensional minimisation problem. A shortest-path algorithm that finds optimal implementations in time hnear in the number of operations of the program is given.
1
Problem Setting
Many parallel programming models gain expressiveness by raising the level of abstraction. Important examples are skeletons, and data-parallel languages such as HPF. Programs in these models are compositions of moderately-large building blocks, each of which hides significant parallel computation internally. There are typically multiple implementations for each of these building blocks, and it is straightforward to order these implementations by execution cost. W h a t makes the problem difficult is that different implementations require different arrangements of their inputs and outputs. Communication steps must typically be interspersed to rearrange the data between steps. Choosing the best global implementation is difficult because the cost of a program is the sum of two different terms, an execution cost (instructions), and a communication cost (words transmitted). In most parallel programming models, it is not possible to convert the communication cost into comparable units to the execution cost because the message transit time depends on which other messages are simultaneously being transmitted, and this is almost impossible to determine in practice. The chief advantage of the BSP cost model, in this context, is that it accounts for communication (accurately) in the same units as computation. The problem of finding a globally-optimal set of implementation choices becomes a one-dimensional minimisation problem, in fact with some small adaptations, a shortest path problem. The complexity of the shortest path problem with positive weights is quadratic in the number of nodes (Dijkstra's algorithm) or O((e + n)log n), where e is the
699 number of edges, for the heap algorithm. We show that the special structure of this particular application reduces the problem to shortest p a t h in a layered graph, with complexity linear in the number of program steps. BSP's cost model assumes that the bottleneck in communication performance is at the processors, rather than in the network itself. Thus the cost of communication depends on the fan-in and fan-out of d a t a at each processor, and a single architectural parameter, g, that measures effective inverse bandwidth (in units of time per word transferred). Such a model is extremely accurate for t o d a y ' s parallel computers. Other p r o g r a m m i n g models can use the technique described here to the extent that the BSP cost model reflects their performance. In practice, this requires being able to decide which communication actions will occur in the same time fl'ame. This will almost certainly be possible for P3L, H P F , GoldFish, and other data-parMlel models. In Section 2 we show how the BSP cost model can be used to model the costs of skeletons or data-parallel operations. In Section 3, we present an optimisation algorithm with linear complexity. In Section 4, we illustrate its use on a nontrivial application written in the style of P3L. In Section 5, we review some related work.
2
Cost Modelling
Consider the p r o g r a m in Figure 1, in which p r o g r a m operations, BSP supersteps [7], are shown right to left in a functional style. Step A has two potential implementations, and we want to account for the total costs of the two possible versions of the program. Step A F- . . . . . . . . . . . 181
]/1
-1 Wl
I L_ . . . . . . . . . . .
~
h4
I _1
I- . . . . . . . . . . . . . . . . . .
Is2
h2
w~
ha
I k_ . . . . . . . . . . . . . . . . .
Alternate Implementation of Step A Fig. 1. Making a Local Optimisation. Rectangles represent computation, ovals communication, and bars synchronisation. Each operation is labelled with its total cost parameter. Both implementations may contain nmltiple supersteps although they are drawn as containing one. The alternate implementation of step A requires extra communication (and hence an extra barrier) to place data correctly. In general, it is possible to overlap this with the communication in the previous step. The communication stage of the operation preceding step A arranges the d a t a in a way suitable for the first implementation of step A. Using the second imple-
700 mentation for step A requires inserting extra data manipulation code to change this arrangement. This extra code can be merged with the communication at the end of the preceding superstep, saving one barrier synchronisation, and potentially reducing the total communication load. For example, identities involving collective communication operations may allow some of the data movements to 'cancel' each other. We can describe the cost of these two possible implementations of Step A straightforwardly in the BSP cost framework. The cost of the original implementation of Step A is wl + hlg + sll and the cost of the new implementation is w2 + h2g + s2l plus a further rearrangement cost hag + l To determine the extent to which the h3g cost can be overlapped with the communication phase of the previous step we must break both costs down into fan-out costs and fan-in costs. So suppose that h4 is the communication cost of the preceding step and h3 = max(h~ ~t, h~~)
h4 = max(h~ ~t, h~~)
since fan-ins and fan-outs are costed additively. The cost of the combined, overlapped communication is
max(h~ ut Jv h~ut,
]~3~-~ hi4'z)~
The new implementation is cheaper than the old, therefore, if
w 2 + h 2 g + s 2 l + m a x ( h ~ ~ t + n 4 ,n3 + 3
<
+hlg
sll
Optimisation
The analysis in the previous section shows how a program is transformed into a directed acyclic graph in which the nodes are labelled with the cost of computation (and synchronisation) and the edges with the cost of communication. Alternate implementation corresponds to choosing one path between two points rather than another. In what follows, we simply assume that different implementations for each program step are known, without concerning ourselves with how these are discovered. Given a set of alternate implementations for each program step, the communication costs of connecting a given implementation of one step with a given implementation of the succeeding step can be computed as outlined above. If there are a alternate implementations for each program step, then there are a 2 possible communication patterns connecting them. If the program consists of n steps, then the total cost of setting up the graph is (n - 1)a 2. We want to find the shortest path through this layered graph. Consider the paths from any implementation block at the end of the program. There are (at most) a paths leading back to implementations of the previous step. The cost of a path is the sum of the costs of the blocks it passes through and the edges it contains. After two communication steps, there are a ~ possible paths, but only (at most) a of them can continue, because we can discard all but the cheapest
701 of any paths that meet at a block. Thus there are only a possible paths to be extended beyond any stage. Eventually, (at most) a paths reach the start of the program, and the cheapest overall can be selected. The cost of this search is ( n - 1 ) a 2, and it must be repeated for each possible starting point, so the overall cost of the optimisation algorithm is (n - 1)a 3. This is linear in the length of the program being optimised. The value of a is typically small, perhaps 3 or 4, so that the cubic term in a is unlikely to be significant in practice. 4
An
Example
We illustrate the approach with a non-trivial program, Conway's game of Life, expressed in the intermediate code used by the P3L compiler[5,1]. The operations available to the P3L compiler are: distribute :: Distr_pattern -+ SeqArray o~ -+ ParArray (SeqArray a) gstencil :: ParArray a --~ Distr_pa.l,tcrn --~ SI,enci] -+ Distr_pattern -+ ParArray c~ lstencil :: ParArray (., ~ Sl.(;ncil --+ l)istr_pattern -+ ParArray (SeqArray a) map :: ParArray a -~ (a -+/?) -+ ParArray fl reduce :: earArray a --~ (a -+ a -+ a) -+ reduceall :: ParArray a -+ (a --+ (~ -+ a) -+ ParArray a gather :: ParArray a -+ Distr_pattern -+ ScqArray a where distribute distributes an array to a set of worker processes according to a distribution pattern (Distr_pattern), gstencil fetches data from neighbour workers according to a given stencil, Istencil arranges the data local to a worker to have correct haloes in a stencil computation, map and reduce are the usual functional operations, reducea]l is a reduce in which all the workers get the result, and gather collects distributed data in an array to a single process. Possible intermediate P3L code for the game of life is: 1. dl = (block_block p) 2. W = distribute dl World 3. while NOT(C2) 4. W1 = gstencil W dl [(I,I),(I,i)] dl 5. W2 = istencil WI [(I,I),(I,I)] dl 6. (W3,C)= map (map update) W2 7. Cl = map (reduce OR C) 8. C2 = reduceall OR C 9. X = gather dl W3
where d l is the distribution adopted for the W o r l d matrix (block,block over a rectangle of p processors), and Wis the ParArray of the distributed slices of the initial World matrix. Line 4 gathers the neighbours from non-local processors. Line 5 creates the required copies. Line 6 performs all the update in parallel. Array C contains the boolean values saying if the corresponding element has changed. Line 9 gathers the global results. We concentrate on the computation inside the loop. It may be divided into three phases: Phase A gathers the needed values at each processor (Line 4). Phase B computes all of the updates in parallel (Lines 5-7). Phase C checks for termination: nothing has changed in the last iteration (Line 8). There are several possible implementations for these steps:
702 Operation Computation Cost Comment l A1 direct stencil implementation 21 A2 gather to a single process then distribute Aa 31 gather to a single process then broadcast (9 + te)N 2 + 1 local update and reduce B C1 !total exchange of results, then lo(p- 1)top cal reductions C2 (logp -- 1)(/+ toR,) tree reduction where t e is the cost of a local update, tOR is the time taken to compute a logical OR, and N is the length of the side of the world region held by each processor. In A~, all workers in parallel exchange the boundary elements (4(g + 1)), requiring a single barrier (costing l). In As, all the results are gathered on a single processor and redistributed giving each worker its halo; this requires much more data exchange (g2p + 4(N + 1)) and two synchronisations (20. In A3, data are collected as in A2 and then broadcasted to all the workers. This requires three synchronisations (30, as optimal BSP broadcast is done in two steps [7]. The cost of the communication between different implementations is: Pair Communication Cost Comment (A1, B) 4(N + 1)g send the halo of an N x N block (A2, B) N2pg + 4(N + 1)g gather and distribute the halo (A3, B) 3N2pg gather and broadcast, no haloes (B, c1)i ( p - 1)g total exchange
(B, C2)
2g
(C1, *) (C2,*)
0 data already placed as needed 2 ( l o g p - 1)g where p is the number of target processors, and N the length of side of the block of World on each processor). The optimal solution depends on the values of g, 1 and s of the target parallel computer. In general, for this example, A1 is better than A2 and A3, but C1 is better than C2 only for small numbers of processors or networks with large values of g. For instance, if we consider a CRAY T3D (with values of g ~ 0.35#s/word, 1 ~ 25#s and s ~-, 12M flops as in [7]), C1 is faster than C2 when p <_128, so the optimal implementation is AI-B-C1. For larger numbers of processors, A1-B-C2 is the best implementation. For architectures with slower barriers (Convex Exemplar, IBM SP2 or Parsytec GC) the best solution is always A1-B-C1. 5
Related
Work
The optimisation problem described here has, of course, been solved pragmatically by a large number of systems. Most of these use local heuristics and a small number of program and architecture parameters, and give equivocal results [4, 3, 2]. Unpublished work on P3L compiler performance has shown that effective optimisation can be achieved, but it requires detailed analysis of the
703 target architecture and makes compilation time-consuming. The technique presented here is expected to replace the present optimisation algorithm. A number of global optimisation approaches have also been tried. To [8] gives an algorithm with complexity n2a to choose among block and cyclic distributions of data for a sequence of data-parallel collective operations. Rauber and Riinger [6] optimise data-parallel programs that solve numerical problems. Different implementations are homogeneous families of algorithms with tunable algorithmic parameters (the number of iterations, the number of stages, the size of systems), and the cost model is a simplification of LogP.
6
Conclusions
The contribution of this paper is twofold: a demonstration of the usefulness of the perspective offered by the BSP cost model in simplifying a complex problem to the point where its crucial features can be seen; and then the application of a shortest-path algorithm for finding optimal implementations of skeleton and data-parallel programs. The known accuracy of the BSP cost model in practice reassures us that the simplified problem dealt with here has not lost any essential properties. Reducing the problem from a global problem with many variables to a one-dimensional problem that can be optimised sequentially makes a straightforward optimisation algorithm, with linear complexity, possible. A c k n o w l e d g e m e n t . We gratefully acknowledge the input of Barry Jay and Fabrizio Petrini, in improving the presentation of this paper.
References 1. B. Bacci, B. Cantalupo, M. Danelutto, S. Orlando, D. Pasetto~ S. Pelagatti, and M. Vanneschi. An environment for structured parallel programming. In L. Grandinetti, M. Kowalick, and M. Vaitersic, editors, Advances in High Performance Computing, pages 219-234. Kluwer, Dordredlt, The Netherlands, 1997. 2. Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, and M.-Y. Wu. Compiling Fortran 90D/HPF for distributed memory MIMD computers. Journal of Parallel and Distributed Computing, 21(1):15-26, April 1994. 3. S. Ciarpaglini, M. Danelutto, L. Folchi, C. Manconi, and S. Pelagatti. ANACklCTO: a template-based p31 compiler. In Proceedings of the Seventh Parallel Computing Workshop (PCW '97), Australian National University, Canberra, 1997. 4. M. Danelutto, F. Pasqualetti, and S. Pelagatti. Skeletons for data parallelism in p31. In C. Lengauer, M. Griebl, and S. Gorlatdl, editors, Proc. o] EURO-PAR '97, Passau, Germany, volume 1300 of LNCS, pages 619-628. Springer-Verlag, August 1997. 5. S. Pelagatti. Structured development of parallel programs. TaylorSzFrancis, London, 1997. 6. T. Rauber and G. Rfinger. Deriving structured parallel implementations for numerical methods. The Euromicro Journal, (41):589 608, 1996. 7. D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249-274, 1997. 8. H.W. To. Optimising the Parallel Behaviour of Combinations of Program Components. PhD thesis, Imperial College, 1995.
A Parallel Multigrid Skeleton Using BSP Femi O. Osoba and Fethi A. Rabhi Department of Computer Science, University of Hull, Hull HU6 7RX {B. 0. Osoba, F. A. RabBi} Odcs. hull, ac. uk
A b s t r a c t . Skeletons offer the opportunity to improve parallel software development by providing a template-based approach to program design. However, due to the large number of architectural models available and the lack of adequate performance prediction models, such templates have to be optimised for each architecture separately. This paper proposes a programming environment based on the Bulk Synchronous Parallel(BSP) model for multigrid methods, where key implementation decisions are made according to a cost model. 1
Introduction
Programming environments and CASE tools are increasingly becoming popular for developing parallel software. Skeleton-based programming environments offer known advantages of intellectual abstraction and development of structured programs [1,9]. Most skeleton-based systems are implemented at a low level using functional languages [3, 12] or imperative languages with parallel constructs e.g. P V M / C , dataparallel languages etc. [1,9]. However such models suffer from having no simple cost calculus, thereby hindering the functionality of the skeleton. To generate the required target code for different architectures, such skeletonbased systems have to be optimsed for each architecture. This paper describes a skeleton-based programming environment that is implemented at the low-level with a programming model that posseses a cost calculus, namely the Bulk Synchronous Parallelism (BSP) model. The next section presents our case study skeleton and Section 3 presents an overview of the proposed programming environment. The next sections present an overview of the user interface, the specification analysis phase and the code generation process. Finally the last section presents some conclusions and future directions of our work. 2
A Case
Study
: Multigrid
Methods
Most scientific computing problems represent a physical system by a m a t h e m a t ical model equation. Presently we are considering a model differential equation of the form V . K V u = f in the 2D space. To make it suitable for a computer solution, the continuous physical domain of the system being modelled is discretised and this leads to a set of linear equations of the form A x = b with
705 n unknowns, which needs to be solved for. A particular class of methods for solving such equations is known as multigrid(MG) methods. Such methods have been selected as a case study on the basis of their usefulness in an important application area (CFD) and their similarity with SIT algorithms which were the focus of earlier work [9, 12]. The principle behind a multigrid method is to use a relaxation m e t h o d (e.g. Gauss Seidel or Jacobi) on a sequence of m discretisation grids G 1 . . . G "~ (G 1 represents the finest grid and G '~ the coarsest). By employing several levels of discretisation, the multigrid method is able to accelerate the convergence rate of the relaxation method on the finest grid. One iteration of the MG m e t h o d from finest to coarsest grid and back is called a cycle. During a cycle, the different stages in an MG algorithm are relaxation which involves the use of an iterative method on the current grid level, restriction which involves the transfer of the residual from one grid to the next coarser grid, coarsest grid solution which provides an exact solution on the coarsest grid and interpolation which is concerned with the transfer of the solution obtained from a coarser grid to the next finer grid. Additional details on multigrid methods can be found in [2].
3
Overview of a Skeleton-Based Programming Environment
The overall structure of the system is very similar to other skeleton-based programming environments [9] as figure 1 illustrates. The primary aim is separating the role of the user and the system developer. The user decides on how the multigrid method should be used and what parameters are to determine the optimal performance irrespective of the underlying architecture, while the system developer addresses issues concerned with generating efficient code and determining suitable or recommended parameters for different architectures based on performance prediction and cost modelling. The main components of the system as shown in figure 1 are described in turn in the next sections.
3.1
User Interface
The role of the User Interface is to allow the user to enter and edit the parameters of the multigrid algorithm. The User Interface also displays the results of the computation which consist of a visual representation of the solution. Figure 2 shows three sets of parameters: multigrid parameters, physical parameters and
implementation parameters. M u l t i g r i d p a r a m e t e r s The multigrid parameters include the number of grids, the relaxation method, the stencil (e.g. 4 or 8 points) and the cycling strategy (V or W cycles). Another parameter is the number of relaxation iterations performed at various grid levels. Some parameters (e.g. the relaxation method) are expected to be changed after the Specification Analyser has been invoked. When selecting values, the user primarily attempts to reduce the overall execution time by minimizing the number of cycles required to converge to the solution.
706
t
Fig. 1. Overall description of the system PARAMETERS MULTIGRID
PHYSICAL
-Number of grids -Number of points -Rclaxatmn method -Stenml -Number of iterations -Cycling strategy
[
-Initi~tl data -Boundary data
iNumb~refcycles
IMPLEMENTATION -BSP architecture (p,g,l) -Partitioning strategy -Tiles shape
Fig. 2. User Interface parameters
P h y s i c a l p a r a m e t e r s These parameters are dependent on the physical attributes of the problem e.g. the constant K describes the anisotropy in the problem. I m p l e m e n t a t i o n p a r a m e t e r s These parameters represent the characteristics of the parallel implementation. First, the BSP architecture is specified as a triplet (p, g, l) [6]. Another parameter represents the partitioning strategy, which is generMly based on grid partitioning techniques. Finally, the tiles shape parameter refers to the shape of sub-partitions (e.g. square or rectangular tiles). 3.2
Specification Analyser
The role of the Specification Analyser is to predict the optimal value of some of the parameters given a minimal subset of values provided by the user. These predictions are made based on estimates on the computation and communication requirements for parallel execution. We base our domain partitioning theoretical cost model on a simple analytic model presented in [11]. Because BSP removes the notion of network topology [6], the goal is to build solutions that are optimal
707 with respect to total computation, total communication, and total number of supersteps[6]. For our system this is represented by a parameter set. Each set represents one variation of the algorithm. A parameter set consists of Number of Cycles(Cy), Relaxation Method(RM), Stencil type(St), and Tileshape(Ts). Designing a particular program then becomes a m a t t e r of choosing a parameter set that is optimal for the range of machine sizes envisaged for the application. Our BSP theoretical cost model for a standard MG algorithm for a regular grid is of the form:
Cy[C.(n~/p) + 9.(n(1/tr + 1/tc).M)] + l.S
(1)
where C is the total number of computation steps, M is the total amount of communication, p is the number of processors (and also the number of tiles), t~ is the number of tiles in a row, tc the number of tiles in a column, and S is the total number of supersteps. C is a function of (RM, St, n ~, p), M is a function of (Ts, RM, St) and S is a function of (t=tM,Ts). So for a given number of cycles (Cy), the analyser produces a parameter set (Cy, RM, St, Ts) which results in the lowest execution time for the particular architecture. 3.3
The Code Generator
The role of the Code Generator is to produce BSP code according to the current values of the parameters. The Code Generator uses a set of reusable library modules which form the MG Module Library and the BSP library (BSPlib) [4] for handling communication and synchronisation. The MG modules are coded in C and implement each of the phases identified in Section 2. For further details on the multigrid programming environment, the reader is referred to the Ph.D thesis in [8].
4
Conclusion and Future Work
This paper described a skeleton-based programming environment for multigrid methods which generates BSP code. The user is offered an interface that abstracts from the underlying hardware thus ensures portability and intellectual abstraction. By combining the skeletons idea with BSP, accurate analysis and predictions of the performance of the code on architectures can be made. Since the system is aware of the global properties of the underlying BSP architecture, the system can decide which aspects of the MG algorithm may need to be changed to provide optimal performance and then performs the code generation. Future work will refine the performance prediction model according to practical results, develop a systematic framework for automatic code generation and increase the parameter set to cater for more realistic problems. Finally, we will also investigate a suitable notation for specifying and modifying parameters and evaluate the proposed system with real users who work with such methods.
708
5
Acknowledgements
We would like to thank members of the BSP team at Oxford University C o m p u t ing Laboratory U.K, for their continuous support on the use of the BSP library, and also for making the SP-2 available to us.
References 1. G.H. Botorog and H. Kuchen, Algorithmic skeletons for adaptive multigrid methods, In Proceedings of Irregular '95, LNCS 980, Springer Verlag , 1995, pp. 27-41. 2. J.H. Bramble, Multigrid methods, Pitman Research Notes in Mathematics, Series 294, Longman Scientific & Technical, 1993. 3. J. Darlington, A. Field, P. Harrison,P. Kelly, D. Sharp, Q. Wu, R. While, Parallel Programming using Skeleton Functions, In Proceedings o] PARLE '93, LNCS, Munich, Springer Verlag, June 1993. 4. M. W. Goudreau, J.M.D. Hill, K. Lang, W.F. McColl, S. B. Rao, D.C. Stefanescu, T. Suel, and T. Tsantilas, A proposM for the BSP worldwide standard library, July 1996, available on WWW h t t p ://wuw. bsp-worldwide, org/. 5. O.A. McBryan, P.O. Frederickson, J. Linden, A. Schuller, K. Solchenbach, K. Stuben, C.A. Thole, and U. Trottenberg, Multigrid methods on parallel computers - a survey of recent developments, IMPACT Comput. Sci. Eng., vol, 3, pp. 1-75, 1991. 6. W.F. McColl, Bulk synchronous parallel computing, In Abstract Machine Models /or Highly Parallel Computers, J.R. Davy and P.M. Dew (eds), Oxford University Press, 1995, pp. 41-63. 7. M. Nibhanupudi, C. Norton, and B. Szymanski, Plasma Simulation On Networks of Workstations using the Bulk Synchronous Parallel mode, in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Athens, GA, November 1995. 8. B.O. Osoba, Design o] a Parallel Multigrid Skeleton-Based System using BSP, Ph.D thesis in preparation in the Department of Computer Science, University of Hull, 1998. 9. P.J. Parsons and F.A. Rabhi, Generating parallel programs from paradigm-based specifications, to appear in the Journal of Systems Architectures, 1998. 10. F.A. Rabhi, A Parallel Programming Methodology Based on Paradigms, In Transpurer and Occam Developments, P. Nixon (Ed.), lOS Press, 1995, pp. 239-252. 11. P. Ramanathan and S. Chalasani, Parallel multigrid algorithms on CM-5. In IEE Proc. Computers and Digital Techniques, vol. 142, no 3, May 1995. 12. J. Schwarz and F.A. Rabhi, A skeleton-based implementation of iterative transformation algorithms using functional languages, In Abstract Machine Models ]or Parallel and Distributed Computing, M. Kara et al. (eds), IOS Press, 1996,
Flattening Trees Gabriele Keller 1 and Manuel M. T. Chakravarty 2 I FachbereichInformatik, Technische Universits Berlin, Germany keller0cs.tu-berlin.de 2 Inst. ofIn~rm. Sciences and Electronics, University of Tsukuba, Japan chakOis.tsukuba.ae.jp
Abstract. Nested data-parallelism can be efficiently implemented by mapping it to flat parallelism using Blelloch 8z Sabot's flattening transformation. So far, the only dynamic data structure supported by flattening are vectors. We extend it with support for user-defined recursive types, which allow parallel tree structures to be defined. Thus, important parallel algorithms can be implemented more clearly and efficiently.
1
Introduction
The flattening transformation of Blelloch ~: Sabot [6, 4] implements nested dataparallelism by mapping it to flat data-parallelism. Compared to flat parallelism, nested parallelism allows algorithms to be expressed on a higher level of abstraction while providing a language-based performance model [5]; in particular, algorithms operating on irregular data structures and divide-and-conquer algorithms benefit from nested parallelism. Efficient code can be generated for a wide range of parallel machines [2, 11, 12, 10]. However, flattening supports only vectots (i.e., homogeneous, ordered sequences) as dynamic data structures. Thus, important parallel algorithms, like hierarchical n-body codes and other adaptive algorithms based on tree structures, are awkward to program--the tree structure has to be mapped to a vector structure, implying explicit index calculations, to keep track of the parent-child relation, and leading to a suboptimM data distribution on distributed-memory machines. In this paper, we propose user-defined recursive types to tackle the mentioned problems. We extend flattening such that it maps the new structures to efficient, fiat data-parallel code. Our extension fits easily into existing formalizations and implementations of flattening; in particular, the optimization techniques of previous work [11, 12, 10,7,9] remain applicable. This paper makes the following three main contributions: (1) It demonstrates the usefulness of recursive types for nested data-parallel languages (Section 2), (2) it formally specifies our extension of flattening including user-defined recursive types (Section 3), and (3) it provides experimental results gathered with the resulting code on a Cray T 3 E (Section 4). Regarding point (2), as a side-effect of our extension, we contribute to a rigorous specification of flattening by formMizing the instantiation of polymorphic primitives. Thereby, we also introduce a new kind
710 of primitives, so-called chunkwise operations, for more efficient data redistribution on distributed-rnemory machines. We use the functional language NESL [5] throughout this paper, but the discussed techniques also work for imperative languages [1]. Many details and discussions have been omitted in this paper due to shortage of space--this material can be found in [8]. Section 2 discusses the benefit of recursive types for tree-based algorithms in a purely vector-oriented language. Section 3 formalizes our extended flattening transformation. Section 4 presents benchmarks. Finally, Section 5 concludes. 2
The
Problem:
Encoding
Trees
by Vectors
NESL [5] is strict functional language featuring nested vectors as its central data structure. In addition to built-in parallel operations on vectors, the apply-toeach construct is used to express parallelism. In its general form {e : xl in el . . . . . x~ in e~ I f} we call e the body, the x~ in e~ the generators, and f (which is optional) the filter. The body e is evaluated for each element of the vectors ei and the result is included in the result vector if f evaluates to T (true); the vectors ei are required to be of equal length and are processed in lock-step. For e x a m p l e , { x + y : x in [1, - 2 , 3 ] , y i n [4, 5, 6] I x > 0} evaluates to [5, 9]. Nested parallelism occurs where parallel operations appear in the body of an apply-to-each, e.g., {plus_scan (x) : x in xs}, which f o r x s = [ [ 1 , 2, 3 ] , [ 4 ] , [5, 6]] yields [ [ 0 , 1, 3 ] , [ 0 ] , [0, 5 ] ] . The built-in p l u s _ s c a n is a prescan using addition [4]. The implementation of a tree-based algorithm in such a language implies representing trees by nested vectors, obscuring the code with explicit index calculations, to keep track of the parent-child relation. Moreover, in an implementation based on flattening, these nested vectors are represented by a set of flat vectors in the target code. All data elements of the tree are mapped to a single vector and are uniformly distributed to achieve load balancing. This, however, leads to superfluous redistributions as those algorithms usually traverse the trees breadth-first, i.e., level by level, all nodes on one level are processed in parallel. We illustrate these problems at the example of Barnes ~: Hut's hierarchicM n-body code [3]. It minimizes the number of force calculations by grouping particles hierarchicMly into cells according to their spatial position. The hierarchy is represented by a tree. This allows approximating the accelerations induced by a group of particles on distant particles by using the centroid of that group's cell. The algorithm has two phases: (1) The tree is constructed from a particle set, and (2) the acceleration for each particle is computed in a down-sweep over the tree. The following NESL function outlines the tree construction: function bhTree (ps) : [MassPnt] -> [Cell] = if #ps == I then [Cell (ps[O], [])] -- cell has only one particle else let (split particles ps into spatial groups pgs> subtrees = {bhTree (pg): pg in pgs}; children = {subtree[O] : subtree in subtrees}; (*) cd = centroid ({mp : Cell (mp, is) in children});
711
ci
r~ I/:.7
[
r-~
I/=.6
/\
F4
e3
/\
I-1
F6 1-7 I:-8 F9
I 1-5
I7-9
2
/\
[[[cl !
'r"
c4 c5 r6 r7 r8 r-9],[c2 r2 r-3] [c3 r4 rs] I Processor 1 a. Processor 2
Iv.l]][ I
Fig. 1. Example of a Barnes-Hut tree and its representation as a vector. sizes
: {#subtree
:
subtree in subtrees}
in [Cell (cd, s i z e s ) ] ++ f l a t t e n ( s u b t r e e s )
(*)
- - whole tree as flat vector (*)
We represent the tree as a vector of cells. Each cell of the form C e l l (rap, szs) corresponds t o a node of the tree and is a pair of a mass point mp and the sizes of the subtrees szs (tuples can be named in NESL). Given such a vectorized t r e e , the acceleration of a set of mass points raps is computed by fUnCtiOll accels (tree, mps) : ([Cell], [MassPnt]) -> [Vec] = if #mps == 0 then [] else let Cell (cd, chNos) = tree[O]; --get root <split raps into closeMps and farMps (direct force calculation)) farAcs -- {accel (ca, mp) : lap in farMps}; subTrees = partition (drop (tree, I), chNos); closeAcss = {accels (tree, closeMps) : tree in subTrees}; in (combine farAcs and closeAcss)
(.)
(*)
It computes the acceleration for the mass points in farMps directly (using the function accel) and recurses into the tree for those in closeMps. T h e function drop omits the first element of a vector and partition forms a nested vector according to the lengths passed in the second argument (it is the same as 7~ in the next section). The type Vec represents 'vectors' in the sense of physics. The lines marked by (*) in the functions are (partially) artifacts of maintaining the tree as a vector. Figure 1 depicts the grouping of an exemplary particle distribution and the corresponding tree. The tree is both built and traversed level by level, i.e., all nodes in one level of the tree are processed in a parallel step. Let us consider the d a t a layout (over two processors) for the example tree in Figure 1. To ensure proper load balancing, all cells of the already constructed subtrees have to be redistributed in each recursive step of b h T r e e . Similarly, a c c e l s , while descending the tree, has to redistribute those cells t h a t correspond to one level of the tree. We will quantify these costs experimentally in Section 4. It should be clear that it is more suitable to store the nodes of the tree in a distinct vector for each level of the tree, and then, to chain these vectors to represent the whole tree. At the source level, such a structure corresponds to regarding each node as being composed from some node d a t a plus a vector of tree nodes that have exactly the same structure; in other words, we need a recursive type. For our example, we can use d a t a t y p e Node (MassPnt, [Node]). Then, we simplify the function b h t r e e as follows: We omit computing e h i l d r e t t and
712 sizes, we compute cdby centroid ({mp : Node (rap, chs) in trees}), and change the body of the let into Node (cd, subtrees). In accels, computing subtrees becomes superfluous, and closeAcss is computed by accels (tree, closeMps) : tree in subtrees} where Node (cd, snbtrees) = tree. Generally, we extend NESL with named tuples that refer to themselves, but only in the element type of a vector (to avoid infinite types), as we can terminate recursion only by empty vectors--there are no arbitrary sum types. Handling sum types efficiently in flattening seems much harder than recursive types. 3
Flattening
Trees
by Program
Transformation
After the source language extension, we proceed to the implementation of userdefined recursive types by flattening. The flattening transformation serves three aims: First, it replaces all nested parallel expressions by flat parallel expressions that contain the same degree of parallelism, second, it changes the representation of data structures such that vectors contain only elements of base type, and third, it replaces all polymorphic vector primitives with monomorphic instances. In this section, we begin by introducing the flat data-parallel kernel language FKL, which is the target language of the first part of the flattening transformation. We continue with a discussion of the target representation of data structures, and then, describe an instantiation procedure, which implements the change of the representation of data structures as well as the generation of all necessary monomorphic instances of the primitives. Due to changing the representation of data structures, the instantiation of the polymorphic primitives becomes technically challenging--especially so, in the presence of recursive types. The representation of recursive types together with the instantiation procedure are the central technical contributions of this paper. Although the instantiation of the polymorphic primitives--excluding recursive types--is already implicit in previous work, it was never described in detail (for example, Blelloch [4] merely gives some examples and leaves the non-trivial induction of a general algorithm to the inclined reader); in particular, we provide the first formal specification. Our treatment also leads to more efficient code on distributed-memory machines than previous approaches (which concentrated on vector and shared-memory machines). A central idea of our method is the fact that only the primitive vector operations actually access or manipulate elements of nested vectors. Therefore, we can regard nested vectors as an abstract data type with some freedom in the concrete implementation. We will not discuss the first step of the flattening, as it is already detailed in previous work [4, 11, 10] and not affected by the addition of recursive t y p e s - however, a complete specification of the flattening transformation can be found in an unabridged version of this paper [8]. 3.1
T h e Flat K e r n e l L a n g u a g e
A kernel language (FKL) program consists of a list of declarations produced by the rule D - + V ( V 1 , . . . , V~) = Ewith variables V and expressions produced by
713 E -+ C [ V I V (El . . . . , E~)I let V = Ea in E21 if E1 t h e n }~2 else Ezl [E~,..., E~] where C are constants. We assume programs are typed, with types from T -+ Int I Bool I V I ( T ~ , . . . , T~) ] [T] I t~V.T
For brevity, we only have Int and Bool as primitive types. In a recursive type # x . T , all occurrences of x in T must be within element types of vectors (to get a finite type). For example, the type Node of Section 2 is represented as p x . ( M a s s P n t , Ix]), where MassPnt abbreviates the tuple of a mass point. T h e second c o m p o n e n t of FKL are its primitive operations. A m o n g the most i m p o r t a n t are the usual arithmetic and logic operations as well as construction of tuples r " : c~ x . . . x ~,, -+ ( c ~ , . . . , c~) and the corresponding i : ( c ~ , . . . , c~) --+ cq (the function type is given to the right of projections ~r~ the colon). T h e vector operations include operations length # : [c~] --+ Int, concatenation -H- : [c~] x [~] --+ [c~], and indexing i n d : [c~] x Int --+ o~. Moreover, we have distribution dist : c~ x Int -+ [c~], where dist (a, n) yields [ a , . . . , a] with length n, p e r m u t a t i o n p e r m : [cu] x [Int] --+ [c~], where perm (as, is) permutes xs according to the index vector is, packing p a c k : [c~] x [Bool] --+ [c~], where pack (xs,fs) removes all elements from xs t h a t correspond to a false value in fs. Furthermore, there are families of reduction | : [r] --+ r and prescan | : [r] --+ Iv] functions for associative binary primitives | operating on d a t a of basic type r. Finally, FKL contains three functions t h a t form the basis for handling nested vectors: U : [[c~]]--+ [c~] correspond to + + - r e d u c e and removes one level of nesting, e.g., 5c ([[1, 2, 31, ~, [4, 5]]) = [1, 2, 3, 4, 5]; $ : [[a]] --+ [Int] corresponds to { # x s : xs +-- xss} and returns the toplevel nesting structure, e.g., 8 ([[1, 2, 3], ~, [4, 5]]) = [3, 0, 2]; and 7~: [a] x [Int] --+ [Jail reconstructs a nested vector from the results of the previous two function, e.g., 7) ([1,2, 3, 4, 5], [3, 0, 2]) = [[1, 2, 3], ~, [4,511. For each nested vector xs, we have 7) (U (~s), $ (xs)) = xs. We write the application of primitives like ~ infix. Some of the primitives are p o l y m o r p h i c (those with c~ in the type) and, as said, we discuss their instantiation later, but we assume that in an I?KL p r o g r a m all p o l y m o r p h i s m in user-defined functions is already removed (by code duplication and type speciMization). To compensate the lack of general nested parallelism (i.e., no apply-to-each construct), FKL supports all primitive functions p : T1 x 9 9 9x T,~ -+ T in a vectorized form pt : [T~] x . . . x [T~] --+ [T], which applies p in parallel to all the elements of its argument vectors (which must be of the same length). For example, we have [1,2, 3] +tz, t [4, 5, 6] = [5, 7, 9]. In general, a primitive and its vectorized form relate through p t ( x s l , . . . , x ' s ~ ) = { p ( x l , . . . , x ~ ) : xl +- x s l , . . . , x , +-- xs~}. Note, the part of the transformation generating FIlL guarantees t h a t nothing like ( p t ) t is needed. T h e next subsection introduces two additional sets of primitives: One handles recursive types and the other, although not strictly necessary, handles some operations on nested vectors more efficiently. We delay the discussion of these primitives as it benefits from knowing the target d a t a representation. We choose FKL as the target language as there are optimizing code-generation techniques mapping it on different parallel architectures [2, 7, 9].
714 3.2
Concrete Data Representation
Before presenting the instantiation procedure for polymorphic primitives, we discuss an efficient target representation of nested vectors, vectors of tuples, and vectors of recursive types. N e s t e d v e c t o r s . We can represent nested vectors of basic type using only fiat vectors by separating the data elements from the nesting structure. For example, [[1, 2, 3], D, [4, 5]] is represented by a pair consisting of the data vector [1, 2, 3, 4, 5] and the segment descriptor [3, 0, 2] containing the lengths of the subvectors. The primitives 5r and S extract these two components, whereas 7) combines them into an abstract compound structure representing a nested vector. In general, this representation requires a single data vector and one segment descriptor per nesting level of the represented vector. Instances of polymorphic primitives operating on these nested vectors can be realised by combining primitives on vectors of basic type with the functions jr, S, and 7), as we will see below. This representation allows an optimized treatment of costly reordering primitives, such as permutation. Consider the expression perm t (as, is), where as is of type [[int]]. Both the data vector and the segment descriptor of as have to be permuted. We have perm' (as, is) = 7) (perm (Jr (as), is'), perm (S (as), is)), where perm operates on vectors of type lnt and is ~ is a new permutation vector computed from is and S (as). So, for example, perm' ([[3, 4, 5], [1, 3]], [1, 0]) = 7) (perm ([3, 4, 5, 1, 3], [2, 3, 4, 0, 1]), perm ([3, 2], [1, 0])). This scheme is expensive, because (a) a new index vector is' is computed for each level of nesting, and moreover, (b) the data vector is permuted elementwise, whereas the original expression allows to deduce that several continuous blocks of elements (the subvectors) are permuted, i.e., we lose information about the structure of communication. We can prevent this behaviour by employing an additional set of primitive functions, the so-called chunkwise operations distC, permC, arid indC. The operation distC gets a vector together with a natural number and yields a vector that repeats the input vector as often as specified by the second argument. The operations permC and indC both get three arguments: (1) a flat data vector, (2) a segment descriptor, and (3) an index vector or index position (depending on the function). They permute and index blocks of the data vector chunkwise, where the chunk size is specified by the segment descriptor. Their semantics is defined as permC (xs, s, is) = perm' (7) (xs, s), is) and indC (xs, s, i) = i n d ' (7) (zs, s), i), respectively, where perm' and ind' operate on vectors of type [[s]] when xs is of type Is] and s is a basic type. Their implementation using blockwise communication is straightforward. V e c t o r s o f t u p l e s . Vectors of tuples are represented by tuples of vectors. Ac-
cordingly, applications of vector primitives operating on such vectors are pulled inside the tuple, as proposed by [4]. R e c u r s i v e T y p e s . The most complicated case is the representation of vectors of recursive types by vectors of basic type in such a way that the nodes of each
715 level of the represented tree structure are stored in a separate vector (as discussed in Section 2). In Subsection 3.1 we required that, in a recursive type #x. T, each occurrence of x in T has the form [x]. Let us consider the possible contexts of these occurrences. If the outermost type constructor of T is a primitive type (e.g., Int), there is no recursion and we represent T as usual. But, if the outermost type constructor of T is a vector, we have [[x]], i.e., a nested vector of recursive type. Similar to nested vectors of basic type, we regard them as an abstract structure manipulated by the three functions Y, $, and P . Next, if the outermost type constructor of T is a tuple, we have to represent it as a tuple of vectors like in the non-recursive case, while treating the components of the resulting tuple in the same way as T itself. Finally, if the outermost type constructor of T is a recursive type of the form px'. T ~, no special treatment is necessary, apart from handling T ~ in the same way as T. Overall, any occurrence of a recursive type variable x (except in the root of a tree) is represented by [[x]], due to the propagation of vectors inside tuples and the requirement t h a t x occurs only as Ix]. Thus, these occurrences are manipulated by S , 3, and P . These functions allow us to represent a tree structure of potentially unbounded depth using only flat vectors by separating d a t a from nesting structure. Nevertheless, a problem remains: For nested vectors, the nesting depth of the data is statically known (due to strong typing), but not so for the depth of a recursive type. Hence, we get a sequence of segment descriptors of statically unknown length. Each of these is accompanied by the d a t a vector for one level of the tree. A value of recursive type #x. T is always terminated by e m p t y vectors of a recursive occurrence ac in T, but due to the propagation of vectors inside tuples, we need a special value to identify the recursion termination. Hence, we wrap the representation of vectors of recursive type into a maybe type (c~}, which is either @} (where x is of type c~) or (}. To process values of type (a}, we add two primitives (.?) : (a} --+ Bool and (.$) : (a} -+ a. Operation (.?) yields true if the argument has the form {x}, in this case, ('1") returns x (it is undefined, otherwise). The tree of Figure 1 is represented by (co, v) with v = (([c,, c2, c3, p,])),
., 3.3
"P (w, [2,2,2,0])) x = (([P6,pv,Pa,pg], "P (y,[0,0,0,0]))} <([,:~,,'~,;,~,;,~,;,~,v.q, 'P (.,:,[~,2,0,0,0,01))) v (([1, "r (<>,[]))>
Instantiation of Polymorphic Functions
Now, we are in a position to define the instantiation of the polymorphic uses of primitives. We denote the type of an instance of a primitive by annotating the type substituted for the type variable (a in the signatures from Subsection 3.1) as a subscript. For example, -H-[Int] has type [[Int]] • [[Int]] -+ [lint]], and we have to generate instances for all occurrences apart from -H-Int and -H-Bool. The generic primitives r " , 7r~, ~ , ~v, 3, and 7) are not instantiated as their code is independent of the type instance. T h e t r a n s f o r m a t i o n a l g o r i t h m . In the presence of recursive types, we cannot transform the p r o g r a m by replacing uses of polymorphic primitives with expressions that merely use primitives on basic type; instead, we add new declarations to a program, until all uses of primitives are either directly supported or covered
716 by a previously added declaration. The addition of a declaration m a y introduce primitives occurring at new instances, which in turn triggers the addition of declarations for these instances. We discuss the termination of this process later. Each equation says that if a primitive occurs at a type matching the left hand-side, add a declaration that is an instance of the right hand-side for t h a t type. We start with the distribution dist, together with its chunkwise variant:
dist(T ...... T.) (x, n) = (distw~ (Try(x), n ) , . . . , distT, (7~(x), n))
dist,~.v (x, n) dist[T] (X, n)
= P(distC[T] (x, n), distr~t (#x, n))
distCux. T (xs, n) distQr] (xs, n)
= if
xs?
then
distCT{pz, r/z} (xs~, n)
else
()
= P(distC[T] (.T (xs), n), distCznt (S (xs), n))
The rules for tuples propagate the function inside. Vectors are distributed using tile chunkwise version, of which the second rule demonstrates the handling of the m a y b e type (@. We omit the rules for tuples in the following, as they have the same structure as above. The next set of rules covers chunkwise concatenation:
xs -H-ux.T ys = if xs? t h e n (if ys? t h e n xs? +~-T{,~.T/z} ys$ e l s e as) e l s e ys ~s +~c~1 y" = p(y(xs) ~+~ f(ys)), s(xs) +~i~t S(ys)) We continue with plain and chunkwise indexing. Note that indexing an e m p t y vector of recursive type leads to an error as the index is out of bounds.
ind,~. T (xs, i) = i f xs? t h e n indT{,~. T/~:} (XS, i) e l s e e r r o r i.d M (~, i) = inac~ (S(~s), s(~s), i) indC,~.7" (xs, s, i) = i f xs? t h e n indCT{~.T/x] (xs, s, i) e l s e ()
indqT] (xs, s, i)
= P( indCr (7(xs), +z,t-~educe r (P (8(xs), s) ), i), indCr~t (S(xs), s, i))
For indC[T], the segment descriptor passed to the recursave call is c o m p u t e d by s u m m i n g over the segments of the current level. For space reasons, we o m i t the rules for perm, pack, and combine. They are similar to ind, but, e.g., p e r m u t i n g an empty vector does not raise an error, but returns the e m p t y vector. Finally, considering vectorized primitives, the rules for types (T1,. 99 , T~) and #x. T are as above; for [T], the vectorized version is generated by vectorizing the above rules--this vectorization is essentially the same as the first step of the flattening transformation that was mentioned in the beginning of the present section. The details of this and of handling perm, pack, and combine can be found in [8]. T e r m i n a t i o n . Considering the termination of the above process, we see that for each primitive p, the rules for instances P(T,,...,T,) and P[5,] decrease the structural depth of the type subscript. Merely the rules for P,x.T m a y be problematic as they recursively unfold the type in their right hand-side. However, in
717 t[sec] . 6.0 5.0 4.0 3.0 2.0
t "1-1~-
~
Speedup
9 - 20 000 l)arl,icles
T 0
2
4
6
:. 9 -99 99
8 10 12 14 16 18 20DE
absolute (20,000) .- i optimal 1/ relative (9 000) / / q ~ r e l a ; ~
10 5
0
9
0
)
5
10
15
20
Fig. 2. Absolute time and relative speed-up for Barnes-Hut on a Cray T3E. the definition of the kernel languages, we required that type variables bound by # occur only as the element type of a vector. So, for all occurrences where the type unfolding substitutes the recursive type, we get an expression t h a t requires p[u~. T]. If p is a chunk operation or -H-, the right hand-side of an instance of the rule for P[T,] with T ~ = r T requires PU,:.T, which is exactly the instance we started with and which, therefore, is already defined. I f p is neither a chunk operation nor -H-, it requires the corresponding ehunkwise operation and we already discussed the termination for chunkwise operations.
4
Experimental Results
We measured the accumulated runtime of the expressions in the l e t - b i n d i n g of b h t r e e (Section 2) for the original NEsL-program using 15,000 particles and CMU-NESL [2] (on a single processor, 200MHz P e n t i u m / L i n u x machine), to quantify the overhead of m a p p i n g the tree structure to a vector. From the overall runtime of 1.75 seconds, 0.5 seconds (29%) are spend on the operations introduced by the mapping. We gathered this d a t a on a sequential machine, due to inaccurate profiling information from CMU-NESL on the Cray T 3 E . The inefficiencies would even be worse in a parallel run, since these operations include algorithmically unnecessary reordering operations, which cause communication. To measure our proposed techniques for implementing recursive types, we timed an implementation of Barnes-Hut generated using the presented rules. 1 Figure 2 shows the timing of a single simulation step on a Cray T 3 E for 9000 and 20000 particles as well as the relative speedup and the absolute speed up (we only had access to 20 processors). The relative speedup is already quite close to the theoretical optimal speedup, but the absolute speedup shows that there is still room for improvement and we already know some possible optimizations. We did not compare CMU-NEsL and our implementation on the Cray T 3 E directly, since our code generation techniques for the flat language already o u t p e r f o r m CMU-NESL by more than an order of magnitude [9]. The code was 'hand-generated'; we m'e currently implementing a compiler.
718
5
Conclusion
After its introduction by Blelloch & Sabot, flattening has received considerable attention, but to the best of our knowledge, we are the first to extend flattening to tree structures. There are other approaches to implementing nested dataparallelism, and of course, there is a wealth of literature on trees in parallel computing, but we do not have the space to discuss this work h e r e - - s o m e of it is discussed in [8]. Our extension is easily integrated into existing systems and the gathered benchmarks support the efficiency of the generated code. Acknowledgements. We are indebted to Martin Simons and Wolf Pfannenstiel for fruitful discussions and helpful comments on earlier versions of this paper. Furthermore, we are grateful to the anonymous referees of EuroPar'98 for their valuable comments and the members of the CACA seminar (University of Tokyo) for lively discussions. The first author receives a PhD scholarship from the D F G ( G e r m a n Research Council). She thanks the SCORE group of the University of Tsukuba, especially, Tetsuo Ida, for the hospitality while being a guest at ScoRE.
References 1. P. K. T. Au, M. M. T. Chakravarty, J. Darlington, Yike Guo, Stefan J~nichen, G. Keller, M. K6hler, W. Pfannenstiel, and M. Simons. Enlarging the scope of vector-based computations: Extending Fortran 90 with nested data parallelism. In W. K. Giloi, editor, Proc. of the Intl. Conf. on Advances in Parallel and Distributed Computing. IEEE Computer Society Press, 1997. 2. G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. In 4th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming, 1993. 3. J. Barnes and P. Hut. A hierarchical O(n log n) force calculation algorithm. Nature, 324, December 1986. 4. G. E. Blelloch. Vector Models for Data-Parallel Computing. The MIT Press, 1990. 5. G. E. Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3):85-97, 1996. 6. G.E. Blelloch and G. W. Sabot. Compiling collection-oriented languages onto massively parallel computers. Journal o/Parallel and Distributed Computing, 8:119134, 1990. 7. S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. on Prog. Lang. and Systems, 15(3), 1993. 8. Gabriele Keller and Manuel M. T. Chakravarty. Flattening trees-unabridged. Forschungsbreicht 98-6, Technical University of Berlin, 1998. http ://cs. tu-berl in. de/cs/ifb/TechnBerichteListe, html. 9. 13. Keller. Transformation-based Implementation of Nested Parallelism for Parallel Computers with Distributed Memory. PhD thesis, Technische Universit/it Berlin, Fachbereich Informatik, 1998. Forthcoming. 10. G. Keller and M. Simons. A calculational approach to flattening nested data parallelism in functional languages. In J. Jaffar, editor, The 1996 Asian Computing Science Conference, LNCS. Springer Verlag, 1996.
719 11. J. Prins and D. Palmer. Transforming high-level data-parallel programs into vector operations. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119--128, San Diego, CA., May 19-22, 1993. ACM. 12. D. Palmer, 3. Prins, and S. Westfold. Work-efficient nested data-parallelism. In
Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Processing (Frontiers 95). IEEE, 1995.
Dynamic Type Information in Process Types Franz P u n t i g a m Technische Universitgt Wien, Institut ffir Computersprachen, Argentinierstr. 8, A-1040 Vienna, Austria. franz@complang, tuwien, ac. at
Abstract. Static checking of process types ensures that each object accepts all messages received from concurrent clients, although the set of acceptable messages can depend on the object's state. However, conventional approaches of using dynamic type information (e.g., checked type casts) are not applicable in the current process type model, and the typing of self-references is too restrictive. In this paper a refinement of the model is proposed. It solves these problems so that it is easy to handle, for example, heterogeneous collections. 1
Introduction
The process type model was proposed as a statically checkable type model for concurrent and distributed systems based on active objects [10, 11]. A process type specifies not only a set of acceptable messages, but also constraints on the sequence of these messages. Type safety can be checked statically by ensuring that each object reference is associated with an appropriate type mark which specifies all message sequences accepted by the object via this reference. The object accepts messages sent through different references in arbitrary interleaving. A type m a r k is a limited resource that represents a "claim" to send messages. It can be used up, split into several, more restricted type m a r k s or forwarded from one user to another, but it must not be exceeded by any user. Support of using dynamic type information shall be added. But, checked type casts break type safety: An object's dynamic type is not a "claim" to send messages. This problem is solved by checking against dynamic type marks. Each object has better knowledge about its own state than about the other objects' states. The additional knowledge can be used in changing the selfreferences' type marks in a very flexible and still type-safe way. The rest of the paper is structured as follows: An improved version of the typed process calculus presented in [11] is introduced in Sect. 2. Support of using dynamic type information is added in Sect. 3, a more flexible typing of self-references in Sect. 4. Static type checking is dealt with in Sect. 5. 2
A Typed
Calculus
of
Active Objects
The proposed calculus describes systems composed of active objects that communicate through asynchronous message passing. An object has its own thread
721 of execution, a behavior, an identifier, and an unlimited buffer of received messages. According to its behavior, an object accepts messages from its buffer, sends messages to other objects, and creates new objects. All messages are received and accepted in the same logical order as they were sent. We use following notation. Constants (denoted by x , y , z , . . . ) are used as message selectors and in types. Identifiers (u, v, w , . . . ) in the infinite set V are used as object and procedure identifiers as well as formal parameters. For each symbol e, g is an abbreviation of e l , . . . , en; e.g., $ is a sequence of constants. {g} denotes the smallest set containing g, and Igl the length of the sequence. For each s y m b o l . , ~.g stands for e l ' g , . . . , e,~.g, and ~.~ for e~.g~,..., e~.g~ (1~1 = I~1). Processes (p, q , . . . ) specify object behavior. Their syntax is: p
::~
S
::~---
~Tt
::~
Og ::~-~7-
::~
a
::~---
~o ::= # ::= t l o t
I pt
A selector {~} specifies a possibly empty, finite set of guarded processes (r, s , . . . ). Each si is of the form x(ft);p, where x is a message selector, g a list of formal parameters, and p a process to be executed if si is selected. An si is selectable if the first received message in the object's buffer is of the form x(&, 5), where (~, ~ are arguments to be substituted for the p a r a m e t e r s fi (lSz, ~l = Ifil); 5~ is a list of types and 5 a list of object and procedure identifiers. A process u . m ; p sends a message rn to the object with identifier u, and then behaves as p. A procedure definition u : = ((~)p:~); q introduces an identifier u of a procedure of type ~ and then behaves as q; the procedure takes fi as p a r a m e t e r s and specifies the behavior p. A process v$u(&, fi); p creates a new object with identifier v and then behaves as p; the new object behaves as u(&, ~). A call u(&, fi) behaves as specified by the procedure with identifier u. A conditional expression u=v ? p I q behaves as p if the identifiers u and v are equal, otherwise as q. There are two kinds of types, object types (Tr, t), c~, r , . . . ) and procedure types (~, ~b,... ). A type is denoted by a,/3, 3',. 99 if its kind does not matter. Furthermore, there are meta-types (p, v , . . . ) : "ot" is the type of all object types, "pt" the type of all procedure types, and %" the type of all types of any kind. T h e activating set [2] of an object type {g}[2] is a multi-set of constants, the behavior descriptor {fi} a finite collection of message descriptors (a, b, c, . . . ). A message descriptor x(~:fi, &)[~]~,[2] describes a message with selector x, type parameters fi of recta-types/~, and parameters of types c~. The fi can occur in c~. The message descriptor is active if the activating set [2] contains all constants in the multi-set [~] (in-set). When a corresponding message is sent (or accepted), the type is updated by removing the constants in the imset from the activating set and adding those in the multi-set [2] (out-set). Using type updating repeatedly
722 for all active message descriptors, the type specifies a set of acceptable message sequences. An expression .u{a}[2] is a recursive version of an object type. A type combination ~r + r specifies a set of acceptable message sequences that includes all arbitrary interleavings of acceptable message sequences specified by c~ and T. An identifier u can be used as type parameter. A procedure type (~:/5, 6~)7 specifies that a procedure of this type takes type parameters ~ of meta-types /5 and parameters of types &. Objects behaving according to the procedure accept messages as specified by T. For example, a procedure B := ((u){put(v_); {get(w); w.back(v); B(u)}}:~B) specifies the behavior of a buffer accepting "put" and "get" in alternation and sending "back" to the argument of "get" after receiving "get". The type of B is given by ~U =d~f (_u:t){put{u)[e]~,[t], get({back(u}[once]>[]}[once])[f]~,[e]}[e]. The type of a procedure specifying the behavior of an infinite buffer that accepts as many "get" messages as there are elements in the buffer can be given by (flBI = d e f (u__:t){put(u)~t,[f], get({back(u)[once]~,~}[once])[f]~,[]}~. Each underlined occurrence of an identifier binds this and all following occurrences of the identifier. An occurrence is free if it is not bound. Free(e) denotes the set of all identifiers occurring fl'ee in e. Two processes are regarded as equal if they can be made identical by renaming bound identifiers (a-conversion) and repeatedly applying the equations {~, s} = {~, s, s} and {~, s, ~} = {~, a, s} to selectors. Two types are regarded as equal if they can be made identical by renaming bound identifiers (a-conversion) and applying these equations:
: {a, ~, c} ~+r=r+c~
[2, y, ~] : [2, 9, y]
@+~)+r
9~_a = [,~_a/q~ {a}[2, 9] : {a}[2] (~ ~ Elf{a}) {~(~:~, ~}[~]~[9, 9'], a}[~] = {.<_~:,~,,~>[@>[.q,a}[~ (~' ~ {2}uEft{a}) {a}[2] + {~}[~] = {a, ~}[2,~] (2 6 Eft{5}; fl ~ Eff{~}) where Ey'f{~<_q:#,, '~,>[.h]~[~d,..., ~(_a{:,~{, ,~d[}d~>[.~d} : {:h} u . . . u { } d . Subtyping on meta-types is defined by ot < t and pt < t. Subtyping for types depends on an environment H containing typing assumptions of the forms u < a and ~ < u. The subtyping relation < on types and the redundancy elimination relation ~. on object types are the reflexive, transitive closures of the rules:
nu{u
/]rl-- {a,6-}[2]-{a}[2]
s
u}~-o,_< ~
(VcE {6}.3a E {a,}. c : ~<_~:,a,,~)[9, 9']~>[~, z'] A Z' ~ E f t { a } A : x @ , > , ~,)[@>[< Z"] A ,a < ~ ,,, ~r ~- ,~ < ~,)
(a < fl and c~ -- T are simplified notations for 0 l- a < fl and 0 l- c~ -- r.) The relation ~ is used for removing redundant message descriptors from behavior descriptors. Using these definitions we can show, for example, WB~ < ~oB.
723 A type {5}[~] is d e t e r m i n i s t i c if all message descriptors in {5} have pairwise different message selectors. For each deterministic type cr and message m there is only one way of updating ~ according to m. A s y s t e m configuration C contains expressions of the forms u ~+ (p, (n} and u ~-~/~}p:~. The first expression specifies an object with identifier u, behavior p and sequence of received messages ~ . The second expression specifies a procedure of identifier u. C contains at most one expression u ~-+ e for each u ~ %7. C[ul e l , . . . , u,~ ~-+ e,~] is a system configuration with expressions added to C, where Ul, . . . , u,~ are pairwise different and do not occur to the left of ~-+ in C. For each u E %/the relation -Y-> on system configurations (defined by the rules given below) specifies all possible execution steps of object u. ([~/,5]/is the simultaneous substitution of the expressions ~ for all free occurrences of ~ in f,)
c[~ ~ <{x/~);> a}, x(~, ~), <>] C[u F-+iv,m; p, rh>, v ~+ /q, rh')] C[u ~
(iv, <}, v ~-+
(v_::(i~>P:F);q, <)] "> c[u~+
C[u ~+ (~$w(a, ~>;p, <>] C[u ~
<~<~,~>, ,~>, ~ ~ i~>v:~] "> C[u~ i([a, ~/~];), <>, v ~/~>v:~] c[u ~ (v:~ ? p I q, ~>] --% c[u iv, ~>] C[u ~-~
---+ denotes the closure of these relations over all u E V, and * > the reflexive, transitive closure of. >..7 > defines the operational semantics of object systems. 3
Dynamic
Type
Information
It seems to be easy to extend the calculus with a dynamic type checking concept: Processes of the form a_
C[u ~ i~_<9?plq, <)] ."> C[u ~+ (q,~>] (~ ~ ~) For example, assume that a reference w is of type mark St[0 , where St ~-def {put/u_:t,u)[e]~[f], get(R[one])[f]>[e]} and R =def {back{v:t,v)[one]~G}. The process w.get/w'); {back{u_, v_};u<_St[e] ? v.put/St[f], v}; {} I{}} (w' denotes the object's self-reference) is type-safe. First, "get" is sent to w; then "back" is accepted. If the type mark of v is a subtype of St[el, v is put into v. The type used as argument of "put" is St[f] because the type is updated when sending "put", Type information is lost when types are updated. St[e] is a supertype of v's type mark u; but, put/u , v} cannot be sent to v: So far it is not possible to specify that v's type mark is equal to u, except that u is updated. To solve the problem,
724 processes of the form c~_
c[~ ~
v in p is replaced with a dynamically determined object type ~ which specifies the difference between c~ and ~-. This version of the above process keeps type information: w.get(w'); {back@, v); u<St[e] ?_u'; v.put(St[t] + u/, v); {} I {}}. The type St[f] + u' equals u after removing e and adding f to the activating set.
4
Typing
Self-References
Without special treatment, static typing of self-references is very restrictive: If an object behaves according to a type ~-, the type m a r k ~r of a self-reference is a subtype of T. Both, ~r and v specify all messages that m a y return results of possibly needed services. In general, it is impossible to determine statically which services are needed. Hence, a and ~- usually specify much larger sets of message sequences than needed. This problem can be solved because each object has much more knowledge about its own state than other objects: Before a message with a self-reference as argument is sent to a server, this reference's type m a r k is selected so that it supports the messages returned by the server. At the same time the object's type is extended correspondingly. The objeet's type and selfreference's type m a r k are determined dynamically as needed in the computation. We extend the calculus: The distinguished name "self' can be used as arguments in procedures; "self' is replaced with a self-reference when the procedure is called. The fifth rule in the definition of " ) has to be replaced with:
The type m a r k of an occurrence of "self" as argument is equal to the type of the corresponding formal parameter. (Elements of V t2 {self} are denoted by o , . . . ) An example demonstrates the use of self and dynamic type comparisons:
~_ := (((~, ~', ~, w_')~<st[~] ?,.get(self); {baek(~, ~); ~(~, ~', v, ~')} Iw'(u,v))
:
(u:t,u_':ot, u,(u:t,u)u')u');
...
The procedure w takes as parameters a type u, an object type u', an identifier v of type u and a procedure identifier w' of type {u:t, u)u'. If v is a n o n - e m p t y store, an element is got from the store and w is called recursively with the element and its type as arguments. Otherwise w' is called with u and v as arguments. The type m a r k of self is R[one], as specified by St. The corresponding message "back" is accepted if the expression containing self is executed.
725
5
Static T y p e Checking
An important result is that the process type model with the proposed extensions supports static type checking. Type checking rules are given in Appendix A. T h e o r e m 1. Let C = {ul ~-~
6
Related Work
Much work on types for concurrent languages and models was done. The lnajority of this work is based on Milner's 7r-calculus [3, 4] and similar calculi. Especially, the problem of inferring most general types was considered by Gay [2] and Vasconcelos and Honda [14]. Nierstrasz [5], Pierce and Sangiorgi [8], Vasconcelos [13], Colaco, Pantel and Sall'e [1] and Ravara and Vasconcelos [12] deal with subtyping. But their type models differ in an important aspect, from the process type model: They cannot represent constraints on message sequences and ensure statically that all sent messages are acceptable; the underlying calculus does not keep the message order. The proposals of Nielson and Nielson [6] can deal with constraints on message sequences. As in the process type model, a type checker updates type information while walking through an expression. However, their type model cannot ensure that all sent messages are understood, and dealing with dynamic type information is not considered. The process type model can ensure statically that all sent messages are understood, although the set of acceptable messages can change. Therefore, this model is a promising approach to strong, static types for concurrent and distributed applications based on active objects. However, it turned out that the restrictions caused by static typing are an important difficulty in the practical applicability of the process type model. So far it was not clear how to circumvent this difficulty without breaking type safety. New techniques for dealing with dynamic type information (as presented in this paper) had to be found. The approach of Najm and Nimour [7] has a similar goal as process types. However, this approach is more restrictive and cannot deal with dynamic type information, too.
726 The work presented in this paper refines previous work on process types [911]. The m a j o r improvement is the addition of dynamic type comparisons and a special treatment of self-references. Some changes to earlier versions of the process type model were necessary. For example, the two sets of type checking rules presented in [11] had to be modified and combined into a single set so that it was possible to introduce "self". Dynamic type comparisons exist in nearly all recent object-oriented prog r a m m i n g languages. However, the usual approaches cannot be combined with process types because these approaches do not distinguish between object types and type marks and cannot deal with type updates. Conventional type models do not have problems with the flexibility of typing self-references.
7
Conclusions
The process type model is a promising basis for strongly typed concurrent prog r a m m i n g languages. The use of dynamic type information can be supported in several ways so that the process type model becomes more flexible and useful. An i m p o r t a n t property is kept: A type checker can ensure statically that, clients are coordinated so that all received messages can be accepted in the order they were sent, even if the acceptability of messages depends on the server's state.
References 1. J.-L. Colaco, M. Pantel, and P. Sall'e. A set-constraint-based analysis of actors. In Proceedings FMOODS '97, Canterbury, United Kingdom, July 1997. Chapman & Hall. 2. Simon J. Gay. A sort inference algorithm for the polyadic rr-caleulus. In Conference Record of the 20th Symposium on Principles of Programming Languages, January 1993. 3. Robin Milner. The polyadic w-calculus: A tutorial. Technical Report ECS-LFCS91-180, Dept. of Comp. Sci., Edinburgh University, 1991. 4. R. Milner, J. Parrow, and D. Walker. A calculus of mobile processes (parts I and II). Information and Computation, 100:1-77, 1992. 5. Oscar Nierstrasz. Regular types for active objects. ACM SIGPLAN Notices, 28(10):1-15, October 1993. Proceedings OOPSLA'93. 6. Flemming Nielson and Hanne Riis Nielson. From CML to process algebras. In Proceedings CONCUR '93, number 715 in Lecture Notes in Computer Science, pages 493-508. Springer-Verlag, 1993. 7. E. Najm and A. Nimour. A calculus of object bindings. In Proceedings FMOODS '97~ Canterbury, United Kingdom, July 1997. 8. Benjamin Pierce and Davide Sangiorgi. Typing and subtyping for mobile processes. In Proceedings LICS'93, 1993. 9. Franz Puntigam. Flexible types for a eoncmTent model. In Proceedings of the Workshop on Object-Oriented Programming and Models of Concurrency, Torino, June 1995.
727 10. Franz Puntigam. Types for active objects based on trace semantics. In Elie N a j m et al., editor, Proceedings FMOODS '96, Paris, France, March 1996. I F I P W G 6.1, Chapman & Hall. 11. Franz Puntigam. Coordination requirements expressed in types for active objects. In Mehmet Aksit and Satoshi Matsuoka, editors, Proceedings EUOOP '97, number 1241 in Lecture Notes in Computer Science, Jyvs Finland, June 1997. Springer- Verlag. 12. Antdnio Ravara and Vasco T. Vasconcelos. Behavioural types for a calculus of concurrent objects. In Proceedings Euro-Par '97, Lecture Notes in Computer Science. Springer-Verlag, 1997. 13. Vasco T. Vasconcelos. Typed concurrent objects. In Proceedings ECOOP'94, number 821 in Lecture Notes in Computer Science, pages 100-117. Springer-Verlag, 1994. 14. Vasco T. Vasconcelos and Kohei Honda. Principal typing schemes in a polyadic pi-calculus. In Proceedings CONCUR'93, July 1993.
728
A
Type
Checking
Rules
/7 l- {fi}[~)] < 7 V l < i < n 9
~)i:6zi]I- (p{:{a}[S{]) / I , j ~ }- ({~1(~1, ~ l ) ; P l , . . . ,Xn(~__~.,V__n);pn}:T) H ~- er _< {a}[~] -4- ~ n , F [ u : { a } [ 0 ] + ~] t- (Er:~,&([~/~]~/),p:T) O:6:] ~- (r 0 /7,1"[u:r ~- (q:T) Lr, F ~- ((u := ((~, ~)p:,~_r q):~)
/7,0[~:zS,~z:~,u:r163
(t) (2)
SEL
SEND
(3)
DEF NEW CALL
n, rb:~, ~,:r ~ ( v )
n, rb:~, ~:r ~ (q:~)
//, P[u:/~] /- ((u
HU{ct<_u},P[u:t~]F-(p:r) /7, P[u:l~]b-(a:t,,q:T} (uf~Free(a)) /7, P[u:/~] b- ((a<_u ?p[q):v) H U {u < a}, P[u:ot, v:ot] b (p:r) //, r[u:ofl] f- (a:ot, q:T) (U ~ Free(a)) //, Flu:or] ~- <(u
SUB2 SPLIT
n ~ ~ _< ~ + ~- /L rb:d ~ (~:~)
osJ
/7, F ~- ff:~, p:a + r)
SELS
H,/7 ~- (self:a, e:g,p:r}
n, Fb:r e
PROC
(~:~, ~:0)
n , r b : . ] ~ (~:,, ~:0) V l < i < n 9 F[u:ot, ill:/],] l- (c}i:t, {}:{}~) H, F I- (~:0) ot < v n , - ~ I-- (*~_{Xl(~I :/]1, (~I}[01]I>[Zl],... ,20,,(~_n:~ln,~n}[~]n]D[Z.n]}[~]]:%,C:~ H , F ~- (~r:ot, r:ot,e:O) ot < p //, F ~- ((0" + T):p, e:g) i7, P[u:pt, flit]] I- (6~:t, r:ot, {}:{}[]) H, F b (~:~) n , r ~ ((,~_(~:~, ~>~):., ~:~)
(1) Act{a}[O] : {x~(~:p,,c},)[]~[h]
.....
~(~:#~,c}~}[]~[~]}
pt _< u
TPAR OB']T1 OBJT2 PROCT
where
Act{}[Y:] : {] Act {x(~:t], &)[2]~[0'], ~}[2, ~] : {x(u-:t], &)[]~[0', 2], ~} (Act{~}[9, ~] = {~}) Act{x(fi:[~, c})[~?]v,[~)'],fi}[5] : Act{a}[2] (V2' 9 [2] # [2, ~']) and {fi}[t)] is deterministic. (2) a : x(u-:/~, 7)[k]~'[0] (3) r = (u_':t~,~}~
and
F = {~:~,~:qS,~b~:}]
G e n e r a t i o n of D i s t r i b u t e d Parallel Java Programs Pascale Launay and Jean-Louis Pazat IRISA, Campus de Beaulieu, F35042 RENNES cedex Pascale. Launay@ irisa, fr, Jean-Louis. Pazat @irisa. fr
A b s t r a c t . The aim of the Do! project is to ease the standard task of programming distributed applications using Java. This paper gives an overview of the parallel and distributed frameworks and describes the mechanisms developed to distribute programs with Do!.
1
Introduction
Many applications have to cope with parallelism and distribution. The main targets of these applications are networks and clusters of workstations (Nows and cows) which are cheaper than supercomputers. As a consequence of the widening of parallel programming application domains, programming tools have to cope both with task and data parallelism for distributed program generation. The aim of the Do/project is to ease the task of programming distributed applications using object-oriented languages (namely Java). The Do/programruing model is not distributed, but is explicitly parallel. It relies on structured parallelism and shared objects. This programming model is embedded in a framework described in section 2, without any extension to the Java language. The distributed programs use a distributed framework described in section 3. Program distribution is expressed through the distribution of collections of tasks and data; the code generation is described in section 4. 2
Parallel
Framework
In this section, we give an overview of the parallel fl'amework described in [8]. The aim of this framework is to separate computations from control and synchronizations between parallel tasks allowing the programmer to concentrate on the definition of tasks. This framework provides a parallel programming model without any extension to the Java language. It is based on active objects (tasks) and structured parallelism (through the class PAn) that allows the execution of tasks grouped in a collection in parallel. The notion of active objects is introduced through task objects. A task is an object which type extends the TASK class, that represents a model of task, and is part of the framework class library. The task default behavior is inherited from the behavior described in the TASK class through its run method. This behavior
730 can be re-defined to implement a specific task behavior. A task can be activated synchronously or asynchronously. We have extended the operators design pattern [6] designed to express regular operations over COLLECTIONs through OPEP~ATORs, in order to write dataparallel SPMD programs: a COLLECTION manages the storage and the accesses to elements; an OPERATOR represents an autonomous agents processing elements. Including the concept of active objects, we offer a parallel p r o g r a m m i n g model, integrating task parallelism: active and passive objects are stored by collections; a parallel program consists in a processing over task collections, task p a r a m e t e r s being grouped in d a t a collections. The PAd class implements the parallel activation and synchronization of tasks, providing us with structured parallelism. Nested parallelism can be expressed, the PAd class being a task. Figure 1 shows an example of a simple parallel program, using collections of type ARRAY; the class MY_TASK represents the p r o g r a m specific tasks; it extends TASK, and takes an object of type PAI~AM as parameter. i m p o r t DO.SHARED.*;
/* the task definition */I public class MY_TASKextends TASK { public void run (Object param) { MYA)ATA data = (MY_DATA)paFam; data.select (criterion); data.print(out); data.add(value); } }
/ * the task parameter "1/ public class MY_DATA { public void add(...) { ... } public void remove(...) { ... } public void print(...) { ... } public void select(...) { ... } }
public class SIMPLE_PARALLEL { public static void m a i n (String argv[ ]) {
/* task and data array initializations */ ARRAY tasks -= new ARRAY(N); ARRAY data ----new ARRAY(N); for (int i=0; i
Distributed
Framework
The distributed framework [9] is used as a target of our preprocessor to distribute parallel programs expressed with our parallel framework. In the parallel framework, we use collections to m a n a g e the storage and accesses to active and passive objects. The distributed framework is based on distributed collections: a distributed collection is a collection t h a t manages elements m a p p e d on distinct processors, the location of the elements being masked to the user: when a client retrieves a remote element through a distributed collection, it gets a remote reference to the element, that can be invoked transparently. Distributed tasks are activated in parallel by remote asynchronous invocations. A distributed collection is
731
composed of fragments m a p p e d on distinct processors, each fragment m a n a g i n g the local subset of the collection elements. A local fragment is a non distributed collection; the distributed collection processes an access to a distributed element by remote invocation to the fragment owning this element (local to this element). To access an element of a distributed collection, a client identifies this element with a global identifier (relative to the whole set of elements). The distributed collection has to identify the fragment owning the element and transform the global identifier into a local identifier relevant to the local fragment. The task of converting a global identifier into the corresponding local identifier and the owner identifier devolves on a LAYOUT_MANAGER object. Different distribution policies are provided by different types of layout_managers. The program distribution is guided by the user choice of a specific l a y o u L m a n a g e r implementation. Nevertheless the distributed framework does not manage the remote creations and accesses. They are handled by a specific runtime, and require to transform objects of the program (section 4). 4
Distributed
Code
Generation
We have developed a preprocessor to transform parallel programs expressed with our parallel framework into distributed programs, using our distributed framework. The parallel and distributed frameworks have the same interface, so the framework replacement is obtained by changing the imports in the program. Despite this, we have to transform some objects of the program: the objects stored in the distributed collections are distributed upon processors and need to have access to other objects. So, we have to: - create (map) objects of the program on processors: when an object is located on a processor, its attributes are managed by the local m e m o r y and its methods run on this processor; - have a mechanism allowing objects to be accessed remotely without any change from the caller point of view. We have extended the object creation semantics to take into account remote creations of objects. Servers running on each host as separate threads are responsible for remote object creations. This extension is implemented using the standard J a v a reflection mechanism. To allow the transparent accesses to remote objects, the Do! preprocessor transforms a class into two classes: -- the implementetion clcss contains the source methods implementations. An implementation object is not replicated and is located where the source object has been created (mapped). It is shared between all proxy objects. - the proxy cless has the same name and interface as the source class, but the method bodies consist in remote invocations of the corresponding methods in the implementation class. The proxy object handles a remote reference on the implementation object; it catches the invocations to the source object and redirects them to the right host. The proxy object state is never modified, so it can be replicated on each processor getting a reference on the source object.
732
5
Related Work
Many research projects have appeared around the J a v a language, aiming at filling the gaps of parallelism and distribution in Java. Some of t h e m are presented in [1]. Some projects are based on parallel extensions to the J a v a language: tools [2, 4] produce J a v a parallel (multi-threaded) programs, relying on a standard J a v a runtime system using thread libraries and synchronization primitives; they do not generate distributed programs; others [7, 10] are based on distributed objects and remote method invocations. Other projects use the J a v a language without any extension: some environments [5] rely on a data-parallel programruing model and a SPMD execution model; as in the Do/project, parallelism m a y be introduced through the notion of active objects [3].
6
Conclusion
In this paper, we have presented an overview of the Do/ project, that aims at a u t o m a t i c generation of distributed programs from parallel programs using the J a v a language. We use the J a v a RMI as run-time for remote accesses; a foreseen extension of this work is to use CORBA for objects communications. Ongoing work consists in extending the parallel and distributed frameworks to handle dynamic creation of tasks, through dynamic collections (e.g. lists) and distributed scheduling.
References 1. ACM 1997 Workshop on Java for Science and Engineering Computation. Concurrency: Practice and Experience, 9(6):413-674, June 1997. 2. A. J. C. Bik and D. B. Gannon. Exploiting implicit parallelism in Java. Concurrency, Practice and Experience, 9(6):579-619, 1997. 3. D. Caromel. Towards a method of object-oriented concurrent programming. Communications of the ACM, 36(9):90-102, September 1993. 4. Y. Ichisugi and Y. Roudier. Integrating data-parallel and reactive constructs into Java. In OBPDC'97, France, October 1997. 5. V. Ivannikov, S. Gaissaryan, M. Domrachev, V. Etch, and N. Shtaltovnaya. DPJ: Java class library for development of data-parallel programs. Institute for System Programming, Russian Academy of Sciences, 1997. 6. J.-M. J~z~quel and J.-L. Pacherie. Parallel operators. In P. Cointe, editor, ECOOP'96, number 1098 in LNCS, Springer Verlag, pages 384-405, July 1996. 7. L. V. Kal6, M. Bhandarkar, and T. Wilmarth. Design and implementation of Parallel Java with a global object space. In Conference on Parallel and Distributed Processing Technology and Applications, Las Vegas, Nevada, July 1997. 8. P. Launay and J.-L. Pazat. A framework for parallel programming in Java. In HPCN'98, LNCS, Springer Verlag, Amsterdam, April 1998. To appear. 9. P. Launay and J.-L. Pazat. Generation of distributed parallel Java programs. Technical Report 1171, Irisa, February 1998. 10. M. Philippsen and M. Zenger. JavaParty - transparent remote objects in Java. In PPoPP, June 1997.
A n Algebraic Semantics for an A b s t r a c t Language with Intra-Object-Concurrency* Thomas Gehrke Institut ffir Informatik, Universits Hildesheim Postfach 101363, D-31113 Hildesheim, Germany [email protected] Object-oriented specification and implementation of reactive and distributed systems are of increasing importance in computer science. Therefore, languages like Java [1] and Object REX)( [3, 7] have been introduced which integrate objectorientation and concurrency. An important area in programming language research is the definition of semantics, which can be used for verification issues and as a basis for language implementations. In recent years several semantics for concurrent object-oriented languages have been proposed which are based on the concepts of process algebras; see, for example, [2, 6, 8, 9]. In most of these semantics, the notions of processes and objects are identified: Objects are represented by sequential processes which interact via communication actions. Therefore, in each object only one method can be active at a given time. Furthermore, language implementations based on these semantics tend to be inefficient, because the simulation of message passing among local objects by communication actions is more expensive than procedure calls in sequential languages. In contrast to this identification of objects with processes, languages like Java and Object R E X X allow for intra-object-concurrency by distinguishing passive objects and active processes (in Java, processes are objects of a special class). Objects are data structures containing methods for execution, while system activity is done by processes. Therefore, different processes may perform methods of the same object at the same time. To avoid inconsistencies of data, the languages contain constructs to control intra-object-concurrency. In this paper, we introduce an abstract language based on Object R E X X for the description of the behaviour of concurrent object-based systems, i.e. we ignore data and inheritance. The operational semantics of this language is defined by a translation of systems into terms of a process calculus developed previously. This calculus makes use of process creation and sequential composition instead of the more common action prefixing and parallel composition. Due to the translation of method invocation into process calls similar to procedure calls in sequential languages, the defined semantics is appropiate as a basis for implementation.
The language O: An O-system consists of a set of objects and an additional sequence of statements, which describes the initial system behaviour. Each object is identified by a unique object identifier O and must contain at least one * This work was supported by the DFG grant EREAS (Entwurf reaktiver Systelne).
734
o b j e c t O1 m e t h o d ml
a;b m e t h o d m2 a; reply; b m e t h o d m3 a [] reply; b
object 02 m e t h o d ml g u a r d e d c; reply; d m e t h o d rn2 g u a r d e d e; g u a r d off; reply; f
o b j e c t 03 m e t h o d ml g u a r d e d a; reply; b o b j e c t 04 m e t h o d ml g u a r d e d c; d
Fig. 1. O examples.
method. The special object identifier self can be used for the access of methods of the calling object (self is not allowed in the initial statement sequence). Methods are identified by method identifiers m and must contain at least one statement. Statements include single instructions and nondeterministic choices between sequences: Sl[]S2 performs either sl or s2. We distinguish five kinds of instructions: atomic actions a representing the visible actions of systems (e.g. access to common resources), methods calls O.m and the three control instructions r e p l y , g u a r d o n and g u a r d off. ; denotes sequential composition; for syntactical convenience we assume that ; has a higher priority than [] (e.g., a [] b; c is a [] (b; c)). System runs are represented by sequences al...aN of atomic actions, called traces. During the execution of a called method, the caller is blocked until the called method has terminated. For example, the execution of the sequence O l . m l ; c in combination with the object O1 in Figure 1 generates the trace a bc. In some cases it is desirable to continue the execution of the caller before the called method has finished (e.g., the execution of the remaining instructions of the called method does not influence the result delivered to the caller). The instruction r e p l y allows the caller to continue its execution concurrently to the remaining instructions of the called method. Therefore, the execution of O l . m 2 ; e leads to the traces a b c and a c b. Furthermore, the execution of O l . m 3 ; c leads to the traces a c, b c and c b. Methods of different objects can always be executed in parallel. Concurrency within objects can be controlled by the notion of guardedness. If a m e t h o d m of an object O is guarded, no other method of O is allowed to become active when m is taking place. If m is unguarded, other methods of O can be performed in parallel with m. The option g u a r d e d declares a method to be guarded. The instructions g u a r d o n and g u a r d off allow to change the guardedness of a method during its execution. Consider the object 0 2 in Figure 1 in which both methods are declared as guarded. Although O 2 . m l contains a r e p l y instruction, the execution of the sequence O2.ml; 0 2 . m 2 can only generate the trace c d e f , because the guardedness of the methods prevents their concurrent execution. On the other hand the sequence O2.m2; 0 2 . m l leads to the traces e f c d and e c d f
735
---~
Ti
,p~w,,(t) 2+ ,w,~.(t)
t2+t
T2
u2+u
(t;u) 2+ (t;u)
T3
t2+t ( a : t) 2+ ( a : t)
T4
t - % t' -
-
a -% 1
t - - ~ t' (b : t) - ~
R1
a# b (b : t')
n _2+ O(n)
R2
t ~-~ t' R4
t A~ t' u--% U'R8 t; u ~ t';u'
u --% u'
t + u - ~ t' t A~ t'
spawn(t) -% spawn(t')
R5
t' -% t"
t + u -% u'
R3
t -% t' R6
t; u -% t'; u
R7
u _ e ~ u' { ~ , a ' } = { a , a ? } R 9 t; u ~ t";u'
Fig. 2. Transition rules.
(the trace e c f d is prevented by the guardedness of O2.rnl). Guardedness does not play a role in the initial s t a t e m e n t sequence. As a third example, consider the sequence O 3 . m l ; O 4 . r n l . This sequence leads to the possible traces a b c d, a c b d and a e d b, because the called m e t h o d s belong to different objects.
The process calculus P : To model the behaviour of O-systems, we introduce a process calculus 7), which is a slightly modified version of a calculus studied in [5]. We assume a countable set C of action names, ranged over by a, b, c. A n a m e a can be used either for input, denoted a?, or for o u t p u t , denoted with only the n a m e a itself. We sometimes use a t as a " m e t a - n o t a t i o n " d e n o t i n g either a? or a. T h e set of actions is denoted A = {at I a C C} U {r}, ranged over by a, ft. r is a special action to indicate internal behaviour of a process. s = A U {5}, ranged over by w, is the set of transition labels, where 5 signals successful t e r m i n a t i o n . N', ranged over by n, n', is a set of names of processes. The calculus 7), ranged over by t, u, v, is d e f n e d t h r o u g h the following g r a m m a r :
t : : = 0 I 1 l a I n I t;t I spawn(t) ] t + t
I ( a :t)
0 denotes the inactive process, 1 denotes a successfully t e r m i n a t e d term. A process n a m e n is interpreted by a function 0 : iV" --+ 7), called process environment, where n denotes a process call of O(n). t; u denotes the sequential composition of t and u, i.e. u can perform actions when t has t e r m i n a t e d , spawn(t) creates a new process which performs t concurrently to the spawning process: spawn(t); u represents the concurrent execution of t and u. T h e choice o p e r a t o r t + u performs either t or u. (a : t) restricts the execution of t to the actions in A \ {a, a?}. T h e semantics o f / ) is given by the transition rules in Figure 2.
Translation: A n object system is translated into a process e n v i r o n m e n t O and a t e r m representing the initial s t a t e m e n t sequence. T h e set Af of process n a m e s is defined as JV = {O_m ] 0 object, rn m e t h o d of O} U { O _ m u t e x I 0 object}. For each m e t h o d m of an object O, we include a process n a m e O _ m into the
736
01_ml Ol_m2 Ol_m3
~ a; b ~+ a; spawn(b) ~-> r; a + r; spawn(b)
02_ml F-> o2_1; c; spawn(d; o2_u) 02_m2 ~-~ o2_l; e; o2_u; spawn(o2_l; f; o2_u) 0 2 _ m u t e x ~-~ 02_/?; o2_u?; 02_rnutex 03_ml ~-~ o3_l; a; spawn(b; o3_u) 0 3 _ m u t e x ~+ o3_/?; o3_u?; 03_rnutex 04_ml ~+ o4_1; c; d; o4_u 0 4 _ m u t e x ~ o4_/?; o4_u?; 04_rnutex
Fig. 3. Process semantics of the objects in Figure 1.
set H ; the process t e r m O(O_m) models the behaviour of the body of m. For example, the method m l of object O1 in Figure 1 is translated into O h m l ~-~ a; b (see Figure 3). The additional O_mutex processes are used for synchronizing methods. Method calls are translated into process calls of the corresponding process definitions. Therefore, the statement sequence O l . m l ; c is translated into O l _ m l ; c, which is able to perform the following transitions: O l _ m l ; c - ~ a; b; c -% 1; b; c _2+ 1; 1; c -% 1; 1; 1. Note that the sequence of the transition labels without r corresponds to the trace for O l . m l ; c. The translation of the reply-instruction is realized by enclosing the remaining instructions of the method in a spawn-operator. For example, the m e t h o d m2 of O1 is translated into the process Ol_m2 ~-~ a; spawn(b). The sequence 01.m2; c is translated into Ol_m2; c, which leads to the following transition system: b Ol_m2;c
:~ a;spawn(b);r a
1;spawn(1);c
1;spawn(b);c
~b
1;spawn(i);1
1;spawn(b);1
Choice operators sl [] s2 are translated into terms r ; t l + T;t2 where tl,t2 are the translations of sl and s2. The initial r-actions simulate the internal nondeterminism of the n-operator. If an instruction sequence contains subterms of the form (sl [] s2); sa, we have to distribute sequential composition over choice, i.e. (sl [] s~);s~ has to be transformed into sl;sa [] s2;s3 before the translation into the process calculus can be applied. This transformation is necessary for correct translation of the reply-instruction. For example, the sequence (a; b [] r e p l y ; e); d is transformed into a; b; d [] r e p l y ; c; d, and then translated into r; a; b; d + r; spawn(c; d). Guardedness is implemented by mutual exclusion with semaphores. In the absence of data, we have to simulate semaphores by communication. Por each ob-
737 ject O using guardedness, we introduce a special semaphore process O_mutex ~-~ o_l?; o_u?; O_mutex. Communication on channel o_l means locking the semaphore, communication on o_u the corresponding release (unlock). To ensure that only one guarded m e t h o d of an object can be active at the same time, methods of objects using guardedness have to interact with the corresponding semaphore process. For example, consider the translation of object 0 2 in Figure 3. In order to integrate unguarded actions into the synchronization mechanism, we have to enclose every unguarded action by lock and unlock actions. Therefore, in 02'.m2 the instruction f is translated into o2d; f; o2_u. Without the synchronization actions, unguarded actions could be performed although a guarded method has locked the semaphore. In objects without guardedness, the insertion of lock and unlock actions is not necessary and therefore omitted (see Figure 3). To enforce c o m m u nication over the l and u channels, we have to restrict these actions to the translation of the initial instruction sequence. Furthermore, the semaphore process has to be spawned initially. Therefore, the initial statement sequence O 2 . m l ; 0 2 . m 2 is translated into ({o2_/, o2_u} : spawn(O2_mutex); O2_ml; O2_m2). T h e semaphore process is spawned initially and runs concurrently to the m e t h o d calls. o2d and o2_u are restricted, therefore they can only be performed in c o m m u nications between 0 2 _ m l , 02_m2 and the semaphore process. It is easy to see t h a t the only possible trace is c d e f (with v-actions omitted). In the full paper [4], the translation of O-systems is defined via translation functions. Furthermore, weak bisimulation is used as a equivalence relation on object systems.
References 1. K. Arnold and J. Goshng. The Java Programming Language. Addison-Wesley, 1996. 2. P. Di Blasio and K. Fisher. A Calculus for Concurrent Objects, In Proceedings o] CONCUR '96, LNCS 1119. Springer, 1996, 3. T. Ender. Object-Oriented Programming with REX)(. John Wiley and Sons, Inc., 1997. 4. T. Gehrke. An Algebraic Semantics for an Abstract Language with Intra-ObjectConcurrency. Technical Report HIB 7/98, Institut fiir Informatik, Universit/it Hildesheim, May 1998. 5. T. Gehrke and A. Rensink. Process Creation and Full Sequential Composition in a Name-Passing Calculus. In Proceedings of EXPRESS '97, vol. 7 of Electronic Notes in Theoretical Computer Science. Elsevier, 1997. 6. K. Honda and M. Tokoro. An Object Calculus for Asynchronous Communication. In Proceedings of ECOOP '91, LNCS 512. Springer, 1991. 7. C. Michel. Getting Started with Object REXX. In Proceedings of the SHARE Technical Conference, March 1996. 8. E. Najm and J.-B. Stefani. Object-Based Concurrency: A Process Calculus Analysis. In Proceedings of TAPSOFT '91 (vol. 1), LNCS 493. Springer, 1991. 9. D. Walker. Objects in the 7r-Calculus. Information and Computation, 116:253-271, 1995.
An Object-Oriented Framework for Managing the Quality of Service of Distributed Applications St@hane Lorcy, Noel Plouzeau Irisa* Campus de Beaulieu - 35042 Rennes Cedex - France Email: [email protected]
A b s t r a c t . In this paper we present a framework aiming at easing the design and implementation of distributed object-oriented applications, especially interactive ones, where quality of service has to be monitored and controlled. Our framework relies on a new computation model based on a contract concept. Contracts are used to specify execution requirements and to monitor the remote execution of methods; they enable a good separation of concern between the application domain issues and the distribution domain ones, while promoting structured interactions between objects in these two domains, especially regarding quality of service.
1
Introduction
It is now a well-known fact that the design, implementation and testing of distributed applications pose many challenges to the computer engineer. A potential and partial cure of these problems may lie in the use of object-oriented techniques. But parallel and distributed applications have specific problems requiring the development of specific object-oriented techniques. Handling distributed objects (ie objects executing in execution spaces disseminated over a network) is a serious concern to the application designer. A lot of effort has been devoted to techniques for using and managing distributed objects. In this paper we focus on the quality of service management issue in object-oriented distributed applications. Designers of distributed computations build on a huge amount of experience both on architectural side and on the algorithmic side. Concurrency control, efficiency, fault-tolerance, best use of the resources are some of the main qualities that distributed software designer strive to obtain [1,9]. However, once obtained in some particular implementation , these qualities cannot be transposed and adapted easily to other application settings. Cross-breeding object-oriented techniques with distributed computations ones is a fairly hot topic and many interesting works have been published [8]. Let us mention a few of major approaches. * This work has been supported in part by the council of R~gion Bretagne
739 1. One of the most popular approach relies on hiding distribution. To the application program developer, all objects are local (except maybe at initialization time). The bad side is the lack of control and the fact that some important issues are hidden although they are too important to be ignored. 2. Another popular approach uses a different strategy and gives tools to the application designer to build a custom distributed object framework. The possibilities that this scheme gives are numerous and it overpowers the simple yet limited scheme of proxies. The drawback here is that using and assembling parts requires a great deal of knowledge on distributed application architectures. 3. A third scheme goes farther on the reification path. Even higher order entities such as protocols can be reified[4]. Defining a protocol between objects is done by stacking or composing existing classes of protocols. On the down side, an application needs specific protocols which are difficult to identify by the application developer. In this paper we present another approach to the cross-breeding of objectorientedness and distributed computation domains. The main concept we use is the contract model. Most features derive from the semantics of this contract notion. The main aim of the contract notion is to capture the interaction properties between two or more objects. A method invocation is a common case of object interaction. By using an adequate contract, a customer object may invoke a method on a provider object and still get vital information on the m e t h o d execution, such as delays, failures, etc. This paper does not aim at providing an extensive definition of the contract model we designed. We rather aim at presenting the overall objectives, structure and logic of a distributed framework we built for managing quality of service issues in distributed applications. Section 2 contains a general exposition of out contract model. Section 3 gives a very brief overall description of our framework for distributed objects. 2
The
Contract
Concept
A contract defines the ways and means of interaction between two or more objects. Several contract models have already been defined to enhance the clarity of design and the reliability of object-oriented applications. A common form of contracts use method call precondition, postconditions, loop invariant, class invariants, etc. For instance, the Eiffel language [7] includes constructs to state explicitly this kind of contract within the code. Other forms of contracts may include the specification of valid interaction protocols between objects [51 . Also, real-time software designs include timing constraints on object interaction. Our work on contracts and distribution aims at extending the contract notion to deal properly with many distribution related problems, such as quality of service management (eg varying network latency, network failures, site failures,... ) or load balancing (eg performance evaluation of sites, selection of a site for m e t h o d evaluation).
740 Our model relies on four i m p o r t a n t types of entities : the contract, the contractable, the customer and the c o m m a n d . Briefly stated, a contract is setup by a negotiation between the customer and the contractable. The contract specifies how a c o m m a n d will be executed, at a customer's request. The m o s t c o m m o n form of request is method invocation, ie an object (customer) wants another one (provider) to execute some method. Typical contracts m a y include specifications of preconditions, postconditions, execution deadlines, alternate behaviors to follow in case of contract violation. From the language point of view, a contract is an object instantiated from some class which inherits from the abstract Contract class. Instantiation of contract objects is performed by contractable objects only, in a way similar to the abstract factory design pattern [3]. As it is c u s t o m a r y with object-oriented architectures, the exact concrete class of a contract object is hidden to the customer object, allowing sophisticated means of contract implementation selection and tuning within the contract m a n a g e m e n t software. Contract negotiation is the first phase within a contract's life. If some given customer object aims at requesting a c o m m a n d execution under specific conditions (eg a m a x i m u m duration of execution), the customer object m u s t ask a known contractable object to provide a contract stating these execution conditions (which command, which delays, etc). The contractable object is fully responsible for finding out whether the conditions are reasonable or not. This implies that some estimation of the feasibility has to be performed, eg a network load evaluation. By returning a valid contract a contractable object give hints to the customer object about the current quality of service of c o m m a n d execution. A contract is used by a customer object when it requests execution of a c o m m a n d to its contractable partner. The contractable object is responsible for finding out means to execute the c o m m a n d under the specifications of the contract. The contractable object is also responsible for monitoring the execution, based on the contract features.
3
Contracts Between Distributed Objects
Our contract model states that a contract negotiation starts at initiative of a customer object. In a distributed execution environment, work has to be done to find out whether a contract proposed by a customer object is feasible, which site or sites can deal with requests under the contract, what adjustments have to be made. This involves specific trading algorithms. Our framework includes several subclasses of the Contract class, to provide the application developer with distributed services such as execution of c o m m a n d in a group of objects, etc. The simplest contract is a maximum delay of execution contract. To get this kind of contract, the designer of the scene m a n a g e m e n t subsystem simply needs to subclass a basic contract class of our framework, n a m e d TimeoutContract. Another useful contract is the Delta T Reachable Set Contract (DTRSContract for short). A DTRSContract contains a list of remote sites which have to be reached in less than T units of time. DTR,5'Contract objects are created
741 and configured by the kernel of our framework, which monitors the communication between a set of sites using the fail awareness datagram scheme of Fetzer and Cristian [2]. This scheme uses physical clocks on each site to compute the message transmission delays and take decisions about the reaehability of sites. The Delta Connected Group Contract includes requirements for the connectivity of members of groups: all members of a group must follow the requirements of the same group contract. We designed a specific group management algorithm to compute a partition of the sites set [6].
4
Conclusion
Like in the Bast system [4], we use high level building blocks to help the distril:~t, ~d application designer. However, our contract model has more expressive p ,wer to selectively hide or expose features of the underlying distribution mechanisms. Thanks to this, distributed applications can be constructed with a limited expertise in distribution issues, without sacrifying the ways of control. This improves the separation of concern: some part of the application design does not care about distribution issues, some other part includes specific, application-specific distributed mechanisms and expose them via the contract user & contract provider metaphor.
References 1. R. Van Renesse K. P. Birman and S. Maffeis. Horus : a flexible group communication system. Communication of the ACM, 39(4), April 1996. 2. C. Fetzer and F. Cristian. Fail-awareness in timed asynchronous. In 15th A CM Symposium on Principles of Distributed Computing, Philadelphia, May 1996. 3. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns, element of Reusable Object-Oriented So]tware. Addison-Wesley, 1995. 4. B. Garbinato, P. Felber, and R. Guerraoui. Composing reliable protocols using the strategy design patterns. In Usenix International Conference on Object-Oriented Technologies (C00TS'97), 1997. 5. R. Helm, I. M. Holland, and D. Gangopadhay. Contracts : Specifying behavioral compositions in object-oriented systems. In ECOOP/OOPSLA, pages 169-179, October 1990. 6. S. Lorcy and N. Plouzeau. A distributed algorithm for managing group membership with multiple groups. In Proc. of the PDPTA '98 conference, 1998. 7. B. Meyer. Applying "design by contract". Computer (IEEE), 25(10):40-51, October 1992. 8. D. C. Schmidt. The adaptative communication environment : Object-oriented network, programming components for developping client/server application. 12th Sun Users Group Conference, June 1994. 9. P. Verissimo, L. Rodriguez, F. Cosquer, H. Fonseca, and J. Frazao. An overview of the navtech system. Technical report, Department of Informatics, Faculty of Sciences of the University of Lisboa, Portugal, 1995.
A Data Parallel Java Client-Server Architecture for D a t a F i e l d C o m p u t a t i o n s o v e r 2Z n Jean-Louis Giavitto, Dominique De Vito, Jean-Paul Sansonnet LRI u.r.a. 410 du CNRS, Bs 490 - Universit~ de Paris-Sud, F-91405 Orsay Cedex, France. {giavitt o Idevito}@Iri, fr
A b s t r a c t . We describe FieldBroker, a software architecture, dedicated to data parallel computations on fields over z7 ~ . Fields are a natural extension of the parallel array data structure. From the application point of view, field operations are processed by a field server, leading to a client/server architecture. Requests are translated successively in three languages corresponding to a tower of three virtual machines processing respectively mappings on Z~~*, sets of arrays and flat vectors in core memory. The server is itself designed as a master/multithreadedslaves program. The aim of FieldBroker is to mutually incorporate approaches found in distributed computing, functional programming and the data parallel paradigm. It provides a testbed for experiments with language constructs, evaluation mechanisms, on-the-fly optimizations, load-balancing strategies and data field implementations.
1
Introduction
C o l l e c t i o n s , D a t a F i e l d s a n d D a t a P a r a l l e l i s m . The data parallel paradigm relies on the concept of collection: it is an aggregate of data handled as a whole [6]. A data field is a theoretically well founded abstract view of a collection as a function from a finite index set to a value domain. Higher order functions or intensional operations on these mappings correspond to data parallel operations: point-wise applied operation (map), reduction (fold), etc. Data fields enable to represent irregular data by using a suitable index set. Another attractive advantage of the data field approach, in addition to its generality and abstraction, is that many ambiguities and semantical problems of "imperative" data parallelism can be avoided in the declarative fl'amework of data fields. A D i s t r i b u t e d P a r a d i g m for D a t a P a r a l l e l i s m . Data parallelism was motivated to satisfy the increasing needs of computing power in scientific applications. Thus, the main target of data parallel languages has been supercomputers and the privileged linguistic framework was Fortran (cf. HPF). Several factors urge to reconsider this traditional fl'amework: - Advances in network protocols and bandwidths have made practical the development of high performance applications whose processing is distributed over several supercomputers (metacomputing).
743 - The widening of parallel programming application domains (e.g. data mining, virtual reality, generalization of numerical simulations) urges to use cheaper computing resources, like NOWs (networks of workstations). - Development in parallel compilation and run-time environments have made possible the integration of data parallelism and control parallelism, e.g. to hide the communication latency with the multithreaded execution of independent computations. - New algorithms exhibit more and more a dynamic behavior and perform on irregular data. Consequently, new applications depend more and more on the facilities provided by a run-time (dynamic management of resources, etc.). - Challenging applications consist of multiple heterogeneous modules interacting with each other to solve an overall design problem. New software architectures are needed to support the development of such applications. All these points require the development of portable, robust, high-performance, dynamically adaptable, architecture neutral applications on multiple platforms in heterogeneous, distributed networks. Many of theses attributes can be cited as descriptive characteristics of distributed applications. So, it is not surprising that distributed computing concepts and tools, which precisely face this kind of problems, become an attractive framework for supporting data parallel applications. In this perspective, we propose F i e l d B r o k e r , a client server architecture dedicated to data parallel computations on data field over 2g '~ . Data field operations in an application are requests processed by the F i e l d B r o k e r server. F i e l d B r o k e r has been developed to provide an underlying virtual machine to the 81/2 language [5] and to compute recursive definitions of group based fields [2]. However, F i e l d B r o k e r aims also to investigate the viability of client server computing for data parallel numerical and scientific applications, and the extent to which this paradigm can integrate efficiently a functional approach of the data parallel programming model. This combination naturally leads to an environment for dynamic computation and collaborative computing. This environment provides and facilitates interaction and collaboration between users, processes and resources. It also provides a testbed for experiments with language constructs, evaluation mechanisms, on-the-fly optimizations, load-balancing strategies and data field implementations.
2
A Distributed Computation
Software
Architecture
for Scientific
The software architecture of the data field server is illustrated by Fig. 1 right. Three layers are distinguished. They correspond to three virtual machines: - The s e r v e r handles requests on functions over 2~ ~. It is responsible for parallelization and synchronization between requests from one client and between different clients.
744 T h e m a s t e r handles operations between sets of arrays. This layer is responsible for various high-level optimizations on d a t a field expressions. It also decides the load balancing s t r a t e g y and synchronizes the c o m p u t a t i o n s of the slaves. - T h e s l a v e s implement sequential c o m p u t a t i o n s over contiguous d a t a in m e m ory (vectors). T h e y are driven by the master requests. Master requests are of two kinds: c o m p u t a t i o n s to perform on the slave's d a t a or c o m m u n i c a tions (send d a t a to other slaves; receives are implicit). C o m p u t a t i o n s and c o m m u n i c a t i o n s are multithreaded in order to hide c o m m u n i c a t i o n latency. -
web client
] client application] _ _
~0
~1o IoT~ ( ~ " ~ Valuc) ~ '' ''~ ~
]slave re~a] s
]
receive conq)utel
re~a slave[
..... ire [coml'"le
]
Fig. 1. Left: Relationships between field algebras s163 and s Right: A client/server-master/multithreaded-slaves architecture for the data parallel evaluation of data field requests. The software architecture described on the right implements the field algebras sketched on the left. Functions ,Ti+l are phases of the evaluation. The functions [ ~i are the semantic functions that map an expression to the denoted element of ~,n _.+ Value. They are defined such that the diagram commutes, that is ~ei~, = [iTi+l(ei)~i+ 1 is true for i E {0, 1} and ei E s This property ensures the soundness of the evaluation process. Our software architectures corresponds to a three levels language tower. Each language specifies the c o m m u n i c a t i o n s between two levels of the architecture and describes a d a t a structure and the corresponding operations. Three languages are used, going from the more abstract s (client view on a field) to s and to the more concrete E2 (in core m e m o r y view on a field). T h e server-master and the slave p r o g r a m s are implemented in Java. T h e rationale of this design decision is to s u p p o r t portability and d y n a m i c extensibility. T h e expected benefits of this software architecture are the following: A c c e s s i b i l i t y a n d c l i e n t i n d e p e n d e n c e : requests for the d a t a field comp u t a t i o n are issued by a client t h r o u g h an A P I . However, because the slave is a J a v a p r o g r a m , Java applets can be easily used to c o m m u n i c a t e with the server. So, an interactive access could be provided t h r o u g h a web client at no further cost. In this case, the server appears as a d a t a field desk calculator.
745
-
-
A u t o n o m o u s s e r v i c e s : the server lifetime is not linked to the client lifetime. Thus, implementing persistence, sharing and checkpointing will be much easier with this architecture than with a monolithic S P M D program. M u l t i - c l i e n t i n t e r a c t i o n s : this architecture enables applications composition by pipelining, data sharing, etc.
T h e figure 1 illustrates that s terms are successively translated into s and s and that s terms are dispatched to the slaves to achieve the d a t a parallel final processing. More details about these languages can be found in [1].
3
Conclusion
The aim of our first ongoing implementation is to evaluate the functionalities provided by such an architecture. At this stage, we have not pay attention to its performance which is certainly disappointing. A promising way to tackle this drawback consists in the use of just-in-time Java compiler that are able to translate J a v a bytecode into executable machine-dependent code. F i e l d B r o k e r integrates concepts and technics that have been developed separately. For example, relationships between the definition of functions and d a t a fields are investigated in [4]. A proposal for an implementation is described in [3] but focuses mainly on the m a n a g e m e n t of the definition domain of d a t a fields. One specific feature of F i e l d B r o k e r is the use of heterogeneous representations, i.e. extensional and intensional d a t a fields, to simplify field expressions. Clearly, the algebraic framework is the right one to reason about the mixing of multiple representations. References I. 3.-L. Giavitto and D. De Vito. Data field computations on a data parallel Java clientserver distributed architecture. Technical Report 1167, Laboratoire de Recherche en [nformatique, Apr. 1998. 9 pages.
2. J.-L. Cliavitto, O. Michel, and J.-P. Sansonnet. Group based fields. Parallel Symbolic Languages and Systems (International Workshop PSLS'95), vol. 1068 of LNCS, pages 209 215, Beaune (France), 2-4 October 1995. Springer-Verlag. 3. J. Hal~n, P. Hammarlund, and B. Lisper. An experimental implementation of a higly abstract model of data parallel programming. Technical Report TRITA-IT 9702, Royal Institute of Technology, Sweden, March 1997. 4. B. Lisper. On the relation between functional and data-parallel programming languages. In Proc. of the 6th. Int. Conf. on Functional Languages and Computer Architectures. ACM, ACM Press, June 1993, 5. O. Michel. Introducing dynamicity in the data-parallel language 81/2. EuroPar'96 Parallel Processing, vol. 1123 of LNCS, pages 678-686. Springer-Verlag, Aug. 1996. 6. J. M. Sipelstein and G. Blelloch. Collection-oriented languages. Proceedings of the IEEE, 79(4):504 523, Apr. 1991.
Workshop 7+20 Numerical and Symbolic Algorithms Maurice Clint and Wolfgang Kreuchlin Co-chairmen
Numerical
Algorithms
The workshop on numerical algorithms is organised in three sessions under the headings, Parallel Linear Algebra, Parallel Decomposition Methods and Parallel Numerical Software. The use of numerical linear algebra in the development of scientific and engineering applications software is ubiquitous. With the rapid increase in the size of the problems which can now be solved there is a concentration on the use of iterative methods for the solution of linear systems of equations and eigenproblems. Such methods are particularly amenable to efficient implementation on many types of parallel computer. Efficient methods for the partial eigensolution of large systems include Davidson's method and the Krylov-space based m e t h o d of Lanczos. An important method for the iterative solution of linear systems is another Krylov-space based method, GMRES. All of these methods incorporate an orthogonalization phase. Frayse~, Giraud and Kharraz-Aroussi, compare the effectiveness, on a distributed memory machine, of four variants of the Gram-Schmidt orthogonalization procedure when used as components of the GMRES method. They conclude that, in a parallel environment, it is best to use ICGS (Iterative Classical GraIn-Schmidt) rather than the more widely recommended modified versions of the procedure since it offers the best trade off between numerical robustness and parallel efficiency. Borges and Oliveira present a new parallel algorithm (RDME), based on Davidson's method, for computing a group of the smallest eigenvalues of a symmetric matrix. They demonstrate the effectiveness of their algorithm by comparing its performance on a Paragon with that of a parallel Lanczos-type method taken from PARPACK. Adam, Arbenz and Geus compare the performances on a HP Exemplar machine of a number of methods, including Davidson-based and Lanczos-based procedures, for computing partial eigensolutions of a large sparse symmetric matrix arising fl'om an application of Maxwell's equations. They observe that parallel efficiencies are as might be expected from an analysis of the algorithms and that subspace iteration is not competitive with the other methods they investigate. Many powerful algorithms for the solution of problems in numerical mathematics are based on the decomposition of linear operators into simpler forms. It is thus imperative that the potential for efficient and robust parallel implementation of such methods be studied. In addition, the use of decomposition may allow
748 the solution of a problem to be constructed from an appropriate combination of the simultaneously computed solutions of a number of subproblems. Osawa and Yamada present a decomposition-based waveform relaxation method, well suited to parallel implementation, for the solution of second order differential equations. The usefulness of their approach is supported by the results of experiments conducted on a KSR1 machine for both linear and non-linear equations. Incomplete LU factorization is an important operation in preconditioned iterative methods for solving sets of linear equations. This phase of the overall process is, however, the least tractable as far as parallel implementation is concerned. Nodera and T s u m a present a generalization of Bastion and Horton's parallel preconditioner and demonstrate its effectiveness on a distributed m e m o r y machine when used with the solvers, BiCGStab and GMRES. Kiper addresses the problem of computing the inverse of a dense triangular matrix and proposes an approach, based on Gaussian elimination, in which arithmetic parallelism is exploited by means of diagonal and triangular sweeps. A PRAMbased theoretical analysis is given and it is shown that, if sufficient processors are available, the method outperforms the standard algorithm of Sameh and Brent. Maslennikov, Kaniewski and Wyrzkowski present a Givens rotation-based QR decomposition method, designed for execution on a linear processor array, which permits the correction of errors arising fl'om transient hardware faults: this is achieved at the expense of a modest increase in arithmetic complexity. Developers of applications software intended for execution on high performance computers increasingly rely on the availability of libraries of efficient parallel implementations of algorithms from which their systems may be built. The provision of such libraries is the main goal of the ScaLAPACK project in the USA and the Parallel NAG Library project in the UK. Krommer reports on work being undertaken within the PINAPL project, an aim of which is to extend the NAG Parallel Library to meet the needs of developers of parallel industrial applications. He reports favourably on the efficiency of execution and the scalability of a number of core sparse linear algebra routines which have been implemented on a Fujitsu AP3000 distributed memory machine. However, he sounds a note of caution by indicating that, in some cases, parallel performance may be poor. De Bono, di Serafino and Ducloux integrate a parallel 2D F F T routine from the PINEAPL Library into an industrial application concerned with the design of high performance lasers: most of the computational load in this application arises from the execution of this routine. The performance of the parallel software on an IBM SP2 is evaluated. Wagner presents an efficient parallel implementation, for execution on an IBM SP2, of an algorithm for computing the Hermite normal form of a polynomial matrix with its associated unimodular transformation matrix. The latter is used in the context of convolutional decoders to find the decoder for a given encoder. The implementation is expressed in C / C + + and uses the IBM message passing library, MPL.
749 Symbolic
Algorithms
Symbolic computation is concerned with doing exact mathematics by computer, using the laws of Symbolic Logic or the laws of Symbolic Algebra. Consequently we are dealing with a wide spectrum, including theorem proving, logic programming, functional programming, and Computer Algebra. Symbolic computation is usually extremely resource demanding, because computation proceeds on a highly abstract level. Systems typically have to engage in some form of list processing with automated garbage collection, and they have little or no compiletime knowledge about the ensuing workloads. After the demise of special purpose list-processing hadware, however, there is little hardware support for symbolic computation except for the use of many processors in parallel. Therefore, investigating parallelism is one of the major hopes in symbolic computation to achieve new levels of performance in order to tackle yet larger and more significant applications. At the same time the system issues in parallelisation are among the most demanding one can face: highly dynamic data structures, highly data dependend algorithms, and extremely irregular workloads. This is most obvious in logic programming, where the data form a program and parallelisation must follow the structure of the data. Hence workloads and other dynamic behaviour are even impossible to predict in theory. Especially in the fields of Theorem Proving and Computer Algebra, there is also still the problem of finding good parallel algorithms. Frequently, naive parallelisations are possible, which however ignore sequential optimisations and perform worse than standard approaches. Parallel algorithms which outperform highly tuned sequential ones on realistic problems are still sorely needed. The field has seen very good progress in the past years. It is a great advantage that shared memory machines and standard programming environments have taken an upswing recently. This greatly favors parallel symbolic computation applications which usually have complex code with frequent synchronization needs. We may therefore hope to see more real speedups on real problems, which still is the great overall challenge to parallel symbolic computation. Costa and Bianchini optimise the Andorra parallel logic programming system. They analyse the caching behavior of the existing system using a simulator and develop 5 techniques for improving the system. As a result, the speedups improve significantly. Although itself based on simulation, this work is important for closing the gap between theoretical parallelisations and real architectures for which caching behaviour is critical in achieving speedup. N~meth and Kacsuk's paper is concerned with the improvement of the LOGFLOW parallel logic programming system. They analyse the closed binding method and propose a new hybrid binding scheme for the logic variables. Ogata, Hirata, Ioroi and Futatsugi present an extension of earlier work on their parallel term-rewriting machine PTRAM. PTRAM was designed for shared memory multiprocessors; PTRAM/MPI in addition supports message passing computation and was implemented on a Cray T3E. Scott, Fisher and Keane investigate the parallelisation of the tableau method for theorem proving in temporal logic. Temporal logic permits reasoning about time-varying properties. Proof methods are PSPACE
750 complete and hence provers are time and resource intensive. T h e y present two shared m e m o r y parallel systems one of which achieves good speedups over the sequential system.
A Parallel-System Design Toolset for Vision and Image Processing M. Fleury, N. Sarvan, A. C. Downton, and A. F. Clark Dept. of Electronic Systems Engineering, University of Essex, Wivenhoe Park, Colchester, CO4 4SQ, U.K tel: +44 - 1206 - 872795 fax: +44 - 1206 - 872900 [email protected]
Abstract. This paper analyses the requirements of a toolset intended for integrated parallel system design. Real-time, data-dominated systems are targeted, particularly those in vision and image-processing. A constrained abstract model is represented based around software pipelines each stage of which can contain a parallel component. The toolset is novel in seeking to enable generalized parallelism in which the needs of the system as a whole are given priority. The paper describes the tool infra-structure, and reports on current progress on the performance prediction toolset element, including simulation visualizer and analytic support.
1
Introduction
This paper describes a parallel-system design toolset which will guide the engineer ‘in-the-field’ towards the construction of Pipeline Processor Farm (PPF) applications. PPF is a stylized design and development methodology aimed at continuous-flow, embedded systems. More precisely, the target systems are expected to be driven by soft real-time constraints such as pipeline throughput, and traversal latency. Recently, a number of medium-scale, data-dominated, vision and image-processing applications [1,2,3,4,5,6], have been parallelized along PPF lines. We tentatively define medium-sized systems to contain several algorithmic components, with approximately 3,000 to 50,000 lines of code. Examples of applicable irregular, continuous-flow systems can be found in vision, radar [7], speech processing [8], and data compression [9]. PPF promotes the notion of a software pipeline, which can subsequently be transferred onto available hardware according to client need, analogously to the way relational database systems are mapped to differing parallel hardware [10]. It has been observed [11] that many parallel algorithms merely form a sub-system of a vision-processing system. A way of combining algorithmic components in a coherent whole is required, though any subsequent changes should be selfcontained. A pipeline may be the natural architecture for vision processing [12]; indeed, it is the architecture used by the human mind, the result of engineering through natural selection. In PPF, a pipeline is a convenient organising entity. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 92–101, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Parallel-System Design Toolset for Vision and Image Processing
93
Each stage of the pipeline can independently cater for the differing requirements of a system’s algorithmic components, either through centralized processing, data farming, or some other form of algorithmic parallelism. By adjusting the system parameters differing performance goals can be achieved. For example, identification of handwritten postcodes falls into three stages: preprocessing of a digitised postcode image; classification of the characters within the postcode; and matching the postcode against a dictionary of postal addresses. By timing the algorithmic components of a system in a sequential setting, the ratio of processing power required at each stage of a pipeline can be initially assessed, though the extent of testing will determine the veracity of the results. For full generality (rather than heuristics), either second-order statistics or even identification of the statistical distribution formed by the processing times is necessary. The distribution can be estimated by fitting a distribution to the timings’ histogram, for example by a chi-squared test. In the postcode example, the first stage might be achieved by farming complete postcode images to a set of worker processors, the second stage by the same or by decomposing processing to individual postcode characters, and the third stage by means of a trie search on each processor. A pipeline enables (say) easy transition to an alternative search engine, such as a syntactic neural net. The first two stages may be approximated by parameterized truncated Gaussian (normal) distributions while processing of the final stage is bi-modal, UK postcodes being either six or seven characters in length. It emerges that a reduction in latency is achievable if the second-stage processing is on a character basis and the postcode characters can be re-assembled into postcodes without significantly impeding the dictionary search [4]. As timings on sequential code give static requirements, the system designer, aiming for a reduction in hardware costs, will also want to know dynamic effects such as: the mean, variance and maximum work flow, i.e. the throughput and traversal latency metrics of a particular configuration of a parallel pipeline in the long run; the likely steady-state per-stage scheduling regime behaviour; what memory requirements will be needed if there are temporary holdups in the workflow requiring buffering; the effect of varying compute and communicate parameters; and will a particular processor employed at any one stage spend a significant time idling while a less-powerful processor would suit the granularity of the tasks more closely. The PPF performance tools progress the designer to a resolution of these issues.
2
Toolset Overview
Previous work on the PPF toolset has entailed identifying semi-manual methods of partitioning existing sequential code, i.e. identifying likely partition points, and transferring the partitioned code to pre-written high-level software templates which will enact individual processing stages. The ability to routinely parallelize algorithms in a generic manner enables algorithms to be prototyped on intermediate, general-purpose parallel hardware. Introducing an intermediate stage
94
M. Fleury et al.
has several advantages: a central parallel version of the system is available for transfer to a variety of client machines according to individual processing needs; the correct working of the parallelisation can be checked; and if instrumented high-level, software templates have been used in the construction of the pipeline an iterative design cycle is set-up in which predicted performance is compared to actual through the medium of an event trace. The PPF design cycle is shown in Fig. 1. The performance predictor works by simulation and analytic means. Because one wishes to compare predicted with actual performance the simulation should set-up a visual impression of activity within a pipeline segment which the designer can compare with recorded activity on the prototyping machine, using a similar display format. The display format of the simulator is deceptively simple as: the PPF methodology restricts the degrees of freedom in the parallel system development path; and extraneous detail is avoided so that the mapping between prototype design and target system(s) is not prematurely fixed. For generality, a machine-neutral description [13] is required, which will reflect a designer’s linguistically-based thought processes in developing a design. For example, it is important to present the concept of ‘pipeline’ and ‘data-farm’ which do not already exist in other common visualizers, such as ParaGraph. Analytic results are used for synchronous pipelines (Section 4) and to check the simulated results for some mathematically tractable distributions. Synchronous and asynchronous pipeline segment results are then combined. Single-stage scheduling regimes using either fixed or variable task sizes are also open to simulation and analytic prediction.
feed-back cycle
structured sequential program
Analysis tools and
partition
Performance
Performance
methods
timing data.
predictor
analyser
event-trace file
Monitored configuration
prototype parallelised
implementation.
code sequential code sections
High-level software templates and configuration tool
Fig. 1. The PPF Design Cycle
3
PPF Outlook
In the vision and image-processing fields, the choice of parallel hardware has narrowed, since SIMD-style machines are not generally available — not because
A Parallel-System Design Toolset for Vision and Image Processing
95
of any particular difficulty with the design but because a critical mass in the market place has not been achieved.1 Similarly, topology was a source of considerable academic debate when store-and-forward communication was predominant amongst first generation message-passing machines. With the advent of generalized communication methods such as the routing switch, the ‘fat’ tree, wormhole routing, and virtual channels the situation appears less critical. Communication latency is also apparently becoming less of an issue. The Bulk Synchronous Parallelism (BSP) programming model, which has been widely ported to recent machines, adopts a linear model of piece-wise execution time, characterised by a single network permeability constant. Message-latency variance can be reduced by a variety of techniques, such as the message aggregation used on the IBM SP2 SMP. Routing congestion can be palliated either by a randomized routing step, or by a changing pattern of communication generated through a latin square. It appears that a single metric, such as the mean latency, may suitably characterize PPF communication performance. The advantages for the system designer were this to be the case is that the behaviour of the algorithm becomes central, with communication characteristics remaining stable and decoupled from the algorithm.
4
PPF Taxonomy
The postcode example given in Section 1 may give the impression that PPF pipelines are invariably asynchronous, though the need to collect a complete postcode before commencing the dictionary search is a form of synchronous constraint. Throughput and traversal latency depend on whether there is a synchronous constraint on a pipeline stage, which, in PPF, can be: an algorithmic constraint, i.e. completion of a given set of jobs; one or more feedback loops; a folded-back pipeline. An ordering constraint is a feedback loop that can be confined to one stage, but generally feedback constraints extend over at least one stage. A folded-back pipeline is a way of avoiding synchronisation overhead by using the same pipeline stage for more than one step in the processing, but ineluctabley this strategy will only partially succeed because partitioning is imperfect and because even a perfect partition may have variance due to datadependency. PPF specifies the elementary types of software pipeline (Fig. 2) as: a linear pipeline; a dual pipeline; a pipeline with feedback; a folded pipeline; and combinations of any of the four previous pipeline types. Input to the complete pipeline is separately specified as either being instantaneously available or modelled by an input stream distribution, typically exponential. A pipeline is split into asynchronous and synchronous segments. Examples of splittings are given in Fig. 3. A form of pipeline process algebra is possible. A segment is a set of adjacent stages. A segment need not be a proper subset of the whole pipeline. A synchronous segment may contain any finite number of complete asynchronous segments and optionally one additional part of an asynchronous segment. An asynchronous segment is the largest set of adjacent 1
MasPar Inc. and their word-oriented machines and Cambridge Memory Systems, who support the DAP bit-serial machine, remain in active trading.
96
M. Fleury et al.
A)
B)
farmer 1
farmer 2
farmer 3
1
C)
D)
2
3
farmer 2a farmer 1
1&3
2
4
farmer 3
4
farmer 2b
Fig. 2. Elementary Pipeline Types: A) Linear B) Simple feedback C) Folded D) Dual stages, other than the trivial case, that can be formed at any one site in which no synchronous constraints apply. An asynchronous segment is capable of being simulated and any single stage may also be simulated. Synchronous segments must contain all synchronous constraints to ensure that the segment is selfcontained but cannot include any constraints which are not needed to make the segment self-contained. Notice that a synchronous segment may contain another synchronous segment as a subset. For the purposes of the predictor tool, a feedback-loop can be replaced by a nominal pipeline stage which models the delay distribution experienced by the succeeding stage (Fig. 4). A nominal stage, which can be synchronous or asynchronous, will not correspond to a stage in the implemented software pipeline. A folded-back pipeline can also be unwrapped again to form a linear pipeline by means of first replacing the folding and then providing a nominal stage. By alternating simulation and analytic predictions the performance of a complete pipeline can be built up in a piecewise fashion. In the pipeline of Fig. 3(C), after first replacing the feedback loop by a nominal stage, the performance metrics for the first segment are found. The throughput characteristics act as inputs to the first asynchronous segment proceeding in the direction of data flow, which is simulated. The output metrics of the first asynchronous segment are available for subsequent segments. The maximum latency is additive between pipeline segment. Notice that the ubiquitous central-limit theory in various forms suggests that the result of concatenating random variables, the service times, across a number of stages is a Gaussian distribution, which is determined solely by mean and variance.
5
Implementation of the Simulator
Fig. 5 shows a sample screen from the simulator, which is implemented for portability through the Java 1.1 AWT. The number of pipeline stages and jobs to be
A Parallel-System Design Toolset for Vision and Image Processing
97
completed as well as the job input type and arrival rate are entered initially. Within any one stage, jobs can be grouped into tasks. The user enters: the per-stage statistical distributions and parameters as estimated from test runs; the number of processors; and their performance characteristics. The interconnect parameters are also adjustable. If algorithmic parallelism is used then each worker may output work according to a different distribution. The scrollable pipeline display animates a discrete-event simulation running as a background thread. The display includes running indication of minimum, maximum and mean pipeline traversal latency. Throughput details, including a measure of distribution spread, are also included. Simulation time, which is adjustable, is shown in counter or clock format. The display can be zoomed into to show activity on individual pipeline stages. The state of the pipeline can be displayed, e.g. the total idle time to date of any worker process is to be available by clicking on that worker.
A)
Sa
Sa
Sa
Sa
A
A
A = Asynchronous stage Sa = Synchronous stage with synchronized segment
asynchronous segment
B)
algorithmic constraint Sf = Synchronous stage with feedback constraint Saf = Synchronous stage with algorithmic and feedback constraint.
Sa
Sa
Sf
A
A
A
Sa
A
synchronized segment C)
Sa
Saf
A
A
partial asynchronous segment synchronized segment
Fig. 3. Pipeline splittings: A) Disjoint segments B) Singly Nested segments C) Multiply Nested Segments
98
M. Fleury et al.
The simulation moves forward in time according to the global minimum remaining service time for any task. A preliminary, two-stage, version of the simulator used local minima on a stage-by-stage basis, which correctly calculated the output statistics but did not preserve correct event ordering on the screen. To keep a record of job latency a job is tagged with its time to date in the pipeline along with any remaining time at an individual stage. The simulation program also accounts for differing numbers of jobs per task at each stage of the pipeline. Interstage buffer queues are represented internally by calls to the vector class library, which enables dynamic data structures. At each time update point, the latency times of all jobs in a buffer are incremented. To give an impression of parallelism on a sequential machine using a slower semi-compiled language, communication is animated by changes in the colour and size of the links resulting in a persisting display.
A)
C)
1
2
3
4
1&3
B)
2
4
D)
!n
1
n = nominal stage
2
3n
3
1
2
3
4
1
2
3
4
4 E)
1n
Fig. 4. A) Two simple feedback loops B) Replacement by nominal stages C) Folded-back pipeline D) After unwrapping E) After substituting a nominal stage.
6
Analytic Prediction
PPF is intended to develop systems that can guarantee or adapt to specifications. A principle problem is to determine maximum latency given data-dependent behaviour. The maximum latency distribution is the output distribution for algorithmically synchronous pipeline stages, i.e. latency and throughput are
A Parallel-System Design Toolset for Vision and Image Processing
99
bound together. Distribution spread occurs even on regular problems within synchronous pipeline segments as there will still be system noise [14]. The obvious analytic technique, queueing theory, is generally confined to exponential arrival distributions. A further serious detraction is that the distributions of waiting times are not easily found, though we have made some progress with delay-cycle analysis which will estimate variances as well as means. The waiting time distribution is needed to find maximum latency due to buffering in asynchronous pipeline segments. However, order statistics have been shown in simulation and in timing experiments [15] to give suitable statistics for maximal behaviour.
Fig. 5. Sample Screen Display for the PPF Simulator Because of the changed nature of parallel hardware, as envisaged in Section 3, it becomes possible to use order statistics whereas previously linear programming or queueing theory or both appeared necessary [16]. This paper considers just one way of using order statistics. Taylor expansions of all order statistics [17] are available from Xi:p = F −1 (Ui:p ) = G(Ui:p ), when the E[Ui:p ] are i/(p + 1) = pi , resulting in: 1 Xi:p = G(pi ) + G (pi )(Ui:p − pi ) + G (Ui:p − pi )2 · · · , 2
(1)
with G = d(G(u))/du|u=pi , u = F (x). For example, with F a logistic distribution, p = 20, after taking the expectation of (1), the results are plotted in Fig. 6 for odd values of p. The advantage of this procedure is twofold: the amount
100
M. Fleury et al.
of work that needs to be reserved at each scheduling round in order to minimize the final idling time, assuming a monotonically-reducing task-size duration scheduling regime, can be found by observing the shape of the cdf of {E[Xi:p ]}; and likewise the performance degradation from any ordering constraint on task output is estimated by the expectation of the order statistics expectation cdf.
7
6
5
E[Xi:p]
4
3
2
1
0 0
2
4
6
8
10
12
14
16
18
20
i
Fig. 6. Approximate Mean Order Statistics for the Logistic Distribution
7
Conclusion
A coherent scheme, PPF, for developing parallel systems has been introduced. The system design element is supported by a performance prediction tool which gives an abstract graphical representation of pipelines formed from independent parallel components. The complete-pipeline performance is built-up by combining simulated and analytic predictions for respectively asynchronous and synchronous pipeline segments. Input to the predictor is found by test runs of software developed in a sequential environment, which implies that the first stages of development can occur without purchase of parallel hardware. In fact, the path to full implementation is further eased because an intermediate step is introduced whereby the system is prototyped on general-purpose parallel kit, the results being fed-back for comparison with predictions. Order statistics are used to form estimates of maximal pipeline segment output characteristics. A discrete-event simulator, written in Java, has been animated to visualize pipeline activity. The event traces from typical runs can be animated for graphical comparison.
A Parallel-System Design Toolset for Vision and Image Processing
101
Acknowledgement This work is being carried out under EPSRC research contract GR/K40277 ‘Parallel software tools for embedded signal-processing applications’ as part of the EPSRC Portable Software Tools for Parallel Architectures directed programme.
References 1. A. C. Downton. Speed-up trend analysis for H.261 and model-based image coding algorithms using a parallel-pipeline model. Signal Processing: Image Communications, 7:489–502, 1995. 92 2. H. P. Sava, M. Fleury, A. C. Downton, and A. F. Clark. A case study in pipeline processor farming: Parallelising the H.263 encoder. In UK Parallel ’96, pages 196– 205. Springer, London, 1996. 92 3. A. C ¸ uhadar, D. Sampson, and A. Downton. A scalable parallel approach to vector quantization. Real-Time Imaging, 2:241–247, 1996. 92 4. A. C ¸ uhadar, A. C. Downton, and M. Fleury. A structured parallel design for embedded vision systems: A case study. Microproc. and Microsys., 21:131–141, 1997. 92, 93 5. M. Fleury, A. C. Downton, and A. F. Clark. Pipelined parallelization of face recognition. Machine Vision and Applications, 1998. submitted for publication. 92 6. M. Fleury, A. C. Downton, and A. F. Clark. Co-design by parallel prototyping: Optical-flow detection case study. In High Performance Architectures for RealTime Image Processing, pages 8/1–8/13, 1998. IEE Colloquium Ref. No. 1998/197. 92 7. M. N. Edward. Radar signal processing on a fault tolerant transputer array. In T.S Durrani, W.A. Sandham, J.J. Soraghan, and S.M. Forbes, editors, Applications of Transputers 3. IOS, Amsterdam, 1991. 92 8. S. Glinski and D. Roe. Spoken language recognition on a DSP array processor. IEEE Transactions on Parallel and Distributed Systems, 5(7):697–703, 1994. 92 9. A-M Cheng. High speed video compression testbed. IEEE Transactions on Consumer Electronics, 40(3):538–548, 1994. 92 10. J. Spiers. Database management systems for parallel computers. Technical report, Oracle Corporation, 1990. 92 11. S-Y. Lee and J. K. Aggarwal. A system design/scheduling strategy for parallel image processing. IEEE Trans. on PAMI, 12(2):194–204, February 1990. 92 12. D. P. Agrawal and R. Jain. A pipeline pseudoparallel system architecture for real-time dynamic scene analysis. IEEE Trans. on Comp., 31(10):952–962, 1982. 92 13. K. M. Nichols and J. T. Edmark. Modeling multicomputer systems with PARET. IEEE Computer, pages 39–47, May 1988. 94 14. M. Dubois and F. A. Briggs. Performance of synchronized iterative processes in multiprocessor systems. IEEE Trans. on SW. Eng., 8(4):419–431, July 1988. 99 15. M. Fleury, A. C. Downton, and A. F. Clark. Modelling pipelines for embedded parallel processor system design. Electronic Letters, 33(22):1852–1853, 1997. 99 16. M. Fleury and A. F. Clark. Performance prediction for parallel reconfigurable low-level image processing. In International Conference on Pattern Recognition, volume 3, pages 349–351. IEEE, 1994. 99 17. F. N. David and N. L. Johnson. Statistical treatment of censored data part I fundamental formulæ. Biometrika, 41:228–240, 1954. 99
On the Influence of the Orthogonalization Scheme on the Parallel Performance of GMRES Val@rie Frayss61, Luc Giraud 1 and Hatim Kharraz-Aroussi 2 i CERFACS, 42 av. Gaspard Coriolis, 31057 Toulouse Cedex 1, France. f r a y s s e , giraud0cerfacs, fr. 2 ENSIAS, BP713, Agdal, Rabat, Morocco.
A b s t r a c t . In Krylov-based iterative methods, the computation of an orthonormal basis of the Krylov space is a key issue in the algorithms because the many scalar products are often a bottleneck in parallel distributed environments. Using GMRES, we present a comparison of four variants of the Gram-Schmidt process on distributed memory machines. Our experiments are carried on an application in astrophysics and on a convection-diffusion example. We show that the iterative classical GramSchmidt method overcomes its three competitors in speed and in parallel scalability while keeping robust numerical properties.
1
Introduction
Krylov-based iterative methods for solving linear systems are attractive because they can be rather easily integrated in a parallel distributed environment. This is mainly because they are free from matrix manipulations apart from matrixvector products which can often be parallelized. The difficulty is then to find an efficient preconditioner which is good at reducing the number of iterations without degrading too much the parallel performance. We consider the solution of a large sparse linear system Ax=b where A is a nonsingular n x n matrix, b and x are two vectors of length n. Given a starting vector x0, the GMRES method [10] consists in building an orthonormal basis Vm for the Krylov space =
The integer m is called the projection size, G M R E S produces a solution whose restriction to this Krylov space has a minimal 2-norm residual. When one wants to limit the amount of storage for V,~, m is kept fixed and the method is restarted with a better initial guess: m is then called the restart parameter and the restarted GMRES method is denoted by GMRES(m). The construction of the basis Vm is an important step of GMRES. Several variants of the Gram-Schmidt (GS) process are available; they have different efficiencies and different numerical properties in finite precision arithmetic [1]. Classical GS (CGS) is the most efficient but is numerically unstable, modified GS (MGS) improves on CGS but can
752 still suffer from a loss of orthogonality. The iterative MGS (IMGS) method and the iterative CGS (ICGS) have been designed to reach the numerical quality of the Householder (or Givens) factorization with a reasonable computational cost. From a point of view of computational efficiency, CGS and ICGS are the best because the scalar products can be gathered and implemented in a matrix-vector product form. However, the iterative procedures ICGS and IMGS may require more than twice as much work than their counterparts CGS and ICGS, when reorthogonalization is needed. Our implementations are the ones described in [1]. From a numerical point of view, it has been proved that MGS, IMGS and ICGS are able to produce a good enough computed Krylov basis to ensure the convergence of GMRES [4, 6]. In this paper, we wish to compare the efficiency of GMRES(m) with these four orthogonalization processes in a parallel distributed environment. It is wellknown that the many scalar products arising in the orthogonalization phase are the bottleneck of Krylov based methods. Our tests are based on a restarted GMRES code developed at CERFACS which implements the four GS variants and uses reverse communication for the preconditioner, the matrix-vector products and dot products [5]. The numerical experiments have been performed on the 128 node CRAY T3D available at CERFACS, using the MPI message passing library. We present results concerning an application in astrophysics developed in collaboration of M. Rieutord (Observatoire Midi-Pyr~n~es, Toulouse) and L. Valdettaro (Politecnico di Milano). In order to explain in depth these results, we study a model problem deriving from a convection-diffusion equation, on which we can test the scalability of GMRES.
2 2.1
An application in astrophysics Description of the
application
The application we are interested in belongs to the class of flow stability problems and arises in astrophysics, when modelling the internal structure of stars and planets, to study for instance the electromagnetic field in the earth kernel [8, 9]. The study of the stability of a solution of a nonlinear problem begins by solving the eigenvalue problem verified by small perturbations of the original solution. Therefore, the inertial modes (or eigenmodes) for an incompressible viscous fluid located between two concentric spheres are obtained by solving the linearized Navier Stokes continuity equation
{
EAVxu--XTx(ezxu)=,~Vx div u = 0
xu,
where u is the scaled velocity of the fluctuations, and (ez • u) is the non dimensional Coriolis force. The Eckman number E is equivalent to the inverse of a scaled Reynolds number and is intended to be very small. When using a spherical geometry and after projecting on the space of Chebyshev polynomials,
753 one obtains a generalized eigenproblem A x = )~Bx where A and B are two non hermitian matrices, and B m a y be singular. The smaller E, the finer must be the discretization and thus A and B must be larger. All the eigenvalues have an imaginary part between - 1 and 1. The eigenvalues of interest are those closest to the imaginary axis. The eigenproblem is solved by the Arnoldi m e t h o d with a Chebyshev acceleration [2]. The strategy for selecting the interesting eigenvalues is based on a shift and invert technique, with several imaginary shifts c~ in the interval I - i , i]. We then solve for # the eigenproblem (A - ~rB)-l Bx = #x
with # = 1 / ( A - cr) being the largest eigenvalue of C = ( A - ~rB)- 1 B. The Arnoldi method, like any Krylov-based iterative method, requires only the application of C to a vector. Clearly in this case, this operation involves the solution of linear systems such as ( A - a B ) x = y. Even if A and B are sparse, the use of direct methods for solving this system imposes severe limitations on the size of A and B, and consequently limits the range of possible values of the E c k m a n number E. We have therefore investigated the use of G M R E S ( m ) in a parallel distributed m e m o r y environment. After detailing some choices about the d a t a distribution and the preconditioning, we give some results on the performance of G M R E S ( m ) on this example, where, for sake of simplicity, we have assumed the shift a to be zero. In the rest of this paper, the acronym G M R E S will always refer to the restarted method. 2.2
The data distribution
T h e m a t r i x A is tridiagonal by blocks. T h e size of the diagonal blocks is usually twice the size of the off-diagonal blocks (also called coupling blocks). For example, the test m a t r i x Asss0 used in our experiments is 5850 x 5850, with 45 diagonal blocks of size 130 x 130 and 88 off-diagonal blocks of size 61 x 61. Only the blocks are stored. They are dense enough so that there is no substantial gain to hope f r o m a sparse storage. Note that the m a t r i x B is diagonal by blocks, with a block size compatible with the storage of A: therefore the matrices A - a B can be stored identically as A. The m a t r i x is distributed over the processors so that each processor possesses complete blocks only. The advantage of this distribution is that it allows an easy implementation of the preconditioner but the drawback is that it m a y induce some load imbalance when the number of blocks is not a multiple of the n u m b e r of processors. In the case of two processors for instance, processor 1 will receive the first 23 diagonal blocks of Asss0 and processor 2 the remaining 22 blocks. in other words, processor 1 (resp., processor 2) will treat a subproblem of size 130 • 23 = 2990 (resp., 130 • 22 = 2860). But out of 32 processors, 13 will have two blocks whereas 19 processors will have half of the load that is one block only. Let n(p) be the size of the subproblem treated by the first processor when the
754 m a t r i x is distributed over p processors. If the d a t a distribution were equilibrated, n(p) would be kept proportional to 1/p. Figure 1 shows the evolution of the ratio n(2)/n(p), where the optimal distribution (dotted line) would give n(2)/n(p) = p/2. We see that our distribution is reasonably balanced whenever p < 16. When p >_ 32, the first processor is penalized by a larger a m o u n t of d a t a t h a n expected with the increase in processors. The opposite p h e n o m e n a arises for the last processor which sees its load decrease faster than the increase in processors.
A s t r o - 5850 16
~-
8
2 4
8 16 Number of processors
32
Fig. 1. Load of the first processor
2.3
Preconditioning
The restarted G M R E S method applied on this problem does not converge even with large restarts. We have chosen one of the simplest preconditioner: the block Jacobi preconditioner. The preconditioner consists in the diagonal blocks of A and is well adapted to the d a t a distribution. Each processor computes once a dense LU factorization of its blocks at the beginning of the computation. Applying the preconditioner consists in performing two triangular solves per block locally on each processor. There is no communication involved when building or applying the preconditioner. 2.4
Performance
results
We describe now the results obtained for the m a t r i x of size 5850. Since the number of processors on the CRAY T3D has to be a power of 2, we show results for 2, 4, 8, 16 and 32 processors (which is the m a x i m u m allowed for 45 blocks). The sequential code could not be run due to m e m o r y limitations: consequently, our reference for speed-ups will be the time obtained for 2 processors.
755 From a numerical point of view, the m a t r i x A is very ill-conditioned (with a condition number larger than 1013). G M R E S with a CGS reorthogonalization does not converge on this example. In our tests, we stop the iterations when the backward error on the preconditioned system becomes lower than 10 -7 . The original system then has a normwise backward error l I A r - b IJ2/ (IIA JJ2 IIz JJ2 + l JblJ2) smaller than 10 - l ~ (here 2 denotes the computed solution). We will show the results obtained with a restart of 100 since we obtained similar behaviors for other values. Because the m a t r i x is ill-conditioned, the number of iterations m a y vary significantly with the orthogonalization strategy and the number of processors (see Figure 2). However, these variations are not monotonous and o n e cannot predict the preeminence of one method over the other. The variation with the number of processors is due to the different orderings of the floating-point operations in the parallel matrix-vector and dot products. Because of this sensitivity, we have chosen to evaluate the performance of the code mainly on one iteration rather than on the complete solution: the times measured over the solution are then scaled by the number of iterations required for this solution. However, to be fair, we first look at the total computational time needed to obtain the solution (see Figure 3). Regardless of the number of iterations, the fastest m e t h o d is G M R E S - I C G S and the slowest is G M R E S - I M G S . The success of G M R E S - I C G S is mainly due to the better scalability properties of the ICGS orthogonalization scheme, as we will see in the next section.
Astro- 5850
A s t r o - 5850
1000
250 ~: 200
~: 900 ._o
150 "5 800 o
~ 100
E :~ 700
w so
0
600
2 4
8 16 Number of processors
32
Fig. 2. Number of iterations required for convergence
0
2 4
8 16 Number of processors
32
Fig. 3. Time for convergence
S p e e d - u p f o r t h e p r e c o n d i t i o n e r . The speed-up for the preconditioner is taken as the ratio preconditioning time on p processors preconditioning time on 2 processors
756 and is plotted on Figure 4. We cannot compare with the sequential time because the problem is too large to fit on one processor. It departs from its optimal value (dotted line) when p _> 16. This only reflects the imbalance of the processors load (see Figure 1) and, because there is no communication overhead, is the best one can expect from this data distribution. S p e e d - u p f o r t h e m a t r i x - v e c t o r p r o d u c t . The matrix-vector product in parallel involves a local part corresponding to the contribution of the diagonal blocks and a communication part with the left and right neighbours to take into account the contribution of the coupling blocks. It is implemented so as to overlap communication and computation as much as possible. The speed-up, computed as for the preconditioning is shown on Figure 5. Here again, it reflects rather faithfully the distribution of the load on the processors (see Figure 1), which proves that the overhead due to communications does not penalize the computation.
Astro - 5850 - ICGS
Astro - 5850 - ICGS 16
16
...'
..."
== (D
co
2 4
8 16 Number of processors
Fig. 4. Preconditioning
32
2 4
8 16 Number of processors
32
Fig. 5. Matrix-vector product
I n f l u e n c e o f d o t p r o d u c t s . The dot products are the well-known bottleneck for Krylov methods in a distributed environment. Figure 6 shows the time spent, per iteration, in the dot products and Figure 7 gives the percentage of the solution time spent in the dot products. Both figures indicate clearly that G M R E S with ICGS is the best method for avoiding the degradation of the performance generally induced by the scalar products. We even see on Figure 6 that, when p >__ 32, the time spent in iteration in the dot products starts increasing for IMGS and MGS whereas it continues to decrease with ICGS: this happens when the communication time overcomes the computation time in the dot products for IMGS and MGS. By gathering the scalar products, ICGS ensures more computational work to the processors. Finally, Figure 8 gives the speed-up, for an average iteration, of the complete solution. When comparing with the best possible curve with the given processor
757 Astro-
Astro- 5850
5850
70
0.08
6O 0.06 = 50 "5
u o.o4 E 0.02
09
~,40
Y__I
28 30 20
0
10
32
' ' 2 4
Number of processors
Fig. 6. Time spent in dot products per iteration
8 16 Number of processors
32
Fig. 7. Percentage of the solution time spent in dot products
load on Figure 1, we see that G M R E S - M G S and G M R E S - I M G S give bad performance whereas G M R E S - I C G S is less affected by scalar products. We now present a test problem for which it is easier to vary the size in order to test the parallel scalability properties of GMRES.
Astro-
5850
16
.Qsl
I I , IMGS i * * ICGS
2 4
.
..."
..
8 16 N u m b e r of p r o c e s s o r s
32
Fig. 8. Speed-up for an average iteration
3
S c a l a b i l i t y on a m o d e l p r o b l e m
We intend to investigate the parallel scalability of the G M R E S code and the influence of the orthogonalization schemes. For this purpose we consider the solution, via two classic domain decomposition techniques, of an elliptic equation
0x2 + @2
+ a -~x + b-~y = f
(1)
758 on the unit square s = (0, 1) 2 with Dirichlet boundary conditions. Assume that the original domain s is triangulated by a set of non-overlapping coarse elements defining the N sub-domains s We first consider an additive Schwarz preconditioner that can be briefly described as follows. Each substructure f2i is extended to a larger substructure ~i, within a distance 5 from f?i, where ~ refers to the amount of overlap. Let A~ denote the discretizations of the differential operator on the sub-domain Ds Let R T denote the extension operator which extends by zero a function on f?i onto s and Ri the corresponding pointwise restriction operator. With these notations the additive Schwarz preconditioner, MAD, c a n be compactly described as
MADU = Z R T A i - 1 F&u We also consider the class of domain decomposition techniques that use nonoverlapping sub-domains. The basic idea is to reduce the differential operator on the whole domain to an operator on the interfaces between the sub-domains. Let I denote the union of the interior nodes in the sub-domains, and let B denote the interface nodes separating the sub-domains. Then grouping the unknowns corresponding to I in the vector ut and the unknowns corresponding to B in the vector UB, we obtain the following reordering of the problem:
{dII
Au = \As1
AIB ) ( U U ; ) ~_ ( f ; )
ABB
(2)
For standard discretizations, AII is a block diagonal matrix where each diagonal block, Ai, corresponds to the diseretization of (1) on the sub-domain Di. Eliminating ui in the second block row of (2) leads to the following reduced equation for UB: SUB = gB = f s -- A B z A - [ ) f t , (3) where
S = ABB -- ABIA-[~AIB. S is referred to as the Schur complement matrix (or also the capacitance matrix). In our experiments, Equation (1) is discretized using finite elements. The Schur complement matrix can then be written as N
s =
(4) i=l
where S (i) is the contribution from the i th sub-domain with S(i) ~a(i) XBB -A(i) tA(i)~_l~(i) and A(~) denotes submatrices of Ai Without more sophisticated preconditioners, these two domain decomposition methods are not numerically scalable [12]; that is the number of iterations required grows significantly with the number of sub-domains. However, in this section we are only interested in the study of the influence of the orthogonalization schemes on the scalability of the GMRES iterations fi'om a computer science
759 point of view. This is the reason why we selected those two domain decomposition methods t h a t exhibit a large amount of parallelism and we intend to see how their scalability is affected by the G M R E S solver. For a detailed overview of the domain decomposition techniques, we refer to [12]. For both parallel domain decomposition implementations, we allocate one sub-domain to one processor of the target distributed computer. To reduce as much as possible the time per iteration we use an efficient sparse direct solver from the Harwell library [7] for the solution of the local Dirichlet problems arising in the Schur and Schwarz methods. To study the scalability of the code, we keep constant the number of nodes per sub-domain when we increase the number of processors. In the experiments reported in this paper, each sub-domain contains 64 x 64 nodes. For the additive Schwarz preconditioner, we selected an overlap of one element (i.e. 6 = 1) between the sub-domains that only requires one communication after the local solution A ' - 1 while one more communication would be necessary before the solution for a larger overlap. In Figure 9 are displayed the elapsed time for both the Schur and the Schwarz approaches observed on a 128 node Cray T3D. If the parallel code were perfectly scalable, the elapsed time would remain constant when the number of processors is increased.
Ela 3sed time for Schur complement method
Elapsed time for Schwarz method
16
6 5.5
15 o
E
~ s
~ 14
~4.5 13 3.5
'
1'6
64 Number of processors
128
12
4 16
64 Number o{ processors
128
Fig. 9. Elapsed time
We define also the scaled speed-up by SUp = p z
T4
-Tp
where Te is the elapsed time to perform a complete restart step of GMRES(50) on s processors. For both Schur and the Schwarz approaches we report in Figure 11 the scaled speed-ups associated with the elapsed time displayed in Figure 9.
760 Elapsed time for Schur complement method
Elapsed time for Schwarz method 13.5 13
48 o
E 12.5
".E,
"~ 4.6
8. m
Q- 12
UJ 4.,1
11.5
4.2'
1632
64
128 Number of processors
11
256
+:MGS-*
e____e
1632
:IMGS-o:CGS-
64
e------
128 Number of processors
256
x :ICGS
F i g . 10. E l a p s e d t i m e Scaled speed-up for the Schur complement method
Scaled speed-up for the Schwarz method 120
o_ 100
? ~ 80
F * IMGSl p-o CGSl I~
cl.
"/
. .
.
"
i
~.100
./~ J
.
'
///
m 60 ~ 4o
~ 80 ~ 60 ~ 40 20
0
4 16
64 Number of processors
0
128
16
64 Number of processors
128
F i g . 11, Scaled s p e e d - u p s Scaled speed-up for the Schur complement method 300
Scaled speed-up for the Schwarz method 300
250
25O Q.
,/, 200
2
~ 150
o~ 150
"~ 100
"~ 100
200
Q O3
50 1632
5O 64
128 Number of processors
+ :MGS-*
256
:IMGS-
1632
o:CGS-
F i g . 12. Scaled s p e e d - u p s
64
128 Number of processors
• :ICGS
256
761 As it can be seen in Figure 9, the solution of the Schur complement syst e m does not require iterative orthogonalization; the curves associated with C G S / I C G S and M G S / I M G S perfectly overlap each other. In such a situation, the C G S / I C G S orthogonalizations are more attractive than M G S / I M G S . T h e y are faster and exhibit a better parallel scalability as it can be seen in the left picture of Figure 11. On 128 nodes the sealed speed-up is equal to 114 for C G S / I C G S and only 95 for M G S / I M G S . When iterative orthogonalization is required, as for the Schwarz method on our example for instance, CGS is the fastest and the most sealable but m a y lead to a loss of orthogonality in the Krylov basis resulting in a poor and even a loss of convergence. In that case I C G S offers the best trade-off between numerical robustness and parallel efficiency. A m o n g the numerically reliable orthogonalization schemes, I C E S gives rise to the fastest and the most scalable iterations.
4
Conclusion
The choice of the orthogonalization scheme is crucial to obtain good performance from Krylov-based iterative methods in a parallel distributed environment. From a numerical point of view, CGS should be discarded together with MGS in some eigenproblem computations [3]. Recent works have proved that for linear systems, MGS, IMGS and ICGS ensure enough orthogonality to the c o m p u t e d basis so t h a t the method converges, Finally, in a parallel distributed environment, ICGS is the orthogonalization m e t h o d of choice because, by gathering the dot products, it reduces significantly the overhead due to communication.
References 1. •. BjSrck. Numerics of Gram-Schmidt orthogonalization. Linear Algebra Appl., 197-198:297-316, 1994. 2. T. Braconnier, V. Frayss6, and J.-C. Rioual. ARNCHEB users' guide : Solution of large non symmetric or non hermitian eigenvalue problems by the ArnoldiTchebycheff method. Tech. Rep. TR/PA/97/50, CERFACS, 1997. 3. F. Chaitin-Chatelin and V. Frayss6. Lectures on Finite Precision Computations. SIAM, Philadelphia, 1996. 4. 3. Drko~ovs A. Greenbaum, Z. Strako~, and M. Rozlo~nik. Numerical stability of GMRES. BIT, 35, 1995. 5. V. Frayss6, L. Giraud, and S. Gratton. A set of GMRES routines for real and complex arithmetics. Technical Report TR/PA/97/49, CERFACS, 1997. 6. A. Greenbaum, Z. Strako~, and M. Rozlo~nik. Numerical behaviour of the modified Gram-Schmidt GMRES implementation. BIT, 37:707-719, 1997. 7. HSL. Harwell Subroutine Library. A Catalogue o] Subroutines (Release 12). AEA Technology, Harwell Laboratory, Oxfordshire, England, 1995. 8. M. Rieutord. Inertial modes in the liquid core of the earth. Phys. Earth Plan. Int., 90:41-46, 1995. 9. M. Rieutord and L. Valdettaro. Inertial waves in a rotating spherical shell, d. Fluid Mech., 341:77-99, 1997.
762 10. Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. S l A M J. Sci. Star. Comput., 7:856-869, 1986. 11. J. N. Shadid and R. S. Tuminaro. A comparison of preconditioned nonsymmetric Krylov methods on a large-scale MIMD machine. S l A M J. Sci. Comp., 14(2):440459, 1994. 12. B.F. Smith, P. Bjcrstad, and W. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, New York, 1st edition, 1996.
A Parallel Solver for E x t r e m e E i g e n p a i r s I Leonardo Borges and Suely Oliveira 2 Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA.
A b s t r a c t . In this paper a parallel algorithm for finding a group of extreme eigenvalues is presented. The algorithm is based on the well known Davidson method for finding one eigenvalue of a matrix. Here we incorporate knowledge about the structure of the subspace through the use of an arrowhead solver which allows more parallelization in both the original Davidson and our new version. In our numerical results various preeonditioners (diagonal, multigrid and ADI) are compared. The performance results presented are for the Paragon but our implementation is portable to machines which provide MPI and BLAS.
1
Introduction
A large number of scientific applications rely on the c o m p u t a t i o n of a few eigenvalues for a given m a t r i x A. Typically they require the lowest or highest eigenvalues. Our algorithm (DSE) is based on the Davidson algorithm, but calculates various eigenvalues through implicit shifting. DSE was first presented in [10] under the name R D M E to express its ability to identify eigenvalues with multiplicity bigger than one. The choice of preconditioner is an i m p o r t a n t issue in eliminating convergence to the wrong eigenvalue [14] In the next section, we describe the Davidson algorithm and our version for computing several eigenvalues. In [9] Oliveira presented convergence rates for Davidson type algorithm dependent on the type of preconditioner. These results are summarized here in Section 3. Section 4 addresses parallelization strategies discussing the d a t a distribution in a MIMD architecture, and a fast solver for the projected subspace eigenproblem. In Section 5 we present numerical and performance results for the parallel implementation on the Paragon. Further results about the parallel algorithm and other numerical results are presented in [2].
2
The Davidson
Algorithm
Two of the most popular iterative methods for large symmetric eigenvalue problems are Lanczos and Davidson algorithms. Both methods solve the eigenvalue 1 This research is supported by NSF grant ASC 9528912 and a Texas A ~ M University Interdisciplinary Research Initiative Award. Department of Computer Science, Texas A&M University, College Station, TX 77843. email: [email protected].
764 problem Au = )~u by constructing an orthonormal basis Vk = Iv1, 9 9 vk], at each k th iteration step, and then finding an approximation for the eigenvector u of A by using a vector uk from the subspace spanned by Vk. Specifically, the original problem is projected onto the subspace which reduces the problem to a smaller eigenproblem Sky = ~Yk, where Sk = VTAVk. Then the eigenpair (~k, Yk) can be obtained by applying a efficient procedure for small matrices. To complete the iteration, the eigenveetor Yk is mapped back as uk = Vkyk, which is an approximation to the eigenveetor u of the original problem. The difference between the two algorithms consists on the way that basis Vk is built. The attractiveness of the Lanczos algorithm results from the fact that each projected m a t r i x Sk is tridiagonal. Unfortunately, sometimes this method may require a large number of iterations. The Davidson algorithm defines a dense matrix Sk on the subspace, but since we can incorporate a preconditioner in this algorithm the number of iterations can be much lower than for Lanczos. In Davidson type algorithms, a preconditioner Ma k is applied to the current residual, rk : A u k - - / ~ k U k , and the preconditioned residual tk = Mxk rk is orthonormalized against the previous columns of Vk = Iv1, v=, . . . , vk]. Although in the original formulation Ma is the diagonal matrix (diag(A) - M ) - 1 [6], the Generalized Davidson (GD) algorithm allows the incorporation of different operators for Mx. The DSE algorithm can be summarized as follows. A l g o r i t h m 1 - R e s t a r t e d D a v i d s o n f o r S e v e r a l E i g e n v a l u e s Given a matrix A, a normalized vector Vl, number of eigenpairs p, restart index q, and the minimal dimension m for the projected matrix S (m > p), compute approximations 3~ and u for the p smallest eigenpairs of A. 1. Set 71 ~ [l)1](initial guess) 2. For j = 1, ..., p (approximation for j-th eigenpair) While k = 1 or Ilrk_lll < e do (a) Project Sk = VT A Vk . (b) If (re + q) <_ dim S (restart Sk) Reduce Sk +-- (Ak)(,~xm) to its m smaller eigenvectors, and update Vk for the new basis. (c) Compute the jth smallest eigenpair Ak, Yk of Sk. (d) Compute the Ritz vector uk +-- Vkyk. (e) Check convergence for rk +-- Auk -- )~kuk. (f) Apply preconditioner tk +-- Mrk. (g) Expand basis Vk+l +--- [Vk, tk] using modified Gram Schmidt (MGS'). End while 3. End For. The core ideas of DSE (Algorithm 1) are based on the projection of A into the subspace spanned by the columns of Vk. The interation number k is not necessarily equal to dim Sk, since we have incorporated implicit restarts. The matrix Sk is obtained by adding one more column and row VTAvk to matrix S k - , (step 2.a). Other important aspects of the DSE algorithm are:
765 (1) the eigenvalue solver for the subspace matrix Sk (step 2.e); (2) the use of an auxiliary matrix Wk = [ w l , . . . , w k ] to provide a residual calculation rk = Auk - )~ku~ = wkyk -- )~kUk with less computational work (step 2.e); the choice of a preconditioner M (step 2.f); and the use of modified Gram-Schmidt orthonormMization (step 2.g) which preserves numerical stability when updating the orthonormal basis Vk+l. At each iteration, the Mgorithm expands the matrix S either until all the first p eigenvalues have been converged, or S reaches a maximum dimension m + q; In the latter case, restarting is applied by using the orthonormM decomposition Sk = y T A k Y k of S. It corresponds to step 2.b in the algorithm. Because of our choice for m, note that in step 2.e d i m S will be always bigger or equal to j. 3
Convergence
Rate
A proof of convergence (but without a rate estimate) for the Davidson algorithm is given in Crouzeix, Philippe and Sadkane [5]. A bound on the convergence rate was first presented in [10]. The complete proof is shown in Oliveira [9]. Let A be the given matrix whose eigenvalues and eigenvectors are wanted. The preconditioner M is given for one step, and Davidson's algorithm is used with uk being the current computed approximate eigenvector. The current eigenvalue estimate is the Rayleigh quotient ,~k = flA (uk ) = ( u Tk A u k ) / ( u Tk uk). Let the exact eigenvector with the smallest eigenvalue of A be u, and A u ~ )~u.
(If/~ is a repeated eigenvalue of A, then we can let u be the normalized projection of Uk onto this eigenspace.) T h e o r e m 1. Let P be the orthogonal projection onto ker(A - AI) • that A and M are symmetric positive definite. If lIP- PMP(A-
Suppose
)~I)II2 < a < 1,
then for almost any starting value xl, the convergence of the eigenvalue estimates Ak converge to )~ ultimately geometrically with convergence factor bounded by c~, and the angle between the computed eigenvector and the exact eigenspace goes to zero ultimately geometrically with convergence factor bounded by er.
A geometric convergence rate can be found for DSE (which obtains eigenvalues beyond the smallest (or largest) eigenvalue) by modifying Theorem 1. In the following theorem assume that
~' = l i P ' - P ' M P ' P ' ( A - Ap/)P'II2 where P~ is the orthogonal projection onto the orthogonal complement of the span of the first p - 1 eigenvectors. Then we can shown, in a similar way to Theorem 1 that the convergence factor for the new algorithm is bounded by
766 (a~)2 To prove Theorem 2 we use the fact that P~sk = sk, as sk is orthogonal to the b o t t o m p eigenvectors, and that although (A - ApI) is no longer positive semi-definite, PI (A - ~vI)P I is. T h e o r e m 2. Suppose that A and M are symmetric positive definite and that the first p - 1 eigenvectors have been found exactly. Let P~ be the orthogonal projection onto the orthogonal complement of the span of the first p - 1 eigenvectors of A. If l i P ' - P ' M P ' ( A - ApI)P'II2 < c / < 1,
then for almost any starting value xl, the eigenvalue estimates'Ak obtained by our modified Davidson algorithm for several eigenvalues converges to Ap ultimately geometrically with convergence factor is bounded by (~)2, and the angle between the exact and computed eigenvector goes to zero ultimately geometrically with convergence factor bounded by (rt. 4
Parallel
Implementation
Previous implementations for the Davidson algorithm solve the eigenvalue problem in subspace S by using algorithms for dense matrices: early works [3, 4, 17] adopt EISPACK [12] routines, and later implementations [13, 15] use LAPACK [1] or reductions to tridiagonal form. Partial parallelization is obtained through the matrix-vector operations and sparse format storage for m a t r i x A [13, 15]. Here we explore the relationship between two successive matrices Sk which allows us to represent Sk through an arrowhead matrix. The arrowhead structure is extremely sparse and the associated eigenvalue problem can be solved by a highly parallelizable method. 4.1
Data Distribution
Data partitioning significantly affects the performance of a parallel system by determining the actual degree of concurrency of the processors. Matrices are partitioned along distinct processors so that the program exploits all the best possible data parallelism: The final distribution is well balanced, and most of the computational work can be performed without communication. These two conditions make the parallel program very suited for distributed memory architectures. Both computational workload and storage requirements are the same for all processors. Communication overhead is kept as low as possible. Matrix A is split into row blocks A i, i = 1 , . . . , N, each one containing ~ [n/N] rows of A. Thus processor i, i = 1 , . . . , N stores A i, the i th r o w block of A. Matrices Vk and Wk are stored in the same fashion. This data distribution allow us to perform many of the matrix-vector computations in place. The orthonormalization strategy is also an important aspect in parallel environments. Recall that the m o d i f e d Gram Schmidt (MGS) algorithm will be applied to the extended matrix [Vk, tk] where the current basis Vk has been previously orthonormalized. This observation reduces the computational work by eliminating the outer loop from the two nested loops in the full MGS algorithm.
767
4.2
T h e A r r o w h e a d R e l a t i o n s h i p B e t w e e n Matrices Sk
As pointed in [2, 10], the relationship between Sk and Sk-1 can be used to show that Sa is explicitly similar to an arrowhead matrix Sk of the form
(1) where ak = YI~-IVk_lWk,T T Skk = v T w k , and the diagonal matrix Ak-1 corresponds to the orthonormal decomposition Sk-1 = Y k - l A k - l Y k r _ , . In practice, the matrix Sk does not need to be stored: only a vector for Ak and a matrix for Yk are required from one iteration to the next. Thus, given the eigenvalues Ak-1 and eigenvectors Yk-, of Sk-1, matrix Sk can be used to find the eigenvalues Ak of Sk. Arrowhead eigensolvers [8, 11] are highly parallelizable and typically perform O(k 2) operations, instead of the usual C9(k3) effort of algorithms for dense matrices S. 5
Numerical
Results
In our numerical results we employ three kind of preconditiners: diagonal preconditioner (as in the original Davidson), multigrid and ADI. A preconditioner can be expressed as the matrix which solves A x = b by applying an iterative m e t h o d to M A x = M b instead. In the case of a Diagonal preconditioner this would correspond to scaling the system and then solving. Multigrid and ADI preconditioners are more complex and for that we refer the reader to [16, 18, 19]. In our implementation level 1, 2 and 3 BLAS and the Message Passing Interface (MPI) library were used for easy portability. The computational results in this section were obtained with a finite difference approximation for - A u + gu = f (2) on a unit square domain. Here g is null inside a 0.2 • 0.2 square on the center of the rectangle and g = 100 for the remaining domain. To compare the performance delivered by distinct preconditioners we observe the total timing and number of iterations required for the sequential DSE for finding the ten smallest eigenpairs (p = 10) assuming convergence for residual norms less or equal to 10 -7. The restart indexes were q = 10 and m = 15. This corresponds to apply restarting every time that the projected m a t r i x S~ achieves order 25, reducing its order to 15. Table 1 presents the actual running tIMINGS In a single processor of the Intel Paragon, running three grid sizes: 31x31, 63x63, and 127x127 (matrices of orders 961, 3969 and 16129, respectively.). It reflects the tradeoff between preconditioning strategies: although the diagonal preconditioner (DIAG) is the easiest and fastest to compute, it requires an increasing number of iterations for larger matrices. Multigrid preconditioners (MG) are more expensive than DIAG, but they turn to be more effective for larger matrices. Finally, the ADI method aggregate the advantages of the previous preconditioners in the sense that it is more effective and less expensive than
768 MG. More details about the preconditioners used here can be found in [2] and its references. T a b l e 1. Sequential times and number of iterations for three preeonditioners.
16129 ] ime (~'~J 4.7 8.8 8.6
~
16.2 ~ 43.0 43.2
80.7 l 254.1 386.0
The overall behavior of the DSE algorithm (with a multigrid preconditioner) is shown in Figure 1 for matrices sizes 3969 and 16129, as a function of the number of processors. Note that the estimated optimal number of processors is not far from the actual optimal. The model for our estimates is presented in [2].
Actual times compared wdh theoretical estimates
22
20
r::::::t3:,0:9
18
~ m a t e d
optimal
16
g
Nt4 E 12 \
10
\
\
x x
8
6
5,
\
x
, L 10 15 number of processors
2'0
25
Fig. 1. Actual and estimated times for equation (2). Results for two different matrix sizes performance on the Paragon are shown.
To conclude, we compare the performance of the parallel DSE with PARPACK [7], a parallel implementation of ARPACK 1 Figure 2 presents the total running times for both algorithms for the problem described above, For these runs, DSE used our parallel implementation of the ADI as its preconditioner. 1 ARPACK implements an Implicitly Restarted Arnoldi Method (IRAM) which in the symmetric case corresponds to the Implicitly Restarted Lanczos algorithm. We used the regular mode when running PARPACK.
769 The problem was solved by using 4, 8, and 16 processors to obtain relative residuals ]]Au-Aull/l[ul] of order less than 10 -5. We show our theoretical analysis for the parallel algorithm in [2]. Other numerical results for the sequential DSE algorithm, including examples showing the behavior of the algorithm for eigenvalues with multiplicity greater than one, were presented in [10].
DSE (ADO and PAFtPACK for three distinct problem s i z e s . . . . . . ,
? /
6C
5C
-
DSE (ADI)
--
PARPAGK
)~
4 processors
+
8 processors
,
i
// // ////
/
/ /
-ff
/
~4o
//
//
30
/// /
j/
9
/ //
///~ //
20
l0 w
104
10~ log: problem size
Fig. 2. Running times for DSE and PARPACK using 4, 8, and 16 processors on the Paragon.
References 1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouehov, and D. Sorensen. LAPACK User's Guide. SIAM, Philadelphia, 1992. 2. L. Borges and S. Oliveira. A parallel Davidson-type algorithm for several eigenvalues. Journal o] Computational Physics. To appear. 3. G. Cisneros, M. Berrondo, and C. F. Brunge. DVDSON: A subroutine to evaluate selected sets of eigenvalues and eigenvectors of large symmetric matrices. Compu. Chem., 10:281-291, 1986. 4. G. Cisneros and C. F. Brunge. An improved computer program for eigenvector and eigenvalues of large configuration iteraction matrices using the algorithm of [)avidson. Compu. Chem., 8:157-160, 1984. 5. M. Crouzeix, B. Philippe, and M. Sadkane. The Davidson method. SIAM J. Sci. Comput., 15(1):62-76, 1994. 6. E. R. Davidson. The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices. J. Comp. Phys., 17:87 94, 1975.
770 7. K. Maschhoff and D. Sorensen. A portable implementation of A R P A C K for distributed memory parallel architectures. In Proceedings of Copper Mountain Conference on Iterative Methods, April 9-13 1996. 8. D. P. O'Leary and G. W. Stewart. Computing the eigenvalues and eigenvectors of symmetric arrowhead matrices. J. Comp. Phys., 90:497-505, 1990. 9. S. Oliveira. On the convergence rate of a preconditioned algorithm for eigenvalue problems. Submitted. 10. S. Oliveira. A convergence proof of an iterative subspace method for eigenvalues problem. In F. Cucker and M. Shub, editors, Foundations of Computational Mathematics Selected Papers, pages 316-325. Springer, January 1997. (selected). 11. S. Oliveira. A new parallel chasing algorithm for transforming arrowhead matrices to tridiagonal form. Mathematics of Computation, 67(221):221-235, January 1998. 12. B. T. Smith, J. M. Boyle, a. J. Dongarra, B. S. Garbow, Y. lkebe, V. C. E:lema, and C. B. Moler. Matrix eigensystem routines: EISPACK guide. Number 6 in Lecture Notes Comput. Sci. Springer-Verlag, Berlin, Heidelberg, New York, second edition, 1976. 13. A. Stathopoulos and C. F. Fischer. A Davidson program for finding a few selected extreme eigeinpairs of a large, sparse, real, symmetric matrix. Comp. Phys. Comm., 79:268-290, 1994. 14. A. Stathopoulos, Y. Saad, and C. F. Fisher. Robust preconditioning of large, sparse, symmetric eigenvalue problems. J. Comp. and Appl. Mathematics, 64:197215, 1995. 15. V. M. Umar and C. F. Fischer. Multitasking the Davidson algorithm for the large, sparse eigenvalue problem. Int. J. Supercomput. Appl., 3:28 53, 1989. 16. R. S. Varga. Matrix Iterative Analysis. Prentice-Hall, 1963. 17. J. Weber, R. Lacroix, and G. Wanner. The eigenvalue problem in configuration iteration calculations: A computer program based on a new derivation of the algorithm of Davidson. Compu. Chem., 4:55-60, 1980. 18. J. R. Westlake. A handbook of numerical matrix inversion and solution of linear equations. Wiley, 1968. 19. D. M. Young. Iterative Solution of LaTye Linear Systems. Academic Press, 1971.
P a r a l l e l S o l v e r s for L a r g e E i g e n v a l u e P r o b l e m s Originating from Maxwell's Equations Peter Arbenz I and Roman Geus 1 Swiss Federal Institute of Technology (ETH), Institute of Scientific Computing, CH-8092 Zurich, {arbenz, geus}@inf. ethz. ch
A b s t r a c t . We present experiments with two new solvers for large sparse symmetric matrix eigenvalue problems: (1) the implicitly restarted Lanczos algorithm and (2) the ffacobi-Davidson algorithm. The eigenvalue problems originate from in the computation of a few of the lowest frequencies of standing electromagnetic waves in cavities that have been discretized by the finite element method. The experiments have been conducted on up to 12 processors of an HP Exemplar X-Class multiprocessor computer.
1
Introduction
Most particle accelerators use standing waves in cavities to produce the high voltage RF (radio frequency) fields required for the acceleration of the particles. The mathematical model for these high frequency electromagnetic fields is the eigenvalue problem solving Maxwell's equations in a bounded volume. Usually, the eigenfield corresponding to the fundamental mode of the cavity is used as the accelerating field. Due to higher harmonic components contained in the RF power fed into the cavity, and, through interactions between the accelerated particles and the electromagnetic field, an excitation of higher order modes can occur. The RF engineer designing such an accelerating cavity therefore needs a tool to compute the fundamental and about ten to twenty of the following eigenfrequencies together with the corresponding electomagnetic eigenfields. After separation of variables depending on space and time and after elimination of the magnetic field terms the variational form of the eigenvalue problem for the electric field intensity is given by [1] F i n d (A,u) E IR x Wo such that u 5s 0 and
(curl u, curl v) = x(,~,v),
w ~ w~.
(1)
Let L2(/2) be the Hilbert space of square-integrable functions over the 3-dimensional domain Y2 (the cavity) with inner product (u, v) = f n u ( x ) . v(x) dx and norm
[lu[]0,~ :=
(u,
u) 1/~. In
(1), W0 := {v ~ W [div v = 0} with
W := {v E L2(~2) 3 [ c u r l y E L~(Y2)3 , d i v v C L2(X?) , n x v = 0 on OX2}.
772 The difficulty with (1) stems from the condition div v = 0 as it is hard to find divergence-free finite elements. Therefore, ways have been looked for to get around the condition div v = 0. In this process care has to be taken in order not to introduce so-called spurious modes, i.e. eigenmodes that have no physical meaning [8]. We considered two approaches free of spurious modes, a penalty method and an approach based on Lagrange multipliers. In the penalty method approach (1) is replaced by [9, 11] For fixed s > O, find (A, u) E ]It x W such that u ~s 0 and (curl u, curl v) + s (div u, div v) = A(u, v), Vv E W.
(2)
Here, s is a positive, usually small parameter. The eigenmodes u(x) of (2) corresponding to eigenvalues A < pzS are eigenmodes of (1). #1 is the smallest eigenvalue of the negative Laplace operator --A on [2. When discretized by ordinary nodal-based finite elements, (2) leads to matrix eigenvalue problems of the form Ax = ),Mx.
(3)
For positive s, both A and M are symmetric positive definite sparse n-by-n matrices. In the mixed formulation the divergence-free condition is enforced by means of Lagrange multipliers [9]. Find (X,u,p) E lR x H0(curl; a ) x H01(~2) such that u 7s 0 and (a) ( c u r l u , c u r l ' I Q + ( g r a d p , ~ ) = A ( u , V ) , Y~ CH0(curl;~2) (b) (u, g r a d q) = 0, Vq e H01(f2)
(4)
Here, H0(curl;f~) := {v E L~"(s 3 I c u r l v E L2(I?) a, n x v = 0 on 0D}. The finite element discretization of (4) yields a matrix eigenvalue problem of the form
;] [;]
o~ [;]
where A and M are n-by-n and C is n-by-re. M is positive definite, A is only positive semidefinite. It turns out that (5) becomes most convenient to handle if the finite elements for the vector fields are chosen to be edge elements as proposed by N~d~lec [12,8] and the Lagrange multipliers are represented by nodal-based finite elements of the same degree. Then, the columns of M - 1 C form a basis for the nullspace of A and, in principle, it suffices to compute the eigenvalues of an eigenproblem formally equal to (3) but with A and M from (5). To get rid of the high-dimensional eigenspaee associated with the eigenvalue zero Bespalov [4] proposed to replace (5) by Ax = AMx,
.~ = A + C H C T,
(6)
where H is a positive definite matrix chosen such that the zero eigenvMues are shifted to the right of the desired eigenvalues and do not disturb the computations. In this note we consider solvers for Ax = AMx with A and M from the penalty approach (3) as well as from the mixed element approach (6).
773
2
Algorithms
In this section we briefly survey the numerical methods that we will apply to the model problem which is a cavity of the form of a rectangular box, $2 = (0, a) • (0, b) • (0, c). We investigate two algorithms for computing a few of the smallest (positive) eigenvalues of the matrix eigenvalue problems originating from both the penalty method and the mixed method. For computing a few, say p, eigenvalues of a sparse matrix eigenvalue problem A x = )~Mx,
A = A T,
M = M T > 0,
(7)
closest to a nmnber r it is advisable to make a so-cMled s h i f t - a n d - i n v e r t approach and apply a spectral transformation with a shift cr close to r and solve [7] (A - ~ M ) - I M x
= #x,
# = A-
1
~"
(s)
instead of solving (7). Notice that ( A - c ~ M ) - I M is M-symmetric, i.e., it is symmetric with respect to the inner product x T M y . The spectral transformation leaves the eigenvectors unchanged. The eigenvalues of (7) close to the shift c, become the largest absolute of (8). They are relatively well-separated which improves the speed of convergence. The cost of the improved convergence rate is the need to solve (at least approximately) systems of equations with the matrix A - ~M. In all algorithms the shift cr was chosen to be 48 < A1 ~ 48.4773. 1. T h e I m p l i c i t l y R e s t a r t e d L a n c z o s a l g o r i t h m ( I R L ) . Because of the large memory consumption of the Lanczos algorithm it is often impossible to proceed until convergence. It is then necessary to restart the iterative process in some way with as little loss of information as possible. An elegant way to restart has been proposed by Sorensen for the Arnoldi algorithm [16], see [5] for the symmetric Lanczos case. Software is publicly available in ARPACK [10]. The algorithm is based on the spectral transformation Lanczos algorithm. The iteration process is executed until j = p + k, where k is some positive integer, often k = p. Complete reorthogonalization is done for stability reasons. This is possible since by assumption p + k is not big. In the restarting phase a clever application of the QR algorithm [14] reduces the dimension of the search space to p in such a way that the p new orthogonal basis vectors still form a Krylov sequence [16,5]. This allows the Lanczos phase to be resumed smoothly. Here, we solved the symmetric indefinite system of equations ( A - ~ M ) x = y iteratively by SYMMLQ [13, 3]. The accuracy of the solution of the linear system has to be at least as high as the desired accuracy in the eigenvalue calculation in order that the coefficients of the Lanczos three-term recurrence are reasonably accurate [10]. We experimented with various preconditioners. None of them was satisfactory. We obtained the best results with diagonal preconditioning. In our implementation we chose p = k = 15. Besides the storage for the matrices A and M, IRL requires space for two n • (p + k) arrays.
774 2. T h e J a c o b i - D a v i d s o n a l g o r i t h m ( J D Q R ) . In the Jacobi-Davidson algorithm the eigenpairs of Ax = AMx are computed one by one. As with the implicitly restarted Lanczos procedure the search space Vj is expanded up to a certain dimension, say j = jmax. The basis of this space is constructed to be Morthogonal but there is no three-term recurrence relation. In the expansion phase the search space is extended by a vector v orthogonal to the current eigenpair approximation (A, ~) by means of equation [15]
(I - 6i6tTM ) ( A - AM)v = - ( I - 6i6iTM ) r ,
( I - O O r M ) v = v.
(9)
(If eigenvectors have already been computed, this process is executed in the space M-orthogonal to them.) The solution of (9) is only needed approximately. Therefore, it can be solved iteratively. In [6] the authors propose to use a preconditioner of the form
(I - ~I6iTM)I'( ([ -- 61~tTM )
(10)
where I ( is a good and easily invertible approximation of A - AM. They also give a generalized inverse of the matrix in (10). We solved the systems by the conjugate gradient squared (CGS) method [3]~with K equal to the diagonal of A - AM. The next approximate eigenvalue ~ and corresponding Ritz vector 0 are obtained by a Ritz step for the subspace VN+I. If the search space has dimension jmax it is shrunk to dimension jmin by selecting only those Jmin Ritz vectors corresponding to Ritz values closest to the target value V = 48. In our experiments we set jmin = P = 15 and jm~• = 2p as suggested in [10]. Thus, besides the storage for the matrices A and M, memory space is needed for a n • Jmax array and for three n x p arrays. 3
Numerical
Experiments
In this section we compare IRL and J D Q R for computing the 15 smallest eigenvalues of (1) with /2 = (0, a) x (0, b) x (0, c) where a -= 0.8421, b = 0.5344, c = 0.2187. We apply penalty as well as mixed methods to this model problem whose eigenvalues can be computed analytically [1]. The computational results have been obtained with the HP Exemplar X-class system at the E T H Zurich. 1. S e q u e n t i a l r e s u l t s . In Figs. 1 and 2 execution times for computing the 15 lowest eigenvalues and associated eigenvectors vs. the accuracy of the computed solutions relative to the analytic solution are plotted for different mesh sizes. The largest problems sizes were n = 65t125 and n = 45t226 for the linear and quadratic edge elements, and n = 28~278 and n = 118~539 for the linear and quadratic node elements. For all experiments we used a tolerance of 10 - s in the stopping criterion of the eigensolvers. Loosely speaking, this means that the eigenvalues are computed to at least 8 significant digits. The execution times comprise the solution time of the eigensolver but not the time for building the matrices. For each finite element type (linear/quadratic, node element/edge element) the performance of the two algorithms is shown. Fig. 1 shows the results
775
I0 s
performance compansion of A R P A C K and JDQR with nodal elements i ........ , ........ i ........ i , 9 ",',, ,,u
........
I
I .................~4 ~ ................-* 0 ................,~
l O~
........
1in node w/arpack hn node w/Jdqr quad node w / a r p a c k quad node w / j d q r
10 ~
.,=
10 z
10 t
"\1%.
~
10 o
1(] -1
........
i
........
i
10-5
10 ~
........
10 4
i
........
i
10 -~ rel accuracy of k I
........
,
10 2
.......
10-~
10~
Fig. 1. Comparison of IRL and JDQR with node elements: Accuracy of the computed )~1 relative to the analytic solution vs. computation time performance compadsion of A R P A C K and JDQR with edge elements 10s
104
9
. . . . . . .
i
.
.
.
.
.
.
\'%~>.~>.
.
.
i
.
t %
.
.
.
.
.
.
.
i
,
9
. . . . . .
F
........
I Jx
4~
lin edge w / a r p a c k Un edge w / [ d q r
I *
-~
quad edge w / a r p a c k quad edge w / j d q r
~
10 ~
,~10 2
101
10~
10-' 10 s
........
t 10-*
........
, ........ i 10 =3 I 0 -2 rel accuracy of k I
........
, 10 -I
,
, , ,,,, 10 0
Fig. 2. Comparison of IRL and JDQR with edge elements: Accuracy of the computed A1 relative to the analytic solution vs. computation time
776 that we obtained with linear and quadratic node elements for A1. The higher eigenvalues behave similarly but are of course less accurate. Fig. 2 shows the results for the linear and quadratic edge elements. For edge elements the convergence is improved by introducing the matrix C as given in (6). Experiments to that end are reported in [2]. H in (6) was heuristically chosen to be a I with a = 100/h, where h is the mesh width. The comparison of linear with quadratic element types reveals immediately the inferiority of the former. They give much lower accuracy for a given computation time. The node elements in turn are to be preferred to the edge elements, at least in this simplified model problem. For a given computational effort, the eigenvalues obtained with the node elements are about an order of magnitude more accurate than those obtained with the edge elements. The situation is not so evident with the linear elements. With the quadratic elements and linear node elements J D Q R is consistently faster than IRL by about 10 to 30% for the large problem sizes. For linear edge elements J D Q R still is 10 to 20% ahead of IRL for most of the larger problem sizes. For the smaller problems the situation is not so clear. We now discuss the behavior of the two algorithms in the case of the quadratic edge elements where the problem size grows from n = 1408 to 45226. (In the latter case A and M have each about 1'730'000 nonzero elements.) With both algorithms the number of outer iteration steps varies only little. It is ~ 65 with IRL and ~ 210 with JDQR. So, the nmnber of restarts does not depend on the problem size. Each outer iteration step requires the solution of one system of equations. We used the solver SYMMLQ with IRL and CGS for JDQR. One (inner) iteration step of CGS counts for about two iteration steps of CGS. We applied diagonal preconditioning in all cases. The superiority of J D Q R over IRL for large problem sizes can be explained by the number of inner iteration steps. The average number of inner iteration steps per outer iteration step grows from 64 to 91 with J D Q R but from 423 to 1140 with IRL. In ARPACK each system of equations is solved to high accuracy. In J D Q R the accuracy requirement is not so stringent and actually varies from (outer) step to step, cf. w This explains the higher iteration numbers for IRL. The projections in (9) improve the condition number of the system matrix. Further, the shift ~r is updated in J D Q R while it stays constant with ARPACK. These may be the reasons for the less pronounced increase of the number of inner iteration steps with JDQR. Notice that the projections made in each iteration step account for less than 10% of the execution time of JDQR. Similar observations can be made with the other element types. 2. P a r a l l e l r e s u l t s . The parallel experiments were carried out on the 32 processor HP Exemplar X-class system at E T H Zurich. This shared-memory CCNUMA machine consists of two hypernodes with 16 HP PA-8000 processors and 4 GBytes memory each. The processors and memory-banks within a hypernode are connected through a crossbar-switch with a bandwidth of 960 MBytes/s per port. The hypernodes themselves are connected by a slower network with a ring
777 ssc
)~
~(
mpack ~ d e =pack edge idqr node
2SO
~sa
mo
5
5
6
9
B
9
10
I1
12
2
3
5
4
numt~r of pressers
(a)
6 i 8 number of proces~r s
9
10
11
12
(b)
Fig. a. Execution times (a) for the large eigenproblem (n = 22323 for node, n = 5798 for edge elements) and (b) for the small eigenproblem (n =4981 and n = 1408, respectively) s~
~
~packedge] ~packnode
~pack node s
4
~aS.
2
IS
a
4
s
e
7
number of p r ~ e a ~ r 8
(a)
~
~
10
I1
2
a
4
S
~ ~ D u ~ f oJ processors
B
9
10
I1
(b)
Fig. 4. Speedups (a) for the large eigenproblem and (b) for the small eigenproblem topology. The HP PA-8000 processors have a clock rating of 180 MHz and a peak performance of 720 MFLOPS. We carried out the parallel experiments for IRL (ARPACK) and J D Q R using both quadratic node and quadratic edge elements and two different problem sizes. Linear elements are not considered because of the inferior results in the sequential case. For both edge and node elements we chose two problem sizes: the larger problems require about 1700 seconds one processor, whereas the smaller problems require only about 220 seconds. For our experiments we were allowed to use up to 12 processors. With ARPACK about 95% of the sequential computation time is spent in the inner loops forming sparse matrix-vector products with the matrices A, M and C. JDQR spends about 90% for this task. We therefore parallelized the sparse matrix-vector product using the so called HP Exemplar advanced sharedmemory programming technique, that is directive-based.
778 The matrix C and the strictly lower triangles of A and M are stored in the compressed sparse row format [3]. The diagonals of A and M are stored separately. We parallelized the outermost loops with the l o o p _ p a r a l l e l directives. The necessary privatization of variables was done manually by means of the l o o p _ p r i v a t e directive. For the product with C and the lower triangular parts of A and M no special considerations were neeesary. For the product with C T and the upper triangular parts of A and M the processors store their "local" result into distinct vectors, that are accumulated into the global result vector afterwards. To get a well-balanced work load we distributed the triangular matrices in block-cyclic fashion over the processors. In Fig. 3 the execution times for both problem sizes are plotted. In Fig. 4 the corresponding speedups are found. The best speedups are reached with 10 processors for the large problems. ARPACK's IRL gives speedups of 5.8 for node elements and of 3.7 for edge elements. With J D Q R we get 3.8 and 3.0, respectively. These numbers are lower for the small problem size. The reason for the better speedups with ARPACK is, that the implicit classical Gram-Sehmidt orthogonalization which consumes between 5 and 10% of the sequential execution time in J D Q R doesn't scale well on the HP-Exemplar. The parallelized BLAS routine DGENVin the HP MLIB library shows no speedup even though the matrices have more than 100'000 nonzero elements! We are currently resolving this issue with HP. Using a reasonably parallelizing DGEt4Vroutine we expect J D Q R to scale as well as ARPACK. Fig. 4 further shows that in our experiments node elements scale better than edge elements. Since we chose the problem sizes such that both edge and node elements require about the same computation time on one processor, and the linear systems arising from node elements are better conditioned, the matrices originating from edge elements are smaller. Matrices A and M stemming from edge elements together have about 7 times fewer non-zero elements than in the nodal case. So, the relative parallelization overhead is bigger for edge elements. For IRL the situation is improved because instead of A we store the shifted matrix A - a M , which has about twice as many non-zero elements. Our parallel experiments show the limitations of directive-based sharedmemory programming. Before and after each parallel section (e.g. parallel loops) of the code, the compiler inserts global synchronization operations to ensure correct execution. However, in most eases, these global synchronization operations are unnecessary. That is why our implementation doesn't scale well. In order to remove unnecessary synchronization points, a different programming paradigm, such as message-passing or low-level shared-memory programming, has to be employed. Our parallel results depend very much on the computer architecture, the parallel libraries and the programming paradigm we used. They can therefore not be easily generalized. Nevertheless, our experiments prove that reasonable parallel performance can be obtained on small processor numbers with only modest programming effort using directive-based shared-memory programming on the HP Exemplar.
779
References 1. St. Adam, P. Arbenz, and R. Geus, Eigenvalue solvers for electromagnetic fields in cavities, Tech. Report 275, ETH Ziirich, Computer Science Department, October 1997, (Available at U R L http ://www. inf. ethz. ch/publicat ions/tr, html). 2. P. Arbenz and R. Geus, Eigenvalue solvers/or electromagnetic fields in cavities, F O R T W I H R International Conference on High Performance Scientific and Engineering Computing (F. Durst et al., ed.), Springer-Verlag, 1998, (Lecture Notes in Computational Science and Engineering). 3. R. Barret, M. Berry, T. F. Chan, J. Demmel, a. Donato, a. Dongarra, V. gijkhout, R. Pozo, Ch. Romine, and H. van der Vorst, Templates/or the solution of linear systems: Building blocks for iterative methods, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1994, (Available from Netlib at URL http ://www. net i ib. org/t emplat es/index, html). 4. A. N. Bespalov, Finite element method for the eigenmode problem o/a RF cavity resonator, Soviet Journal of Numerical Analysis and Mathematical Modelling 3 (1988), 163-178. 5. D. Calvetti, L. Reichel, and D. C. Sorensen, An implicitely restarted Lanczos method.for large symmetric eigenvalue problems, Electronic Transmissions on Numerical Analysis 2 (1994), 1-21. 6. D. R. Fokkema, G. L. G. Sleijpen, and H. A. van der Vorst, gacobi-Davidson style QR and QZ algorithms for the partial reduction of matrix pencils, Preprint 941, revised version, Utrecht University, Department of Mathematics, Utrecht, The Netherlands, January 1997. 7. R. Grimes, J. G. Lewis, and H. Simon, A shifted block Lanczos algorithm for solving sparse symmetric generalized eigenproblems, SIAM J. Matrix Anal. Appl. 15 (1994), 228-272. 8. J. Jin, The finite element method in electromagnetics, Wiley, New York, 1993. 9. F. Kikuchi, Mixed and penalty formulations ]or finite element analysis o/an eigenvalue problem in electromagnetism, Computer Methods in Applied Mechanics and Engineering 64 (1987), 509-521. 10. R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK users' guide: Solution of large scale eigenvalue problems by implicitely restarted Arnoldi methods, Department of Mathematical Sciences, Rice University, Houston TX, October 1997, (Available at URL h t t p ://uww. caam. r i c e . edu/software/ARPACK/index,html). 11. R. Leis, Zur Theorie elektromagnetischer Schwingungen in anisotropen Medien, Mathematische Zeitsehrift 106 (1968), 213-224. 12. J. C. Ndd~lee, Mixed finite elements in ]R3, Numerische Mathematik 35 (1980), 315 341. 13. C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Namer. Anal. 12 (1975), 617-629. 14. B. N. Parlett, The symmetric eigenvalue problem, Prentice Hall, Englewood Cliffs, N J, 1980. 15. G. L. G. Sleijpen, A. G. L. Booten, D. R. Fokkema, and H. A. van der Vorst,
aacobi-Davidson type methods for generalized eigenproblems and polynomial eigenproblems, BIT 36 (1996), 595-633. 16. D. Sorensen, Implicite application of polynomial filters in a k-step A rnoldi method, SIAM J. Matrix Anal. Appl. 13 (1992), 357-385.
Waveform R e l a x a t i o n for Second Order Differential E q u a t i o n y" = if(x, y) Kazufumi Ozawa and Susumu Y a m a d a 1 Graduate School of Information Sdence, Tohoku University, Kawauchi Sendai 980-8576, JAPAN
A b s t r a c t . Waveform relaxation (WR) methods for second order equations y" = f(t, y), y(to) = yo, y'(to) = y~ are studied. For linear case, the method converges superlinearly for any splittings of the coefficient matrix. For nonlinear case, the method converges quadratically only for waveform Newton method. It is shown, however, that the method with approximate Jacobian matrix converges superlinearly. The accuracy, execution times and speedup ratios of the WR methods on a parallel computer are discussed.
1
Introduction
In this paper we propose a waveform relaxation (WR) method for solving y" = f ( x , y) on a parallel computer. The basic idea of this method is to solve a sequence of differential equations, which converges to the exact solution, with a starting solution y[~ In this iteration, we can solve each component(block) of the equation in parallel.
2
Linear Differential Equation
Consider first the linear equation of the form
J ' ( x ) = Qy(x) +g(~),
y(~0) = y0,
y'(~0) = y~,
y e R ~,
(1)
where we assume - Q is a positive definite matrix. The Picard iteration for solving (1) is given by
y[~"t-1]H(X) = Qy[~'](x) .-.~g(x),
y[U](xo) : yo, y[t~]'(x0) -- y0.
(2)
Here we consider the rate of convergence of (2) and propose an acceleration of the iteration. The solutions of (1) and (2) are given by {Qy(r) + g(r)}dTds,
y ( x ) = Yo + (x - xo)Y~o + 0
(3)
0
y[,+l] (x) = Yo + (x - xo)Jo +
{Qy['] (r) + g ( r ) } d r d s , O
O
(4)
781 respectively. By subtracting (3) from (4) we can obtain e ["+1] (x) =
[f o
s[~] (r)drds,
(5)
0 aaJo
where e ["] (x) = y["]( x ) - y ( x ) . Taking the norm of the left-hand side and assuming Ile[~ <_ h'(x - x0), we have by induction
IId"J(x)ll _< K c~(~(2~- +x ~1)!
'
(6)
where we set cl = I]QII-The inequality shows that Picard iteration (2) converges on any finite interval x E Ix0, T] and the rate of convergence is superlinear. It is, however, expected that the method converges slowly if the system is stiff or the length of the interval is large. Here we propose an acceleration of the Picard iteration using the splitting Q = N - M. The iterative method to be considered is
y[~'+l]"(x) + My[~'+l](x) = Ny[~'](x) + g(x),
(7)
where we assume that matrix M is symmetric and positive definite. The error of the iterate y["](x) satisfies the differential equation
~E~+ll"(~) + M~E~+ll(~) = Nj~1(~).
(s)
In order to obtain the explicit expression for g["](x), here we define the square root and the trigonometric functions of matrices. Let ),k (k = 1 , . . . , m) be the eigenvalues of M, and P is a unitary matrix that diagonalizes M, i.e. M = Pdiag (),1,---,),,~) p - i , then using P the square root of M can be defined by =
For any square matrix H and scalar x, the trigonometric functions can be defined by cos(Hx) =
(--1)kH2kx 2k,
k=0
~o (_l)kH2k+l sin(Hx) = E (2k + 1)! x2k+l
(~k)!
k=0
Using the functions defined above we have
e['+l](x) = (x/M) - 1 J [ [ (c~
= (v~)-'
-cos(V~x)sin(vFM~-))Nc['](r)d~
i xsin(
v~(~
-
)
- ~) N~E'1(~)d~
:~i~:cos(v/--M(x-r))Ne['](s)dsdr,
(9)
782 where the condition J~+ll(x)]~=x o = a~[~+ll(x)l~=xo = 0 is used. To bound e[~l(x) if we use the Euclidean norm then we have
II cos(v~x)LI z
LIPll diag(cos(v~7~),...,cos(V~X))
liP-111 < 1, (10)
since IIPII = lIP-ill = 1. Assuming Ilei~ < tf(x - xo) as before and letting c2 = IINII, we fiud by induction the inequality
-
(~. + 1)!
(11)
The result shows that although the rate of convergence of splitting m e t h o d (7) remains superlinear, we can accelerate the convergence by making the value of
IINII small. 3
Nonlinear
Differential
Equation
Next we discuss the rate of convergence of the waveform relaxation for the second order nonlinear equation
y'(x) = f(x,y),
y(xo) = Yo,
y'(xo) = y~,
y e R "~.
(12)
For the first order nonlinear equation y' = f(x, y), the method called the waveform Newton method is proposed[5]. The method is given by y [ ' + l ] ' ( x ) - J.y['+l](x) = f(y['] (x)) - J.y['] (x),
(13)
where J . is the Jacobian matrix of f(y[']). It is shown by Burrage[1] that the waveform Newton method converges quadratically on all finite intervals x E Ix0, T]. In this section we first consider the convergence for the case that J~ is replaced with an approximation in order to enhance the efficiency of the parallel computation. Instead of (13) let us consider the iteration y [ ' + l ] ' ( x ) - ].y['+l](x) = f ( y [ ' ] ( x ) ) - ].y[~](x),
(14)
where ]~ is an approximation to J~. Let c['] = yM - y then we have
Of ~[,]
f(y[']) = f(y + ~[']) = f(y) + -~y
+
O(lleMII2).
(15)
Substituting (15) into (14) and ignoring the term O(llc[~]ll~), we have g[~+111 =
s
+
_
(16)
and taking the inner product we have
(17)
783 where the norm 11. [I is defined by Ilull ~ ==< u , u >, and #1 and #2 are the subordinate matrix norms of ]~ and J . - ] . , respectively. The left-hand side of (17) can be written as
< d~e[~+1]'e['+1]>= ]-d---]JeW+till2 = dx 2 Ile[~+l]]ld-~lle['+1]]]' so that, under the assumptionthat lls[v+1]]l5sO, we have d d-7116~+1~1} _< mlld~+1111 + #211e{~111.
Assuming ]]e[~
(18)
< K ( x - Xo) as before and using ]]g[~
IleC"+ll(~)ll _< ~=e "~(~-~o)
I = O, we have
F e-"~(~-~~
I ds
JXO
Ile>](8)ll ds,
#2e #l (x-'~
(1.9)
0
which leads to ILer~J(x)ll < K (/~2e**~(~-~~ (~ + 1)! - x~
'
(20)
showing that the rate of convergence of iteration (14) is not quadratic but superlinear. By the way, the second order nonlinear equation y"(x) = f(x,y),
y(xo) = yo,
y'(xo) -- Jo,
y C Rm
(21)
can be rewritten as the first order system z'(~) = (W(x), f ( ~ , y ) ) T - - Y ( z ( x ) ) ,
(22)
where z(x) = (y(x), y'(x)) T. The waveform Newton method for (22) is given by
z[-+l]'(~) - j.z[-+11(,) = F(zH(~)) - j S < ( . ) .
(2a)
In this case the Jacobian matrix J , is given by _
Oy'
=
,
(24)
where I is an identity matrix. This result shows that for second order equation (12), if we define the waveform Newton method analogously by d2 y[~+l](x) - 0 f y [ ' + z ] ( x ] = f ( y [ ' ] ( x ) ) - Cgfy['](x), dx ~ Oy " Oy
(25)
then the method also converges quadratically. Moreover, from the same discussion as for the first order equation, we can conclude that if an approximate Jacobian J~ is applied, then the rate of convergence of the iteration d2 dx2Y[~+1](x)z~y[~'+1](x) = f(yM(x))- Y.y["](x) is superlinear.
(26)
784 4
Numerical
4.1
Experiments
Linear-Equations
In order to examine the efficiency of the waveform relaxations, we solve first the large linear system of second order differential equations. Consider the wave equation
02 u _ 02u 0t ~ cOx~, 0 < x < u(x, 0) = sin(r~x),
1,
(27)
u(0, t) = u(1, t) = 0,
where the exact solution is u(x, t) = cos(Trt) sin(r~m). The semi-discretization by 3-point spatial difference yields the system of linear equations
J ' ( t ) = Qy(t),
y(t) = (,,~,...,~*m) ~,
(28)
where -2 1
Q-(Ax)
2
"'"
1 ".
" "'" 1 -2 1
E R mxrn
'
A~x --
1
re+l
.
In the splitting method given here we take M as the block diagonal matrix given by M = diag(M1, M 2 , . . . , M , ) , and each block Mt is the tridiagonal matrix given by -1 1
M l - - ~(Ax)
2 -1 ".. '.. " . -1
E R axd, l = 1 , 2 , . . . , # ,
#=re~d,
21
where we have assumed that m is divisible by d and, as a result, # = m / d is an integer. In the implementation, the system is decoupled into # subsystems and each of the subsystems is integrated concurrently on different processors, if the number of processors available are greater than that of the number of the subsystems. If this is not the case each processor integrates several subsystems sequentially. In any way our code is designed so as to have a uniform work load across the processors. The basic method to integrate the system is the 2-stage Runge-Kutta-NystrSm method called the indirect collocation, which has an excellent stability property[3]. In our experiment we set re = 256 and integrate the system from x = 0 to 1 with the stepsize h = 0.1 under the stopping criteria ]1 _< 10-7. The result on the parallel computer KSR1, which is a parallel computer with shared address space and distributed physical memory(see e.g. [6]), is shown in Table 1.
ilyt.l_y[U-l]
785 T a b l e 1. Result for the linear problem on KSR1
CPU time(sec) d iterations serial parallel 1 2 4 8 16 32 64 128
11579 6012 3052 1462 688 427 403 403
566.02 61.92 586.73 53.79 362.94 31.44 240.81 19.56 181.74 13.94 201.56 27.11 369.86 97.27 791.10 407.41
Sp
S~
E
E'
P p =
9.14 2.58 0.57 0.16!16 10.91 2.97 0.68 0.19 16 11.545.080.720.3216 12.31 8.17 0.76 0.51 16 13.04 11.5 0.81 0.72 16 7.43 5.89 0.93 0.73 8 3.80 1.64 0.95 0.41 4 1.94 0.39 0.97 0.20 2
m/d
256 128 64 32 16 8 4 2
12561
1 1159"761 I - [1.001 - I Ill 1 ] CPU time(serial) , CPU time(seriM, d = 256) Sp = CPU time(parMlel) ' S~ = CPU time(parMlel) E = -Sp P'
4.2
E ' = -P-' S~ P = n u m b e r of processors
NonlineawEquations
N e x t we consider t h e n o n l i n e a r wave e q u a t i o n given by v~' = e x p ( v i - l ( x ) ) - 2 e x p ( v i ( x ) ) + e x p ( v i + l ( x ) ) ,
i = 0, 4 - 1 , + 2 , . . . , .
(29)
T h i s e q u a t i o n describes the b e h a v i o u r of t h e well-known T o d a l a t t i c e a n d h a s t h e s o l i t o n s o l u t i o n given by
vi(x)=log(l+sinh2rsech2(ir-w(x+q))), w = sinh
i = 0,+1,•
r,
where q is an a r b i t r a r y c o n s t a n t . T h e e q u a t i o n to be solved n u m e r i c a l l y is n o t (29) b u t its m - d i m e n s i o n a l a p p r o x i m a t i o n given by {~{i = e x p ( y i _ l ) - 2 e x p ( y i ) + e x p ( y i + l ) , = 1 -- 2 e x p ( y l ) q- exp(y2),
y" = e x p ( y ~ _ l ) initial c o n d i t i o n s :
i= 2,...,m-
1 (30)
- 2 exp(y~) + 1 y~(0) = v~(0), V~(0) = v~(0),
where we have to choose a sufficiently large m to lessen t h e effect d u e t o t h e finiteness of t h e b o u n d a r y . As the p a r a m e t e r s we t a k e q = 40, r = 0.5 a n d w = s i n h 0 . 5 . T h e a p p r o x i m a t e J a c o b i a n ]~ used here is t h e block d i a g o n a l m a t r i x given b y
{J~,(i,j), d(lJ'(i'J)
=
0,
1) + 1 _< i,j otherwise,
< dl, l =
1,...,m/d
786 where we have a s s u m e d m is divisible by d as before. In our n u m e r i c a l e x p e r i m e n t we use the K S R 1 p a r a l l e l c o m p u t e r , a n d integ r a t e the 1 0 2 4 - d i m e n s i o n a l s y s t e m f r o m x = 0 to 5 w i t h stepsize h = 0.05 u n d e r the s t o p p i n g c r i t e r i a [ly[~] - y [ ' - l ] l [ _< 10 - l ~ T h e result is shown in T a b l e 2. In our a l g o r i t h m s , if the value of d b e c o m e s s m a l l t h e n the degree of p a r a l T a b l e 2. Results for the nonlinear problem on KSR1
CPU time(sec) d iterations serial parallel 1 17 53.98 6.10 2 15 61.44 6.43 4 14 58.14 6.24 8 14 60.19 6.37 16 13 54.33 5.87 32 12 50.81 5.71 64 6 25.74 3.05 128 6 26.79 4.78 256 4 20.33 6.43 512 4 21.03 11.84
Sp 8.85 9.55 9.32 9.46 9.25 8.91 8.43 5.60 3.16 1.78
, Sp 3.49 3.31 3.41 3.34 3.63 3.73 6.98 4.46 3.31 1.71
E 0.55 0.60 0.58 0.59 0.58 0.56 0.53 0.70 0.79 0.89
E' 0.22 0.21 0.21 0.21 0.23 0.23 0.44 0.56 0.83 0.86
m/d P p = 16 1024 16 512 16 256 16 128 16 64 16 32 16 16 8 8 4 4 2 2
[10241
4 [ 21.30 I - [ - - [1.00[ - I I1[ 1 [ CPU time(serial) CPU time(serial, d = 1024) Sp = CPU time(parallel)' S'~ = CPU time(parallel) P =number of processors E = Sp E ' - S'v --fi '
-
--fi
lelism b e c o m e s large, since the n u m b e r of the s u b s y s t e m s # ( = re~d) b e c o m e s large and, moreover, the cost of t h e L U d e c o m p o s i t i o n t h a t t a k e s p l a c e in t h e inner i t e r a t i o n in each of the R u n g e - K u t t a - N y s t r S m schemes, which is p r o p o r t i o n a l to d 3, b e c o m e s small. However, as is shown in the tables, it is in g e n e r a l true t h a t t h e s m a l l e r the value of d, the larger the n u m b e r of the W R i t e r a tions, which m e a n s the increase of the o v e r h e a d s such as s y n c h r o n i s a t i o n a n d c o m m u n i c a t i o n s . Therefore, s m a l l d i m p l e m e n t a t i o n is n o t n e c e s s a r y a d v a n t a geous. I n our e x p e r i m e n t s we can achieve the b e s t s p e e d u p s w h e n t h e n u m b e r of s u b s y s t e m s is equal to t h a t of processors.
References 1. Burrage, K., Parallel and Sequential Methods for Ordinary Differential Equations, Oxford University Press, Oxford, 1995. 2. Hairer, E., Ncrsett S. P., and Wanner, G., Solving Ordinary Differential Equations I (Nonstiff Problems, 2nd Ed.), Springer-Verlag, Berlin, 1993.
787 3. van der Houwen, P. J., Sommeijer, B.P. Nguyen Huu Cong, Stability of collocationbased Runge-Kutta-NystrSm methods, BIT 31(1991), 469-481. 4. Miekkala, U. and Nevanlinna, O., Convergence of dynamic iteration methods for initial value problems, SIAM J. Sci. Comput. 8(1987), pp.459-482. 5. White, J. and Sangiovanni-vincentelli, A.L., Relaxation Techniques for the Simulation of VLSI Circuits, K|uwer Academic Publishers, Boston, 1987. 6. Papadimitriou P., The tqSR1 - - A Numerical analyst's perspective, Numerical Analysis Report No.242, University of Mancheter/UMIST.
T h e Parallelization of the I n c o m p l e t e L U Factorization on A P 1 0 0 0 Takashi N O D E R A and Naoto TSUNO Department of Mathematics Keio University 3-14-1 Hiyoshi Kohoku Yokohama 223, Japan
A b s t r a c t . Using a finite difference method to discretize a two dimensional elliptic boundary value problem, we obtain systems of linear equations Ax = b, where the coefficient matrix A is a large, sparse, and nonsingular. These systems are often solved by preconditioned iterative methods. This paper presents a data distribution and a communication scheme for the parallelization of the preconditioner based on the incomplete LU factorization. At last, parallel performance tests of the preconditioner, using BiCGStab(g) and GMRES(m) method, are carried out on a distributed memory parallel machine AP1000. The mlmerical results show that the preconditioner based on the incomplete LU factorization can be used even for MIMD parallel machines.
1
Introduction
We are concerned with the solution of linear systems of equations Ax = b
(1)
which arises from the discretization of partial differential equations. For example, a model problem is the two dimensional elliptic partial differential equation:
-u,~: - uyy + ,r(x, y)ux + ",/(x, y)uy = f ( x , y)
(2)
which defined on the unit square Y2 with Dirichlet boundary conditions u(x, y) = y) and 3'(x, y) to be a bounded and sufficiently smooth function taking on strictly positive values. We shall use a finite difference approximation to discretize the equation (2) on a uniform grid points ( M E S H x M E S H ) and allow a five point finite difference molecule A. The idea of incomplete LU faetorization [1,4, 7] is to use A = L U - R, where
g(x, y) on c~Y2. We require r
N
A =
A.~~.3 A ~,3 < AE. G3 AS.. %3
;
~
LW. L.%~.3 0 ~3 LS.. "G3
'
8 =
0 Ui~ UE. ~,3 0
(3)
where L and /] are lower and upper triangular matrices, respectively. The incomplete LU factorization preconditioner which is efficient and effective for the
789 ~2
~2
PU(N) .....
L .........
t .........
'
:
E
L ....
............. t ~ . ( ~ ) .............. ..... b......... i---f i
I ..... i .........
1
'
....
r .........
PU(1)
PU(2)
N
Ui,j_l:
.............. ~ _ ! ~ ) ..............
~ ....
1
PU(1)
Fig. 1. The data dependency among neighborhoods,
pu(1)
Fig. 2. The data distribution for N processors,
Fig. 3. The cyclic data distribution for N processors.
single CPU machine, may not be appropriate for the distributed memory parMlel machine just like AP1000. In this paper, we describe some recent work on the efficient implementation of preconditioning for a MIMD parallel machine. In section 2, we discuss the parallelization of our algorithm, and then we describe the implementation of incomplete LU factorization on the MIMD parallel machine. In section 3, we present some numerical experiments in which BiCGStab(~) algorithm and GMRES(m) algorithm are used to solve a two dimensional self-adjoint elliptic boundary value problem.
2
Parallelization
In section 1, as we presented the incomplete LU factorization, it involves in two parts. One is the decomposition of A into L and U. The another is the inversion of L and D by forward and backward substitution. First of all, we consider the decomposition process. The data dependency of this factorization is given in Fig. 1. Namely, the entry of L~j can be calculated as soon as Lc_I,j and Lied_ 1 have been calculated on the grid points. The LCl,j only depends on L~j_ 1 on the west boundary and LCi,1only depends on Lc' 1,1 on the south boundary. Next, we consider the inversion process. The calculation of wi,j = ~,-iv and wi,j = [~r-1v is given by forward and backward substitution as follows9 W i , j = ( V i , j -- L i,j S W i , j - 1 -- L . ~,3 V~.Wi_l,j)/LCj __ N q j j ?2)i,j ~ ~)i,j Ui,j i,j+l Ugwi+l,j -
-
(4) (5)
We now consider brief introduction to the method of parallelization by Bastian and Horton [4]9 They proposed the algorithm of dividing the domain f2, as shown in Figure 2, when we do the parallel processing by N processors such as PU(1), . . . , PU(N). Namely, PU(1) starts with its portion of the first low of grid points. After end, a signal is sent to PU(2), and then it starts to process its
790
Fig. 4. The efficiency of parallel computation in the grid points on Y2. section of the first row. At the time, PU(1) has started with the calculation of its component of the second row. All processors are running when P U ( N ) begins with the calculation of its component of the first row.
2.1
Generalization of the algorithm by Bastian and H o r t o n
As we can show in Fig. 3, we divide the domain ~2 by N processors cyclically. Here, we denote CYCLIC which denotes the number of cyclic division of the domain. The parallelization of Fig. 2 is equal to C Y C L I C = I . If it divides as shown in the Fig. 3, the computation of the k-th processor PU(k) can be begun more quickly than the case of Bastian and Horton. Therefore, in this case, the efficiency of parallel processing will be higher than the ordinary one. In Fig. 4, we show the efficiency of parallel computation in the square grid points for CYCLIC= 1, 2, respectively. In these figures, the efficiency of parallel processing is shown a deeper color of the grids is low. The efficiency E of the parallel processing by this division is calculated as following equation 6. The detailed derivation of this equation is given in the forthcoming our paper [10, 11]. CYCLIC • MESH E = CYCLIC • MESH + N - 1
(6)
On the other hand, for distributed memory parallel machine, the number of CYCLIC is increased, the communication between processors is also increased. Moreover, concealment of the communication time by calculation time also becomes difficult. Therefore, we will consider the trade-off between the efficiency of parallel processing and the overhead of communication. 3
Numerical
Experiences
In this section we demonstrate the effectiveness of our proposed implementation of ILU factorization on a distribute memory machine Fujitsu AP1000 for some
791 T a b l e 1. AP1000 specification Architecture IDistributed Memory, MIMD Number of processors 64 Inter processor networks Broadcast network(50MB/s) Two-dimensional torus network (25MB/s/port) Synclronization network
T a b l e 2. The computational time (sec.) for the multiplication of matrix A(LU) -1 and vector v on example 1. Mesh Size CYCLIC ( L ( f ) - l v Av Total 1 2.52 x 10-212.88 x 10-~12.81 x 10 -~ 128x128 2 5.86 x 10 -2 6.08 x 10 -a 6.47 x 10 -2 5.44 x 10 -2 9.18 x 10 -~ 6.36 x 10 -~ 1 1.41 x 10 - I 1.41 x 10 -2 1.55 x 10 -1 256x256 2 2.12 x 10 -1 2.68 x 10 -2 2.38 x 10 -1 4 1.56 x 10 -1 4.16 x 10 -2 1.98 x 10 -~ 1 3.31 x 10 - 1 5.05 X 10 - 2 3.81 x 10 - I 2 512x512 4 5.05 X 10 -1 6.70 • 10 -2 5.72 X 10 -1 S.26 X 10 -1 1.21 • 1 0 - 1 9.47 • 10 -1 8
large sparse m a t r i c e s problems. T h e Specification of AP1000 is given in T a b l e 1. Each cell of AP1000 employs R I S C - t y p e S P A R C or S u p e r S P A R C processor chip. As T a b l e 2 shows, the c o m p u t a t i o n a l t i m e for A ( L U ) - l v for various C Y C L I C , which o b t a i n e d by the following e x a m p l e 1. F r o m these results, it is considered s u i t a b l e by the AP1000 to decrease d a t a v o l u m e of the c o m m u n i c a t i o n b e t w e e n processors, u s i n g C Y C L I C = 1. So, we use 1 as the n u m b e r of C Y C L I C i n the following n u m e r i c a l experiments. [ E x a m p l e 1.] T h i s e x a m p l e involves discretizations of b o u n d a r y value p r o b l e m for the p a r t i a l differential e q u a t i o n ( J o u b e r t [5]).
-u~:~: - Uyy + o'ux = f ( x , y)
where the r i g h t - h a n d side is taken such t h a t the true s o l u t i o n is u ( x , y) = 1 + x y o n t2. We discretize above e q u a t i o n 7 o n a 256 x 256 mesh of e q u a l l y spaced discretization points. By varying the c o n s t a n t c~, the a m o u n t of n o n s y m m e t r i c i t y of the m a t r i x m a y be changed. We utilize the i n i t i a l a p p r o x i m a t i o n vector of x0 = 0 a n d use simple s t o p p i n g criterion Ilrkll/llroll < 10 -12. We also used d o u b l e precision real a r i t h m e t i c to p e r f o r m the r u n s presented here. I n T a b l e 3, we show the c o m p u t a t i o n a l t i m e s required to reduce the residual n o r m by a factor of 10 -12 for an average over 3 trials of p r e c o n d i t i o n e d or u n -
792 T a b l e 3. T h e c o m p u t a t i o n a l t i m e (sec.) for e x a m p l e 1.
ah
Algorithm 0 2 -~ 2 -~ 2 -1 2U GMI~ES'~-* * 63.25 31.74 33.51 GMRES(5)+ILU * 48.21 25.74 27.29 22.91 GMRES(10) * 117.04 50.52 47.75 50.41 GMRES(10)+ILU 296.18 36.25 39.58 40.95 37.02 GMRES(20) * 109.68 89.45 88.79 94.16 GMRES(20)+ILU 197.47 60.54 76.01 70.29 54.82 BiCGStab~-~-30.11 23.83 20.341 20.66~ 23.18 BiCGStab(1)+ILU 3O.34 21.28 17.82 17.25 15.09 BiCGStab(2) i35.16 20.78 24.14 24.67 26.38 BiCGStab(2)+ILU] 31.86 22.87 19.67 21.27 16.471 BiCGStab(4) 47.28 33.4C 30.93 32.86 39.23 BiCGStab(4)+ILU I 34.38 i 24.56 22.47 23.171 18.99 9 It does not converge after 3000 iterations.
21 32.15 18.10 50.89 25.45 92.41 36.51 23.04 10.91 29.65 10.37 39.51 12.0(]
2~ 2~ 32.98 32.82 12.24 9.23 50.75 47.82 16.08 11.94 92.00 90.39 29.32 15.55 48.42 101.04 8.46 6,73 30.07 25.64 7.47 i 6.51 36.47 36.75 8.49 7.79
24 2~ 36.68 43.91 7.07 6.13 44.42 43.54 8.31 6.63 84.65 79,01 9.83 5.89 4.86 3.13 24.69 24.90 5.23 3.63 32.33 30.96 5.69 4.30
preconditioned BiCGStab(f) and GMRES(m) algorithm. For the preconditioner of incomplete LU factorization, in most cases both methods worked quite well.
References 1. Meijerink, J. A. a n d v a n d e r Vorst, H. A.: A n i t e r a t i v e solution m e t h o d for linear s y s t e m s of which t h e coefficient m a t r i x is s y m m e t r i c M - m a t r i x , Math. Comp., Vol. 31, pp. 148-162 (1977). 2. G u s t a f s s o n , I.: A class of first order factorizations, B I T Vol. 18, pp. 142-156 (1978). 3. Saad, Y. a n d Schultz, M. H.: G M R E S : A generalized m i n i m a l residual a l g o r i t h m for solving n o n s y m m e t r i c linear systems, S I A M J. Sci. Stat. Comput., Vol. 7, No. 3, pp. 856-869 (1986). 4. B a s t i a n , P. a n d H o r t o n , G.: P a r a l l e l i z a t i o n of r o b u s t m u l t i g r i d m e t h o d s : ILU factorization a n d f r e q u e n c y d e c o m p o s i t i o n m e t h o d , S I A M J. Sci. Stat. Comput., Vol. 12, No. 6, pp. 1457-1470 (1991). 5. J o u b e r t , W.: Lanczos m e t h o d s for t h e solution of n o n s y m m e t r i c s y s t e m s of l i n e a r equations, S I A M J. Matrix Anal. Appl., Vol. 13, No. 3, pp. 926-943 (1992). 6. Sleijpen, G. L. G. a n d Fokkema, D. R.: B i C G S T A B ( g ) for linear e q u a t i o n s involving u n s y m m e t r i c m a t r i c e s w i t h complex s p e c t r u m , E T N A , Vol. 1, pp. 11-32 (1993). 7. B r u a s e t , A. M.: A survey of p r e c o n d i t i o n e d i t e r a t i v e m e t h o d s , P i t m a n R e s e a r c h N o t e s in M a t h . Series 328, L o n g m a n Scientific & Technical (1995). 8. Nodera, T. a n d Noguchi, Y.: Effectiveness of B i C G S t a b ( g ) m e t h o d o n A P 1 0 0 0 , Trans. of I P S J (in J a p a n e s e ) , Vol. 38, No. 11, pp. 2089-2101 (1997). 9. Nodera, T. a n d Noguchi, Y.: A n o t e on B i C G S t a b ( 0 M e t h o d on A P 1 0 0 0 , I M A C S Lecture Notes on Computer Science, to a p p e a r (1998). 10. T s u n o , N.: T h e a u t o m a t i c r e s t a r t e d G M R E S m e t h o d a n d t h e p a r a l l e l i z a t i o n of t h e i n c o m p l e t e L U factorization, M a s t e r T h e s i s o n G r a d u a t e School of Science a n d Technology, Keio University (1998). 11. T s u n o , N. a n d Nodera, T.: T h e p a r a l l e l i z a t i o n a n d p e r f o r m a n c e of t h e i n c o m p l e t e L U f a c t o r i z a t i o n on AP1000, Trans. of I P S J (in J a p a n e s e ) , s u b m i t t e d .
An Efficient Parallel Triangular Inversion by Gauss Elimination with Sweeping Ay~e Kiper Department of Computer Engineering, Middle East Technical University, 06531 Ankara Turkey ayse@rorqual, cc. metu. edu. tr
A b s t r a c t . A parallel computation model to invert a lower triangular matrix using Gauss elimination with sweeping technique is presented. Performance characteristics that we obtain are O(n) time and O(n2) processors leading to an efficiency of O(1/n). A comparative performance study with the available fastest parallel matrix inversion algorithms is given. We believe that the method presented here is superior over the existing methods in efficiency measure and in processor complexity.
1
Introduction
Matrix operations have naturally been the focus of researches in numerical parallel computing. One fundamental operation is the inversion of matrices and that is mostly accompanied with the solution of linear system of equations in the literature of parallel algorithms. But it has equally well significance individually and in m a t r i x eigenproblems. Various parallel matrix inversion methods have been known and some gave special attention to triangular types [1,2,3,51. The time complexity of these algorithms is O(log 2 n) using O(n 3) processors on the parallel r a n d o m accesss machine (PRAM). No faster algorithm is known so far although the best lower bound proved for parallel triangular m a t r i x inversion is Y2(logn). In this paper we describe a parallel c o m p u t a t i o n model to invert a dense triangular matrix using Gauss elimination (although it is known [2] t h a t Gauss elimination is not the fastest process) with sweeping strategy which allows m a x i m u m possible parallelisation. We believe that the considerable reduction in the processors used results in an algorithm that is more efficient than the available methods.
2
Parallel Gauss Elimination w i t h S w e e p i n g
A lower triangular m a t r i x L of size n and its inverse L - 1 = B satisfy the equation LB = I.
(1)
794 The parallel algorithm that we will introduce is based on the application of the standard Gauss elimination on the augmented matrix
[LiI].
(2)
Elements of the inverse matrix B are determined by starting from the main diagonal and progressing in the subdiagonals of decreasing size. The Gauss elimination by sweeping (hereafter referred to as GES) has two main type of sweep stages: (i) Diagonal Sweeps (DS) : evaluates all elements of each subdiagonal in parallel in one division step. (ii) Triangular Sweeps (TS) : updates all elements of each subtriangular part in parallel in two steps (one multiplication and one subtraction). We can describe GES as follows:
Stage 1. Initial Diagonal Sweep (IDS) : the main diagonal elements of B are all evaluated in terms of lij (are the elements of the matrix L ) in parallel by 1 a i i = /~/,
i = 1,2,...,n.
(3)
Stage 2. Initial Triangular Sweep (ITS) : updates the elements of the lower triangular part below the main diagonal in columnwise fashion Olij :
-lij ljj
'
i = 2,3,...,n;
j = 1,2,...,n-
1
(4)
in one parallel step being different than the other regular triangular sweep steps. The IDS and ITS lead to the updated matix 0~ii
i 1 a21 c~22 0
~IS
:
Lanl
(5)
ann
Then the regular DS and TS are applied alternately (denoting their each application by a prime over the elements of the current matrix) until B is obtained
Oll
as
! (1"21 OL22
B =
a~a
(~'-z) k~
1
0
O~32
' aaa
.
,,
.
(6)
, '
O~n,n- 2 0ln,n- I O~nn
If k denotes the application number of the sweeps in DSk and TSk , the stage computations can be generalised as:
795
Stage (2k + 1). DSk : (k + 1)th subdiagonal elements of B are computed using (k-i)
~(k)
ak+i'--------~i
~k..bi,i -- lk+i,k+i
'
i = 1, 2 , . . . , n -- k
(7)
leading to the intermediate matrix 0/11 0/21
0/22 0/32 9
BDSk
---
(s)
k+l 1 0/(k) (k-i)
Olk+2,l
(k-i)
(k-l) .O[n,1
9 Ozn,n-k-1
~,,(k) ~n,n-k
9 9 0/n~n-10G~n
Stage (2k + 2). TSk : The constants are evaluated in the first step by Ck(k) +p,i=~i+l,i,k+p,i+l, _ (k) ,
p=i+l,i+2,...,n-k;
i=l,2,...,n-(k+l)
(9) After the constant evaluations are completed, new updated values of elements below the kth subdiagonal are computed columnwise using (k-l) (~(k) k) p=i+l,i+2,...,n-k; i=l,2,...,n-(k+l)
(10) in parallel giving matrix 0/11
o ~TSk
~"
(11)
~k+l,l
0/(?& 0/('k) rG1
9
0/(k)" 0/(k) n,n--k-1 n,n-k
! 9 O~n,n_ 10Znn
GES algorithm totally requires (n - 2) diagonal, (n - 2) triangular sweeps (k = 1, 2, ..., n - 2 , in (7) and (10)) in addition to initial diagonal, initial triangular and final diagonal sweeps (FDS) (k = n - 1, in (7)) .
3
P e r f o r m a n c e Characteristics
and D i s c u s s i o n
The total number of parallel arithmetic operation steps Tp is obtained as T~ = 3(,~- 1)
(12)
796 with the maximun number of processors used
Pmax --
n(n - 1) 2
(13)
The speedup Sp = T~/Tp with respect to the the best sequential algorithm [4] for which Ts = n 2 time is obtained as
s~ -
3(~- i)
04)
and the associated efficiency Ep = Sp/Praax is 2n
E~ - 3(~ - i)~
(15)
All efficient parallel algorithms [1,2,3] of triangular matrix inversion run in O(log 2 n) time using O(n a) processors and are designed for PRAM system. So we find it appropriate to compare the theoretical performance of GES algorithm with the complexity bounds stated by Sameh and Brent [3] which are the known model reference expressions. We select the same computational model PRAM in the discussions to keep the common comparison base. The speedup and efficiency expressions (denoted by a superscript , to differentiate from those of GES algorithm) for the algorithm of Sameh and Brent [3] over the best sequential time are 2n 2 S; : (log 2 n + 31ogn + 6 ) (16) and
256 E ; = (l~ 2 n + 31ogn + 6)(21n + 60)"
(17)
The best comparative study can be done by observing the variation of speedup R s = Sp/S~ and efficiency RE = Ep/Ep ratios with respect to the problem size. These ratios expilicitely are
log2n + 31ogn + 6 6(n 1)~
(18)
~(21~ + 60)(log ~ . + 3 log,~ + 6) 3 8 4 ( n - 1) 2
(19)
Rs =
-
and
R. =
The efficiency ratio curve (Figure 1) demonstrates a superior behaviour of GES algorithm compared to that of Sameh and Brent for n > 8. Even for the worst case (occurs when n ~- 8 ) efficiency of GES exceeds the value of that for Sameh and Brent more than twice. This is the natural result of considerable smaller number of processor needed in GES. As the speedup ratio concerned the GES algrithm is slower than that given by Sameh and Brent and this property is dominant particularly for small size problems.
797 6 5 0 m
0 @
4 3
0
2 1 0 2
4
8
16 log
32 n
64
128
256
Fig. 1. Efficiency Ratio
The implementation studies of GES on different parallel architectures is open for future work. The structure of the algorithm suggests an easy application on array processors with low (modest) communication cost that meets the P R A M complexity bounds of the problem with a small co~lstant.
References 1. L. Csansky, On the parallel complexity of some computational problems, Ph.D. Dissertation, Univ. of California at Berkeley, 1974. 2. L. Csansky, Fast parallel matrix inversion algorithms, S I A M J. Comput., 5 (1976) 618-623. 3. A. H. Sameh and R. P. Brent, Solving triangular systems on parallel computer, S I A M J. Numer. Anal., 14 (1977) 1101-1113. 4. U. Schendel, Introduction to Numerical Methods for Parallel Computers (Ellis Harwood, 1984). 5. A. Schikarski and D. Wagner, Efficient parallel matrix inversion on interconnection networks, J. Parallel and Distributed Computing, 34 (1996) 196-201.
Fault Tolerant QR-Decomposition Algorithm and Its Parallel Implementation Oleg Maslennikow 1, Juri Kaniewski 1,Roman Wyrzykowski 2 1 Dept. of Electronics, Technical University of Koszalin, Partyzantow 17, 75-411 Koszalin, Poland 2 Dept. of Math. & Comp. Sci, Czestochowa Technical University, Dabrowskiego 73, 42-200 Czestochowa, Poland
A b s t r a c t . A fault tolerant algorithm based on Givens rotations and a modified weighted checksum method is proposed for the QR-decomposition of matrices. The algorithm enables us to correct a single error in each row or column of an input M x N matrix A occurred at any among N steps of the algorithm. This effect is obtained at the cost of 2.5N 2 +O(N) multiply-add operations (M = N). A parallel version of the proposed algorithm is designed, dedicated for a fixed-size linear processor array with fully local communications and low I/O requirements.
1
Introduction
The high complexity of most of m a t r i x problems [1] implies the necessity of solving them on high performance computers and, in particular, on VLSI processor arrays [3]. Application areas of these computers d e m a n d a large degree of reliability of results, while a single failure m a y render computations useless. Hence fault tolerance should be provided on hardware o r / a n d software levels [4]. The algorithm-based fault tolerance (ABFT) methods [5 9] are very suitable for such systems. In this case, input d a t a are encoded using error detecting o r / a n d correcting codes. An original algorithm is modified to operate on encoded d a t a and produce encoded outputs, from which useful information can be recovered easily. The modified algorithm will take more time in comparison with the original one. This time overhead should not be excessive. An A B F T m e t h o d called the weighted checksum (WCS) one, especially tailored for m a t r i x algorithms and processor arrays, was proposed in Ref. [5]. However, the original WCS m e t h o d is little suitable for such algorithms as Gaussian elimination, Choleski, and Faddeev algorithms, etc., since a single transient fault in a module m a y cause multiple output errors, which can not be located. In Refs. [8, 9], we proposed improved A B F T versions of these algorithms. For such i m p o r t a n t matrix problems as least squares problems, singular value and eigenvalue decompositions, more complicated algorithms based on the QRdecomposition should be applied [2]. In this paper, we design a fault tolerant version of the QR-decomposition based on Givens rotations and a modified WCS method. The derived algorithm enables correcting a single error in each row or
799 column of an input M x N matrix A occurred at any among N steps of the algorithm. This effect is obtained at the cost of 2.5N 2 + O(N) multiply-add operations. A parallel version of the algorithm is designed, dedicated for a fixedsize linear processor array with local communications and low I / O requirements.
2
Fault Model and Weighted Checksum Method
Module-levd faults are assumed [6] in algorithm-based fault tolerance. A module is allowed to produce arbitrary logical errors under physical failure mechanism. This assumption is quite general since it does not assume any technology-dependent fault model. Without loss of generality, a single module error is assumed in this paper. Communication links are supposed to be fault-free. In the W C S method [6], redundancy is encoded at the matrix level by augmenting the original matrix with weighted checksums. Since the checksum property is preserved for various matrix operations, these checksums are able to detect and correct errors in the resultant matrix. The complexity of correction process is much smaller than that of the original computation. For example, a W C S encoded data vector a with the Hamming distance equal to three (which can correct a single error) is expressed as
a2...aN
aT=[al
P C S = p T [a,
a2
...
aN],
PUS
QCS]
Q C S = q T [al
(1)
a2
...
aN]
(2)
Possible choices for encoder vectors p, q are, for example, [10] : p
=[2 0
21
...
q
=[1
2
...
X]
(3)
For the floating-point implementation, numerical properties of single-error correction codes based on different encoder vectors were considered in Refs. [7, 11]. Based on an encoder vector, a matrix A can be encoded as either a row encoded matrix An, a column encoded matrix A c , or a full encoded matrix A~qc [11]. For example, for the matrix multiplication A B = D, the column encoded matrix A c is exploited [6]. Then choosing the linear weighted vector (3), the equation A c * B = D c is computed. To have the possibility of verifying computations and correcting a single error, syndromes $1 and $2 for the j - t h column of D-matrix should be calculated, where M
<
= i=1
3
M
- pcsj,
&
=
i,
- QCSj
(4)
/=1
Design of the A B F T QR-Decomposition Algorithm
The complexity of the Givens algorithm [2] is determined by 4N3/3 multiplications and 2N3/3 additions, for a real N • N matrix A. Based on equivalent matrix transformations, this algorithm preserves the Euclidean norm for columns of A during computations. This property is important for error detection and enable us to save computations.
800 In the course of the Givens algorithm, an M x N input m a t r i x A = A 1 ---{aij} is recursively modified in K steps to obtain the upper triangular m a t r i x R = A K+I, where K = M - l f o r M _< N, and K = N for M > N . The i-th step consists in eliminating e l e m e n t s a}i in the i-th column of A i by multiplications on rotation matrices Pji, j = i+ 1 , . . . , M , which correspond to rotation coefficients
cjl = aii j - 1 /v/(aiij-1 )2-t-(a}i)2,
sji = a jii l ~ / ( a ij--1 i ) 2 _t_(a}i)2
Each step of the algorithm includes two phases. The first phase consists in recomputing M - i times the first element aii of the pivot (i.e. i-th) row, and computing the rotation coefficients. The second phase includes c o m p u t a t i o n of ai+l jk , and resulting elements rik in the i-th row of R. This phase includes also recomputing M - i times the rest of elements in the pivot row. J is wrongly Consequently, if during the i-th step, i = 1 , . . . , K , an element aii calculated, then errors firstly appear in coefficients ejl and 8ji , j = i + 1 , . . . , M , and then in all the elements of i I + 1 . Moreover, if at the i-th step, any coefficient eji or sji is wrongly calculated, then errors firstly appear in all the elements of the pivot row, and then in all the elements of A i+1. All these errors can not be located and corrected by the original W C S method. To remove these drawbacks, the following lemmas are proved. It is assumed that a single transient error m a y appear at each row or column of A i at any step of the algorithm. ^i+1 (i < j, i < k) is wrongly calcuL e m m a 1. If at the i-th step, an element ajk lated, then errors will not appear among other elements of A i + 1 . However, if a}k is erroneous, then error appears while computing either the element ttjk _j+l in the pivot row at the j - t h step of the algorithm, for j < k, or values ofa~k , ejk and sjk (j = i + 1 , . . . , M ) at the k-th step, for j > k. Hence in these cases, we should check and possibly correct elements of the i-th and j - t h rows of A i, each time after their recomputing. L e m m a 2. Let an element a~k of the pivot row (j = i + l , . . . , M , " k = 1 , . . . , N ) ~i+1 or an element ~ja of a non-pivot row (k = i + 1 , . . . , N ) was wrongly calculated when executing phase 2 of the i-th step of the Givens algorithm. Then it is possible to correct its value while executing this phase, using the W C S method for the row encoded matrix A n , where AR=[A
pCS~+I
Ap
Aq] = [A P C S Q C S ]
_i+1 _i+1 _i+1 i : uj,i+l "JI-tLj,i+2-]-'''(~j,N = (--sjiai,i+l
+ (-ss a ,N 9
"
=
"
.. i -1- ej~aj,i+l)
+ a ,N) + i
(5)
..
i
at- . . .
+ a ,N) i
(6)
So before executing phase 2 of the i-th step we should be certain t h a t cji, J were calculated correctly at phase 1 of this step (we assume t h a t the sji and aii remaining elements were checked and corrected at the previous step). For this aim, the following properties of the Givens algorithm m a y be used: -
= 1
(7)
preserving the Euclidean norm for column of A during c o m p u t a t i o n ~/
i 2 i 2 aM M (a~,i)2 -t- (ai+l,i) -]-... + (aM,i) = ~,~ (where aii = rii )
(8)
801 The triple time redundancy (TTR) method [4] m a y also be used for its modest time overhead. In this case, values of cji, sji o r / a n d a~i are calculated three times. Hence, for cji and sji, the procedure of error detection and correction consists in computing the left part of expression (7), and its recomputing if equality (7) is not fulfilled, taking into account a given tolerance r [6]. For elements a~i , i = 1 . . . . , K, this procedure consists in computing the both parts of expression (8), and their recomputing if they are not equal. The correctness of this procedure is based on the assumption that only one transient error m a y appear at each row or column of A i at any step. Moreover, instead of using the correction procedure based on formulae (4), we recompute all elements in the row with an erroneous element detected. The resulting A B F T Givens algorithm is as follows: 1. The original m a t r i x A = {aja} is represented as the row encoded m a t r i x A n = {alk} with a~k = ajk, aj,N+ * , =PCS),forj=l, . .., M.; k .= l ., ,N. 2. For i = 1, 2 , . . . , K , stages 3-9 are repeated.
i./12 ' a r e calculated, j = i + 1 , . . . , M. 3. The values o f a~i = ~/ (a{i-1)2 -}-~(a3~/ 4. The n o r m { ai [ for the i-th column of A is calculated. This stage needs approximately M - i multiply-add operations. The value of I ai I is compared with the value of aiiM. If aiiM 7k ] al [, then stages 3,4 are repeated. 5. The coefficients cji and sji are computed, and correctness of equation (7) is checked, for j = i + 1 , . . . , M. In case of non-equality, stage 5 is repeated. 6. For j = i + 1 , . . . , M , stages 7-10 are repeated. J of the i-th row of A i+* are computed, k = 1 , . . . , N + 1. 7. The elements aik 8. The value of PCS~ is calculated according to (6). This stage needs approximately N - i additions. The obtained value is compared with t h a t of ai,N+ j 19 In case of the negative answer, stages 7,8 are performed again. 9. The elements a jk i+l in the j - t h row of A i+1 a r e calculated, k = i+1, "'" , N + I . 10. The value of PCS} +t is computed according to expression (6). This stage also needs approximately N - i additions. The computed value is compared with that of ai+l j,N+I" In case of the negative ease, stages 9,10 are repeated. The procedures of error detection and correction increase the complexity of the Givens algorithm on N2/2 + O(N) multiply-add operations and N 2 + O(N) additions, for M = N. Due to increased sizes of the input matrix, the additional overhead of the proposed algorithm is 2N 2 + O(N) multiplications and N u + O(N) additions (M = N). As a result, the complexity of the whole algorithm is increased approximately on 2.5N 2 + O(N) multiply-add operations and 2N 2 + O(N) additions. At this cost, the proposed algorithm enables us to correct one single transient error oceured in each row or column of A at any among K steps of computations. Consequently, for M = N, it is possible to correct up to N u single errors when solving the whole problem.
4
Parallel Implementation
The dependence graph G1 of the proposed algorithm is shown in Fig.1 for M = 4, N = 3. Nodes of G1 are located in vertices of the integer lattice Q = { K =
802
d 1
l
d4 Al~"
9
F'~
d 2
1"23
a~l o ~ , , a4T ~" 11
a42aX,~rl2
a~,
13
-~"
Fig, 1. Graph of the algorithm
( i , j , k ) : 1 < i < K; i + 1 < j < M; i < k < N } . There are two kind of nodes in G I . Note that non-locM arcs marked with broken lines, and given by vectors d5 -= (0, i + 1 - M, 0), d6 = (0, 0, i - N) are result of introducing A B F T properties into the original algorithm. To run the algorithm in parallel on a processor array with local links, all the vectors d5 are excluded using the T T R technique for computing values of aii, j eji and sji. This needs to execute additionally 6N 2 + O(N) multiplications and additions. Then all the non-local vectors d6 are eliminated by projecting G1 along k-axis. As a result, a 2-D graph G2 is derived (see Fig.2). To run G2 on a linear array with a fixed number n of processors, (32 should be decomposed into a set of s =IN~n[ subgraphs with the "same" topology and without bidirectional data dependencies. Such a decomposition is done by cutting (32 by a set of straight lines parallel to j-axis. These subgraphs are then m a p p e d into an array with n processors by projecting each subgraph onto iaxis [12]. The resulting architecture, which is provided with an external RAM module, features a simple scheme of local communications and a small n u m b e r of I / O channels. The proposed A B F T Givens algorithm is executed on this array in s T=Z [(N+3-n(i-1)) + (M - n ( i - 1 ) )] i=1
time steps. For M = N, we have
T= N3/n -
( N - n)N2/2 + (2N-n)N2/(6n)
The processor utilization is En = W / ( T * n), where W is the c o m p u t a t i o n a l complexity of the proposed algorithm. Using these formulae, for example, in case of s = 10, we obtain E~ = 0.86 Note that with increasing in parameter s the value of En also increases.
803
nI
al
a2
a3
a4
a5
a6
a7
A
Fig. 2. Fixed-sized linear array
References 1. G.H. Golub, C.F. Van Loan, Matrix computations, John Hopkins Univ. Press, Baltimore, Maryland, 1996. 2. A. Bj6rck, Numerical methods for least squares problems, SIAM, Philadelphia, 1996. 3. S.Y. Kung, VLSI array processors, Prentice-Hall, Englewood Cliffs, N.J., 1988. 4. S.E. Butner, Triple time redundancy, fault-masking in byte-sliced systems. Teeh.Rep. CSL TR-211, Dept. of Elec.Eng., Stanford Univ., CA, 1981. 5. K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., C-35 (1984) 518-528. 6. V.S. Nair, J.A. Abraham, Real-number codes for fault-tolerant matrix operations on processor arrays. IEEE Trans. Comp., C-39 (1990) 426-435. 7. J.-Y. Han, D.C. Krishna, Linear arithmetic code and its application in fault tolerant systolic arrays, in Proc. IEEE Southeastcon, 1989, 1015-1020. 8. J.S. Kaniewski, O.V. Maslennikow, R. Wyrzykowski, Algorithm-based fault tolerant matrix triangularization on VLSI processor arrays, in Proc. Int. Workshop "Parallel Numerics'95", Sorrento, Italy, 1995, 281-295. 9. J.S. Kaniewski, O.V. Maslennikow, R. Wyrzykowski, Algorithm-based fault tolerant solution of linear systems on processor arrays, in Proc. 7-th Int. Workshop "PARCELLA '96", Berlin, Germany, 1996, 165 - 173. 10. D.E. Schimmel, F.T. Luk, A practical real-time SVD machine with multi-level fault tolerance. Proe. SPIE, 698 (1986) 142-148. 11. F.T. Luk, H. Park, An analysis of algorithm-based fault tolerance techniques. Proc. SPIE, 696 (1986) 222-227. 12. R. Wyrzykowski, a. Kanevski, O. Maslennikov, Mapping recursive algorithms into processor arrays, in Proc. Int. Workshop Parallel Numerics'94, M. Vajtersic and P. Zinterhof eds., Bratislava, 1994, 169-191.
Parallel Sparse Matrix Computations Using the PINEAPL Library: A Performance Study Arnold R. K r o m m e r The Numerical Algorithms Group Ltd, Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, UK arnoldk@nag, co. uk
Abstract. The Numerical Algorithms Group Ltd is currently participating in the European HPCN Fourth Framework project on Parallel [ndustrial NumErical Applications and Portable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the existing NAG Parallel Library for dealing with computationally intensive industrial applications by appropriately extending the range of library routines. Additionally, several industrial applications are being ported onto parallel computers within the PINEAPL project by replacing sequential code sections with calls to appropriate parallel library routines. A substantial part of the library material being developed is concerned with the solution of PDE problems using parallel sparse linear algebra modules. This talk provides a number of performance results which demonstrate the efficiency and scalability of core computational routines - in particular, the iterative solver, the preconditioner and the matrixvector multiplication routines. Most of the software described in this talk has been incorporated into the recently launched Release 1 of the PINEAPL Library.
1
Introduction
The NAG Parallel Library enables users to take advantage of the increased computing power and m e m o r y capacity offered by multiple processors. It provides parallel subroutines in some of the areas covered by traditional numerical libraries (in particular, the NAG Fortran 77, Fortran 90 and C libraries), such as dense and sparse linear algebra, optimization, quadrature and r a n d o m number generation. Additionally, the NAG Parallel Library supplies support routines for d a t a distribution, i n p u t / o u t p u t and process m a n a g e m e n t purposes. These support routines shield users from having to deal explicitly with the message-passing system - which m a y be M P I or PVM - on which the library is based. Targeted primarily at distributed m e m o r y computers and networks of workstations, the NAG Parallel Library also performs well on shared m e m o r y computers whenever efficient implementations of MPI or PVM are available. The fundamental design principles of the existing NAG Parallel Library are described in [7], and information about contents and availability can be found in the library Web page at http ://www. nag. co. uk/numeric/FD, html.
805 NAG is currently participating in the HPCN Fourth Framework project on Parallel Industrial N u m E r i c a l Applications and Portable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the NAG Parallel Library for dealing with a wide range of computationally intensive industrial applications by appropriately extending the range of library routines. In order to demonstrate the power and the efficiency of the resulting Parallel Library, several application codes from the industrial partners in the PINEAPL project are being ported onto parallel and distributed computer systems by replacing sequential code sections with calls to appropriate parallel library routines. Further details on the project and its progress can be found in [1] and in the project Web page at http ://www. nag. co. uk/proj ect s/PINEAPL, html. This paper focuses on the parallel sparse linear algebra modules being developed by NAG. 1 These modules are used within the PINEAPL project in connection with the solution of PDE problems based on finite difference and finite element discretization techniques. Most of the software described in this paper has been incorporated into the recently launched Release 1 of the PINEAPL Library. The scope of support for parallel PDE solvers provided by the parallel sparse linear algebra modules is outlined in Section 2. Section 3 provides a study of the performance of core computational routines - in particular, the iterative solver, the preconditioner and the matrix-vector multiplication routines. Section 4 summarizes the results and gives an outlook on future work. 2
Scope
of Library
Routines
A number of routines have been and are being developed and incorporated into the PINEAPL and the NAG Parallel Libraries in order to assist users in performing some of the computational steps (as described in [7]) required to solve PDE problems on parallel computers. The extent of support provided in the PINEAPL Library is - among other things - determined by whether or not there is demand for parallelization of a particular step in at least one of the PINEAPL industrial applications. The routines in the PINEAPL Library belong to one of the following classes: M e s h P a r t i t i o n i n g R o u t i n e s which decompose a given computational mesh into a number of sub-meshes in such a way that certain objective functions (measuring, for instance, the number of edge cuts) are optimized. They are based on heuristic, multi-level algorithms similar to those described in [4] and [2]. S p a r s e L i n e a r A l g e b r a R o u t i n e s which are required for solving systems of linear equations and eigenvalue problems resulting from discretizing PDE problems. These routines can be classified as follows: 1 Other library material being developed by PINEAPL partners includes optimization, Fast Fourier Transform, and Fast Poisson Solver routines.
806 I t e r a t i v e S c h e m e s which are the preferred method for solving large-scale linear problems. PINEAPL Library routines are based on Krylov subspace methods, including the CG method for symmetric positive-definite problems, the SYMMLQ method for symmetric indefinite problems, as well as the RGMRES, C G S , BICGSTAB(~) and ( T F ) Q M R methods for unsymmetric problems. P r e c o n d l t i o n e r s which are used to accelerate the convergence of the basic iterative schemes. PINEAPL Library routines employ a range of preconditioners suitable for parallel execution. These include domain decomposition-based, specifically additive and multiplicative Schwarz preconditioners, z Additionally, preconditioners based on classical matrix splittings, specifically Jacobi and Gauss-Seidel splittings, are provided, and multicolor orderings of the unknowns are utilized in the latter case to achieve a satisfactory degree of parallelism [3]. Furthermore, the inclusion of parallel incomplete factorization preconditioners based on the approach taken in [5] is planned. B a s i c L i n e a r A l g e b r a R o u t i n e s which, among other things, calculate sparse matrix-vector products. B l a c k - B o x R o u t i n e s which provide easy-to-use interfaces at the price of reduced flexibility. P r e p r o c e s s i n g R o u t i n e s which perform matrix transformations and generate auxiliary information required to perform the foregoing parallel sparse matrix operations efficiently. I n - P l a c e G e n e r a t i o n R o u t i n e s which generate the non-zero entries of sparse matrices concurrently: Each processor computes those entries which - according to the given data distribution - have to be stored on it. In-place generation routines are also provided for vectors which are aligned with sparse matrices. D i s t r i b u t i o n / A s s e m b l y R o u t i n e s which (re)distribute the non-zero entries of sparse matrices or the elements of vectors, stored on a given processor, to other processors according to given distribution schemes. They also assemble distributed parts of dense vectors on a given processor, etc. Additionally, all routines are available for both real and complex data. 3
Performance
of Library
Routines
For performance evaluation purposes, a simple PDE solver module, which uses a number of sparse matrix routines available in Release 1 of the P I N E A P L Library, has been developed. See Krommer [6] for a description of the solver module and the numerical problem it solves. Performance data generated by this module is used in this section to demonstrate the efficiency and scalability of core computational routines, such as the iterative solver, the matrix-vector multiplication and the preconditioner routines. 2 The subsystems of equations arising in these preconditioners are solved approximately on each processor, usually based on incomplete LU or Cholesky factorizations.
807
3.1
Test Settings
All experiments were carried out for a small-size problem configuration, nl = n2 = n 3 ---- 3 0 , resulting in a system of n = 27000 equations, and for a medium-size problem configuration, nl = n2 = n3 : 6 0 , resulting in a system of n = 216 000 equations. Most equations, except those corresponding to boundary grid points, contained seven non-zeros coefficients. The low computation cost of the small problem helps to reveal potential inefficiencies - in terms of their communication requirements - of the library routines tested; whereas a comparison of the results for the two different problem sizes makes it possible to assess the scalability of the routines tested for increasing problem size. All performance tests were carried out on a Fujitsu AP3000 at the Fujitsu European Centre for Information Technology (FECIT). The Fujitsu AP3000 is a distributed-memory parallel system consisting of UltraSPARC processor nodes and a proprietary two-dimensional torus interconnection network. Each of the twelve computing node at F E C I T used in the experiments was equipped with a 167 MHz UltraSPARC II processor and 512 Mb of memory. 3.2
P e r f o r m a n c e of U n p r e c o n d i t i o n e d Iterative Solvers
Figure 1 shows the speed-ups of unpreconditioned iterative solvers for the two test problems. The iterative schemes tested - B I C G S T A B ( 1 ) , BICGSTAB(2) and CGS - have been found previously to be the most effective in the unsymmetric solver suite. BICGSTAB(2) and CGS can be seen to scale (roughly) equally well, whereas BICGSTAB(1) scales considerably worse (in particular, for the smaller problem). The results are generally better for the larger problem than for the smaller, especially when the maximum number of processors is used. In this case, the improvement in speed-up is between 10 and 20 %, depending on the specific iterative scheme.
3.3
P e r f o r m a n c e of Matrix-Vector M u l t i p l i c a t i o n
When solving linear systems using iterative methods, a considerable fraction of time is spent performing matrix-vector multiplications. Apart from its use in iterative solvers, matrix-vector multiplication is also a crucial computational kernel in time-stepping methods for PDEs. It is therefore interesting to study the scalability of this algorithmic component separately. Figure 2 shows the speed-ups of the matrix-vector multiplication routine in the PINEAPL Library. The results demonstrate a degree of scalability of this routine in terms of both increasing number of processors and increasing problem size. In particular, near-linear speed-up is attained for all numbers of processors for the larger problem.
3.4
P e r f o r m a n c e of Multicolor S S O R P r e c o n d i t i o n e r s
Multicolor SSOR preconditioners are very similar to matrix-vector multiplication in their computational structure and their communication requirements. In
808
I
12
I
I
I
small, PICGSTAB(1) .<>. small, PICGSTAP(2) .~>.sma//, CGS 0 medium, PICGSTAP(1) . o medium, BICGSTAP(2) .|
11 10 9
medium, CGS
8
o
I
.-." .-." ." ..."
.. "" ""
...'" ~ _ . _ . ~
Speed-Up 7 6 5 4 3 2 1 W
I
I
1
2
4
I
I
12
8 Number of Processors
Fig. 1. Speed-Up of Unpreconditioned Iterative Solvers
I
I
I
12
I
I
sma// 0
.'"
11 10 9 8
--
"
oOO
o oO oo oo
o0
o''~176176
Speed-Up 76 5 4 3 2 1 I
I
I
1
2
4
I
8 Number of Processors
Fig. 2. Speed-Up of Matrix-Vector Multiplication
r
12
809 particular, the two operations are equal in their computation-to-communication ratios. However, the number of messages transferred in multicolor SSOR preconditioners is larger than in matrix-vector multiplication (i. e. the messages are shorter). As a result, one would expect multicolor SSOR preconditioners to exhibit similar, but slightly worse scalability than matrix-vector multiplication. Figure 3 confirms this expectation for configuration with up to eight processors. Between eight and twelve processors, however, the speed-up curves flatten out significantly. The precise reasons for this uncharacteristic behavior will have to be investigated in future work.
I
I
12
I
I
small medium
11 10 9 8
<) o
r
..''"
Speed-Up 7 6 5 4
3 2 1 I
l
I
1
2
4
I
I
8
12
Number of Processors Fig. 3. Speed-Up of Multicolor SSOR Preconditioners
3.5
Performance of Additive Schwarz Preconditioners
The multicolor orderings used in the PINEAPL SSOR preconditioners (see Section 3.4) may vary depending on the number of processors a n d / o r data distribution used. However, experiments have shown that the quality of multicolor SSOR preconditioners is fairly independent of the particular multicolor orderings they are based on: The number of iterations required to achieve convergence does not vary significantly when different multicolor SSOR preconditioners are used in an iterative solver. With additive Schwarz preconditioners, the situation is very different: The quality of additive Schwarz preconditioners usually decreases when the number of processors and hence, the number of subdomains - is increased, i.e. the
810
n u m b e r of iterations required to achieve converge with a corresponding preconditioned iterative solver increases. Therefore, the performance of additive Schwarz preconditioners cannot be evaluated separately, but only in conjunction with their application in iterative solvers. Figure 4 shows the speed-ups of the BICGSTAB(1) iterative solver in conjunction with several additive Schwarz preeonditioners. 3 The additive Schwarz preconditioners tested differ in the size of the overlap regions between subdomains. In particular, the choice "overlap=0" corresponds to a traditional block Jacobi preconditioner. The scalability results are clearly worse than those shown in the previous sections. In particular, the transition from one to two processors actually results in a performance decrease for the block Jacobi preconditioner. Generally speaking, the overlapping additive Schwarz preconditioners perform significantly better than the block Jacobi preconditioner for small numbers of processors; for larger numbers of processors, the performance gain from using overlap regions decreases or - for the smaller problem - is completely lost. In K r o m m e r [6] the hardware performance 4 speed-ups corresponding to the temporal performance 5 speed-ups represented in Figure 4 are investigated. It turns out that the hardware performance scales significantly better t h a n temporal performance - indeed, the results for the hardware performance are quite comparable to the results obtained for the unpreconditioned solvers in Figure 1. This fact demonstrates that (i) the implementation of the additive Schwarz preconditioners is efficient in terms of communication requirements and that (ii) the main reason for the relatively poor temporal scalability lies in the unsatisfactory quality of additive Schwarz preconditioners.
4
Conclusion and Outlook
The performance study in Section 3 has established the eff• and scalability of a number of core computational routines in the sparse linear algebra modules of Release 1 of the P I N E A P L Library. The only main deficiency detected is the poor quality of additive Schwarz preeonditioners for two or more processors when compared to the additive Schwarz preconditioner for a single processor (=incomplete LU factorization). It is planned to overcome this problem by providing parallel incomplete factorization preconditioners in Release 2 of the P I N E A P L Library.
3 The subsystems associated with different domains were approximately solved using local modified ILU(0) faetorizations. 4 The hardware performance of a program is defined as the ratio between the number of floating-point operations carried out and the run-time of the program. 5 The temporal performance of a program is defined as reciprocal run-time of the program.
811 I
12
I
small, small, small, medium, medium, medium,
11 10 9 8
I
I
overlap=O overlap=l overlap=2 overlap=0 overlap= 1 overlap=2
9<5 -~.9 . o. -. 9| o
I
..." .." ..." .. 9" .. "" ." . "
Speed-Up 7
6 5 4 3 2 1 I
I
~
1
2
4
I
8 Number of Processors
I
12
F i g . 4. Speed-Up of PICGSTAP(1) with Additive Schwarz Preconditioners
References 1. M. Derakhshan, S. Hammarling, A. Krommer, PINEAPL: A European Project on Parallel Industrial Numerical Applications and Portable Libraries, in "Recent Advances in Parallel Virtual Machine and Message Passing Interface" (M. P u b a k et al., Eds.), Vol. 1332 of LNCS, Springer-Verlag, Perlin, 1997, pp. 337-342. 2. A. Gupta, Fast and Effective Algorithms for Graph Partitioning and Sparse Matrix Ordering, IPM Research Report RC 20496, IPM T.J. Research Center, Yorktown Heights, New York, 1996. 3. M . T . Jones, P. E. Plassman, Scalable Iterative Solution of Sparse Linear Systems, Parallel Computing 20 (1994), pp. 753 773. 4. G. Karypis, V. Kumar, Multilevel k-Way Partitioning Scheme for Irregular Graphs, Technical Report TR-95-064, Department of Computer Science, University of Minnesota, 1995. 5. G. Karypis, V. Kumar, Parallel Treshold-Based ILU Factorization, Technical Report TR-96-061, Department of Computer Science, University of Minnesota, 1996. 6. A. Krommer, Parallel Sparse Matrix Computations Using the PINEAPL Library: A Performance Study, Technical Report TR1/98, The Numerical Algorithms Group Ltd, Oxford, UK, 1998. 7. A. Krommer, M. Derakhshan, S. Hammarling, Solving PDE Problems on Parallel and Distributed Computer Systems Using the NAG Parallel Library, in "HighPerformance Computing and Networking" (P. Hertzberger, P. Sloot, Eds.), Vol. 1225 of LNCS, Springer-Verlag, Perlin, 1997, pp. 440-451.
Using a General-Purpose Numerical Library to Parallelize an Industrial Application: Design of High-Performance Lasers* Ida de Bono 1, Daniela di Serafino 1,2 and Eric Ducloux 3 Center for Research on Parallel Computing and Supercomputers, CPS-CNR, Naples, Italy 2 The Second University of Naples, Caserta, Italy 3 Thomson-CSF/LCR, Orsay Cedex, France
A b s t r a c t . We describe the development of a parallel version of an industrial application code, used to simulate the propagation of optical beams inside diodes pumped rod lasers. The parallel version makes an intensive use of a 2D FFT routine from a general-purpose parallel numerical library, the PINEAPL Library, developed within an ESPRIT project. We discuss some issues related to the integration of the library routine into the application code and report first performance results.
1
Introduction
The development of high performance lasers is presently a very active field. An industrial challenge is the production of 100W-1KW diodes p u m p e d rod lasers with high compactness, high b e a m quality and high energy conversion. The numerical simulation of the propagation of optical beams inside these lasers plays a central role in the optimization of the whole design process and leads to a reduction of the development time. On the other hand, the industrial design of rod lasers requires an intensive simulation and an effective use of high-performance computers allows to reduce the so-called "time-to-market". It also gives the possibility of increasing the level of detail in the simulation without increasing the computation time, which is an i m p o r t a n t requirement in an efficient design process. In this work we present the main issues concerning the development of a parallel version of an industrial code, used at T h o m s o n - C S F L C R to model the propagation of optical beams in diodes p u m p e d rod lasers. This version, developed for MIMD distributed-memory machines, has been obtained exploiting software from a general-purpose parallel numerical library. Numerical software libraries play a significant role in the development of application codes, since they provide building blocks for the solution of computational kernels arising in the modeling of scientific problems. The availability of reliable, accurate and efficient numerical software allows users to develop their application codes without dealing with * This work has been supported by the European Commission under the ESPRIT contract 20018 (PINEAPL Project).
813 details related to the implementation of numerical algorithms, improving at the same time the quality of the computed solutions [2]. In particular, the use of numerical software for High-Performance Computing environments can simplify the porting of application codes to such environments, allowing also the introduction of different mathematical models and numerical techniques, that can better exploit the powerful resources of advanced machines. This work has been carried out as a part of the PINEAPL (Parallel Industrial NumErical Applications and Portable Libraries) Project, an ESPRIT Project in the area of High-Performance Computing and Networking, with a duration of three years (1996-1998) [3]. The main goal of the project is to produce a general-purpose library of parallel numerical software suitable for a wide range of computationally intensive industrial applications. This project is a collaborative effort among academic/research partners (CERFACS (F), CPS-CNR (I), IBM Italia (I), Manchester University (UK) and Math-Tech (DK)) and industrial partners (British Aerospace (UK), Danish Hydraulic Institute (DK), Piaggio Veicoli Europei (I) and Thomson-CSF (F)), under the coordination of NAG (UK). The industrial partners are the end-users of the project and have provided several application codes, that are ported on parallel machines using routines from the PINEAPL Library. A first release of the library is already available and includes numericM routines in the areas of Fast Fourier ]~ransforms (FFTs), Nonlinear Optimization and Sparse Linear Algebra. 1 The parallel version of the rod laser application code, described in this paper, integrates the 2D FFT routine of the PINEAPL Library.
2
Numerical Simulation of B e a m Propagation Inside R o d Lasers
The numerical simulation is mainly devoted to describe the evolution of optical beams, which propagate inside the rod laser following a "zig-zag" trajectory (Figure 1). The final goal is to estimate the deformation and the amplification of laser spots, and to find an optimal coupling between laser beams and pumping devices. The complete simulation is performed using three application codes, L A S E R DIODF~'I
L A S E R D | O I 'DF~g
L A S E R DIODES
Fig. 1. Schematic picture of a rod laser. 1
For more details see the URL http://www.nag.co.uk/Projects/PINEAPL.html.
814 modeling the heating of the rod, the propagation of the beams and the transformation due to the reflection on mirrors. The work described here is devoted to the development of a parallel version of the propagation code. The mathematical model of the propagation of an optical beam is based on the Helmholtz equation: A ~ + k2on2~ = 0, where ~ = #(x, y, z) is the (polarized) electromagnetic field, n is the index of the medium, k0 = 2~r/A is the wave number corresponding to the wavelength A, and and n have complex values. Let us assume for simplicity that the direction of propagation is z. The fast variation of the field along z is separated from the "modulation", q/(x, y, z), leading to an equation of the form: 0#
A~/ + 2i kz-~z + k~(n 2 - nz2)~ = 0,
(1)
where k~ is a suitable wave number, nz = k~/no and i = ~%-]-. The solution of equation (1) is performed using a method in the class of Beam Propagation Methods (BPMs) [6], that are widely used in integrated optics. The BPM used here is based on a splitting of equation (1) into two equations, modelling the propagation and the "lens" diffraction, respectively: 0~P 0~P Ax,y~ + ~ + 2i kz ~ = 0, 2i
0#
+ k0 (n -
= 0,
(2) (3)
where A,,y is the Laplace operator in the (x, y)-plane. The solution is obtained solving in turn equations (2) and (3) along the z-direction, that is repeating the following steps: - propagation from z to z + d z / 2 (eq. (2)); lens diffraction from z to z + dz (eq. (3)); - propagation from z + dz/2 to z + dz (eq. (2)).
-
Each propagation step is carried out performing a transformation from the (x, y)plane to the 2D frequency space, where the solution is known analytically, and coming back to the (x, y)-plane. The transformations are obtained as direct and inverse Discrete Fourier Transforms (DFTs) on complex data arising from the discretization of equation (2) on a uniform grid. The DFTs are computed using F F T algorithms. We note that the computational (x, y)-domain depends on the angle of incidence of the beam at the entry of the rod laser. For large angles, the beam can cross the boundaries of the rod; in this case the domain is unfolded, i.e. it is reflected with respect to the crossed boundaries, in x- or y-direction or both, and the computation is performed in a domain that is twice or four times the original domain.
815
3
Integration of the Parallel 2D F F T R o u t i n e
Until now, most of the work has been carried out on a simplified version of the propagation code, named PROBLA, which neglects the coupling with thermal effects and some physical details that increase the complexity of the model. This version is considered as a testbed, to evaluate efforts and improvements concerning the integration of the parallel 2D F F T of the P I N E A P L Library. The experience gained in working on the simplified code will be exploited to develop a parallel version of the complete industrial code. PROBLA is written in Fortran 77, apart from some routines for the dynamical allocation of memory, that are written in C. Its structure is sketched in Figure 2. The bulk of computation is in step 5.1, as it is shown by the plots
1. 2. 3. 4. 5.
read input parameters; initialize the electro-magnetic field; compute the mean index of the medium; compute the energy of the beam; for i -- 1, visual_steps
5.I simulate t h e b e a m p r o p a g a t i o n ( B P M ) ; 5.2 rearrange the data to be visualized; endfor 6. reconstruct the real field from the computational field; 7. compute the energy of the beam; 8. output the electro-magnetic field to be visualized.
Fig. 2. Structure of the PROBLA code.
of the execution times reported in Section 4. This step implements the B P M method and hence makes an intensive use of 2D FFTs. The B P M is applied between two successive visualization positions, that is positions along the direction of propagation where selected values of the electro-magnetic field are required for a future visualization of the evolution of the beams along that direction. Taking into account the above considerations, a speedup can be achieved parallelizing the execution of each FFT. Hence, our work has been devoted to the integration of the parallel 2D F F T routine of the the P I N E A P L Library into the application code. This routine performs a complex-to-complex F F T on a 2D array of data, distributed in row-block form over a 2D logical grid of processors, i.e. assigning approximately the same number of consecutive rows of the array to each processor, moving row by row on the processor grid. In the rod laser application code, such data distribution corresponds to partitioning the (x, y) computational grid along the x-direction, that is into vertical strips. The algorithm implemented in the parallel F F T routine is the so-called fourstep algorithm, that is: ID FFTs along the rows of the 2D array, transpose,
816 1D FFTs along the rows, and transpose. All the communication is performed in the transpose steps, and the second transpose can be avoided when a direct transform followed by an inverse transform must be executed, as it is in the PROBLA code. All the communication is performed using the Basic Linear Algebra Communication Subprograms (BLACS) [4], as in most of the P I N E A P L Library routines. More details are given in [1]. An efficient integration of the parallel F F T routine requires the introduction of the row-block distribution in the whole application code, to avoid unnecessary data movements among different stages of the simulation. Therefore, the different steps of PROBLA have been analyzed and modified to use this data distribution. In the BPM step (5.1), the solution in the frequency space and the computation of the lens diffraction are local processes, and hence are performed without any communication among processors, doing only minor changes in the corresponding parts of the application software. A few changes have been introduced to rearrange the data to be visualized (step 5.2), although this step does not require any communication. The initialization of the electro-magnetic field (step 2) is local too, and has been only slightly modified to be run in parallel. Global communications are required in some initial and final steps, such as the computation of the mean index of the medimn (step 3) and of the energy of the beam (steps 4 and 7). These steps have been reduced to global sum operations, that are executed using ad hoc routines from the BLACS. Modifications have been introduced to reconstruct the real field from the computational field (step 6), when the unfolding along the x-direction is performed. In this case, processors holding rows that are symmetric with respect to the central row of the computational domain, exchange data to reconstruct the field in the effective domain. The i n p u t / o u t p u t has not been parallelized and is performed by only one processor, which distribute to the others the input parameters and gathers the data to be printed out. This allows to have the same i n p u t / o u t p u t files as in the sequential code; moreover, it has been observed that generally the inp u t / o u t p u t does not account for a large percentage of the total execution time (see Section 4).
4
Numerical
Experiments
First experiments have been carried out to evaluate the efficiency of the parallel version of the rod laser application code, using two sets of benchmarks of interest for Thomson-CSF. In the following, they are referred to as Benchmark 1 and Benchmark 2. In both benchmark sets, the number of grid points in the ydirection, ny, is kept fixed, while the number of grid points in the x-direction, nx, varies. In Benchmark 1 nx = 64,128,256,512, 1024 and ny = 64; in Benchmark 2 n~ = 64, 128,256,512, 1024 and ny = 128. Moreover, the increment along z is dz = 10 p m , the number of visualization steps, i.e. the number of times the B P M step is repeated, is 100, and the domain is unfolded along x, thus the number of grid points in this direction is doubled.
817 The experiments have been performed on an IBM SP2, at CPS-CNR. This machine has 12 Power2 Super Chip thin nodes (160 MHz), each with 512 MBytes of m e m o r y and a 128 Kbytes L1 data cache; the nodes are connected via an SP switch, with a peak bi-directional bandwidth of 110 MBytes/sec. The SP2 has the AIX 4.2.1.operating system; the BLACS 1.1 are implemented on the top of the IBM proprietary version 2.3 of the MPI message-passing library [5]. The Fortran part of the parallel code has been compiled with the XL Fortran compiler, vers. 4.1, and the C part with the XL C compiler, vers. 3.1.4. Figure 3 shows the execution times, in seconds, of a single B P M step and of the whole P R O B L A (including i n p u t / o u t p u t ) , on 2, 4, 8, 12 processors, for Benchmark 1. Remembering that a B P M step is performed 100 times, we see BPM - Benchmark 1
10000
50 r- -., / ", / "-
~'-,~.
.
.
.
.
. ' ~ '
.
Nx=64 NY=64 +-NX=128 NY---64 + NX=256 NY=64 a.~ NX= 12 N Y = 6 4 - x NX=1~24 RY---64 - -
5000 . t ""-1 .......
-
"'--L_
,o .............. ::.- ............ <::ii::
.................
1000 ... 500
5
"
: ........
-
-'~"
NX=64 NX=128 NX=2:56 NX=512 NX=1024
, :
-
"
o o . ".
PROBLA. Benchmark 1 ,
. [ "
. . . . . . . . .
--~_
"'=.-, . . - .
.
.
:
-
~
.
.
.
NY--64-r
NY=64 NY-'=64 NY=64 NY-~4
.
,
--'- . . . . . . . . . . . . . " -:'=". . . . . . . ..
.
]
+- 1 -~-- 1 .-- / "-/
J
~.. ....... "\,,
..., "N
""....
%
-,.
1
'''2
i 4
100
.... :'--'---~ ..................................
.-.._.,:_..: .................................
i 8 number of processors
12
! ~02
i 4
i 8 number of processors
12
F i g . 3. Execution times of a BPM step and of P R O B L A for Benchmark 1.
that it accounts for a very large percentage of the total execution time. This percentage decreases as the number of processors increases; for example, it is more than 98% on 2 processors and reaches the lowest value of 95% on 12 processors. One reason for such reduction is the load inbalance generated by unfolding along the x-direction. In this case, the rearrangement of the d a t a to be visualized (Figure 2, step 5.2) and the reconstruction of the real field from the computational field (Figure 2, step 6) are mainly performed by half the processors, that is by the processors that hold the effective domain, and hence the distribution of the computational work is not balanced. One more reason is the sequential i n p u t / o u t p u t ; its time is at most 1.5% of the total execution time on 2 processors, while it reaches 3.3% on 12 processors. Figure 4 shows the speedup of a BPM step and of P R O B L A (including i n p u t / o u t p u t ) with respect to the original sequential code, for Benchmark 1. u 2 The speedup is here defined as Sp = T,/Tp, where T] is the execution time of the original sequential code on one processor and Tp is the execution time of the parallel code on p processors, for the same problem.
818 The parallel code has been compared with the sequential code to see the effective gain obtained with the parallelization. The speedup lines of a BPM step and of
BPM Benchmark 1
PROBLA - Benchmark 1
-
Nxo64N,~=64i l
I
NX=64 NY=64 .~:" NX=St 2N~=64 x NX=l~24Nu
~-
id,ealspeedup - - -
l
9
-., ,iI-
,.....-.- .-,-
.
234
3
"""
~
,z'*"
........
. .,<:
I~.;"
2 li 2
8 number of processors
12
2
4
8 number of processors
12
F i g . 4. Speedup of a B P M step and of P R O B L A for B e n c h m a r k 1,
P R O B L A have similar behaviours; the values concerning P R O B L A are generally slightly lower, according to the previous considerations on the execution times. The speedup plots show that for each nmnber of processors there is a "best problem size" where the parallel code gets the largest speedup. For example, the largest speedup of the BMP step on 4 processors is obtained for n~ = 64 and is 3.25, while on 8 processor it is obtained for n~ = 256 and is 5.35 (the value of n y is not considered because it is constant). On the other hand, we can see that there is a "best number of processors" for each problem size, where the efficiency reaches the largest value. 3 For example, for the BPM step with n~ = 64 such best number is 4, with an efficiency of 0.78, while for n~ = 256 it is 8 with an efficiency of 0.67. Using more processors than the optimal number does not increase the performance, as it is evident for nx = 64 and nx = 128. Experiments on more than 12 processors should be performed to confirm this on problems with larger sizes. The behaviour discussed above is due to the use of the cache m e m o r y and to the communications performed in the parallel global transpose operation inside the F F T routine. Finally, we note that there is no gain in terms of execution time where only 2 processors are used, except for n~ = 64. This is because the sequential F F T used in the original code is generally faster t h a n the parallel F F T on one processor and is comparable with it on 2 processors. This is probably due to the fact that the sequential F F T routine does not perform transpositions of data as the parallel F F T routine; moreover, the sequential F F T has been implemented specifically for radix-2 transforms, that are the transforms 3 The efficiency is defined processors.
as
Sp/p,
i.e. as the speedup divided by the number of
819 considered in PROBLA, while the parallel F F T has been implemented to perform more general mixed-radix transforms. The previous considerations apply also to the execution times and the speedup of B P M and P R O B L A for Benchmark 2, shown in Figures 5 and 6. Since the BPM- Benchmark2 100 "-
50:
..
"'.
~
"'-.
"~- . ~
. -. . . . . . . . .
:
NX=64 NY=128-*-NX=128NY=128~NX=256NY=128-e NX=512NY=128 ~, NX=1024NY=128~ -
30
1000 ~". . . . . 50001-
PROBLA- Benchmark2 , '
, ""-.
f'--.. | ...........
"'~ . . . . . . . . .........
"...........
, ~_
NX=64 NY=128~ - I NX=128 NY=128+ / NX=256NY=128,a,- J NX=S12NY=28 x / NX=1024 NY=128 -
J
200(
20'
I00[
~0
50e
................ "
'"
" ........
-
.... i . . --~
"'"--'"'" ". . . . . . .
7. '''--.:.........
""
......
i..................... ,~
20C
..............
I0C
........
.........i ........... iiiiiiiii:i : l
i
4
8 numberof processors
12
4
i
8 numberof processors
12
F i g . 5. E x e c u t i o n times of a B P M step and of P R O B L A for B e n c h m a r k 2.
BPM- Benchmark2 '.
NX=512Nu
! .r" "" ~
PROBLA - Bmchma~ 2
12 NX=64NY=128 ,~z-'] x
/
::t
.//
"t
NX=64NY=128~ , " NX=512NY~128 x NX=lO2r ~-. i~eal speedup ---
10
91-
,S" //~
i* 9
51-
e//- S~/
,
4b
3b 2b
numberofprocessors
4
8 numberof processors
12
Fig. 6. Speedup of a BPM step and of PROBLA for Benchmark 2. value of rty has been doubled with respect to Benchmark 1, the best number of processors after which the performance degrades is generally larger, and is visible only for the cases n,: = 64 and nx = 128. Moreover, the speedup values are generally higher than in Benchmark 1; on 12 processors, the B P M step reaches a speedup of 9 and the whole P R O B L A code a speedup of 8.
820
5
Conclusions
A parallel version of an industrial application code has been developed, which integrates the 2D F F T routine of the parallel numerical library developed in the E S P R I T Project P I N E A P L . First experiments have been performed to evaluate the performance of the parallel application code, obtaining satisfactory speedup values. Work is presently devoted to evaluate the scalability of the parallel code. Preliminary tests have been performed on a single B P M step, obtaining encouraging results; values of scaled efficiency 4 between 0.75 and 0.84 have been measured on 4 and 8 processors, using problems of size nx x ny, with n~:/p = 256,512, 1024 and ny = 128.
References 1. Carracciuolo, L., de Bono, I., De Cesare, M.L., di Serafino, D., Perla, F.: Development of a Parallel Two-dimensional Mixed-radix FFT Routine. Tech. Rep. TR-978, Center for Research on Parallel Computing and Supercomputers (CPS) - CNR, Naples, Italy (1997) 2. di Serafino, D., Maddalena, L., Messina, P., Murli, A.,: Some Perspectives on HighPerformance mathematical Software. In A. Murli and G. Toraldo eds., Computational Issues in High Performance Software for Nonlinear Optimization, Kluwer Academic Pub. (to appear) 3. di Serafino, D., Maddalena, L., Murli, A.,: PINEAPL: a European Project to Develop a Parallel Numerical Library for Industrial Applications. In Euro-Par'97 Parallel Processing, C. Lengauer, M. Griebl and S. Gorlateh eds., Lecture Notes in Computer Science 1300 (1997), Springer, 1333-1339. 4. Dongarra, J.J., Whaley, R.C.: A Users' Guide to the BLACS v 1.0. LAPACK Working Note No. 94, Technical Report CS-95-281, Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN (1995) 5. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.J.: MPh The Complete Reference. MIT Press, Cambridge, MA (1996) 6. Van Roey, J., Van der Donk, J., Lagasse, P.E.: Beam-propagation method: analysis and assessment. J. Opt. Soc. Am. 71 (1981)
4 The scaled efficiency is here defined as SEp = T1 (n,, ny)/Tp(p x nx, ny), where Ti(r, s) is the execution time of the parallel code on i processors, for a problem of size r x s.
Fast Parallel H e r m i t e N o r m a l Form C o m p u t a t i o n of M a t r i c e s over F[x] Clemens Wagner Rechenzentrum der Universit/it Karlsruhe, Germany clemens@exp-math, uni-essen, de
Abstract. We present an algorithm for computing the Hermite normal form of a polynomial matrix and an unimodular transformation matrix on a distributed computer network. We provide an algorithm for reducing the off-diagonal entries which is a combination of the standard algorithm and the reduce off-dlagonal algorithm given by Chou and Collins. This algorithm is parametrised by an integer variable. We provide a technique for producing small multiplier matrices if the input matrix is not full-rank, and give an upper bound for the degrees of the entries in the multiplier matrix.
1
Introduction
We study the problem of computing the Hermite normal form and an unimodular transforming m a t r i x of a large, rectangular input m a t r i x over IF[x]. Our motivation comes from the area of convolutional encoders where we need the transforming m a t r i x to construct the decoder of a given minimal, basic encoder (cf. [6]). Since the degrees of the entries in the transforming m a t r i x determine the size of the decoder's circuit, we are interested in a transforming m a t r i x whose entries have small degrees.
D e f i n i t i o n 1. A matrix H E Matmx~(F[x]) with rank r E No is in Hermite normal form (HNF) /f i. the first r columns of H are nonzero, and the last n - r columns are zero. 2. f o r 1 < j < r let ij be the index of the first nonzero entry in the j t h column. Then 1 < i l < i2 < . . . < it-1 < ir < m. 3. the pseudo diagonal entries Hij,j f o r 1 <_ j < r are monic. 4. for 1 < j <_ r and 1 < k < j the degree of the off-diagonal entries Hij,k is less than deg(Hij,j). We call 1 <_ il < i2 < . . . < it-1 < ir < m the row indices of the pseudo diagonal entries of H . If only the first two conditions are fulfilled we say H is in echelon form. Two matrices A, B E Mat,~x~(F[x]) are right equivalent if there exists an unimodular Matrix V E GLn (IF[x]) such t h a t A V = B. We call V a transforming m a t r i x or multiplier.
822 1. For each matrix A ~ Mat.~x~(F[x]) exists an unique, right equivalent matrix H E Matrnxn(1F[x]) in Hermite normal form.
Theorem
There are m a n y well-known algorithms for computing the Hermite normal form and a multiplier of an integer matrix. Gaussian elimination based methods, as in Sims [13, p. 323], eliminate above diagonal entries of a whole row by extended greatest c o m m o n divisor computations. There are m a n y different specialisations of this general algorithm (e. g. [2], [4], [8]) which can be generalised to arbitrary Euclidean domains [14]. Gaussian elimination has worst-case exponential time and space complexity for matrices over F2k Ix] and Q[x] [9]. But it was shown by Wagner [14], [15] t h a t some variants of Gaussian elimination have a better practical performance t h a n algorithms guaranteeing a polynomial time and space complexity. 2
Parallel
Implementation
A parallel computer 7) := {Tr0,... , Try-l} consists of N processors with distinct m e m o r y and a communication network. Let A E Mat,.x~ (F[x]), and let f(l, k) :=
+
otherwise
for each 0 < k < N - 1. Each processor rrk on the parallel computer stores a (e(m, k)+l)x n m a t r i x A (k) and a e(n, k)x n matrix v(~) which are submatrices of A and the multiplier V E GL~ (F[x]), respectively. The additional row g(m, k) + 1 of A(k) is used for computations. We will refer to this row as the computation rOW.
Fig. 1. Distribution of a matrix on a parallel computer (a & b.1), Broadcasting a part of a row (b.1 & b2)
We distribute the input matrix A by storing the ith row of A in row i ~ of m a t r i x A (k) on processor ~rk where k := (i - 1) rood N and i' := [(i - 1 ) / N ] + 1
823 for 1 < i < m. Note: T h e c o m p u t a t i o n row of every A(k) will not be set here. We construct the identity m a t r i x In in the matrices V (k) in a corresponding way. Figure 1 shows how a m a t r i x with eight rows (a) is distributed on four processors of a parallel c o m p u t e r (b.1). In contrast to Gaussian elimination over fields Gaussian elimination over Euclidean d o m a i n s needs several steps for eliminating the entries in a row. Over Euclidean d o m a i n s we have just division with remainder instead of exact division. This is the reason why we distribute rows instead of c o l u m n s on the parallel computer. T h e m a i n H N F algorithm is given by pseudocode in A l g o r i t h m 1.
PARALLEL-HNF(A (k) , V (k) , b) 1 i n p u t k processor index A (k) rrk-part of a distributed m • n matrix A V (k) 7rk-part of the distributed n x n identity matrix V b non-negative integer 2
r+--O
3 4 5 6 7 8
for i + - 1 t o r e d o ge- (i-1) modN if g = k t h e n re- [(i-1)/N l+l b r o a d c a s t (~(k) \~t,r+l~''" else
I
A t(k)~ to all o t h e r ,n]
9
t +-- ~(rn, k) + 1 ~, Index of computation row
10 I1 12 13
receive \(A t(k) A~k)) f r o m r% ,r+l , " " " , A (k) ~-- COMPnTE-GCD(A (k) , V (k) , t, r + 1, n) ~r 4(k) ~ 0 t h e n --~t,~+l r t--r+1
14
15 16
i~ +-- i
(A(k),V (k)) /--- PARALLEL-ROD(A(k),V(k),r,b, il . . . . , i t ) r e t u r n (A (k) , V (k)) A l g o r i t h m 1. Parallel HNF computation
T h e m a t r i x A (k) is the rrk-part o f A which means the rows of A (and the c o m p u t a t i o n row) lying on processor 7rk E 5~ This a l g o r i t h m is executed on each processor in 7) with the appropriate A (k), V(k) and a non-negative integer b as input. We explain the m e a n i n g of b below. In the description of the M g o r i t h m we refer to A and V, respectively, if we want to consider the whole m a t r i x consisting of the rrk-parts on each processor. In Line 4 we c o m p u t e the index g of the processor which stores row A i , , . Processor rr9 c o m p u t e s in Line 6 the index t of the row of A(g) which corresponds to row A~,,, and broadcasts the entries A i , r + l , 9 9 , Ai,n to each other processor. Broadcasting a row to the c o m p u t a t i o n rows of the other processors is illustrated in Figure 1 (b.1) and (b.2). T h e function C o M P U T E - G C D takes two matrices B and C with n columns, an integer i > 1 and 1 < Jl _< j2 _< n as input, where i is a valid row index of B. It produces right equivalent matrices
824 B' and C' as output where
Bli,jl = gcd(Bi,jl,... , Bi,j2) and B'i,j = 0 for Jl < j _< j2. This algorithm computes B' from B by unimodular column operations (swapping of two columns, multiplying a column with an unit, and adding a multiple of one column to another column) and does the same transformations on C for computing C'. There are several different methods to obtain /3' from B (e. g. [1], [3], [71, [14]). The variable r corresponds to the number of linearly independent rows of A with index less than i. After execution of the for-loop r is the rank of A, il, .. 9 , ir are the indices of the pseudo diagonal entries of A, and A is in echelon form. The technique of computation rows make it easy to reuse already implemented sequential algorithms for the gcd computation. If we use a g c d algorithm whose execution depends only from the entries in row Ai,. we need no communication for function COMPUTE-GCD. The algorithms of Blankinship [1], Bradley [3] or Majewski-Havas [12] and the Multiple-Remainder-GCD as in Wagner [14] are examples for that. The Norm-Driven-Sorting-GCD due to Havas-Majewski-Matthews [8] needs communication for computing column norms of A, and it was shown by examples in Wagner [14], that this algorithm is not useful for parallel HNF computation.
3
Reducing the Off-Diagonal Entries
After leaving the for-loop in Algorithm 1 the current matrix A is in echelon form, and the current V is a multiplier matrix. There are two important algorithms to obtain the Hermite normal form from a matrix in echelon form. There is a row based (standard), as in [11], [13] and a column based method, due to Chou-Collins [5]. Algorithm 2 is a hybrid technique which is controlled by parameter b C No. If we choose b equal to zero the algorithm does not change the input matrices. For b = 1 this algorithm is a parallel variant of the reduction method of Chou-Collins, and if b + 1 _> r = rank(A) we have a parallel variant of the standard reduction algorithm. The function MONIC-MULTIPLIER takes f E ]F[x] as input and gives 1
if f ~ 0 and a is the leading coefficient of f
1
if f = 0
as output. The calls of this function are needed to guarantee point 3. in the definition of the Hermite normal form. For f, g E IF[x] \ {0} the expression f / g is equal to (the unique) q 6 IF[x] given by f = g . q + r with deg(r) < deg(g) where deg(0) := - o o . Algorithm 2 parts the input matrix into stripes of width b, and it reduces all off-diagonM entries in each stripe. The macro reduction order of the stripes is from right to left, and the micro reduction order reduces the entries in the stripes
825 PARALLEL-ROD(A (k) , V (k) , r, b, il . . . . , it) l input k processor index A (k) 7~k-part o f a d i s t r i b u t e d m • n m a t r i x A V (k) ~ k - p a r t o f a d i s t r i b u t e d n • n m a t r i x V r, b, i l , . 9 9 , i~ n o n - n e g a t i v e i n t e g e r s
2
ifr >0andb>0then
3
f4-r
4
repeat
5
14-f-1
6 7 8 9 10 n 12 13 14 15 16
f +---m a x { l , f - b} f o r j~-- f + l t o r do ge-- (i 3 - 1 ) m o d N h 4- m i n { j - 1, l} ifg---k then t ~ [(ij - 1)/NJ + 1 b r o a d c a s t \( ~At ,(k) f l " "* l 'a(k) x t , h ~ A(k)~ t,3 2 t o all o t h e r else t 4- ~(m, k) + 1 :> I n d e x o f c o m p u t a t i o n row r e c e i v e \(~(k) (k) ) f r o m ~r9 ' ' t , f . . . .. A(k) t h Ato f o r s +-- f t o h d o
17
q
18
A 9(k) ,s
/A k),j 4--
A (k) *,s
- -
q "
A (k! *,3
19 20 21
v,(2 4- v,(2 - q . if j < l + 1 t h e n A ~ +- MONIC-MULTIPLIER(A~,k))A(~
22 23
V(,~) 4- MONIC-MULTIPLIER(A}k;)V(,~ ) until f = 1 p 4- (ia - 1) mod N i f p = nk t h e n
24
25
t 4- g ( m , k) q- 1 b r o a d c a s t A (k) t,1 t o all o t h e r
26
27 28 29 30
else t 4- L(i - i)/NJ + 1 r e c e i v e A~,k) f r o m .
31
A (k) 4- MONIC-MULTIPLIER(AIk))A!kl 9 ,1 3
32 33
V,(~) 4- MONIC-MuLTIPLIER(AIk~)V,(,] ) return
( A (k) , V (k))
Algorithm
2. Parallel reduction of off-diagonal entries
f r o m t o p t o b o t t o m . T h e m a c r o r e d u c t i o n o r d e r is s i m i l a r t o t h e o r d e r of C h o u C o l l i n s , w h i l e t h e m i c r o r e d u c t i o n o r d e r is s i m i l a r t o the s t a n d a r d r e d u c t i o n order. F i g u r e 2 shows t h e r e d u c t i o n o r d e r of the s t a n d a r d m e t h o d (a), h y b r i d w i t h b = 2 (b) a n d for C h o u - C o l l i n s (c), w h e r e the c o n s e c u t i v e n u m b e r s in t h e s q u a r e s d e s c r i b e the o r d e r of r e d u c t i o n .
826 The reduction order of Chou-Collins needs smaller degrees than the standard reduction order, since the method of Chou-Collins uses only reduced column vectors of the input matrix for reducing other columns. On the other hand the standard order needs less communication operations (broadcasts) which we will show in the next section.
* 3* 5 6 * S 910 * b (a)
:
4* 56* 7 8 1 *] 9'10 2 31" : b
' b (b)
I
7* 84* 9 5 2 *1 106 3 l I* iblblblbl (c)
Fig. 2. Reducing of diagonal entries: (a) Standard, (b) hybrid, (c) Chou-Collins
4
Analysis
and
Performance
Examples
L e m m a 1. Given a matrix A C Mat.~x,~(1F[x]) with r := rank(A) and 1 < b < r - 1. Let q := [r - 2/bJ. Computing the Hermite normal form with Algorithm 1 needs at most .~ + q(q + 1)4 + r 2 broadcasts if the implementation of COMPUTE-GCD needs no communication. Pro@ Algorithm 1 needs m broadcasts at most. The repeat loop in Algorithm 2 is executed q + 1 times. In the last run of this loop f is equal to 1 and it does r - 1 broadcasts. In the ith run for 1 < i < q f is equal to r - ib and it does r - f = ib broadcasts. There is one further broadcast after the loop, and we get q
m+r-
l +(Eib)+ i=1
l=m+
q ( q + l------~)b+r 2 []
broadcasts.
For a m • n m a t r i x with rank r Algorithm 1 needs m + r broadcasts if b > r - 1. We could reduce the number of broadcasts for b ~ r - 1. Algorithm -
3 We modify Algorithm 1 in the following way.
Replace Line 7 by b r o a d c a s t Alk,) t o all o t h e r
827 Replace Line 10 by r e c e i v e A~,k: f r o m 7rg - Insert the following lines after Line 13: -
A(k,) <-- MONm-MULTIPLmR(Afkd)A!k,~) V,(kr) /--- MONIC-MULTIPLIER(A},kr))V,(,k r) for j(--1 tot-- 1 do a(k)/~(k) q ~-- "~t,j /~ "t,r
A *,j (k) +_ A *,j (k) _ "/ , A *;~ (k) - R e m o v e Line 15.
L e m m a 2. Algorithm 3 needs at most m broadcasts f o r computing the H e r m i t e normal f o r m and a transforming matrix of a m x n input matrix,
It seems that Algorithm 3 is faster than Algorithm i with b < r - 1 in practice where r is the rank of the input matrix. We give some running time examples showing that Algorithm 3 needs larger degrees than Algorithm i with b = 1 for the HNF computation. We have implemented the Mgorithms in C / C + + on the IBM SP2 at the GMD in Sankt Augustin. We have used the xlC compiler and the message passing library MPL (both IBM products). We have used the Sorting-GCD algorithm, due to Majewski-Havas [12], for the implementation of the Compute-GCD function, where we used a heap for determining the polynomial with the largest degree in a subvector. The input matrix is a random 80 x 80 matrix over F2[x], and the degree of the entries are less than or equal to 80. The rank of this matrix is r = 80. Table 1 shows the results of our measurements. We have used 8, 16 and 24 nodes of the SP2. The rows in Table 1 have the following meaning: b-- the value of the parameter b for Algorithm 1. For b = 80 we have taken Algorithm 3. max, degree: The largest degree of all polynomials which occurred during the computation. This value is independent of the number of processors. N = x: Running-time in minutes:seconds.hundredth seconds on the SP2 using x processors. We have used the high performance switch and us-protocol for the communication. tl/tb: The fraction of the running time for the b in this column and the running t{me for b = 1. For b = 1 Algorithm 1 runs faster than for every other b > 1. The relative running times t b / t l become better if the number of processors become larger (with exceptions for b E {2, 4}), since the time for a broadcast depends on the number of processors.
5
Producing Small Multipliers
For every nonzero A E Matm• (F[x]) we denote the largest degree of the entries of A as IIAI] := m a x { d e g ( A i , j ) I 1 < i < m A 1 <_ j < n} .
828 T a b l e 1. Run-times for Algorithm 1 for a 80 x 80 matrix over ]F2[x] depending on parameter b and the number of processors. b 1 max. degree 9965 N=8 2:11.77 1.00 tb/tl N=16 1:36.01 tb/tl 1.00 N=24 1:25.75 1.00 tb/tl
2
4
8
17
12571 2:36.02 1.18 1:46.99 1.11 1:37.37 1.14
13129 3:19.45 1.51 2:09.78 1.35 1:57.04 1.36
19306 4:29.39 2.04 2:46.17 1.73 2:27.46 1.72
27227 6:06.97 2.78 3:40.84 2.30 3:12.58 2.25
20 29009 6:52.61 3.13 4:05.44 2.56 3:34.04 2.50
40 36802 8:47.29 4.00 5:08.17 3.21 4:27.68 3.12
60 8O 40989 43486 9:36.34 11:40.47 4.37 5.32 5:44.60 7:04.60 3.59 4.42 4:56.81 5:58.80 3.46 4.18
3. Given matrices A, H E M a t m x n ( F [ x ] ) \ { 0 } where H : = H N F ( A ) . A multiplier V E G L ~ ( F [ x ] ) satisfying A V = H is unique if and only i f r a n k ( A ) = n. In this case [[V]] _< ( n - 1)IIAll. Lemma
Proof. Let 1 _< il < i2 < . . . < is _< m be the row indices of t h e p s e u d o d i a g o n a l entries of A, a n d let A t , H ' E Mat~(]F[x]) be defined as A t j,, : = A t . , j , , a n d H t j , , : = H t ,"j , , for 1 _< j _< n, respectively. N o t e t h a t H ~ is in H e r m i t e n o r m a l f o r m a n d is right equivalent to A t, so H' = H N F ( A ' ) . T h e n d : = d e t ( A ' ) r 0, a n d it is easy to prove t h a t
Iladj(A')[[ ~ (n - 1)IIA'[I < (n - 1)IIAll where A~ E M a t ~ ( F [ x ] ) is t h e adjoint matrix of A ~. Let V E G L n ( F [ x ] ) a t r a n s f o r m i n g m a t r i x w i t h A ' V = H'. Since
dI~ = A'adj(A') ~
d I ~ H ' = A'adj(A')H' r
H ' = A' ( a d j ( A ' ) H ' )
a n d H / as well as a d j ( A ' ) are unique V m u s t be unique, a n d t h u s IIVI I < _
adj(A')H'
___ (n -
])IIA'II
_< (n -
1)IIAII
A V = H b e c a u s e t h e uniqueness of H . S u p p o s e t h e r e exists a n o t h e r V t E GL~(]F[x]) t h e n H' = A ' V ' b u t this is a c o n t r a d i c t i o n to t h e uniqueness of t h e m u l t i p l i e r V for A~V = H t. [] In t h e case r a n k ( A ) < n the l a r g e s t degree IIVII of a m u l t i p l i e r m a t r i x V can be a r b i t r a r y large. We will m o d i f y G a u s s i a n e l i m i n a t i o n for n o t f u l l - r a n k i n p u t matrices producing small multipliers. Kaltofen, Krishnamoorthy and Saunders [10] e x t e n d t h e i n p u t m a t r i x by the n x n i d e n t i t y m a t r i x to o b t a i n a f u l l - r a n k m + n x n m a t r i x . W e will a p p l y this t e c h n i q u e to our a l g o r i t h m s . Algorithm
plier V
4 Input: A E Matm•
Output: H : = H N F ( A ) and a multi-
829
1. S e t B : : ( i A) 2. Compute G : = H N F ( B ) without computing a transforming matrix using either Algorithm 1 or 3 3. Set H to the first m rows of G and set V to the last n rows of G. A l g o r i t h m 4 c o m p u t e s the H N F of a full-rank m a t r i x even if the i n p u t m a t r i x is not full-rank. Matrix B : = (A) E Mat~+~x~(1F[x]) has an unique multiplier
V ~ GL~(F[x]) satisfying IJvJI _< (n-1)[fNI = (n--1)]FArl for nonzero A which is equal to the returned V. A m o r e accurate analysis leads to the following L e m m a . L e m m a 4. Algorithm ~ computes the Hermite normal form of a matrix A E Mat,~x,~(F[x]) with rank r and a multiplier V E GL~(1F[x]) satisfying
IlVll _< rain{r, n
--
1}IIAII.
Proof. If rank(A) = n we can apply L e m m a 3. Otherwise, w i t h o u t loss of A and progenerality let A 5~ 0, and set B e Mat,~+~x~(F[x]) to B : = (I~), ceed with B as with A in the p r o o f of L e m m a 3. T h e full-rank s u b m a t r i x B ' C Mat~(F[x]) of B m u s t contain at least n - r identity vectors, and thus we get Iladj(B')l I <__rilB'll <_ rilAi[ which completes the proof. [] In Table 2 we give some running t i m e examples for A l g o r i t h m 1 and 4.
Table 2. Running times for Algorithm 1 and 4 for b = 1. Algorithm mxn tlAII 1O 1 500 x 500 1 500 x 500 10 1 512x512 8 1 [0OOx 1000 2
1 4
1 4
6
rank 500 500 512 1000
N 16 32 16 32
128x256
32
128
8
250 x 260
32
250
16
Conclusions
Time 29:35.26 19:00.96 11:47.61 34:50.45 4:34.64 8:39.56 31:39.07 15:32.74
max. degree 7763 7763 161 3117 9200 6395 49400 12457
[IVI[ 4987 4987 55 1997 9200 4092 49400 8000
and Acknowledgements
We have described parallel i m p l e m e n t a t i o n s for Hermite n o r m a l f o r m c o m p u t a tions and we have studied a technique for c o m p u t i n g small multiplier matrices when the input m a t r i x is not full-rank. T h e r u n n i n g time examples in Table 1 d e m o n s t r a t e t h a t we can o b t a i n better p e r f o r m a n c e if we allow higher c o m m u n i c a t i o n costs. We have observed this effect for other examples, too. Algorithm 4 has a good upper b o u n d for the degrees
830 of the computed transforming matrix, and we have seen that this Mgorithm provides very good transforming matrices in practice. We have used earlier implemented sequential gcd algorithms for our parallel implementations. This reuse of sequential code where really well supported by the usage of computation rows. The author is very grateful to the GMD, Sankt Augustin, Germany for offering CPU time on their IBM SP2 installation. Special thanks go to George Havas for his corrections and suggestions.
References 1. W. A. Blankinship. A new version of the Euclidian algorithm. Amer. Math. Monthly, 70(9):742-745, 1963. 2. W. A. Blankinship. Matrix triangulation with integer arithmetic. Comm. ACM,
9(7):513, 1966. 3. G. H. Bradley. Algorithms and bound for the greatest common divisor of n integers and multipliers. Comm. ACM, 13:447, 1970. 4. G. H. Bradley. Algorithms for Hermite and Smith normal matrices and linear Diophantine equations. Math. Comp., 25(116):897-907, 1971. 5. T. W. J. Chou and G. E. Collins. Algorithms for the solution of systems of linear Diophantine equations. SIAM J. Comput., 11(4):687-708, 1982. 6. G. D. Forney. Convolutional codes h Algebraic structure. IEEE Trans. Inform. Theory, IT-16(6):720-738, 1970. 7. G. Havas and B. S. Majewski. Extended gcd calculation. Congr. Numer., 111:104114, 1995. 8. G. Havas, B. S. Majewski, and K. R. Matthews. Extended gcd algorithms. Technical Report TR0302, The University of Queensland, Brisbane, 1995. 9. G. Havas and C. Wagner. Matrix reduction algorithms for euclidean rings. In Proc. of the 3rd Asian Symposium Computer Mathematics, to appear. 10. E. Kaltofen, M. S. Krishnamoorthy, and B. D. Saunders. Parallel algorithms for matrix normal forms. Linear Algebra Appl., 136:189-208, 1990. 11. R. Kannan and A. Bachem. Polynomial algorithms for computing the Smith and Hermite normal forms of an integer matrix. SIAM J. Comput., 8(4):499-507, 1979. 12. B. S. Majewski and G. Havas. A solution to the extended gcd problem. In ISSAC'95 (Proc. 1995 Internat. Sympos. Symbolic Algebraic Comput.), pages 248-253. ACM Press, 1995. 13. C. C. Sims. Computation with Finitely Presented Groups. Cambridge University Press, 1994. 14. C. Wagner. Normalformberechnung yon Matrizen iiber euklidischen Ringen. PhD thesis, Institut fiir Experimentelle Mathematik, Universits Essen, 1997. Published by Shaker-Verlag, 52013 Aachen/Germany, 1998. 15. C. Wagner. Hermite normal form computation over Euclidean rings. Interner Bericht 71, Rechenzentrum der Universit~t Karlsruhe, 1998. submitted for publication.
Achieving Portability and Efficiency Through Automatic Optimisation: An Investigation in Parallel Image Processing D. Crookes1, P.J. Morrow2, T.J. Brown1 , G. McAleese2 , D. Roantree2 , and I.T.A. Spence1 1
Department of Computer Science, The Queen’s University of Belfast, Belfast BT7 1NN, UK 2 Department of Computing Science, University of Ulster at Coleraine, Coleraine, BT52 7EQ, UK
Abstract. This paper discusses the main achievements of the EPIC project, whose aim was to design a high level programming environment with an associated implementation for portable parallel image processing. The project was funded as part of the EPSRC Portable Software Tools for Parallel Architectures (PSTPA) programme. The paper summarises new portable programming abstractions for image processing, and outlines the automatically optimising implementation which achieves portability of application code and efficiency of implementation on a closely coupled distributed memory parallel system. The paper includes timings for optimised and unoptimised versions of typical image processing algorithms; it draws the main conclusion that it is possible to achieve portability with efficiency, for a specific application, by adopting a high level algebraic programming model, together with a transformation-based optimiser which reclaims the loss of efficiency which an algebraic approach traditionally entails.
1
Introduction
The need for portability of program code for parallel systems is an important but elusive goal. For instance, in an anecdotal keynote address to a recent European conference on parallel computing a distinguished American visitor from a major research facility recounted how the parallel supercomputers which his laboratory uses are renewed typically every three years. After installation, some two years is then spent modifying the suite of programs which researchers at the laboratory use so that they will run on the new hardware. Useful computing can then proceed for a period of about a year, before the machines are replaced again by the next generation; and so the cycle of events repeats. His purpose in relating these events was to highlight the relative difficulty in producing parallel software and porting it between machines. Software for parallel computers is often tailored to the architecture of particular machines in order to deliver optimal performance, particularly in the case of distributed memory machines. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 102–112, 1998. c Springer-Verlag Berlin Heidelberg 1998
Achieving Portability and Efficiency Through Automatic Optimisation
103
On the other hand, this complicates the problem of constructing programs and inhibits their portability once constructed. The EPIC project has been undertaken to consider to what extent automatic optimising software tools can take a high level, portable algorithm description, and generate parallel code whose efficiency on a distributed machine rivals than of hand-tuned parallel code. Because of the difficulty of such a problem in a general purpose programming environment, the project has deliberately chosen an application specific approach - in our case, the domain of image processing. The result is the EPIC environment: a high level, portable image processing programming environment, with an implementation which automatically generates very efficient code running on distributed memory parallel systems[1,2]. More specifically, the key objectives of the EPIC project were: 1. To identify a set of programming abstractions for image processing. These abstractions constitute the core of the Extensible Parallel Image Coprocessor (EPIC) model. In the long term, we also wished to see how far these abstractions could be pushed towards general purpose programming abstractions. 2. To construct an (object-oriented) application development environment based on the EPIC model which enables users to construct portable image processing applications. 3. To develop a rule based transformation system which generates parallel code, for distributed memory machines, whose efficiency rivals that of hand tuned parallel code. In the rest of this paper we first summarise the new image processing programming abstractions which were developed at the outset of the project, based on extensions to Image Algebra[3]. We then describe the EPIC application development environment in general, before looking at the off-line optimiser in particular. We present performance figures illustrating the benefit which optimisation brings for a number of simple algorithms and for the three architectures on which the system has now been implemented. Finally we draw conclusions and review the scientific insights which we have gained.
2
EPIC Programming Abstractions for Parallel Image Processing
It is known that efficiency on distributed memory parallel architectures benefits from the predictability and locality of data references. For low level image processing applications this is readily provided by the neighbourhood based operations which are typically used. For these forms of operation one can identify easily any data communication needs associated with an updating operation from the neighbourhood itself. Thus an application specific neighbourhood based programming model offers an advantage over more general purpose language notations where operations would typically be defined in terms of loop constructs and subscripting, from which the extraction of communication requirements is in general non-trivial.
104
D. Crookes et al.
Another advantage of a neighbourhood based programming model is that it can readily be supported by the development of a high level algebraic notation. This has been done for image processing by the development of Image Algebra[3] whose basic operations, plus image and template data types, form the starting point for the EPIC programming notation. Our programming abstractions for image processing have been developed in two stages: 1. Because neighbourhood processing is particularly suited to implementation on parallel machines (since data access patterns are predictable), the basic operations and concepts of Image Algebra were used as the basis of our programming abstractions, with minor extensions. 2. To continue to exploit the locality property of neighbourhood processing, we extended considerably the concept of a neighbourhood. For instance, we introduced sets and sequences of neighbourhoods. Together with variant neighbourhoods of Image Algebra, these novel abstractions give a very powerful and flexible notation which is still nevertheless capable of parallel implementation. As a result, we can now express various complete image transforms such as the Hadamard transform at a high level[4], and obtain an implementation whose performance is comparable with hand coding. We can also express various kinds of geometric transformation, including downsampling and upsampling operations of the kind used in some forms of wavelet transform, and data permutation operations such as perfect shuffle or the bit reversal permutation which is a component of many standard image transforms such as the FFT. Evaluation of the abstractions was carried out by re-engineering a number of existing (sequential) systems. One large one (on calculating optical flow) has highlighted the need for extending the abstractions to include 3-D and video image processing applications. (The latter is currently the subject of more recent work at QUB[5].) In trying to extend the applicability and expressive power of the above abstractions beyond the domain of image processing, we have insisted on retaining the basic concept of neighbourhood processing (as in (2) above). We have investigated several standard numerical algebra problems, and found that, with facilities for building new compound operators from a group of primitives, sets and sequences, we can code algorithms for problems such as matrix multiplication, LU factorisation, and finding the transitive closure of directed graphs. This work has not yet been reported in detail, but results to date indicate that a number of numerical problems can be expressed in terms of neighbourhood processing, but whether or not it is natural to do so needs further investigation.
3
Implementation and the Need for Optimisation
Although the kind of high level, algebraic notation outlined above is very convenient as a programming notation, there are some forms of inefficiency which necessarily arise from a straightforward library-based implementation approach. There are two main reasons for this:
Achieving Portability and Efficiency Through Automatic Optimisation
105
– The fact that special cases, in which an operands value is known and could be exploited manually, cannot be fully exploited, because the implementations of the high level operations must of necessity be general purpose. – The fact that implementing expressions involving compound operations will frequently involve replication of overheads. As a simple example of the first problem, consider a convolution operation between an image and a template (a weighted window used in neighbourhood operators). This operation is normally provided as a general purpose routine applicable to any pair of operands. However it is frequently the case that the template may contain weights of zero or one. Using a general purpose routine will therefore entail unnecessary arithmetic operations. The second problem arises with any expression which involves compound operations using image operands. Each individual operation will normally be implemented by a routine which traverses the image domain (for example using a double loop construct). When a compound expression is implemented this overhead is replicated for each individual operation, leading to a significant loss in efficiency. Of course, a programmer interested primarily in efficiency would manually write additional routines for all the above cases; but this would require operating at a lower level at which the parallelism and communication would be visible, thus reducing portability. The essence of the EPIC environment is that it automatically generates these additional routines, and links them in to the runtime environment; in this way it gives the same efficiency as a good programmer, but allows the application developer to continue to operate at the highest level. The environment is based on a rule based optimiser which aims to apply the same reasoning steps which a programmer would apply manually in generating efficient, tailored versions of the standard routines.
4
EPIC : A Portable Application Development Environment
The second objective defined above was the implementation of an object oriented application development environment for parallel image processing based on the programming abstractions defined within the framework of the EPIC model. We have developed a sophisticated environment, providing C++ as the users language, with the EPIC abstract machine provided as a range of C++ methods[6,7]. In developing the EPIC environment, we have proposed and integrated several ideas which we believe could have longer term benefits for the design of future systems of this kind. The more significant ideas (or developments of previous ideas and approaches) which appear to us to have particular relevance and importance in this area of the project are now considered. An Extensible Abstract Machine The traditional approach to achieving portability across architectures is to define an abstract machine with an instruction set matching the application. In our
106
D. Crookes et al.
case, we provide an image coprocessor with an instruction set which supports our Image Algebra-based programming abstractions. This results in a static instruction set (like a library). However this traditional approach results in inefficiencies which can sometimes make the whole approach less attractive to real application developers. As discussed above, these inefficiencies arise typically for two reasons: 1. Special cases of instructions are often implemented more efficiently by manual programmers, and 2. The array processing overheads for compound instructions are usually cumulative. We have developed an architecture for an abstract machine in which these inefficiencies can be avoided while retaining the elegance and portability of the abstract machine model. We have proposed the concept of an Extensible abstract machine, in which the instruction set has two parts: 1. The basic, static instruction set, and 2. A set of additional, optimised instructions which avoid the inefficiencies mentioned above. These additional instructions are program-specific, and implement special cases and compound instructions as efficiently as manual coding. As indicated below, these extended instructions are generated and called automatically. A Self-Optimising Abstract Machine To enable the programmer to continue to use the static EPIC algebraic programming abstractions, and thus gain clarity, portability and conciseness, while at the same time gain the performance benefits of the types of optimisation mentioned above, we have developed an off-line optimiser which carries out the optimisation and generation task which a performance-minded programmer would carry out. The EPIC environment automatically builds new, optimised instructions which are then linked in to the dynamic, extended part of the EPIC instruction set. This involves the following program execution behaviour: 1. The first time a program is run, it does so using only the static instruction set. But at the same time, syntax trees for compound operations and special cases are dumped to file. 2. Off-line, these syntax trees are transformed into new, optimised routines, and linked into the EPIC coprocessors extended instruction set. 3. On subsequent execution, the extended instructions are automatically used, with no user input. The only visible difference will be the increase in execution speed. To demonstrate how these ideas are provided by the EPIC environment, the next section presents the architecture of the EPIC system itself.
Achieving Portability and Efficiency Through Automatic Optimisation
107
The EPIC System Architecture The EPIC application development environment has three principal components. These are: 1. The C++ class library which forms the users application programming environment. 2. The parallel coprocessor which is installed on the parallel machine. 3. The off-line optimiser which is the rule-based transformation system used to generate extended instructions from combinations of the basic programming abstractions. The optimiser is described in more detail in the next section of this report, but it forms an integral part of the overall system architecture. The overall system architecture is illustrated in Figure 1 below.
Other worker processes C++ Class Library
______ ______ ______ ______ User’s Program
Tree walking routine Syntax Graph Evaluate
Variable table
Instruction stream
Controller Process
Built-in instructions
Control Extended instructions Stack
Table of successfully optimised statements
Other worker processes
Sequential host structure
Syntax expressions
Worker process
Parallel coprocessor structure
Transformation system
Code generator
Off-line optimiser
Fig. 1. Overall System Architecture
The Parallel Coprocessor The parallel coprocessor consists of a controller process and a set of worker processes. Images are geometrically decomposed into horizontal strips or vertical columns. Dynamic redistribution at runtime is supported.
108
D. Crookes et al.
The worker processes perform the actual computations on image data. Each worker process has the internal structure illustrated in the inner box (Figure 1), with a control module, a variable table storing information about program variables including pointers to their data, an evaluation stack, and two tables of instructions. The first is the table of built-in instructions, and the second is the table of extended instructions generated by the optimiser. The control logic can handle the implementation of either form of instruction. The coprocessor is implemented in C, using MPI for communications. In fact, only six MPI routines are used. On the C40 system, we wrote our own MPI routines, making sure they implement only the minimum necessary functionality (which proved a significant performance advantage). Portability of the System Our initial goal was to base the EPIC implementation on a network of Texas Instruments C40 processors. We have not only successfully met this objective, but we have also ported the implementation to two other parallel architectures. These are: a quad-processor Unix workstation, and a network of Pentium PCs running Linux. The application development environment is fully operational on all three architectures.
5
The Off-Line Optimiser
The EPIC environment includes a rule-based transformation system which can generate new optimised instructions from arbitrary compositions of the built-in instructions. This component is the off-line optimiser shown in Figure 1 and it is in reality an integral feature of the overall system architecture. The off-line optimiser takes the syntactic representations generated by execution of the users program on the sequential host and generates optimised parallel code to implement the complete operation defined in each statement. The optimised routine, after compilation and linking, becomes a new instruction on the parallel coprocessor. The optimiser is a rule driven system written in Prolog. It uses optimisation techniques which are not in themselves new. For the most part these are loop combining, loop unrolling and loop interchanging, along with the elimination of redundant arithmetic. Several strategies are employed to identify the most appropriate path to follow based on the contents of the syntax graph. This includes trial and error approaches when an obvious course is not evident from the nature of the syntax graph. A pattern-matching approach is used to find the best transformation applicable. The output is initially in the form of an object code syntax graph. A code generation module is then used to generate C code with calls to library routines for such purposes as border swapping. The generated optimised instructions are independent of the number of available processors. Direct calls to MPI, which provides the communication framework, are also embedded in the code.
Achieving Portability and Efficiency Through Automatic Optimisation
109
The capabilities built into the optimiser enable it to handle any composition of the operations provided within the class library, including operations nested to any depth. Also provided is the ability to handle the syntactic representations of variant templates operations, including sequences of variant templates.
6
Performance of the EPIC Optimiser
As an illustrative case, consider the example program statement of the form: EdgeImage = (absconv(Im,Sv) + absconv(Im,Sh)) > Threshold)*255 ; This performs a simple edge detection operation when Sv and Sh are vertical and horizontal gradient operators such as the Sobel operators. When we execute a statement like this without optimisation a total of 5 of the basic instructions will be needed, each entailing a pass over the image domain. Optimisation can produce a new instruction requiring only one traverse of the image domain to perform the whole operation. Our initial studies showed that a sequential unoptimised version would run up to nearly 5 times slower than a hand- optimised version for statements of this level of complexity (and a factor of around 4.5 on a parallel system). Trials using the EPIC optimiser have now been performed on all three of the architectures on which the EPIC system is implemented. The results are tabulated (Table 1) for two such operations, namely a top hat filter operation and the edge detector example shown above. Times are in seconds and the image size in each case is 256*256. Table 1. Execution times Top Hat Filter Edge Detector C40 - 4 Workers Before optimisation After Optimisation Improvement C40 - 2 Workers Before optimisation After Optimisation Improvement PC-Linux System Before optimisation After Optimisation Improvement Quad-Processor Workstation Before optimisation After Optimisation Improvement
0.62 0.13 4.8 fold
0.39 0.10 3.9 fold
1.12 0.17 6.6 fold
0.47 0.16 2.93 fold
1.83 0.75 2.44 fold
0.94 0.53 1.77 fold
2.87 1.74 1.65 fold
1.95 1.36 1.43 fold
110
D. Crookes et al.
These are of course quite short programs but they do illustrate that on the C40 implementation we are seeing impressive performance improvements from the optimisation. Measurement of the overheads incurred by the host processor (tree building, tree matching and host-coprocessor communication) is harder to assess. However, on a selection of tests with a (slow!) SUN host and a C40-based coprocessor, the host overheads reduced overall system performance by between 10% and 35%. This demonstrates another significant finding namely, the importance of a fast, closely-coupled host to a parallel coprocessor. It is clear from all our experiments that workstation clusters are going to struggle in obtaining parallel efficiency for image processing, because of the communication overheads and process switching times. On such systems, the EPIC optimisation capability will probably not give sufficient speedup to make its use worthwhile (although recent work on lightweight messaging[8] could improve the situation somewhat). However, the C40 system gives very promising results, and the EPIC approach definitely pays off. This is largely because the C40 system is much more closely coupled, and because implementation of the MPI communication routines are architecture-aware. This is an important lesson for other implementors.
7
Conclusions
The EPIC project has demonstrated the main thesis of our approach to achieving portability with efficiency, namely that in this problem domain: 1. Portability over parallel architectures can be achieved by adopting an algebraic, application-specific programming model, and by retaining the traditional implementation concept of an abstract coprocessor machine - but in a more sophisticated form. 2. The inherent loss of efficiency arising from a standard abstract machinebased approach to implementing an algebraic model can be recovered, by developing a rule-based optimiser which generates the equivalent of new machine instructions dynamically, thus giving an extensible instruction set for the abstract coprocessor. As part of the project, we have therefore developed a powerful self-optimising tool which demonstrates the feasibility of automatically generating new operations from special cases or compositions of library operations, and have provided a transformation system for program optimisation. An efficient implementation of EPIC has been developed for C40 networks and portability has been achieved through the use of an MPI communications layer. In addition the system has been ported to a quad-processor Unix workstation and a PC-based Linux network. The work described in this paper has contributed to the ongoing debate on how to achieve portability with efficiency, though very much from an application specific viewpoint. The main contributions to this debate are:
Achieving Portability and Efficiency Through Automatic Optimisation
111
– The concept of an extensible abstract machine, for obtaining both portability and efficiency. – New sophisticated neighbourhood-based abstractions, which are proving capable of expressing some algorithms from other application areas (e.g. numerical problems), and which guarantee parallel efficiency. – Automatic identification of extended instructions has been achieved, though we are undecided at this stage as to whether or not it is advisable (the programmer could identify these manually). – The novel concept of a self-optimising machine, made possible by automatic extension of the instruction set by the system generating, linking and calling extended instructions on subsequent program executions. – Portability across different distributed memory platforms, using a small subset of the MPI interface. For full efficiency, we recommend rewriting these MPI routines for a specific configuration, rather than rely on a generalpurpose MPI implementation. – For image processing applications at least, the benefits of automatic optimisation are most apparent on a closely coupled network of processors. They are not so apparent on a loosely coupled cluster with a general purpose MPI implementation. – At the highest level, possibly our most significant achievement is to enable programmers to continue to operate using a set of high level primitives — and hence retain application portability — by reclaiming the usual inherent loss of efficiency through sophisticated optimising software tools. It is this concept which should be applicable to a wide range of other application domains.
Acknowledgements This work has been supported by the UK EPSRC within the framework of the PSTPA (Portable Software Tools for Parallel Architectures) initiative. The support received from Transtech Parallel Systems is also gratefully acknowledged.
References 1. Crookes, D., Brown, J., Dong, Y., McAleese, G.,Morrow, P.J., Roantree D. and Spence I.: A Self-Optimising Coprocessor Model for Portable Parallel Image Processing. Proc. EUROPAR’96. Springer-Verlag (1996) 213–216 103 2. Crookes, D., Brown, J., Spence, I., Morrow, P., Roantree, D. and McAleese, G.: An Efficient, Portable Software Platform for Parallel Image Processing. Proc. PDP’98, Madrid (1998) 103 3. Ritter, G.X., Wilson, J.N. and Davidson, J.L.: Image Algebra : An overview. Computer Vision, Graphics and Image Processing 49 (1990) 297–331 103, 104 4. Crookes, D., Spence, I.T.A. and Brown, T.J.: Efficient parallel image transforms: a very high level approach. Proc. 1995 World Transputer Congress, IOS Press (1995) 135-143 104
112
D. Crookes et al.
5. Hill, S. J., Crookes, D. and Bouridane, A.: Abstractions for 3-D and video processing. Proc. IMVIP-97, Londonderry, (1997) 104 6. Morrow, P.J, Crookes, D., Brown, J., Dong, Y., McAleese, G., Roantree, D. and Spence, I.: Achieving Scalability, Portability and Efficiency in a High-Level Programming Model for Parallel Architectures. Proc. UKPAR’96. Springer-Verlag (1996) 29–39 105 7. Morrow, P., Roantree, D., McAleese, G., Crookes, D., Spence, I., Brown, J. and Dong, Y.: A Portable Coprocessor Model for Parallel Image Processing. European Parallel Tools Meeting. ONERA, Chatillon, France (1996) 105 8. Nicole, D.A. et al. High performance message passing under chorus/Mix using Java. Department of Electronics and Computer Science, University of Southampton (1997) 110
Optimising Parallel Logic Programming Systems for Scalable Machines Vftor Santos Costa I and Ricardo Bianchini 2 1 LIACC and DCC-FCUP, 4150 Porto, Portugal 2 COPPE Systems Engineering, Federal University of Rio de Janeiro, Brazil
A b s t r a c t . Parallel logic programming (PLP) systems have obtained good performance on traditional bus-based shared-memory architectures. However, the scalable multiprocessors being developed today pose new challenges. Our experience with a sophisticated PLP system, Andorra-I, demonstrates that indeed performance suffers greatly on modern architectures. In order to improve performance, we perform a detailed analysis of the cache behaviour of all Andorra-I data structures via executiondriven simulation of a DASH-like multiprocessor. Based on this analysis we optimise the Andorra-I code using 5 different techniques. Our results show that the techniques provide significant performance improvements, leading to the conclusion that PLP systems can and should perform well on modern scalable multiprocessors.
1
Introduction
Parallel computers can improve performance of both numerical and symbolic applications. Logic programs are good examples of symbolic applications that often exhibit large amounts of implicit parallelism and that can greatly benefit fl'om parallel computers. Several P L P systems have been developed so far and have obtained good performance for traditional bus-based s h a r e d - m e m o r y architectures. However, the scalable multiprocessors being developed today pose new challenges, such as the high latency of m e m o r y accesses and the d e m a n d for scalability. The complexity of P L P systems and the large amount of d a t a they process raise the issue of whether P L P systems can obtain good performance on these new parallel architectures. In order to address this issue, we experiment with a sophisticated P L P system, Andorra-I [8], that exploits both dependent and-parallelism and or-parallelism. Andorra-I is a particularly interesting example of P L P system, since most of its data structures are similar to the ones of several other P L P systems. Andorra-I has obtained good performance on the Sequent Symmetry, but our experience with it running on modern multiprocessors demonstrates indeed t h a t scMability suffers greatly on these architectures [7]. This paper addresses the question of whether the poor scalability of AndorraI is inherent to the complex structure of P L P systems or can be improved through careful analysis and tuning. In order to answer this question, we analyse the cache behaviour of all Andorra-i d a t a areas when applied to several different
832 logic programs. The analysis pinpoints the areas that are responsible for most misses and the main sources of the misses. Based on this analysis we remove the main performance limiting factors in Andorra-I through a small set of optimisations that did not require a redesign of the system. More specifically, we optimise Andorra-I using 5 different techniques: trimming of shared variables, d a t a layout modification, privatisation of shared d a t a structures, lock distribution, and elimination of locking in scheduling. We present the isolated and combined performance improvements provided by the optimisations on a simulated DASH-like multiprocessor with up to 24 processors. In isolation, shared variable t r i m m i n g and the modification of the d a t a layout produced the greatest improvements. The improvements achieved when all optimisation techniques are combined are substantial. A few of our programs approach linear speedups as a consequence of our modifications. In fact, for one program the speedup of the modified Andorra-I is a factor of 3 higher than that of the original version of the system on 24 processors. Our main conclusion is then that, even though P L P systems are indeed complex and sometimes irregular, these systems can and should scale well on modern scalable multiprocessors.
2
Methodology
In this section we detail the methodology used in our experiments. The experiments consisted of the simulation of the parallel execution of Andorra-I [8, 10]. The Andorra-I parallel logic p r o g r a m m i n g system employs a very interesting method for exploiting and-parallelism, namely to execute determinate goals first and concurrently, where determinate goals are goals that m a t c h at most one clause in a program. The Andorra-I system also exploits or-parallelism that arises from the non-determinate goals. Andorra-I requires access to shared m e m o r y for both execution and scheduling of work. In order to study the Andorra-I execution more fully, we divided its shared m e m o r y into ten different areas. Andorra-I implements the standard Prolog work areas. The Code Space includes the compiled code for every procedure and is read-only during execution of our benchmarks. The Heap Space stores structured terms and variables. The Goal Frame Space keeps the goals yet to be executed. The Choicepoint Stack maintains alternatives for open goals. The Trail Stack records any conditional bindings of variables. Andorra-I also requires several new areas for a n d / o r parallelism. The OrScheduler Data Structures are used to manage or-parallelism. The d a t a structures for and-parallelism are in the Worker area. The Binding Arrays are used to implement the SRI model [5] for or-parallelism, by storing conditional bindings. A Lock Array was needed in our port to establish a m a p p i n g between a shared m e m o r y position (such as a variable in the heap) and a lock. Finally, the Miscellaneous Shared Variables include the remaining d a t a structures. To simulate Andorra-I we ported the system to a detailed on-line, executiondriven simulator. The simulator uses the M I N T front-end [9], and a back-end
833 ,E 2O
'
i~ealised dash
'
'
--
i
.......
I
/ ,/
-
20
i~ealised j dash
......
915
!
5
5
5 i0 N u m b e r of
5 i0 15 20 N u m b e r of P r o c e s s o r s
15 20 Processors
Fig. 2. Speedups for tsp
Fig. l. Speedupsfor bt-cluster
that simulates the memory and interconnection systems. We simulate a 24-node, DASH-like [4], directly-connected multiprocessor. Each node of the simulated machine contains a single scalar processor, a write buffer, a 128-KB directmapped data cache with 64-byte cache blocks, local memory, a full-map directory, and a network interface. We use the DASH write-invalidate protocol with release consistency [3] in order to keep caches coherent. We classify cache misses under this protocol using the algorithm presented in [1].
g
Workload
and
Original
Performance
The benchmarks we used in this work are applications representing predominantly and-parallelism, predominantly or-parallelism, and both and- and orparallelism. We next discuss application performance for the original Andorra-I (more detailed information on the benchmarks can be found in the extended version of this paper [6] and in Dutra's thesis [2]). Note that our results correspond to the first run of an application; results would be somewhat better for other runs. We use two example And-parallel applications, the clustering algorithm for network management from British Telecom, b t - c l u s t e r , and a program to calculate approximate solutions to the traveling salesman problem, t s p . To obtain best performance, we rewrote the original applications to make them determinate-only computations. Figure 1 shows the bt-cluster speedups for the simulated architecture as compared to an idealised shared-memory machine, where data items can always be found in cache. The i d e a l i s e d curve shows that the application has excellent and-parallelism and can achieve almost linear speedups up to twenty four processors. Unfortunately, performance for the DASH-like machine is barely acceptable. Figure 2 shows that the t s p application achieves worse speedups than b t - c l u s t e r on a modern multiproeessor. The maximum speedup actually decreases for 24 processors, whereas the ideal machine would achieve a speedup of 20 for 24 processors. Figure 3 illustrates the number and sources of cache misses
834
Fig. 3.
Misses
by
data
area
bt-cluster
for
Fig. 4. Misses by data area for chat80
per data area in the hi_cluster application running on 16 processors as a representative example. The Figure shows that the overall miss rate of b t - c l u s t e r is dominated by true and false sharing misses from the Worker and Misc areas. This suggests that the system could be much improved by reducing false sharing and studying activity in the Worker and Misc areas. We use two Or-parallel applications. Our first application, chat80, is an example from the well-known natural language question-answering system c h a t - 8 0 , written at the University of Edinburgh by Pereira and Warren. The second application, fp, is an example query for a knowledge-based system for the automatic generation of floor plans. This application should at least in theory have significant or-parallelism. Figure 5 shows the speedups for the c h a t 8 0 application fl'om 1 to 24 processors. These speedups are very similar to those obtained by Andorra-I on the Sequent Symmetry architecture. In contrast, the DASH curve reaches a maximum speedup of 4.2 for 16 processors. Figure 6 shows the speedups for the f p application. The theoretical speedup is very good, in fact quite close to linear, in sharp contrast to the actual speedup for the DASH-like machine.
[ idealised ~ 20
dash
, ..... i
20
idealised ~ .... dalh ~
/ ~ . . ~
,~
E ~10 5 5 i0 15 20 Number of Processors
Fig. 5. Speedups for chat80
5 l0 Number of
15
20
Processors
Fig. 6. Speedups for fp
835
2O
idealised d a s h - - + .....
i ~i0
5
5 i0 NLunber o f
15 20 Processors
Fig. 7. Speedups for pan2 Fig. 8. Misses by data area for pan2
Figure 4 shows the number and source of misses for c h a t 8 0 running on 16 processors, again as an example of this type of application. Note that c h a t 8 0 does not have enough parallelism to feed 16 processors, suggesting that most sharing misses should result from or-parallel scheduling areas, 0rSch and ChoiceP. Indeed, the figure shows that these areas are responsible for a large n u m b e r of sharing misses, but the areas causing the most misses are Worker and N i s c as in the and-parallel applications, indicating again that these two areas should be optimised. As an example of And/Or application we used a program to generate naval flight allocations, based on a system developed by Software Sciences and the University of Leeds for the Royal Navy. Figure 7 shows the speedups for pan2. The idealised curve shows that the application has less parallelism t h a n all other applications; the ideal speedup does not even reach 12 on 24 processors. When run on the DASH simulator, pan2 exhibits unacceptable speedups for all numbers of processors; speedup starts out at about 1.8 for 2 processors and slowly improves to a m a x i m u m of only 4.8 for 24 processors. Figure 8 shows the distribution of cache misses by the different Andorra-I d a t a areas for 16 processors. In this case, the Worker area clearly dominates, since the contribution from the Mist area is not as significant as in the and-parallel benchmarks. Note t h a t there is more true than false sharing activity in Worker. The true sharing probably results from idle processors looking for work. 4
Optimisation
Techniques
and
Performance
The previous analysis suggests that relatively high miss rates m a y be causing the poor scalability of Andorra-I. It is interesting to note that most misses come from fixed layout areas, such as Worker and Misc, and not from the execution stacks, as one would assume. We next discuss how several optimisations can be applied to the system, particularly in order to improve the utilisation of the Worker and Mist areas.
836 The first two optimisations were prompted by our simulation-based analysis of caching behaviour, and they are the ones that give the best improvement. The other three were based on our original intuitions regarding the system, and were the ones we would have performed first without simulation data. We studied performance for three applications, b t - c l u s t e r , chatS0 and pan2. A detailed discussion of our experiments and results can be found in [6]. In the remainder of this section, we simply summarise the impact of each of the techniques studied when applied in isolation.
Variable Trimming. In this technique we investigate the areas that have unexepected sharing, and try to eliminate this sharing if possible. For two of our applications, the Nisc area gave a surprisingly significant contribution to the number of misses. The area is mostly written at the beginning of the execution to set up execution parameters. During execution it is used by the reconfigurer and to keep reduction and failure counters. By investigating each component in the area, we detected that the counters were the major source of misses. As they are only used for research and debugging purposes, we were able to eliminate them from the Andorra-I code. The results in [6] show that chatS0 benefits the most from this optimisation. This is because the failure counter is never updated by and-parallel applications, but often updated by this or-parallel application. The optimisation does not impact the and-parallel benchmarks as much, leading to less than 10% speedup improvements. Data s Modification. All benchmarks but pan2 exhibit a high false sharing rate, showing a need for this technique. On 16 processors, 15% of the misses in pan2 are false sharing misses, whereas in the other applications false sharing causes between 40% (chatS0) and 51% ( b t - c l u s t e r ) of all misses. These results suggest that improving false sharing is of paramount importance. According to our detailed analysis of caching behaviour, false sharing misses are concentrated in the Worker, OrSch, ChoiceP and BA areas, besides the Misc area optimised by the previous technique. The Worker and 0rSch data areas are allocated statically. This indicates that we can effectively reduce false sharing. We applied two common techniques to tackle false sharing, padding between fields that belonged to different workers or that were logically independent, and field reordering to separate fields that were physically close but logically distinct. Although these are well-known techniques, padding required careful analysis, as it increases eviction misses significantly. The field reordering technique was not easily applied either, as the relationships between fields are quite complex. Padding may lead to serious performance degradation for the dynamic data areas, such as ChoiceP and BA. This restricted our options for layout modification to just field reordering for these areas. The BA area was the target of one final data layout modification, since the analysis of cache behaviour surprised us with a high number of false sharing misses in this area for chat80. Further investigation showed that this was a memory allocation problem. The engines'
837 top of stacks were being shmalloc'ed separately and appeared as part of the BA area in the analysis. This increased sharing misses in the area and was especially bad for the or-parallel applications, as different processor's top of stacks would end up in the same cache line. We addressed the problem by moving these pointers into the Worker area, where they logically belong. Our results show that the b t - c l u s t e r and chat80 applications benefit the most from this optimisation; speedup improvements can be as significant as 60%. In contrast, the pan2 application achieves improvements of less than 10% fi'om this optimisation.
Privatisation of Shared Variables. This technique reduces the number of shared memory accesses by making local copies in each node of the machine. In the best case, the shared variables are read-only and hence local copies can actually be allocated in private memory. The high nmnber of references to Worker suggested that privatisation could be applied there. In fact, Andorra-I did already use private copies of the variables in Worker and there was little room for improvement. The Locks and Code data areas are the major candidates to privatisation in Andorra-I. The Locks area only includes pointers to the actual locks, is thus read-only during execution, and can be easily privatised. Another area that is also read-only during parallel execution of our benchmarks is Code. Unfortunately, logic programs in general can change the database and, therefore, update Code, making privatisation complex. Our results show that privatisation improves speedups by up to 10% at most and that the impact of this optimisation decreases as the number of processors increases. Lock Distribution. This technique was considered to reduce contention on accesses to logical variables, and-scheduling, or-scheduling, and stack management. The original implementation used a single array of locks to implement these operations. In the worst case, several workers would contend for the same lock causing contention. To improve scalability, we implemented different lock data structures for different purposes. We expected best results for or-parMlel applications, as the optimisation prevents different teams from contending on accesses to logical variables. The cost of this optimisation is that, if the arrays of locks are shared, there will be more expensive remote cache misses. Our results show that the b t - c l u s t e r and chatS0 applications benefit somewhat from this optimisation, but that the pan2 application already exhibited a significant number of misses in the Locks area and suffers a slowdown. Elimination of Locking in Scheduling. This technique improves performance in benchmarks with significant and-parallelism by testing whether there is available work, before actually locking the work queue. This modification is equivalent to replacing a test_and_set lock with a test_and_test_and_set lock. This optimisation provides a small speedup improvement for pan2, as it avoids locking when there is no and-work. For b t - c l u s t e r the technique does not improve speedups as this application exhibits enough and-work to keep processors busy.
838
2O
idealised~ / dash (optimised)l ....... / dash ( i n i t i a l ) ~ ............
i
i
i
............
@10
@i0
5
5
5 i0 15 20 Number of Processors
Fig. 11. Misses bt-cluster
by
data
area
"~/,,..
dash (optimisedb ...... , / dash (initial),
5 i0 15 20 Number of Processors
Fig. 9. Speedups for b t - c l u s t e r
5
2O
Fig. 10. Speedups for t s p
for
Fig. 12. Misses by data area for chat80
Combined Performance of Optimisation Techniques
We next discuss the overall system performance with all optimisations combined. We compare speedups against the i d e a l i s e d and o r i g i n a l results. The i d e a l i s e d speedups were recalculated for the new version of Andorra-I, but, as it is shown in the figures, the optimisations did not have any significant i m p a c t for the • machine. Figures 9 and 10 show the speedups for the two and-parallel applications running on top of the modified Andorra-I system. The m a x i m u m speedup for b t - c l u s t e r j u m p e d from 12 to 20, whereas the m a x i m u m speedup for t s p j u m p e d from 6.3 to 19. This indicates t h a t the realistic machine is now able to exploit the available parallelism more fully. The explanation for the better speedups is a significant decrease in miss rates. For b t - c l u s t e r , the new version of Andorra-I exhibits a miss rate of only 0.6% for 16 processors, versus the 1.6% of the previous version. In the case of t s p , the optimisations decreased the miss rate from 3% to 1.2% again on 16 processors. Figure 11 shows the number and source of misses for b t - c l u s t e r on 16 processors. Note that the figure keeps the same Y-axis as in Figure 3 to simplify
839 comparisons against the cache behaviour of the original version of Andorra-I. The figure shows that the number of misses in the Worker area was reduced by a factor of 4, while the number of misses in the Misc area was reduced by an order of magnitude. The figure also shows that there is still significant true sharing in Worker, but false sharing is much less significant. The number of misses from Misc is now almost irrelevant. The or-parallel benchmarks also show remarkable improvements due to the combination of the optimisation techniques we applied. Figure 13 shows the speedups for c h a t 0 0 and Figure 14 shows the speedups for fp. The m a x i m u m speedup for c h a t 8 0 almost doubles from one version of the system to the other. Note that speedups for the optimised system still flatten out on 16 processors, but at a much better efficiency. The other benchmark, fp, displays our most impressive result. The speedup for 24 processors jumps from 6.2 with the original Andorra-I system to 20 when all our optimisations are applied. This result represents more than a three-fold improvement. Figure 12 shows the distribution of misses for c h a t 8 0 with 16 processors. The figure demonstrates that the number of misses in the Worker and Misc areas was reduced by an order of magnitude. The large number of eviction and cold start misses in the Code area remains however. Sharing misses are now concentrated in the 0rSch and ChoiceP areas, as they should. Figure 15 shows the speedups of the new version of Andorra-I for the pan2 benchmark. In this case, the improvement resulting from our optimisations was quite small. Figure 16 shows the cache miss distribution for the optimised Andorra-I. The main source of misses was true sharing originating in the Worker region. A more detailed analysis proved that these misses originate from lack of work. Workers are searching each other's queues and generating misses. We are investigating more sophisticated scheduling strategies to address this problem. 6
Conclusions
and
Future
Work
Andorra-I is an example of an and/or-parallel system originally designed for traditional bus-based shared-memory architectures. We have demonstrated that
,
idealise6~-dash (optlmised)I--+~: -dash (initial),- .......
20
20
15
E
,
,
i idealised~ / dash (opt~mised) i . . . . . / . dash (initialL ......./ ..........
~15 :
i
:
~i0
/i I
~i0
i .Q
i
._..2............. i ,
,
,
5 I0 15 20 Number of Processors
Fig. 13. Speedups for chat80
N~ber
I0 15 20 of Processors
Fig. 14. Speedups for fp
840
20
idealised~ d a s h (optimised),- ....... d a s h (initial): --a
915 I
i
i
i
Z
5 i0 15 20 N u m b e r of P r o c e s s o r s
Fig. 15. Speedups for pan2 Fig. 16. Misses by data area for pan2
the system can also achieve good performance on scalable s h a r e d - m e m o r y systems. The key to these results was the extensive d a t a available from detailed simulations of Andorra-I. This information showed t h a t there was no need to restructure the system or its schedulers. Instead, performance could be d r a m a t ically improved by focusing on accesses to shared data. We believe there is potential for improving the performance of P L P systems even further. To prove so will require more radical changes to d a t a structures within Andorra-I itself, as the system was simply not designed for such large numbers of processors. Last, but not least, we are interested in studying performance of other parallel logic p r o g r a m m i n g systems, such as the systems t h a t exploit independent and-parallelism. Acknowledgements The authors would like to thank Leonidas Kontothanassis and Jack Veenstra for their help with the simulation infrastructure, and Rong Yang, Tony Beaumont, D. H. D. Warren for their work in Andorra-I. This p a p e r results from collaboration work with In~s Dutra, and has benefited from Ms da Silva's studies. Vftor Santos Costa would like to thank support from the Praxis P R O L O P P E and F C T M E L O D I A projects. Ricardo Bianchini would like to thank the support of the Brazilian CNPq.
References 1. R. Bianchini and L. I. Kontothanassis. Algorithms for categorizing multiprocessor communication under invalidate and update-based coherence protocols. In Proceedings of the 28th Annual Simulation Symposium, April 1995. 2. In~s Dutra. Distributing And- and Or-Work in the Andorra-I Parallel Logic Programming System. PhD thesis, University of Bristol, Department of Computer Science, February 1995. 3. D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directorybased cache coherence protocol for the DASH multiprocessor. Proceedings of the
841
4.
5. 6.
7.
8.
9. .0.
17th International Symposium on Computer Architecture, pages 148-159, May 1990. D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The dash prototype: Logic overhead and performance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41-61, Jan 1993. Ewing Lusk, David H. D. Warren, Self Haridi, et al. The Aurora Or-parallel Prolog System. New Generation Computing, 7(2,3):243-271, 1990. V. Santos Costa and R. Bianchini. Optimising Parallel Logic Programming Systems for Scalable Machines. Technical Report DCC-97-7, DCC - FC &: LIACC, UP, October 1997. V. Santos Costa, R. Bianchini, and I. C. Durra. Evaluating the impact of coherence protocols on parallel logic programming systems. In Proceedings of the 5th EUROMICRO Workshop on Parallel and D~str~buted Processing, pages 376-381, 1997. V. Santos Costa, D. H. D. Warren, and R. Yang. Andorra-h A Parallel Prolog System that Transparently Exploits both And- and Or-Parallelism. In Third ACM SIGPLAN PPOPP, pages 83-93. ACM press, April 1991. J. E. Veenstra and R. J. Fowler. Mint: A front end for efficient simulation of shared-memory multiprocessors. In Proceedings of MASCOTS '9~), 1994. Rong Yang, Tony Beaumont, In~s Durra, Vftor Santos Costa, and David H. D. Warren. Performance of the Compiler-Based Andorra-I System. In Proceedings of the Tenth International Conference on Logic Programming, pages 150-166. MIT Press, June 1993.
Experiments with Binding Schemes in LOGFLOW Zsolt N4meth and P4ter Kacsuk MTA Computer and Automation Research Institute H-1518 Budapest, P.O.Box 63. Hungary
{zsnemeth, kacsuk}~sztaki.hu http://www.lpds.sztaki.hu
A b s t r a c t . The handfing of variables is a crucial issue in designing a parallel Prolog system. In the framework of the LOGFLOW project some experiments were made with binding methods. The work is an analysis of possible schemes, a modification of an existing one and a plan for future implementations of LOGFLOW on new architectures. The conditions, principles and results are presented in this paper. 1
Introduction
Logic p r o g r a m m i n g languages offer a high degree of inherent parallelism [4]. The central question in implementing an OR-parallel Prolog system is how the conditional bindings can be handled consistently and effectively. There are a number of techniques for treating conditional bindings. T h e y can be categorized as stack sharing [1][3][5] [6], stack copying [11] and recomputation methods. All these techniques have a c o m m o n feature: references (either physical or logical) are allowed to point from an environment to any other environments. However, in a distributed m e m o r y system these environments m a y reside on different processing elements, and dereferencing a variable through several processing elements is a tremendously expensive operation. L O G F L O W is a distributed-memory implementation of Prolog [8]. Its abstract execution model is the logicflow model based on dataflow principles. Prolog programs are transformed into the so called Dataflow Search G r a p h (DSG). Nodes in this graph represent Prolog procedures. The execution is governed by token streams. Tokens represent a given state of the machine where the computation is continued. L O G F L O W is based entirely on the closed binding environment scheme. Thus, tokens holding an environment are allowed to be distributed over the processor space fi'eely, since there are no external references from the token. The implementation shows some of the drawbacks of the closed binding scheme. There is a long research work in combining the benefits of different binding schemes (e.g. [10][12][14]). The overheads of the closed binding scheme in L O G F L O W were analysed and a hybrid binding m e t h o d was elaborated where the idea basically relies on the Quasi Closed Environment (QCE [12]) but in fact, the new scheme is an a m a l g a m a t i o n of much more binding techniques.
843
2
T h e closed binding e n v i r o n m e n t
A closed environment in Conery's definition is a set of frames E so t h a t no pointers or links originating from E dereference to slots in frames that are not m e m b e r s of E [2]. There are two special procedures to maintain the closed state of an environment frame: the (2-stage) unification and the back-unification [7]. The environment closing is a costly operation both in time and in m e m o r y consumption. The most important contributors to this overhead are the followings. C o m p o u n d terms must be scanned each time when a unification or back- unification takes place. Scanning a compound t e r m means checking each argument of the t e r m whether it is a reference to a variable or not and performing the necessary actions. Since compound terms (structures and lists) are the natural d a t a structures in logic programming languages, they are used extensively. Structures must not be shared by different environments because their arguments m a y be changed by bounding variables or by modifying a reference item during the environment closing. If a token is duplicated, all the structures in its environment must be copied for each new environment. This results in an enormous m e m o r y consumption and slows down the computation. 3
The
proposed
solutions
The previously listed overheads m a y occur because of the very special handling of addresses in the closed binding environment scheme. The proposed solutions try to combine the local addressing scheme of the closed environment with the global addressing scheme of non-closed environments. In [10] the molecules are introduced to avoid copying when all the arguments of a structure became ground. In [14] the so called tagged variable scheme is used which is a variation of hash tables for dataflow machines. In [12] variables are divided into two groups whether they can occur in a structure or not. Variables occurring in a structure are stored as global ones whereas others are stored still according to the local addressing scheme. This distinction does not affect the location of the variables but the way how they can be accessed. In such a way, the most expensive part of the closing procedure, the structure scanning and copying, can be discarded until the task is really migrated between the processors. 4
Combining the in LOGFLOW
closed and non-closed b i n d i n g
schemes
T h e new hybrid binding scheme alloys m a n y ideas from different binding methods. The basis is still a variation of Conery's closed environment. By default, variables are handled in the usual way according to the closed binding environment principles. Variables are made global and thus, are handled differently when they are in the arguments of a structure. The fundamental elements of the new binding scheme are the followings: 1. The special storage of global variables. The variable access and the duplication of environments should be done efficiently. Any well-known global method is a good candidate (e.g. binding arrays, directory trees, etc.); finally
844 a hashing method was chosen. It is a hash technique on variables rather than on environments as Borgwardt's hash windows [1]. There are comparisons of the two hashing methods in [5]. 2. The way how the global variables are detected, created and named according to a global naming scheme. The naming scheme is very similar to that in binding array technique [3]. 3. Binding the global variables to a value. Basic considerations were: how to represent unbound variables, where to put newly bound ones. 4. Handling different OR-branches (the question of conditional global variables.) As a result of the decisions made at 3, the duplication of environments at OR-nodes is easy and efficient - there is no need to take extra care of unbound variables. 5. The new binding scheme manipulates within a single PE. Some special procedures should be done when the computation migrates between PEs. The problem shows many common features with the time stamping method. None of the above mentioned binding techniques can solve all the problems arising in L O G F L O W but their combination could be a promising candidate. Details of the design and implementation principles can be found in [13]. 5
Evaluation
of the
hybrid
binding
method
The hybrid binding scheme was implemented and tested on programs of different types and sizes. Test results can be summarized as follows: the cost of closing procedure within a single processing element (intra- processor overhead) was significantly reduced - the cost of task migration (inter-processor overhead) became slightly higher due to the procedures involved Tests showed some further properties of the binding scheme, e.g.: - the bigger the structures are, the more time can be saved by the new binding method - when other kinds of optimization are present (e.g. ground term table) the new scheme can add a little to the performance -
6
Conclusions
and
further
work
The goal of the revision of the binding method was to improve the performance of a distributed Prolog system. While efforts were made to achieve this goal, test results pointed out an interesting conclusion which can set the direction of the future research work. The test results suggested that this kind of hybrid binding scheme is an excellent candidate for binding environment on multithreaded architectures like Datarol-II and K U M P / D , where the intra-processor overhead associated with closing procedures is reduced by the binding method whereas, the inter-processor copying overhead is eliminated by the facilities of the architecture. This idea is presented in [9].
845 There are ongoing research projects on implementing L O G F L O W on a kind of hybrid von Neumann-dataflow multithreaded architecture. It has been proven that the runtime model of the Datarol-II family and that of the L O G F L O W is very similar. Furthermore, the thread scheduling policies and granularities of the two models are quite close to each other. The efficient implementation of L O G F L O W on Datarol-II can be greatly supported by a binding method like the one presented in this paper.
7
Acknowledgments
The work reported in this paper was partially supported by the National Research Grant 'Massively Parallel Implementation of Prolog on Multithreaded Architectures, Particulary on Datarol-II' registered under No. T-022106.
References 1. P. Borgwardt: Parallel Prolog Using Stack Segments on Shared-Memory Multiprocessors. Proceedings of the 1984 Symposium on Logic Programming. 2. J.S. Conery: Binding Environments for Parallel Logic Programs in Non-Shared Memory Multiprocessors. Proceedings of the 1987 Symp. on Logic Programming, 1987. 3. M. Carlsson: Design and Implementation of an OR-Parallel Prolog Engine. SICS Dissertation Series 02. 4. J.Chassin de Kergommeaux, P.Codognet: Parallel Logic Programming Systems. ACM Computing Surveys Vol 26. No 3. Sept 1994 5. A. Ciepielewski, B. Hausman: Performance Evaluation of a Storage Model for ORParallel Execution of Logic Programs. SICS 86003 Research Report. 6. B. Hausman, A. Ciepielewski, S. Haridi: OR-Parallel Prolog Made Efficient on Shared Memory Multiprocessors. SICS R87006 Research Report. 7. P. Kacsuk: Distributed Data Driven Prolog Abstract Machine. In: P. Kacsuk, M.J.Wise: Implementations of Distributed Prolog. Wiley, 1992. 8. P. Kacsuk, Zs. Nfimeth and Zs. Pusk~s: Tools for mapping, Load Balancing and Monitoring in the LOGFLOW Parallel Prolog Project.. Parallel Computing Journal, Elsevier, Vol. 22, No. 13, Feb. 1997, pp. 1853-1882 9. P. Kacsuk, M. Amamiya: A Multithreaded Implementation Concept of Prolog for Datarol-II. Proceedings of ISHPC97. 10. L.V. Kalfi, B. Ramkumar: The Reduce-Or Process Model for Parallel Logic Programming on Non-Shared Memory Machines. In: P. Kacsuk, M.J.Wise: Implementations of Distributed Prolog. Wiley, 1992. 11. R.Karlsson: A High Performance OR-Parallel Prolog System. SICS Dissertation Series 07, March 1992. 12. H. Kim, J-L. Gandiot: A Binding Environment for Processing Logic Programs on Large-Scale Parallel Architectures. Technical Report, University of Southern California 1994. 13. Zs. N~meth, P. Kacsuk: Analysis and Improvement of the Variable Binding Scheme in LOGFLOW. Workshop on Parallelism and Implementation Technologies for (Constraint) Logic Programming Languages, Port Jefferson, 1997. 14. A.V.S. Sastry, L.M. Patnaik: OR-Parallel Evaluation of Logic Programs on a MultiRing Dataflow Machine. New Generation Computing, 10 (1991), pp. 23-53.
Experimental Implementation of Parallel TRAM on Massively Parallel C o m p u t e r Kazuhiro Ogata, Hiromichi Hirata, Shigenori Ioroi and Kokichi Futatsugi JAIST, JAPAN ({ogata, h - h i r a t a , i o r o i , k o k i c h i } 0 j a i s t . a c . j p )
A b s t r a c t . MPTRAM is an abstract machine for parallel rewriting that is intended to be reasonably implemented on massively parallel computers. MPTRAM has been implemented on Cray T3E carrying 128 PEs with MPI. It rewrites terms efficiently in parallel (about 65 times faster than TRAM) if message traffic is little. But we have to reduce the overhead of message passing somehow, e.g. by using the shared globally addressable memory subsystem of Cray T3E, so that MPTRAM can be used to implement algebraic specification languages.
1
Introduction
Algebraic specification languages such as OBJ3 and CafeOBJ allow users to use computers to verify some properties of software systems and observe their dynamic behavior at the early stage of software development. However there are still some problems that have to be solved about algebraic specification languages. One of them is the improvement of their execution (rewriting) speed. T h a t is why we have designed some rewriting abstract machines. TRAM [5] is an abstract machine for rewriting [4] that may be used to efficiently implement algebraic specification languages. Its rewriting strategy is the user definable E-strategy [2]. Parallel TRAM [7] (PTRAM) is a parallel variant of TRAM that is intended to be implemented on shared-memory multiprocessors. It allows users to control parallel rewriting using the parallel E-strategy that is an extension of the E-strategy and is also a subset of the concurrent E-strategy [3]. Furthermore, we have altered P-I'RAM so as to reasonably implement it on massively parallel computers. The altered version is called Massively Parallel TRAM (MPTRAM). We have experimentally implemented MPTRAM on Cray T 3 E [1] carrying 128 processing elements with MPI [8]. In this paper, we describe briefly parallel rewriting with the parallel E-strategy, PTRAM and MPTRAM, and reports some experiments carried out on Cray T3E. 2
Parallel T R A M
O u t l i n e o f P a r a l l e l R e w r i t i n g . Parallel TRAM (PTRAM) uses as its rewriting strategy the parallel E-strategy [7], which is an extension of the E-strategy [2] and is also a subset of the concurrent E-strategy [3], so as to control parallel
847 rewriting. The strategy allows each operation have its own parallel local strategy. Parallel local strategies are specified for an n-ary operation by using lists that are basically ones of natural numbers and sets of positive numbers, or to be more exact, that are defined by the following extended BNF notation:
Definition 1 (Syntax of parallel local strategies).
(ParallelLocalStrategy)
::= ( ) I ( (SerialElem)*
0 )
(SerialElem) ::= 0 I (ArgNum) I (ParallelArgs) (ArgNum) ::= 1 1 2 1 . . . I n (ParallelArgs) ::= { (ArgNum) + } A positive number x (1 ~ x _~ n) in parallel local strategies denotes the xth argument of a term whose top operation (if the term is f ( t l , . . . , tn), f is the top operation) is that n-dry operation, and zero stands for the term itself. The rewriting of terms with the parallel E-strategy is briefly described, which is followed by the operational semantics. A term is rewritten according to the parallel local strategy of its top operation. If the first element of the strategy is some positive number x, the xth argument is rewritten and the rewriting of the original term continues according to the remainder of the strategy. If the first one is zero, the term is first replaced with a new term and then the new one is rewritten according to the parallel local strategy of its top operation if there exists a rewrite rule whose left-hand side matches with the term, or otherwise the rewriting of the original term continues according to the remainder of the strategy. If the first one is a set of positive numbers, all the subterms corresponding to the positive numbers in the set are rewritten in parallel and then the rewriting of the original term continues according to the remainder of the strategy. The following is the operational semantics written in rewrite rules:
Definition 2 (Operational semantics of the parallel E-strategy).
eval(t) --+ if(eval?(t), t, reduce(t, strategy(t))) reduce(t, nil) -+ mark(t) reduce(t, c(O, l)) --+ if( redex ?(t ) , eval( contractum( t ) ) , reduce(t, 1)) reduce(t, c(x, l)) --+ reduce(replace(t, x, eval(arg(t, x))), l)
reduce(t,
O)
reduce(fork(t,
l)
fork(t, empty) --+ t fork(t, p(x, m)) -+ exit(eval(arg(t, x)), x, fork(t, m)) exit(s, x, t) -+ replace(t, x, s) eval rewrites a term according to the parallel local strategy of its top operation by calling reduce if the term has not been rewritten yet. t and s are variables for terms, x for positive numbers (that is why reduce(t, c(x, l)) does not overlap with reduce(t, c(O, /))), l for (ParallelLocalStrategy) and m for (ParallelArgs/, and the other symbols are operations, c and nil, and p and empty are constructors for representing (ParallelLocalStrategy) and (ParallelArgs) respectively, exit and if are given ({1 3} 0) and (1 0) as their strategies, and the other operations eager local strategies such as (1 2... 0).
848 ,."lit ~Blil'l:i~li$ii~i'"-,~
......
..........
~
.................
{~~;7. i,:,.~:::,}~,~<7~/ :.,J..:'....!i!~..:il.====================================================== ... L!
Fig. 1. Rough sketch of PTRAM architecture
PTRAM A r c h i t e c t u r e . PTRAM is intended to be implemented on sharedmemory multiprocessors with a small number of processors. Fig. 1 shows a rough sketch of PTRAM architecture. There are three kinds of processing units (an interpreter, a rule compiler and a term compiler), and six kinds of regions. DNET and CR are stable regions and are shared with all processors. The three mutable regions SL, STACK and VAR are given to each processor, but CODE, on which terms are allocated, is shared with all processors so that terms can be shared with all processors. One of the processors, called the main processor, has also a rule compiler and a term compiler. Given a set of rewrite rules, with the rule compiler the main processor encodes the left-hand sides into a discrimination net that is kind of a decision tree for pattern matching, and translates the righthand sides into pairs of matching program templates and strategy list templates. The discrimination net is allocated on DNET, and the pairs on CR. After that, given a term, with the rule compiler the main processor translates it into a pair of a matching program and a strategy list. The matching program is allocated on CODE, and the strategy list on the main processor's SL. And then, in order to rewrite the original term, the main processor interprets the matching program according to the strategy list with its interpreter, STACK and VAR, and DNET. If a processor meets a situation that some arguments may be evaluated in parallel and there are some available (idle) processors, it asks them to rewrite the arguments independent of it. If there is no idle processor at that time, the processor itself rewrites the arguments. The strategy list [6] of a term is made from the matching program compiled from the term and the parallel local strategies of the subterms' top operations. It indicates the order in which the term is rewritten. The elements of the strategy list are basically addresses at which the matching programs of the subtemrs and some instructions are stored. There are three instructions for parallel rewriting: f o r k , j o i n and e x i t . f o r k creates a new process for parallel rewriting, j o i n delays its caller process until all of its child processes terminate, and e x i t terminates its caller process after reporting the termination to the parent process. e x i t does not have to explicitly return the term that its caller process has rewritten to the parent process because terms are allocated on the shared region CODE. Strategy lists may be regarded as process queues. A chunk of consecutive elements of a strategy list represents a process. Each process has two possible states: active and pending. Each process queue contains zero or more pending processes, and zero or one active process that is on the top of the queue if exists.
849
............
Fig. 2. Rough sketch of MPTRAM architecture
A process whose process queue contains an active process is busy, and otherwise idle. There is no special processor that manages processors, but a table for recording processors' state that is shared with all processors. Each processor itself keeps a record of its own state in the table, and looks at the table so as to find idle processors.
3
Massively Parallel TRAM
We have altered PTRAM to reasonably implement it on massively parallel computers. The altered version is called Massively Parallel TRAM (MPTRAM). Fig. 2 shows a rough sketch of MPTRAM architecture. MPTRAM uses the message passing model as its parallel computational model, and the simple m a s t e r / w o r k e r model as the basic parallel algorithm. The master manages all workers by using the two containers (IdleStack and WaitList) and the one table (Fork Table). Each worker has three possible states: idle, waiting and busy. Idle workers are pushed on the stack IdleStack, and waiting workers t h a t are waiting for results from their child processes are added to the linked list WaitList because waiting workers m a y become busy at some time. Workers that are in neither IdleStack nor WaitList are busy. At first all the workers are idle. Each worker has the three processing units and the six regions so t h a t the amount of messages passed between processors should not explode. T h a t is why each worker can collect garbages in its own CODE independent of the others. ForkTable will be described later. Given a set of rewrite rules, the master receives and forwards it to all the workers. Each worker compiles it with its rule compiler in the same way as PTRAM. Given a t e r m afterward, the master receives and forwards it to some worker, e.g. Workerl, and then the worker becomes busy and compiles it with its t e r m compiler like PTRAM. Subsequently, the worker interprets the matching p r o g r a m according to the strategy list with its interpreter so as to rewrite the original term. If a worker meets the f o r k instruction, it asks the m a s t e r if there is some available worker. The master searches IdleStack and then WaitList for an available worker, and changes the available worker to busy and sends its ID to the client worker if exists, or informs the client of no available worker otherwise. The client worker asks the child worker with the ID to rewrite some argument if
850 70 60 , ~ g
,,,~,i,1~: 2, +-r
T
'
,.L.~:~:,,~:~ ~-:i ...................i ...................i ; . r :
5O I ....... tbreslltlld" 27"---:'(- . . . . . . . . . . . . . . . . . . . ~. . . . .
..4
.... ~1 ................. i
i "'"'~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
41)
i
10 ; 20
40
60 R0 Number o[ workers
1(~1
12(1 127
Fig. 3. Experimental results of computing the 34th Fibonacci number
it receives the ID from the master, or otherwise rewrites the argument by itself. If a worker meets the j o i n instruction, it basically asks the master whether it can go on, or it must wait for their child workers, in which case it becomes waiting. The master judges this by looking at ForkTable in which it keeps a record of how m a n y child workers a worker is waiting for at each join point t h a t is a place where the j o i n instruction is referred in the worker's strategy list. If a worker meets the e x i t instruction, it sends the result t e r m to the parent worker, which is either waiting or busy, and notifies the master t h a t it has finished a request rewriting. After the notification, the master decreases the number of how m a n y child workers the parent worker is waiting for at the corresponding join point in ForkTable, and makes the parent worker busy and has the parent worker resume if the number becomes zero and there is no other processes above the process corresponding to the join point in the parent worker's strategy list. Each worker must have some table in which it keeps a record of references to terms that it has asked other workers to rewrite. Since those references become necessary so that a worker can go on rewriting after executing the j o i n instruction, the necessary references are put just after each point at which the j o i n instruction is referred in the strategy list.
4
I m p l e m e n t a t i o n o f MPTRAM on T 3 E w i t h M P I
Implementation. We have experimentally implemented M P T R A M on Cray T 3 E [1] with M P I [8]. The Cray T 3 E used carries 128 user processing elements (PEs), each of which has the DECchip 21164 64-bit super-scalar RISC (300MHz DEC Alpha EV5) and 64MB local memory. The PEs are connected by a bidirectional three-dimensional torus system interconnect network. T h e MPTRAM system uses two intracommunicators called Master and Workers, and one intercommunicator called Section between Master and Workers. An idle worker waits for a message that informs it that some busy worker is going to ask it to rewrite some t e r m from the master with the intercommunicator Section. After it receives such a message, it waits for a message t h a t includes a t e r m from a busy worker with the intracommunicator Workers. When a waiting worker waits for some messages from the master or workers t h a t the waiting worker has asked to evaluate some subterms, it is necessary to use the nonblocking receive operation
851 because the waiting worker cannot generally predict the order in which messages will arrive. P r e l i m i n a r y E x p e r i m e n t . A set of rewrite rules t h a t compute Fibonacci numbers in parallel was used to assess the MPTRAM system on Cray T3E. T h e rewrite rules used primitive integers (DEC Alpha EV5's integers). They also used a threshold to prevent excess number of processes. If the n t h Fibonacci number, where n is less t h a n the threshold, is computed, the rewrite rules compute it sequentially, not in parallel. We had the MPTRAM system compute (rewrite) the 34th Fibonacci number with the rewrite rules in varying the number of workers used from 1 to 127 and also the threshold from 24 to 27. Fig. 3 shows the normalized speed of the MPTRAM system based on TRAM. The result corresponding to each threshold except for 24 has a similar tendency. The rewriting speed drastically increases at some point: 90, 60 and 40 workers if the thresholds are 25, 26 and 27 respectively. This seems to relate to how m a n y processes are created because the f o r k instruction is executed 88, 54 and 33 times if the thresholds are 25, 26 and 27 respectively. The rewriting speed drastically increases after the number of workers is over the number of f o r k s executed in each case. If the threshold is 24, the f o r k instruction is executed 143 times t h a t is more t h a n actual PEs. T h a t is why the result has the different tendency from the others. We used another set of rewrite rules, in which natural numbers were represented s la Peano, to compute the 25th Fibonacci number so as to investigate how the amount of messages passed affected the increase in rewriting speed. The threshold was 14. The speedup was not over twice even with 90 workers while the speedup was approximately nine times if primitive integers were used. Hence we have to reduce the overhead of message passing somehow, e.g. by using the shared globally addressable memory subsystem of Cray T3E, so t h a t MPTRAM can be used to implement efficiently algebraic specification languages. References 1. Cray T3E: h t t p ://www. cray. com/products/systems/crayt3e 2. Futatsugi, K., Goguen, J. A., Jouannaud, J. P. and Meseguer, J.: Principles of OBJ2. Proc. of POPL' 85. (1985) 52-66 3. Goguen, J., Kirchner, C. and Meseguer, J.: Concurrent Term Rewriting as a Model of Computation. LNCS 279 (Graph Reduction'86) Springer-Verlag. (1986) 53 93 4. Klop, J. W.: Term Rewriting Systems. In Handbook of Logic in Computer Science. 2. Oxford University Press. (1992) 1-116 5. Ogata, K., Ohhara, K. and Futatsugi, K.: TRAM: An Abstract Machine for OrderSorted Conditional Term Rewriting Systems. LNCS 1232 (RTA' 97) Springer-Verlag. (1997) 335-338 6. Ogata, K. and Futatsugi, K.: Implementation of Term Rewritings with the Evaluation Strategy. LNCS 1292 (PLILP' 97) Springer-Verlag. (1997) 225-239 7. Ogata, K., Kondo, M., Ioroi, S. and Futatsugi, K.: Design and Implementation of Parallel TRAM. LNCS 1300 (Euro-Par' 97) Springer-Verlag. (1997) 1209-1216 8. Snir, M., Otto, S,W., Lederman, S.H., Walker, D.W. and Dongarra, J.: MPI: The Complete Reference. The MIT Press. 1996
Parallel Temporal Tableaux* R.I.
Scott t, M. D. Fishert and J. A. Keane t
t Department of Computation, UMIST, Manchester UK, t Department of Computing s Mathematics, Manchester Metropolitan University, UK.
A b s t r a c t . Temporal logic is useful for reasoning about time-varying properties. To achieve this it is important to validate temporal logic specifications. The decision problem for propostional temporal logic is PSPACE complete and thus automated tableau-based theorem provers are both computationally and resource intensive. Here we analyse a standard temporal tableau method using parallelism in an attempt to both reduce the amount of storage needed and improve efficiency. Two shared memory parallel systems are presented which are based on this decomposition of the algorithm but differ in the configuration of parallel processes.
1
Introduction
Temporal logic has been shown [4] to be a powerful tool for reasoning about properties that vary over time. It is important that we are able to test specifications expressed in temporal logic for validity. A decision procedure for determining the validity of a temporal logic formula is given in [6]. As the decision problem for propositional temporal logic (PTL) is PSPACE-complete, and because temporal specifications are typically large, an automated theorem prover based on this algorithm would be computationally intensive and would potentially require a large amount of memory. The aim of this work is to address ways in which we can minimise the amount of memory required and reduce the computational intensity by utilising parallel systems within temporal tableaux. Wolper's algorithm [6] has two phases in which a graph (representing possible models of the temporal formula) is constructed and, when construction is complete, systematically reduced. Parallelising Wolper's algorithm has two potential benefits: (1) the reduction of the amount of memory consumed by deleting nodes from the graph during the construction phase; and (2) improving overall efficiency by decreasing the overall time taken. The paper is structured as follows: in w we define PTL; in w we describe Wolper's decision procedure for PTL; in w we analyse the algorithm in order to identify those operations which can be performed on the graph in parallel; * This work was partially supported by EPSRC grant GR/J48979
853 from this, in w we outline a shared memory parallel program design and show how, by grouping together different operations, the granularity of processes can be refined; in w we present results of the parallel temporal tableaux algorithm on a range of parallel systems; conclusions are presented in w
2
Propositional
Temporal
Logic
(PTL)
Here we consider a discrete linear future-time temporM logic [1]. The Mphabet of this logic is that of classical propositional logic augmented with the unary operators [] ("always"), (~ ("sometime"), @ ("next-time") and the binary operators U ("until") and W ("unless"). Informally, [ ] f means that f is true in all future states, (}f means that f is true in some future state, O f means that f is true in the next state and fUg means that f is true in all future states until g becomes true. fWg is a weaker version of lug where g is not guaranteed to occur. An interpretation for P T L is a pair (a, so} where ~ provides an interpretation for the propositions of the language in each state and so is some initial state. As we are dealing with a linear model of time, each state has a unique successor state. Further discussion of temporal logics can be found in [1]
3
Wolper's Decision Procedure for PTL
Wolper's method for determining the validity of a temporal logic formula is an extension of the semantic tableau decision procedure for propositional logic [1]. As a refutation procedure, it attempts to construct a model for the negation of the formula to be tested. If a model cannot be found, then the negation is unsatisfiable and hence the initial formula is valid. The algorithm works on the observation that a P T L formula can be decomposed into subformulae that are true in the current state or the next state in time. Thus we can a t t e m p t to construct a model state by state and test for satisfiability. The algorithm consists of two phases: Graph Construction, and Graph Reduction. In the construction phase the logical structure of the formula is abstracted into the form of a graph, whose nodes are labelled with sets of formulae. Models are represented by infinite paths in the graph. The graph is built by applying certain rules to the formulae depending on whether they are disjunctive or conjunctive. Once construction is complete, nodes are deleted according to some reduction rules. If this reduction phase deletes the entire graph then the original formula is valid. We can classify temporal logic formulae into four types: c~ formulae (conjunctions);/? formulae (disjunctions); next-time formulae (those whose leading connective is 0 ) and literals (atomic propositions and their negations). Elementary formulae are defined to be either literals or next-time formulae. During construction, tableau expansion rules are applied to a and fl formulae, these are:
854 -
+
and
-- /3 + {{/~1}, {/~2}}. The a and/3 sub-formulae are given in Table 1.
Table 1. c~ and/3 formulae
A1 AAs A1 A2
raA
AOnA
(>B
B
O<>B
B~UB2 B2I B~ AO(B~UB2) B1WB2 B2 B1 A O(B1WB2)
We now semi-formally outline the graph construction and graph reduction stages of the algorithm (for a more formal description, see [6]).
3.1
Graph Construction
The digraph G(N, E, L), where N is the set of nodes (which m a y be marked or unmarked) in the graph, E is the set of edges and L is a function which m a p s a node n to a set of formulae L . , is constructed as follows: 1. Start with N := {nl}, where nl is the initial node and is labelled with the set of formulae L,u := {f} where f , the initial formula, is the negation of the formula to be tested. E := ~. 2. While there exists an unmarked node n E N, labelled with the set of formulae L,~, repeatedly apply the following rules: (a) if f E L~ where f is an unmarked non-elementary formula then for all & in the tableau expansion rule ( f + {Si}), create a new node n' labelled with L~, := L , ~ - { f } U { & } U { f * } where f * is f marked. N := N U { n ' } and E := E u
(b) if L~ contains only elementary a n d / o r marked formulae, create a new node n' where n,~, := { f I O f 6 L~} N := N U {n'} and E := E U
N.B. when we write N := N U {n'}, we mean that n' is added to N if
Vni C N . L ~ # L,,, otherwise an edge to the existing node is created. Nodes added by rule 2. (b) are called prestates and the parent of a prestate is called a state.
855 3.2
Graph Reduction
The digraph G is reduced by repeatedly applying the following reduction rules: 1. Delete nodes containing complementary formulae. Vn E N. g f E Ln, ( ~ f E Ln~ (N=N-{n})) 2. Delete nodes with no successors. Vn E N, Vni E N. (n, ni) ~ E ~ (N = N
-
{n})
3. Delete prestates containing unsatisfiable eventualities. (a) Vn e N. prestate(n). 3<}f E L~ A Vni E N. reachablefrom(nl, n) A f Ln, ::> (N = N - {n}) (b) Vn E N. prestate(n). 3fUg E Ln AVni E N. reachable from(hi, n ) A g L~, ~ (N = N - {n}) N.B. We say that the formulae {>g and fUg have associated eventuality g because both formulas imply that eventually there is a state in which g is satisfied. The predicate reachable from(hi, nj) is true iff there exists a directed path in the graph from nj to ni. 4
Opportunities
for Parallelism
To identify possibilities for parallel activity, we divide the sequential algorithm into separate sub-problems. As described above, expansion rules are repeatedly applied to the sets of formulae labelling pre-states. A new node is then added to the graph for each of the new formula sets derived (if no node exists with an identical label) and then a prestate is derived from these states. These prestates can be expanded in parallel as can the two formula sets created upon the application of a j3 expansion rule. When a set of formulae is fully expanded, a new node is added to the graph only if there is no other node with the same label. Therefore, the graph has to be searched for duplicate labels. Here, two approaches were considered: (1) nodes are added to the graph and a marker process can be used to detect duplicate labels; or (2) the graph is searched for a label before any node is added to the graph. The advantage of the first approach is that it does not restrict the amount of parallel activity, but the graph may become unnecessarily large and the marker process could become overloaded. The second approach means that the graph size is kept to a minimum as no speculative work is performed by processes applying tableau expansion rules and adding nodes to the graph. Searching the graph before adding a node can be done in parallel, but synchronisation is required between a process searching the graph and a process adding a new node to the graph. The label of any node added to the graph must be visible to a process concurrently searching the graph for a label to be certain that no two nodes have the same label. This constraint may result in a process having to lock the entire graph while it searches for a particular label. The graph is reduced by deleting nodes with labels which contain contradictory formulae, with no successors, or whose labels contain unsatisfied eventualities.
856 The first of these rules can be applied during construction by simply discarding the contradictory formula set, as no global checking of the graph is required. The other two reduction rules need more consideration. If a process is to delete a node during construction because it has no successors, it must ensure t h a t the node will n e v e r have any children, and similarly, to delete a node because it contains an unsatisfied eventuality a process must be certain that the eventuality will n e v e r be satisfied. Coordination is again needed between a process deleting nodes from the graph and a process simultaneously traversing the graph, for instance during search to prevent the actions of the latter from being corrupted. Therefore, it m a y be more efficient to concurrently m a r k nodes for deletion, yet actually delete them sequentially at some future point. As a result we can divide the algorithm into the following operations which m a y be performed in parallel (these operations are similar to those presented in
[3]): 0PI 0P2 0P3 0P4 0PS 0P6 0P7
5
-
applying tableau expansion rules to sets of formulae; searching the graph for duplicate node labels; adding a new node to the graph; identifying nodes which contain contradictory formulae; identifying nodes with no successors; identifying prestates which contain unsatisfied eventualities; deleting nodes identified by 0P4, {:}P5 or {:}P6.
Program Design
The operations described above can be implemented as individual fine-grained processes or grouped together to form processes with a coarser granularity. Below, we describe a medium-grained shared m e m o r y program model. This design was then analysed and the configuration of the processes was modified in an a t t e m p t to improve efficiency. 5.1
Heterogeneous
Processes - Model 1
This model consists of four processes: two c o n s t r u c t o r s ; a searcher; and a checker. Each process has read and write access to a shared structure - the graph. The constructors obtain and expand unmarked labels from the graph (0P1). Upon the application of a ~ rule, two sets of formulae are created, a constructor retains one set and pushes the other onto a shared queue (to which both constructors have access). Sets which contain complementary formulae are simply discarded by the constructor (0P4). Once a formula set has been obtained from the graph (or the shared queue), the constructor does not require further access to the graph and thus the amount of coordination required between constructors is small. Therefore, we allow multiple instances of the constructor process in our model. Fully expanded formula sets are passed to the searcher via queue 1.
857 eventuality look-up table
GRAP H
\ ~
queue 1
\ ]
)
queue2
Fig. 1. Program Design with 4 Heterogeneous Processes
The searcher performs 0P2 and OP3 and instructs the checker t h a t the graph has been updated via queue 2. The checker is responsible for the identification of nodes which can be deleted while the graph is being constructed. The process of identifying nodes that can be deleted is similar to the problem of garbage collection in functional p r o g r a m m i n g languages [5]. The two methods described below for 0P5 and 0P6 are similar to reference count garbage collection m e t h o d [5]. To perform 0P5 a process has to be sure that a node with no successor nodes can never have any more before marking it for deletion. This is achieved by m~intaining a reference count for each node. When a node is added to the graph, its reference count is set to 1. The node's count is incremented when a/3 rule is applied during its expansion, and is decremented when a formula set is discarded due to 0P4 or when a node or edge is added from it. Thus, if a node's reference count is equal to zero and it has no children, a process can be sure t h a t no more successors will be added to the node. In addition to building up a picture of which eventualities are satisfied at which nodes, a process performing 0P6 must also identify when an eventuality can never be satisfied, i.e. when a node cannot reach any more nodes as a result of a node or edge being added to the graph. When a node is added to the graph, this new label has to be checked to see whether it satisfies any of the eventualities that can reach it and when an edge (hi, n2) is added to the graph, all the nodes that n2 can reach must be checked against all the eventualities t h a t can reach nl. Therefore a process has to identify the subgraph that can be reached by a particular node. The identification of subgraphs is complicated by the existence of loops in the graph and some measure has to be taken to detect this. As the process traverses the graph identifying nodes in a subgraph, it tags each node through which it passes, thus ensuring that it passes through a node once only. The system must record where eventualities are satisfied and so an Eventuality Look- Up Table is maintained which contains information on to which node an eventuality is associated and which nodes satisfy it. In order to delete a node because it is labelled with an unsatisfied eventuality a process must be certain
858 that the node can have no more descendants. This is achieved by maintaining a set, reachsetn, of node identifiers for each node n. This set contains the identifiers of the nodes in the subgraph reachable from n which could yet have more successors, i.e. their reference count is not zero. These reachsets can be maintained during the identification of subgraphs. As both 0P5 and 0P6 involve the identification of subgraphs, we combine them into one checker process. The checker is notified of any additions to the graph by the searcher. If a node, n, is added it obtains any eventualities associated with the subgraph that can reach n, updating the relevant reachsets as it does so, and checks t h e m against the label of n. If an edge (hi, n2) is added it identifies the subgraph, SG1, that can reach nl and the subgraph, SG~, reachable from n2, again updating the relevant reachsets and checks the labels in SG2 against any eventualities associated with SG1. The actions of a process deleting nodes from the graph (0PT) and those of a process traversing the graph have to be synchronised to ensure that the former process does not corrupt the latter. Locking each node upon access could remove this possibility but could prove unnecessarily restrictive, especially in an example where no nodes can be deleted concurrently. Here the approach taken for 0P7 is to suspend the parallel processes and delete nodes sequentially when a predefined number of nodes can be deleted. 5.2
Homogeneous
Processes
~ Model
2
Model ] was designed to minimise inter-process synchronisation by grouping similar operations into the same process. By statically determining the configuration of the processes in this way it is unlikely that the load across the four processes will be balanced. Neither is model 1 particularly scalable: we can allow more instances of the constructor process but we are restricted to single instances of the seacher and checker processes because of the need for sequential update. An alternative model has four homogeneous coarse-grained processes which sequentially perform {:}P1 to 0P6. Here {:]P2, 0P3, {3P5 and 0P6 are only performed once a formula set has been fully expanded and so no process is idle as long as there are enough sets of formulae needing expansion. We now have more than one process able to update the graph and so some synchronisation is required. Before 0P3 is performed we must ensure that the graph contains no node with label similar to the new label. Thus, we have to deal with the possibility that two processes m a y a t t e m p t to add nodes with identical labels at the same time. To facilitate this only one insertion point for new nodes is permitted. When a process has completed 0P2, it a t t e m p t s to lock the last node of a linked list (the point of insertion). If, when the lock is obtained, the locked node is no longer the last node of the linked list, the process knows that more nodes have been added by another process and {3P2 is not complete and so the process again searches to the end of the linked-list and a t t e m p t s to lock the last node. If the locked node is still the last node in the list, then the process can be sure that no more nodes will be added until it releases the lock and thus the graph can never contain two nodes with identical labels. Similarly, tiP5 and 0P6 need access to
859 the entire graph and so some synchronisation is required - a node is locked when its reference count or its reachset is updated. 0P7 is again performed at some fixed points during construction.
6
Results
The two models were implemented on a Silicon Graphics 4-processor shared m e m o r y machine. A sequential version was also implemented for comparison purposes. These programs were tested on four reasonably large formulae which have previously been used to test sequential temporal tableau systems [2]. The sizes of the four formulae are summarised in Table 2. Table 2. Summary of Test Formulae Formula' L e n g t h / c h a r s N u m b e r of S u b f o r m u l a e 213 1 876 2 422 142 3 116 99 130 4 418
Each program - sequential, model I and model 2 - require some preprocessing, this involves converting the formula to post-fix form and building a formula table so that a formula can subsequently be referred to simply by an identifier. These preprocessing times are not included in Table 3, containing the timings of the three programs on the four formulae. As an indicator of the nature of the computation involved, it is useful to compare the m a x i m u m size of the graph produced in the sequential case, where deletions are carried out once the graph is fully constructed, with that of the parallel case, when nodes are deleted during construction. These are also shown in Table 3. Table 3. Timings for Each Model & Maximum Graph Size for Sequential and Parallel Models TIME M A X G R A P H SIZE F o r m u l a Sequential Model 1 I Model 2 Sequential I Parallel 1 0m41.08s 0m32.97 ~0m16.57s 1 1 2 28m31.33s 19m27.28s 13m47.13s 197 84 3 0m31.29s 0m49.24s 0m19.87s 257 257 4 3m58.62s 4m51.68s 2m19.32s 134 130 1
As can be seen in Table 3, Model 2 exhibits some speed-up on all four formulas whereas Model 1 performs better than the sequential implementation on only
860 two occasions. This may be explained by the dependencies between operations, for example DP2 and 0P3 rely on 0P1, and by the fact that it is difficult to predict the nature of a computation for a given example. Model 1 assigns operations to processors statically, whereas Model 2 is more dynamic, operations 2, 3, 5 and 6 are performed on demand. Model 2 does not exhibit ideal speed-up because of the amount of inter-process synchronisation required when accessing the graph and because of the extra work needed to delete nodes concurrently. 7
Conclusions
and
Further
Work
Automated reasoning methods for temporal logics consume a large amount of memory and can be slow. This restricts the size of temporal theorem that can be validated, Our aim, in parallelising the decision procedure for these logics, was two-fold: to reduce the time taken to prove a theorem; and to reduce the size of graph produced in proving a theorem. The algorithm was divided into seven concurrent operations. Two parallel models were designed and implemented along with a sequential algorithm. The heterogeneous model proved to be slower than the sequential algorithm for two of the four formulas tested, whereas the second homogeneous model exhibited satisfactory speed-ups for all four formulas. This can be attributed to the more dynamic nature of the second model. We have also compared the m a x i m m n size of the graph produced for the sequential and parallel cases. The sequential algorithm deletes nodes from the graph only when construction is complete whereas the parallel models a t t e m p t to delete as many nodes as possible during construction. A considerable amount of deletion was possible for formula 2 (over 50%), but very little parallel deletion was achieved for the other formulas. If the amount of concurrent reduction possible could be predicted by some pre-analysis, efficiency may be improved by deciding not to test for concurrent reduction if it were deemed unlikely. Unfortunately, formulas 2 and 4 are similar in length and structure and yet formula 2 allows much more reduction. The parallel temporal logic theorem prover outlined above achieves reasonable speed-up on all the temporal formulas tested. We have also shown that significant parallel graph reduction is possible for some but not all formulas. Thus, for some cases, a parallel theorem prover may not be able to handle larger formulae by concurrently reducing the graph. References 1. M. Ben-Ari. Mathematical Logic for Computer Science. Prentice Hall, 1993. 2. G. D. Gough. Decision Procedures for Temporal Logic. Technical Report UMCS89-10-1, Department of Comp. Sci., Univ. of Manchester, 1984. 3. R. Johnson. A BlackBoard Approach to Parallel Temporal Tableaux. In Proc. of Artificial Intelligence, Methodologies, Systems, and Applications, World Scientific, 1994.
861 4. Z. Manna and A. Pneuli. Verification of Concurrent Programs: The Temporal Framework. In R.S. Boyer and J.S. Moore (Eds.), The Correctness Problem in Computer Science, Academic Press, 1982. 5. R. Plasmeijer and M. van Eekelen. Functional Programming and Parallel Graph Rewriting. Addison-Wesley, 1993. 6. P. Wolper. Temporal Logic Can Be More Expressive. Information and Control, 56, p72-99, 1983.
Workshop 1 0 + 1 7 + 2 1 + 2 2 Theory and Algorithms for Parallel Computation Bill McColl and David Walker Co-chairmen
Theory
and Algorithms
for Parallel
Computation
Bridging models such as BSP and LogP are playing an increasingly important role in scalable parallel computing. They provide a convenient architectureindependent framework for the design, analysis, implementation and comparison of parallel algorithms and data structures. The Theory and Algorithms sessions at EuroPar '98 will cover a number of important issues concerning models of parallel computation. The topics covered will include: The BSP model and its relationship to LogP, BSP algorithms for PC clusters, distributed shared memory models, computational complexity of parallel algorithms, BSP algorithms for discrete event simulation, cost modelling of shared data types, systolic algorithms, cache optimisation, optimal scheduling of task graphs.
Parallel I/O In recent years it has become increasingly apparent that I/O presents significant challenges to the goal of making parallel applications truly portable, while atthe same time maintaining good performance. The need for a parallel I/O standard has become particularly acute as more and more large-scale scientific applications have migrated to parallel systems. Many of these applications produce hundreds or gigabytes of output which would result in a serious bottleneck if handled sequentially. Parallel I/O is also of central importance in very largescale problems requiring out-of-core solution, and is closely related to the subject of distributed file systems. Hardware vendors have developed customized parallel I/O subsystems for their particular platforms, but at most this only addresses portability within a vendor's family of systems. Recently, the MPI Forum and published a standard specification for parallel I/O called MPI-IO. This has not yet been widely implemented, and much research remains to be done before it can be adequately assessed. The Scalable I/O Initiative has also addressed related issues. Nevertheless, parallel I/O remains a wide open field, and a fertile research area. The parallel I/O session of EuroPar'98 contains three papers that address different aspects of parallel I/O. The first, by Jensch, Liiling, and Sensen, is of interest to anyone who has ever been frustrated by the slow response time of a popular web site. They consider the problem of how the average response
864 time can be speeded up by mapping the site's files over multiple disks in such a way that the overall throughout from the disks to the outside world is maximised. This is done by assigning files to disks in such a way that collisions, i.e., quasi-simultaneous accesses to files on the same disk, are minimised. A graph theoretic approach is used to solve the optimisation problem, and found to be effective when compared with a random assignment strategy. The second paper, by Schikuta, Fuerle, and Wanek, describes ViPIOS, the Vienna Parallel I/O System. ViPIOS uses a client-server approach to parallel I/O which seeks to achieve maximum I/O bandwidth by reacting dynamically to the behavior of the application. Applications are the clients whose I / 0 requests are serviced by one or more ViPIOS servers. ViPIOS can be used either as an independent sub-system, or as a runtime library. The last paper by Dickens and Thakur describes different approaches to two-phase I/O, and presents performance results for the Intel Paragon. The two-phase approach is a technique often used for performing parallel I/O in which processors coordinate their I/O requests so that a few requests for large contiguous blocks of data are made, rather than a larger number of requests for fragmented data. This reduces latency and improves I/O performance. One phase combines I/O requests, and the other phase actually performs the I/O. The three papers presented in this session give some idea of the diversity of ideas currently being pursued in the search for a solution to the parallel I/O problem. For those interested in finding out more about the subject the parallel I/O archive at Dartmouth University is an excellent place to begin: http ://www. cs. dartmouth, edu/pario/.
BSP, LogP, and Oblivious Programs J6rn Eisenbiegler Welf L6we Wolf Z i m m e r m a n n Institut f/Jr Programmstrukturen und Datenorganisation, Universits Karlsruhe, 76128 Karlsruhe, Germany, {eisenlloewelzimmer}@ipd. inIo. u n i - k a r l s r u h e .de
A b s t r a c t . We compare the BSP and the LogP model from a practical point of view. Using compilation instead of interpretation improves the (best known) simulations of BSP programs on LogP machines by a factor of O(log P) for oblivious programs. We show that the runtime decreases for classes of oblivious BSP programs if they are compiled into LogP programs instead of executed directly using a BSP runtime library. Measurements support the statements above.
1
Introduction
Parallel p r o g r a m m i n g suffers from the lack of a uniform and c o m m o n l y accepted machine model providing abstract p r o g r a m m i n g of parallel machines and describing their costs adequately. Two candidates, the BSP model (Valiant [9]) and the LogP model (Culler et al. [4]), have been considered in an increasing number of papers. The comparison of the two models in [2] determines the delays for a simulation of LogP programs on the BSP machine and vice versa. For our observations we make two additional assumptions: we only consider oblivious programs and message passing architectures. A p r o g r a m is oblivious if source and destination processors of communications are statically determined 1. The target machines are processor-memory-nodes connected by a communication network. We explicitly exclude shared m e m o r y machines. Virtual shared m e m o r y architectures are covered since they implicitly require communication via the interconnection network for remote m e m o r y operations. We only mention send and receive communications and deliberately ignore remote store and load operations. The first part of our paper shows that oblivious BSP programs can be compiled to the LogP machine. We further show, that compilation reduces the delay of the simulation of oblivious BSP programs on the LogP machine by a factor of O(log(P)) compared to the result in [2], i.e., there is asymptotically no delay for the compiled LogP program compared to the BSP program. Even better, it turns out that the compiled LogP program could outperform a direct execution of the BSP p r o g r a m on the same architecture. To sharpen this observation, we consider three classes of oblivious programs: first we study Multiple-Program-MultipleD a t a (MPMD) solutions. Those programs are in general hard to partition into 1 Many algorithms, especially those in scientific computing are oblivious, e.g. Matrix Multiplication, Discrete Simulation, Fast Fourier Transform, etc.
866 supersteps, i.e. in global phases of receive-, compute-, and send-operations. As an example, we discuss the optimal broadcast problem. The second s third classes of problems allow Single-Program-Multiple-Data (SPMD) solutions and there is a natural partition into supersteps. They differ in the d a t a dependencies between the phases: in the second class there are sparse dependencies between the phases in third class these dependencies are dense. We call a d a t a dependency sparse iff the BSP communication of each superstep last longer t h a n the corresponding communication in the compiled LogP program. As representatives of the second class of problems, we discuss a numeric wave simulation. The fast Fourier transform is a representative of the third class. Measurement of the p a r a m e t e r s and run time results for an optimal broadcast, the simulation and the F F T support our theoretical results.
2
The
Machine
Models
This section describes the two machine models a little more in detail. In order to distinguish between the parameters of both models, we use capital letters for the parameters of the LogP model and small letters for the p a r a m e t e r s of the BSP model. 2.1
The LogP Model
The LogP model assumes a finite number P of processors with local memory, which are connected by a d a t a network. It abstracts from the network topology, presuming t h a t the position of the processor in the network has no effect on communication costs. Each processor has its own clock, synchronization and communication is done via message passing. All send and receive operations are initiated by the processor which sends or receives, respectively. From the p r o g r a m m e r s point of view, the network has no direct connection to the local memory. All communication is done via the processor. In the LogP model, communication costs are determined by the p a r a m e t e r s L, O, and G. Sending a message costs time O (overhead) on the processor. The time the network connection of this processor is busy with sending the message into the network is bound by G (gap). A processor can not send or receive two messages within time G, but if a processor returns from a send or receive routine, the difference between gap and overhead can be used for computation. The time between the end of sending a message and the start of receiving this message is defined as latency L. There are most [L/G] messages in transit from any or to any processor at any time, otherwise, the communication stalls. We only consider programs satisfying this capacity constraint. If the sending processor is still busy with sending the last bytes of a message while the receiving processor is already busy with receiving, the send and the receive overhead for this message overlap. In this case the latency is negative. This happens on m a n y systems especially for long messages or if the communication protocol is too complicated. L, O, and G have been determined for quite a number of machines; all works confirmed
867 runtime predictions based on the parameters by measurements. In contrast to [1] and [5], we assume the LogP parameters to be constants (as proposed in the early LogP works [4]). This assumption is admissible if the message size does not vary in a single program. 2.2
The BSP Model
The BSP (bulk synchronous parallel) machine was defined by Valiant [9]. We refer to its modification by McColl in [8,6]. Like the LogP model, the BSP model assumes a finite number of processors P with local memory, local clock, and a network connection to an arbitrary network. It also abstracts from the network topology. In contrast to the LogP model, the BSP machine can explicitly (barrier) synchronize all processors. The synchronization barriers subdivides the calculation into supersteps. All send operations in a superstep i are guaranteed to be completed before superstep i + 1. In the BSP model as invented by Valiant, processor communicate via remote m e m o r y access. For oblivions programs we m a y focus on a message based communication: Consider the remote m e m o r y accesses at one superstep. If processor rri reads from processor rrj, then rri should send a request to rrj and rrj sends its answer for the general case. However, for oblivious algorithms, it is already known that rri reads from the m e m o r y of rcj. Thus, the request can be saved in the case of oblivious algorithms: it is sufficient to send the result of the read request from 7rj to rri. A write of processor rri to the m e m o r y of processor rrj, is equivalent to sending a message from processor rri to rrj containing the m e m o r y address and the value to be written. The cost model uses two parameters: the time for the barrier synchronization l, and the reciprocal of the network bandwidth g. With theses parameters, the time for one superstep is bounded by l + 2h 9 g + w, where h is the m a x i m a l number of messages sent or received by one processors and w is the m a x i m a l c o m p u t a t i o n time needed by one processor in this superstep. A BSP machine is able to route a [ l / g ] - r e l a t i o n in a superstep which is a capacity constraint analogous to the LogP model. The total c o m p u t a t i o n time is the s u m of the time for all supersteps. Like for the LogP model, we assume the p a r a m e t e r s to be constant.
3
BSP
vs. LogP
for Oblivious
Algorithms
In this section, we discuss the compilation of oblivious BSP p r o g r a m s to the LogP machine. First, we discuss how the communication between subsequent supersteps can be m a p p e d onto the LogP machine without exceeding the LogP capacity constraints. Second, we define the actual compilation and prove execution time bounds for the compiled LogP programs. Third, we compare the direct execution of a BSP program on a target machine with the execution of the (compiled) LogP program. Therefore, we conclude this section by determining lower bounds for the BSP and LogP parameters, respectively, for the same target machine.
868 3.1
Communication
of a Superstep on the LogP Machine
For simplicity, we assume that a h-relation is implemented, i.e. each processor sends and receives exactly h messages. It is well known that a "pipelined" communication can be computed using edge coloring on a bipartite graph (U, V, E) where U and V are the set of processors and (u, v) E E, iff u communicates with v. Each color, represented by an integer j E { 0 , . . . , k - 1) where k is the number of required colors, defines set of non-conflicting communications that can be started simultaneously. A send(v) on processor u is scheduled at time j - m a x ( O , G) and recv(u) on processor v at time L d- O + j . max(O, G). Since we consider oblivious BSP algorithms, the edge coloring of the communication graph for each superstep (and therefore the communication phase itself) can be computed prior to execution of the BSP algorithm. Thus, the time for edge coloring (O(]E I log(IV I + ]U]) due to [3]) can be ignored when considering the execution time of the BSP algorithm. It is easy to see that the schedule obtained from the above algorithm does not violate the capacity LogP constraints if L > (h - 1) - max(O, G). If follows L e m m a 1. I f L > (h - 1) max(O, G), then every fixed h-relation can be implemented on the LogP machine such that its execution time is L + 2 0 + (h 1) max(O, G). If L < (h - 1) max(O, G) the scheduling algorithm must be modified to avoid stalling. First we discuss the simplified model where each channel is allowed to contain an arbitrary number of messages. In this case, the first receive operation is performed after the last send operation. The same approach as above then yields execution time 2 0 + 2 ( h - 1 ) max(O, G) since the first receive operation can be performed at time O + ( h - 1 ) m a x ( O , G) instead o f O + L < ( h - l ) max(O, G). If the number of messages is bounded, the message is received greedily, i.e. as soon as possible after a send operation on the processor is finished at the time when the message arrives. This does not increase the overall execution time of a communication phase since no new gaps are introduced. Thus, the following lemma holds: L e m m a 2. If L < ( h - 1)max(O, G), then every fixed h-relation can be implemented on the LogP machine such that its execution time is 2 0 + 2(h -
1) max(O, G). 3.2
Execution Time Bounds for the Compiled LogP Programs
The actual compilation of an oblivious BSP algorithms to the LogP machine is now straightforward: Each BSP processor corresponds one to one to a LogP processor. Beginning with the first superstep we map the tasks of BSP processor to a corresponding LogP processor in the same order. Communication is mapped as described in the previous subsection. Since a processor can proceed its execution when it received all its messages of the preceding superstep, there is no need for a barrier synchronization. Together with Lemmas 1 and 2, this observation leads to the
869
T h e o r e m 1 ( S i m u l a t i o n o f B S P o n L o g P ) . Every superstep of an oblivious B S P algorithm with work w and h remote memory accesses can be implemented on the LogP Machine in time w + 2 0 + ( h - 1) m ~x( O , G) + max(L, (h max(O, c)). If we choose g = m a x ( O , G) and 1 = m a x ( L , (h )max(O, a ) ) + 2 0 max(G - O, 0) then, the execution time of a BSP algorithm in the BSP model and the compiled BSP algorithms in the LogP model is the same. Especially, the bound for the simulation in [2] is improved by a factor of log P for oblivious BSP programs. 3.3
Interpretation
vs. Compilation
For the comparision of a direct execution of a BSP program with the compiled LogP program, the parameters for the BSP machine and LogP machine, respectively, cannot be chosen arbitrarily. They are determined by the target architecture, for comparable runtime predictions we must choose the smallest admissible values for the respective model. A superstep implementing a h-relation costs in the BSP-model w + l + h 9g. According to Theorem 1, it can be executed in time w + 20 + ( h - 1 ) max(O, G ) + max(L, ( h - 1)max(O,G)) on the LogP machine. The speedup is the ratio of these two numbers. Easy calculations prove the following C o r o l l a r y 1 ( I n t e r p r e t i n g vs. C o m p i l i n g ) . Let M be a parallel computer with B S P parameters l,g, and P and LogP parameters L , O , G , and P. Let .A an oblivious B S P algorithm where each superstep executes at most h remote memory accesses. If 1 > 0 + max(L, (h - 1) max(O, C)) and g > max(O, G) then the execution time of the compiled B S P algorithm on the LogP model is not slower than the execution time of the B S P algorithm on the B S P model. If one of these inequalities is strict, then the execution time of the compiled B S P algorithm is faster. 3.4
Lower Bounds for the Parameters
Let g* and l* (resp. max*(O,G) and L*) be the smallest values of the BSP parameters (resp. LogP parameters) admissible on a certain architecture. Our simulation implies that g* = C9(max*(O, G)) and l* = O(L*). Together with the simulation result of [2] that proves a constant delay in simulation of LogP algorithms on the BSP machine, we obtain T h e o r e m 2 (BSP a n d LogP p a r a m e t e r s ) . For the smallest values of the B S P parameters g* and l* (resp. LogP parameters max* (0, G) and L*) achievable on any given topology, it holds that
g* = O(max(O*,a*))
and l* = O(L*),
provided that both machines run oblivious programs.
870 So far we only considered the worst case behavior of the two models. T h e y are equivalent for oblivious programs in the sense that bi-simulations are possible with constant delay. We now consider the time for implementing a barrier synchronization and a packed router on topology networks. T h e o r e m 3. Let d be the diameter of a point-to-point network with P processors. Let d9 be the degree of the processors of the network. Each processor may route up to d9 messages in a single time step but it receives and sends only one message at each time step 2. On any topology, the time l* of its B S P model is l* = / 2 ( m a x ( d ( P ) , log P)). Proof. A lower bound for synchronization is broadcasting. Hence, d(P) is obviously a lower bound for l*. Assume an optimal broadcast tree [7] could be embedded optimally on the given topology. Its depth for P processors is O ( l o g P ) , a message could be broadcasted in time O ( l o g P ) . Hence, synchronization takes at least time X2(log P). Remark 1. Actually a'2(P l/a) is a physical lower bound for l* and L* under the assumption that the processors must be layouted (in 3-dimensions) and signals have a run duration. Then the minimal average distance is X?(P1/3). Due to the small constant factors of this bound, we m a y abstract from the layout and model signal delay by discrete hops from one processor to its neighbors. For m a n y packet routing problems (specific h-relations) and topologies, the latency is considerable smaller than l* which could lead to a speed up of the LogP programs compared to oblivious BSP programs. This hypothesis is addressed by the next section.
4
Subclasses
of Oblivious
Programs
In general, there is a higher effort in designing LogP programs instead of BSP programs. E.g., the proof that the BSP capacity constraints are guaranteed reduces to simply counting the number of messages sent in a superstep. Due to the asynchronous execution model, the same guarantee is not easy to prove for the LogP machine. BSP programs cannot deadlock, LogP programs can. Hence, for all considered classes, we will have to evaluate whether: - the additional effort in p r o g r a m m i n g directly a LogP p r o g r a m pays out in efficiency, or - compilation of BSP programs to LogP programs speeds up the p r o g r a m ' s execution, or - such a compilation cannot guarantee a speedup. For our practical runtime comparisons, we determine the p a r a m e t e r s of the two machine models for the IBM RS6000/SP with 16 processors 3. 2 This behavior lead to the LogP model where only receiving and sending a message takes processor time while routing is done by the communication network. 3 The IBM RS6000/SP is a IBM SP, but uses other processors. Therefore, we compiled the BSP tools without processor specific optimizations.
871 Table 1, LogP vs. BSP parameters (IBM RS6000/SP), message size <: 16 Bytes. IBSP iLogP = 502ps IL = 17.1ps -I0 = 9.0ps g = 30.1ps G = 9.8ps
4,1
The Parameters
We use the native message passing library for the LogP model and the Oxford BSP tools 4 for the BSP model. The results are shown in Table 1. For all measurements in this and the following sections we used compiler option -O3 -qstrict and for the BSP library option -flibrary-level 2. 4.2
MPMD-Programs
If an efficient parallel solution for a problem is hard to partition in supersteps, the BSP model is not appropriate. For those programs the LogP model seems preferable. Due to its asynchronous execution model, it avoids unnecessary dependencies of processors by synchronization. However, if the problem is low level, as our example broadcast is, the solution m a y be hidden in a library. It is not hard to extend the BSP model by a set of such library functions. Their implementation could be tuned using the LogP model. O p t i m a l B r o a d c a s t : A basic algorithm on distributed m e m o r y machines is the optimal broadcast of data from one processor to all others, which is used in m a n y applications. For the LogP model, an optimal solution is given by K a r p at M. in [7]. Each processor that received the item immediately initiates a repeated sending to processors which have not received the item until there is no such processors. We achieve the optimal BSP broadcast for our machine if the first processor sends messages to all others in one superstep. After synchronization the other processors receive their message. This is optimal for our target machine, since all other implementations would need at least two synchronization steps, which alone cost more t h a n the algorithm given above. The measured runtime of broadcast for LogP and BSP on our machine can be seen in Table 2. R e m a r k 2. In the measurements for the BSP parameters, we used raged messages where the tag identifies the sender. For an optimal broadcast, this identification of the sender is not necessary. This explains the difference between estimation and measurements. In contrast to the LogP model, it is not possible on the IBM RS6000/SP to receive a message and send a it immediately without a gap. The LogP estimation ignore this property and are therefore too small.
However, even if we assume m a x ( O , G ) = g, L = I then the o p t i m a l LogP broadcast is a lower bound for the o p t i m a l BSP broadcast. The latter requires 4 see http://www.BSP-worldwide.org
872 Table 2. Predictions vs. runtimes of broadcast, wave simulation, and FFT. BSP LogP measurement lestimation measurement Iestimation broadcast 822 ps 980 ps 101 ps 99.6 ps simulation (n=l,000) 6.84 s 6.35 s 1.50 s 1.56 s simulation (n=100,000) 20.6 s 19.4 s 14.9 s 14.24 s FFT (n=1024) 3.16 ms 3.51 ms 2.65 Ins 2.67 Ins IFFT (n=16384) 34.8 ms 35.6 ms 39.3 ms 35.2 ms
global synchronization barriers between subsequent send and receive operations. These synchronizations lead to delays on other processors if the LogP broadcast tree is not balanced by chance. For each send and receive pair all processors have to be synchronized instead of two.
4.3
SPMD-Programs with Sparse Dependencies
Assume t h a t a BSP process sends at most h messages to the subsequent supersteps. The dependencies in the BSP program are sparse for M if the for LogP and BSP parameters for M it holds that + 2h .g > 2 0 + (h - 1) max(O, G) + max(L, (h - 1) max(O, G))
(1)
If a BSP programs communicates at most an h-relation from one superstep to the next, these communication costs are bounded by the right hand side of inequation (1) in the compiled LogP program (due to T h e o r e m 1). Then we expect a speed up if the BSP program is compiled instead of directly executed. However, if the problem size n is large compared to P, c o m p u t a t i o n costs could dominate communication costs. Hence, if the problem scales arbitrarily, the speed-up gained by compilation could approach zero for increasing n. W a v e S i i n u l a t i o n : For simulating a one-dimensional wave, a new value for every simulated point is recalculated in every time step according to the current value of this point(y0), its two neighbors (Y-I,Y+I), and the value of this point one time step before (y~). This update is performed by the function =
2 . yo -
V'o +
2
A,/A
. 2 9
-
2 9 yo
+
>1).
Since recalculation of one node needs the values of direct neighbors only, it is optimM to distribute the d a t a balanced and block by block on the processors. Figure 1 sketches two successive computational steps and the required c o m m u nication. The two programs for the LogP and BSP model only differ in the communication phase of each computation step. In b o t h models, the processes communicate with their neighbors, i.e, the communication phase must route a 2relation. The BSP additionally performs a synchronization of all processors. For our target machine, the data dependencies are sparse i.e., each communications phase is faster on the LogP machine t h a n on the BSP, cf Table 2.
873
Fig. 1. h communication phase of the Wave Simulation.
4.4
S P M D - P r o g r a m s with Dense Dependencies
We call d a t a dependencies of a BSP program dense for a target machine if they are not sparse. Obviously for those programs compilation does not pay as the following example show. F a s t F o u r i e r T r a n s f o r m : We consider an one-dimensional parallel F F T and assume the input vector v to be of size n = 2 k, k E N. Furthermore, let n ___ 2 x P. Then the following algorithm requires only one communication phase: v is initially distribute block-by-block and we m a y perform the first log(n/P) operations locally. Then v is redistributed in a cyclic way, which requires an all-to-all communication. The remaining operations can be executed also locally. The c o m p u t a t i o n is balanced and a barrier synchronization is done implicitly on the LogP machine with the communication. Hence, we expect no noteworthy differences in the runtimes. The measurements in Table 2 confirm this hypothesis.
Remark 3. For the redistribution of data from block-wise to cyclic, we used a L o g P - l i b r a r y routine, which gathers d a t a by m e m c o p y calls with variable d a t a length. For the BSP, such a routine does not exist and was therefore i m p l e m e n t e d by hand. This implementation allowed the compiler to use more efficient copying routines and explains the difference of the LogP runtime to its estimation and to the BSP runtime. 5
Conclusions
We show how to compile of oblivious BSP algorithms to LogP machines. This approach improves the best known simulation delay of BSP programs on the LogP machine [2] by a factor of O(log(P)). It turns out, that the b o t h models are asymptotically equivalent for oblivious programs. We identified a subclass of oblivious programs that are potentially more efficient if directly designed for the LogP machine. Due to a more comfortable programming, other p r o g r a m s are preferably designed for the BSP machine. However, a m o n g those we could identify another subclass that are more efficient if compiled to the LogP machine
874 instead of executed directly. Others are not. Our measurements determine the parameters for a LogP and a BSP abstraction of a IBM RS6000/SP with 16 processors. They compare broadcast, wave simulation, and F F T programs as representative of the described subclasses of oblivious programs. Predictions and measurements of the programs in both models confirm our observations. Further work could combine the best parts of both worlds. We could use the same classification to identify parts of BSP programs that are executed directly, namely the non-oblivious parts and those which do not profit from compilation, and compile the other parts. Low level programs from the first class, like the broadcast, could be designed and tuned for the LogP machine and inserted to BSP programs as black boxes.
References 1. Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. LogGP: Incorporating long messages into the LogP model. In 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95-105, 1995. 2. Gianfranco Bilardi, Kieran T. Herley, Andrea Pietracaprina, Geppino Pucci, and Paul Spirakis. Bsp vs logp. In SPAA '96: 8th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 25-32. ACM, acre press, June 1996. 3. R. Cole and J. Hopcroft. On edge coloring bipartite graphs. SIAM Journal on Computing, 11(3):540-546, 1982. 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In ~th A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 93), pages 235-261, 1993. 5. Jgrn Eisenbiegler, Welf Lgwe, and Andreas Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC'97), pages 149-155. IEEE Computer Society Press, March 1997. 6. Jonathan M. D. Hill, Paul I. Crumpton, and David A. Burgess. Theory, practice, and a tool for BSP performance prediction. In Luc Boug~, Pierre Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par~96 Parallel Processing, number 1123 in Lecture Notes in Computer Science, pages 697-705. Springer, August 1996. 7. R.M. Karp, A. Sahay, E.E. Santos, and K.E. Schauser. Optimal broadcast and summation in the logp model. A CM-Symposium on Parallel Algorithms and Architectures, 1993. 8. W. F. McColl. Scalable computing. In Jan van Leeuwen, editor, Computer Science Today, number 1000 in Lecture Notes in Computer Science, pages 46-61. Springer, 1995. 9. Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), August 1990.
Parallel Computation on Interval Graphs Using PC Clusters: Algorithms and Experiments A. Ferreira 1, I. Gu~rin Lassous 2, K. Marcus 3 and A. Rau-Chaplin 4 1 CNRS, INRIA, Projet SLOOP, BP 93, 06902 Sophia Antipolis, France. ferreira~sophia, inria.fr. 2 LIAFA - Universit~ Paris 7, Case 7014, 2, place Jussieu, F-75251 Paris Cedex 05. guerinr jussieu, fr. 3 Eurecom, 2229, route des Cretes - BP 193, 06901 Sophia Antipolis cedex - France. marcus@eurecom, f r . 4 Faculty of Computer Science, Dalhousie University, P.O. Box 1000, Halifax, NS, Canada B3J 2X4. arc~lcs.dal, ca. A b s t r a c t . The use of PC clusters interconnected by high performance local networks is one of the major current trends in parallel/distributed computing. We give coarse-grained, BSP-like, parallel algorithms to solve many problems arising in the context of interval graphs, namely connected components, maximum weighted clique, BFS and DFS trees, minimum interval covering, maximum independent set and minimum dominating set. All of the described p-processor parallel algorithms require only constant or O(logp) number of communication rounds and are efficient in practice, as demonstrated by our experimental results obtained on a Fast Ethernet based PC cluster.
1
Introduction
The use of PC clusters interconnected by high performance local networks with raw throughput close to 1Gb/s and latency smaller than 10#s is one of the m a j o r current trends in parallel/distributed computing. The local networks are either realized with off-the-shelf hardware (e.g. Myrinet and Fast Ethernet), or application-driven devices, in which case additional functionalities are built-in, mainly at the m e m o r y access level. Such cluster-based machines (called henceforth PCC's) typically utilize some flavour of Unix and any number of widely available software packages that support multi-threading, collective communication, a u t o m a t i c load-balance, and others. Note that such packages typically simplify the p r o g r a m m e r s task by both providing new functionality and by promoting a view of the cluster as a single virtual machine. Clusters based on off-the-shelf hardware can yield effective parallel systems for a fraction of the price of machines using special purpose hardware. This kind of progress m a y thus be the key to a much wider acceptance of parallel computing, t h a t has been postponed so far, perhaps primarily due to issues of cost and complexity. Although a great deal of effort has been undertaken on system-level and prog r a m m i n g environment issues as described above, little attention has been paid to methodologies for the design of algorithms for this kind of parallel systems.
876 Despite the availability of a large number of built-in and/or highly optimized procedures, algorithms are still designed at the machine level and claims to portability lay only on the fact that they are implemented using communication libraries such as PVM or MPI. In this paper we show that theoretical (BSP-like) coarse-grained models are well adapted to PCC's. In particular, algorithms designed for such models are portable and their theoretical and practical performance are closely related. Furthermore, they allow a reduction on the costs associated with software development since the main design paradigm is the use of existing sequential algorithms and communication sub-routines, usually provided with the systems. Our approach will be to study a class of problems from the start of the algorithm design task until the implementation of the algorithms on a PCC. The class of problems will be those arising on a family of intervals on the real line which can model a number of applications in scheduling, circuit design, traffic control, genetics, and others [16]. Previous Work This class of problems has been studied extensively in the parallel setting and many work-optimal fine-grained PRAM algorithms have been described in the literature [3, 14, 15, 16]. Their sequential complexity is O(n log n) in all cases. Whereas fine-grained PRAM algorithms are likely to be efficient on finegrained shared memory architectures, it is common knowledge that they tend to be impractical on PCC's due to their failure to exploit locality. Therefore, there has been a recent growth of interest in coarse-grained computational models [4, 5, 18] and the design of coarse-grained algorithms [5, 6, 8, 10, 13]. The BSP model, described by Valiant [18], uses slackness in the number of processors and memory mapping via hash functions to hide communication latency and provide for the efficient execution of fine grained PRAM algorithms on coarse-grained hardware. Culler et. al. introduced the LogP model which, using Valiant's BSP model as a starting point, focuses on the technological trend from fine grained parallel machines towards coarse-grained systems and advocates portable parallel algorithm design [4]. Other coarse grained models focus more on utilizing local computation and minimizing global operations. These include the Coarse-Grained Multicomputer (CGM(n,p)) model used in this paper [5], where n is the input size and p the number of processing elements. In this mixed sequential/parallel setting, there are three important measures of any coarse-grained algorithm, namely, the amount of local computation required, the number and type of global communication phases required and the scalability of the algorithm, that is, the range of values for the ratio p for which the algorithm is efficient and applicable. We refer to [5, 9, 10] for more details on this model. Recently, Cgceres et al. [2] showed that many problems in general graphs, such as list ranking, connected components and others, can be solved in O(logp) communication rounds in BSP and CGM. However, unlike general graphs, interval graphs can be more easily partitioned and treated in the distributed memory setting. Since each interval is given by its two extreme points, they can be sorted
877 by left a n d / o r right endpoints and distributed according to this ordering. This partitioning allows us to design less complex parallel algorithms; moreover, the derived algorithms are easier to implement and faster both in theory and in practice. The following result will be used in the remaining to achieve a constant number of communication rounds in the solution of many problems. T h e o r e m l . [13] Given a set S of n items stored O ( n / p ) per processor on a C G M ( n , p ) , n / p >_ p, sorting S takes a constant number of communication rounds. [] The algorithms proposed for the CGM are independent of the communication network. Moreover, it was proved that the main collective communication operations can be implemented by a constant number of calls to global sort ([5]). Hence, by Theorem 1, these operations take a constant number of communication rounds. However, in practice these operations will be implemented through built-in, optimized system-level routines. In the remainder, let T s ( n , p ) denote the time complexity of a global sort in the CGM. Our Work We describe constant communication round coarse-grained parallel algorithms to solve a set of the standard problems arising in the context of interval graphs [16], namely connected components [3], maximum weighted clique [15] and breadthfirst-search (BFS) and depth-first-search (I)FS) trees [14]. We also propose O(log p) communication round algorithms for optimization problems as minimum interval covering, maximum independent set [15] and minimum dominating set [17]. In order to demonstrate the practicability of our approach, we implemented three of the above algorithms on a PCC interconnected by a Fast Ethernet backbone. Because of the paradigms used, the programs were easy to develop and are quite portable. The results presented in this paper show that high performance can be achieved with off-the-shelf PCC's along with the right model for algorithm design. Interestingly, super-linear speedups were observed in some cases due to memory swapping effects. Using multiple processors allows us to effectively utilize more RAM and therefore Mlows computation on data sets that are simply too large to be effectively processed on single processor machines. In Section 2 the required basic operations are described. Then, in Section 3, chosen problems in interval family model are presented, and solutions are proposed using the basic operations from Section 2. In Section 4, we describe experiments on a Fast Ethernet based PCC. We close the paper with some conclusions and directions for further research.
2
Basic Operations
In the CGM model, any parallel prefix (suffix) associative function to be performed in an array of elements can be done in O(1) communication steps, since
878 each processor can compute locally the function, and then with a total exchange all the processors get to know the partial result of all the other processors and can compute the final result for each element in the array. We will i s o use the pointer-jump operation, to identify the elements in a linked list. This operation can be easily done in O(logp) communication steps, at each step each processor keeps track of the pointers of its elements.
2.1
Interval Operations
In the following algorithms two functions will be widely used, the Ominright and Omaxright [1]. Given an interval I, Omaxvight([) (Ominright(I)) denotes, among all intervals that intersect I, the one whose right endpoint is the furthest right (left). The formal definition is the following.
Ornaxright(Ii) = f / j ' [ nil,
if bj = max{bk[ak < bi < bk} otherwise 9
The function Omaxright can be computed with time complexity O(Ts (n, p)), as follows. I. Sort the left endpoints of the interval in ascending order as I
!
. .
a!
2. Compute t h e p r e f i x maxima of t h e c o r r e s p o n d i n g s e q u e n c e b~,b2,.' .. ,b,~' of r i g h t e n d p o i n t s and l e t t h e r e s u l t be b Hi , b 2H , . . . , b ~H. (b"k = In aX l <_i <_ k { b i'} . ) 3. For e v e r y i (1 < i _< n) compute t h e r a n k r(i) of bi w i t h r e s p e c t t o !
a ~ , 4 . . . . ,aN H 4. For e v e r y i (1 < i < n ) , s e t O m a x r i g h t ( I i ) = I j , s u c h t h a t bj = br(i) and H . bl :~ b~(~), o t h e r w i s e s e t Omaxright(I~) = nil.
We define also the parameter First(Z) as the segment I which "ends first", that is, whose right endpoint is the furthest left: First(Z) = / j ,
with bj = min{b~ll < i < n}.
To compute it, we need only to compute the minimum of the sequence of right endpoints of intervals in the family Z. Finally, we will use the function next(I) : Z --+ Z defined as next (Ii)
f 6,
[ nil,
if bj = min{bk]bi < ak}, otherwise.
T h a t is, nezt(Ii) is the interval that ends farthest to the left among all the intervals beginning after the end of Ii. To compute next(//), 1 < i < n, we use the same algorithm used for Omaxright(Ii), with a new step 2. i. Sort the left endpoints of the interval in ascending order as
a~,a'2,..
9 ~a
'
n
.
879 2. Compute the suffix minima of the corresponding sequence bl, ' b2, ' 99.,b" of right endpoints and let the result be bl,b2,...,b~. " " " (b~ = mink_
3. For every i (1 !
!
_< i
_< n) compute the rank r(i) of bi with r e s p e c t to
!
(II~C~2~...~an 4. For every i (I < i < n), set Ncxt(li) = Ij, such that bj = b'r'(, ) and bi
hi(i) ; otherwise set Next(Ii) -~ rtil.
It is easy to see that the above procedure implements the definition of next(I/), with the same complexity as for computing Omaxright(Ii). 3
Interval
Graph
Problems
and
Algorithms
Formally, given a set n of intervals Z = {I1, I 2 , . . . , IN} on a line, the corresponding interval graph G = (V, E) has the set of nodes V = { V l , . . . , v~}, and there is an edge in E between nodes vi, vj if and only if Ii f3 Ij 7k ~. In this section, solutions for some important problems in interval graphs are proposed for the CGM model. Some of these algorithms use techniques derived from their corresponding PRAM algorithms while others require different methods, e.g. to compute the connected components, as shown below. 3.1
Maximum
Weighted Clique
A clique is a set of nodes that are mutuMly adjacent. In the maximum weighted clique problem for an interval graph, we want to know the m a x i m u m weight of such a set, given weights p(I 0 > 0 on the intervals, and identify a m a x i m u m weighted clique by marking its nodes. The CGM algorithm is as follows: I. Sort the endpoints of the segments such that each processor 2n/p endpoint s. 2. Assign to each endpoint ci a weight wi defined by
fp(tj), w, = [ _ p ( 5 ) ,
if
ci = a j ,
for
receives
some 1 _<j < n ,
i f c i = b j , f o r some l _ < j < n ,
3. Compute the prefix sum of the resulting weighted sequence - the maximum obtained is the cardinality of a maximum clique; let dl,...,d2n denote the resulting sequence. 4. Consider the sequence el,...,v2n obtained by replacing every dj corresponding to a right endpoint of an interval with -I and compute the rightmost maximum of the resulting sequence; this occurs at ak. 5. Broadcast ak. Every interval Iu such that au < ak < bu is marked to be in the final maximum weighted clique.
Due to space limitations, the correctness and the complexity of the algorithm can be found in [7]. T h e o r e m 2. The maximum weighted clique problem in an interval graph of size
n can be solved on a CGM(n,p) in O(Ts(n,p) + n/p) time, with a constant number of communication rounds. []
880
3.2
Connected Components
The connected components of a graph G are the maximal connected subgraphs of G. The connected components problem consists of assigning to each node the label of the connected component that contains it. For the CGM(n,p) we have the following algorithm: I. Sort the intervals by left endpoints distributing n/p elements to each processor. 2. Each processor Pi computes the connected components for the subgraph corresponding to its n/p intervals, giving labels to the components and associating the labels to the nodes. 3. Each processor detects the farthest right segment amongst its n/p intervals - tail ti - and broadcasts it (with its label) to all other processors. 4. Each processor checks if any of the tails intersects its components, and updates its local labels using in each case the smallest such new label. 5. Each processor Pi records the pair (ti, new label) and sends it to processor P0. 6. Processor Po performs a connected components algorithm on the tails and updates the tail labels using the smallest such new labels and sends the tails and their new labels to all processors. 7. Each processor updates the labeling accordingly.
Due to space limitations, the correctness and the complexity of the Mgorithm can be found in [7]. T h e o r e m 3. The connected components problem in interval graphs can be solved on a C a M ( n , p ) in O ( T s ( n , p ) + n/p) time, with a constant number of communication rounds. []
3.3
BFS and DFS Tree
The problem of finding a Breadth First Search Tree in an interval graph reduces to the problem of computing the function Omaxright described earlier. The tree given by the edges (Ii,Omaxright(Ii)) is a BFS tree [14]. And the tree formed by the edges (I~, Ominright(I~)) is a DFS tree [14]. The algorithm is the following: 1. C o m p u t e Omaxright(Ii), f o r 1 < i < n . 2. L e t ]ather(Ii)= Omaxright(I~). 3. The e d g e s
(/,,
]ather(Ii)) f o r m
a BFS t r e e .
With the appropriate modifications, this algorithm may be used to find a DFS tree. The obtained BFS and DFS trees have their roots in the segments ending farthest to the right in each connected component. With respect to its complexity, the algorithm takes a constant number of communication steps and requires a total running time of O(Ts (n, p) + n/p).
881 T h e o r e m 4. Given an interval graph G, BFS and DFS trees can be found using a CGM(n,p) in O(Ts(n,p)+ n/p) time, with a constant number of communication rounds. []
3.4
M i n i m u m Interval Covering
Given a family Z of intervals and a special interval J = (J~, Jb), the problem of the minimum interval covering is to find a subset J C_ 7? such that Y C_ U ( J ) , and [J[ is minimum; i.e., to find the minimum number of intervals in Z needed to cover J. To solve this problem we may only consider the intervals Ii = (ai, bi) E (Z) such that bi __> J~ and ai < Jb. Let Z j be the family of the intervals in satisfying this condition. An algorithm to solve this problem is as follows: 1. Compute Omaxright(l), I E fJ. 2. Find the i n t e r v a l I i n i t such t h a t b i n i t = max{bklak ~Ja}. 3. Mark [ i n i t and a l l t h e i n t e r v a l s pointers beginning at linit"
in t h e p a t h g i v e n by Omaxright
Due to space limitations, the correctness and the complexity of the algorithm can be found in [7]. T h e o r e m 5. The minimum interval covering problem in interval graphs can be solved using a CGM(n,p) in O(Ts(n,p) + logp) time, with O(logp) communication rounds. [3
3.5
M a x i m u m I n d e p e n d e n t Set and M i n i m u m D o m i n a t i n g Set
The first problem consists of finding a largest set of mutually non-overlapping intervals in the family 5, called the maximum independent set. The second problem consists of finding a minimum dominating set, i.e., a minimum set of intervals which are adjacent to all remaining intervals in the family Z. To solve these problems, we simply show a coarse-grained implementation of the algorithms proposed in [17]. In fact, it can be shown that both problems can be solved by building a linked list from First(Z): MAXIMUM INDEPENDENT SET:
I. 2. 3. 4.
Compute First(S) Compute next(li), for i, i < i < n Let father(/0 = next([i), for i, 1 < i < n Using the pointer-jump operation, mark all the intervals in the linked list given by father and beginning at First(S)
MINIMUM DOMINATING SET:
882 1. 2. 3. 4.
Compute F i r s t ( l ) Compute 0 m a x r i g h t ( / / ) , f o r 1 < i < n Compute n e x t ( / / ) , f o r 1 < i < n Let father(Ii) = 0maxright (next(//)),
for
1 < i < n
5. Using the pointer-jump operation, mark all the intervals in the linked list given by father and beginning at Omaxright (First(1))
The number of communication rounds in each of the algorithms is O(logp), giving us a total time complexity of O(Ts (n, p) + logp) and O(log p) communication rounds. Their correctness stems from the arguments in [17]. T h e o r e m 6. The maximum independent set and the minimum dominating set problems in interval graphs can be solved using a CaM(n, p) in O (rs( ~, p)+logp) time, with O(logp) communication rounds. []
4
Experimental Results
This section describes the implementations of three of the algorithms presented previously. Our aim here is to demonstrate that these algorithms are not only theoretically efficient but that they lead to simple fast codes in practice. They were implemented on a Fast Ethernet-PCC platform which consists of a set of 12 Pentium Pro 200 Mhz processors each with 64M of RAM that are linked by a 100Mb/s Fast Ethernet network. The processors run the Linux Operating System and the programs are written in C utilizing the P VM communication library [11] for all interprocessor communications. Since many of the algorithms rely on sorting, the choice of the sorting m e t h o d was critical. In the following we first present the implemented sort and its performance before describing the implementation and performance of our algorithms. 4.1
Global Sort
The sorting algorithm implemented is described in [12]. The algorithm requires a constant number of communication steps and its single drawback is that d a t a may not be equally distributed at the end of the sort. Nevertheless, a partial sum procedure and a routing can be used to redistribute the data with a constant number of communication rounds so that each processor stores ~p data in its memory. Figure 1 shows the execution time for the global sort on an array of integers with the data redistributed in comparison to the sequential performance of quicksort (from the standard C library). The results shown are the average of ten execution times over ten different inputs generated randomly. The abscissa represents n the size of the input array and the ordinate the execution time in seconds. For less than 7,000,000 integers, the achieved speedup is about 2.5 for four processors and 6 for twelve processors. Beyond the size of 7,000,000 integers, the
883 90(1
~-
, , , , sequential qutcksort - CQM parallel s ~ t ~xlh 4 PCs
8(F, 60~70~
/
/
CGM parallel sort with 12 PCs .......
500 400
300 2O0
100 0 0
5e+t,6
le+03
15e+I)7 2e+I)7
2,5e+07
3e+~7 3.5e+ff/ 4e+g7 4.5e+07
Fig. 1. Sorting on the Fast Ethernet-PCC.
memory swapping effects increase significantly the execution time on a single processor and super-linear speedup is obtained, with 10 for four processors and 35 for twelve processors. Sorting 40,000,000 integers takes less than one minute with twelve processors. 4.2
Maximum
Weighted Clique
The algorithm requires a constant number of communication rounds and rt 0 9-fi) local operations. Note that only the local computations involved in O(-~l the sort require 0(~-1o9 ~-) operations, whereas all the other steps require only p p 0 ( ~ ) operations.
sequentral MWC - CGM M W C ",~~th 4 PC~ ..... CGM MWC wifl~ 12 PCs ......
6(X)
5IX} [
400 ~-
300
2(I0
100
0 le+06
2e+~6
3e+06
4e+06
5e+06
6e+06
7e~6
Fig. 2. Maximum Weighted Clique on the Fast Ethernet-PCC. Figure 2 presents the execution time when the number of intervals ncreases.
884 With a graph having less than 1,000,000 intervals, the speedup is 2.5 with four processors, whereas it is equal to 7 for twelve processors. Beyond 1,000,000 intervals, the speedup is 10 for four processors and 35 for twelve processors, these superlinear timings being due to memory swapping effects. Again note that with this algorithm larger data sets than in the sequential case can be handled in a reasonable time. 4.3
Connected Components
As in the maximum weighted clique algorithm above, here also only the sort requires O(~log-~) local operations all the other steps being linear in p.
500 sequential C C - C G M C C with d PCs ..... C G M C C ,,vtth 12 PCs
450
. . . . . . .
400 350 300 250 200 15(1 /'
100 "i
50
iii ..............................
.........
0 2c+06
4e+06
6e+06
8c+06
J Ic+07
1.2c+07
.... 1,4e+07
Fig. 3. Connected Components on the Fast Ethernet-PCC. Figure 3 shows the execution time in seconds as the size of the input increases. The achieved speedup is approximately 2.5 for four processors and 7 for twelve processors for a graph having at most 2,000,000 intervals. With more intervals, the speedup is 8 for four processors and 30 for twelve processors. Also observe that with one processor, at most 2 million of data can be processed whereas twelve processors can process 12 million of data reasonably. Beyond 12 million data items the execution time increases more steeply due to memory swapping effects, even with twelve processors. 4.4
BFS Tree
The achieved speedup is 2 for four processors and 6 for twelve processors with at most 2 million of intervals. Beyond this size, the speedup becomes 7 for four processors and 20 for twelve processors. The measured times are slower than those obtained for the previous algorithms, because two steps of the function Omaxright require. O(~log. p ~p) operations, whereas for the previous problems only one step reqmred this number of local operations.
885 400 sequential BFS - CGM BFS with 4 PCs ...... CGM BFS with 12 PCs .......
350 ] 30~1 250 200
150
/
I00
0 9 ..
Y
i./"
5O
2e+06
t 4e+06
i 6e+06
i 8e+06
i le+07
/
L2e+07
1.4e+07
Fig. 4. BFS Tree on the Fast Ethernet-PCC.
5
Conclusion
In this paper we have shown how to solve many important problems on interval graphs using a coarse-grained parallel computer such as a cluster of PC's. The proposed algorithms were shown to be theoretically efficient, easy to implement and fast in practice. We believe this can largely be attributed to the use of the CGM model which accounts for distributed memory effects, mixes sequential and parallel coding, and encourages the use of a constant or very small number of communication rounds. Note that the use of the CGM model, which was primarily developed for algorithm design in the context of interconnection networks, has led to efficient implementations even in the context of a bus-based network like Ethernet. We speculate that this is due to several factors including: 1) the model focuses on sending a small number of large messages rather than a large number of small ones 2) it relies on standard, and typically well optimized, communications operations and 3) it focuses on reducing the number of communication rounds and therefore the number on interdependencies between rounds. Of course at some point such bus-based networks always become saturated and more attention must be paid to bandwidth and broadcast conflict concerns, particularly as one scales up. We are currently exploring how such concerns can best be dealt with within the context of a CGM-like model. References 1. A.A. Bertossi and M.A. Bonuccelli. Some Parallel Algorithms on Interval Graphs. Discrete Applied Mathematics, 16:101-111, 1987. 2. E. Caceres, F. Dehne, A. Ferreira, P. Flocchini, I. Rieping, A. Roncato, N. Santoro, and S. Song. Efficient parallel graph algorithms for coarse grained multicomputers and BSP. In Proc. of ICALP'9 Z pages 131-143. Lecture Notes in Computer Science. Springer-Verlag, 1997.
886
3. R. Cole and U. Vishkin. The accelerated centroid decomposition technique for optimal tree evaluation in logarithmic time. Algorithmica, 3:329-346, 1988. 4. D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Shacuser, E. Santos, R. Subramonian, and T. yon Eicken. LogP: Towards a realistic model of parallel computation. In Proc. 4th ACM SIGPLAN Syrup. on Princ. and Practice of Parallel Programming, pages 1-12, 1993. 5. F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In Proc. 9th A CM Syrup. on Computational Geometry, pages 298-307, 1993. 6. M. Diallo, A. Ferreira, and A. Rau-Chaplin. Communication-efficient deterministic parallel algorithms for planar point location and 2d Voronoi diagram. In Proceedin9s of the 15th Symposium on Theoretical Aspects of Computer Science - STACS'98, Lecture Notes in Computer Science, Paris, France, February 1998. Springer Verlag. 7. A. Ferreira, I. Guerin-Lassous, K. Marcus, and A. Rau-Chaplin. Parallel computation of interval graphs on PC clusters: Algorithms and experiments. RR LIAFA 97-30, University of Paris 7, http://www.liafa.jussieu.fr/-guerin/biblio.html, 1997. 8. A. Ferreira, C. Kenyon, A. Ran-Chaplin, and S. Ub~da. d-dimensional range search on multicomputers. Algorithmica, in press. Special Issue on Coarse Grained Algorithms. 9. A. Ferreira and M. Morvan. Models for parallel algorithm design: An introduction. In A. Migdalas, P. Pardalos, and S. Storoy, editors, Parallel Computing in Optimization, pages 1-26. Kluwer Academic Publisher, Boston (USA), 1997. 10. A. Ferreira, A. Rau-Chaplin, and S. Ub~da. Scalable 2d convex hull and triangulation for coarse grained multicomputers. In Proc. of the 6 th IEEE Symposium on Parallel and Distributed Processing, San Antonio, USA, pages 561-569. IEEE Press, October 1995. 11. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R Manchek, and V. Sunderman. PVM: Parallel Virtual Machine- A Users' Guide and Tutorial for Networked Parallel Computing, 1994. 12. A.V Gerbessiotis and L.G Valiant. Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing, pages 251-267, 1994. 13. M.T. Goodrich. Communication-efficient parallel sorting. In Proc. of 28th Syrup. on Theory of Computing, 1996. 14. S.K. Kim. Optimal Parallel Algorithms on Sorted Intervals. In Proc. 27th Annual Allerton Conference Communication, Control and Computing, volume 1, pages 766-775, 1990. 15. A. Moitra and R. Johnson. PT-Optimal Algorithms for Interval Graphs. In Proc. 26th Annual Allerton Conference Communication, Control and Computing, volume 1, pages 274-282, 1988. 16. S. Olariu. Parallel graph algorithms. In A. Zomaya, editor, Handbook of Parallel and Distributed Computing, pages 355-403. McGraw-Hill, 1996. 17. S. Olariu, J.L. Schwing, and J. Zhang. Optimal Parallel Algorithms for Problems Modelled by a Family of Intervals. IEEE Transactions on Parallel and Distributed Systems, 3(3):364-374, 1992. 18. L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33:103-111, 1990.
Adaptable Distributed Shared Memory: A Formal Definition* Jordi Bataller and Jos5 M. Bernab~u-AubAn Dept. de Sistemes Informatics i Computaci6, Universitat Polit~cnica de Valencia
A b s t r a c t . In this paper we introduce a new model of consistency for distributed shared memory called Mume. It offers only the essentials to be considered as a shared memory system. This allows an efficient implementation on a message passing system, and due to this, it can be used to emulate other memory models.
1
Introduction
Distributed shared m e m o r y (DSM) is a p a r a d i g m for p r o g r a m m i n g parallel and distributed systems. It offers the agents of the system a shared space address that they use to communicate each other. The main problem of a DSM implementation is performance, specially when it is implemented on top of a message passing system. This, of course, depends on the consistency the DSM system offers. Many systems have been proposed, each one supporting a different level of consistency. Unfortunately, experience has shown t h a t none is well suited for the whole range of problems. This is also true for different implementations of the same m e m o r y model, since no single one is able to efficiently support all of the possible d a t a access patterns. In this paper, we introduce a novel model called Mume. Following the direction of the main techniques to improve shared m e m o r y systems, M u m e tries to give the p r o g r a m m e r means to tell the system how to carry out the work more efficiently. Mume is a low level layer close to the level of the message passing interface. M u m e interface only offers the strictly necessary requirements to be considered as shared m e m o r y thus allowing an efficient implementation. The interface also includes three types of synchronization primitives, namely, total ordering, causal ordering and mutual exclusion. Because the M u m e interface is low enough, it can be used either to develop final applications or to build new primitives and use t h e m in turn to develop final programs. This latter possibility allows, for example, to emulate the main m e m o r y models proposed to date. This paper describes Mume from a formal point of view. * This work has been supported in part by research grants TIC93-0304 and TIC960729
888
2
Formal Definition of Mume
The formalism we use was presented in [3]. It is a simple extension of the I / O a u t o m a t a model, [5]. We define a m e m o r y model by giving the set of allowable executions for it. An execution is an ordered sequence of actions. The basic orders we consider for an execution c~ are Process Order ( < ~ o ) , Writes-to ( ~ ) and Causal Relation (<~R)" Process Order is the order in which a processor issues actions. Writes-to is the order defined between a write and the read t h a t gets the value left by the write. Causal Relation is the transitive closure of the sum of PO and ~-~. We also consider the Mutual Exclusion ( < ~ ) order for executions produced by models t h a t include a synchronization primitive. Obviously, the order is defined only on the synchronized actions and they are ordered as they appear in the execution. An execution is admissible for a model if it can be rearranged in a way that the new sequence respects a given ordering and it is serial. An execution is serial if every read gets the value written by the immediately preceding write on the same variable. We express this concept by saying t h a t a given execution is <~-serializable, this is, serializable respecting the ordering <.~. In the case of Mume, the actions or primitives which compose executions are classified into ordinary ones (read and write) and synchronizations (sync). Mume considers that there exists a set of independent m e m o r y deposits (caches). Caches apply ordinary actions as they arrive, if they are not synchronized. Ordinary actions must be labelled to indicate to which cache they should be applied. Of course, read actions can only operate at one cache. Formally, the ordinary actions are: write(i,x,v)[A] and read(i,x,v)[a], where i C 7) (set of processes), x E ]2~ (set of data variables), v is the value being read or written and A C C, a C C (set of caches). Because Mume ordinary actions are independent of each other by default, it is necessary to synchronize them to set the order in which they are applied to caches. The synchronizations are the following ones: TOsync(i, STO, t), CRsync(i, scR, t) are MEsync(i, SME, t), where i C 7), sTO C YSro, SCR E VScr~, 8ME E ~'~,9~g (sets of synchronization variables) and t E S = {acq, tel} (synchronization type : acquire or release). The actions issued by a process between an acquire and its following release are the ones that are synchronized. We say t h a t they are X-synchronized being X the type of synchronization: TO, C R or ME. Actions of different processors are related if they are synchronized by synchronizations acting on the same variable. We say t h a t they are X-related, being X TO, CR or ME. The meaning of each type of synchronization is the following. TOsync ensures t h a t actions TO-related are performed in the same total order to any cache. It also ensures t h a t if two actions of the same process are TO-related, then, Process Order is respected. CRsyne maintains causal consistency: it ensures t h a t actions CR-related are applied to any cache in an order consistent with the causal dependences a m o n g the CR-related actions. MEsync is the strictest synchronization. It behaves like a TO-synchronization and, in addition, it guarantees mutual exclusion. Furthermore, when a process obtains
889 the exclusion it is sure t h a t it will see all the writes made by all processes which held the exclusion in the past. Formally, the Mume m e m o r y model is defined by requiring its executions to satisfy the properties T O s y n c , C R s y n c , and M E s y n c . Those properties are defined as follows: T O s y n c : an execution (~ respects TOsync iff Vc 9 C ~ is (<~osy~c Ic) serializable. - C R s y n c : an execution c~ respects CRsync iff Vc 9 C c~ is (<~R~y~c [c)serializable. M E s y n c : an execution a respects M E s y n c iff Vc 9 C, (~ is (<~/iEsync -
-
I( C, sync) )-serializable. In this definition, < ~ y ~ is the ordering t h a t is defined considering only the X-related actions of an execution. Ic denotes the restriction of to c. 3
Emulation
of Known
DSM
Models
The primitives of Mume can be used to emulate other m e m o r y models including Sequential consistency, P R A M , Causal consistency, Processor consistency, Cache consistency, Slow consistency, Local consistency as well as synchronized models such as Release consistency. The references for these models can be found in [3]. We want to point out that each model can be emulated in different ways. For instance, any model can be trivially emulated by using a single cache for all the processes. But as m a n y systems are implemented this way, we have chosen to give emulations in which one memory deposit is used per process, writes act on every cache, and reads act on the "local" cache. Now, as an example, we explain how Mume can emulate some of the cited models. The proofs can be found in [2].
Sequential. The executions of this model satisfy t h a t they are <po-serializable. To emulate it, we have to TO-relate every action. Formally: -
Mu.
e =
C =
VsTo =
{ o ) , Vsc
= (),
= {}).
- Labelling of ordinary actions: Vi E 7): read(i,., .)[mi] write(i,.,-)[{mj}jep]. -The first and last action of every process are TOsync(i,o, acq) and
TOsync( i, o, rel). PRAM. The executions of this model satisfy that, for all process /,they are < po ]i-serializable. To emulate the P R A M model, actions of the same process must be T O related to achieve PO and to ensure t h a t writes of t h a t process are performed in the same order at each cache. Formally:
- M u m e = (7), V~, C = { r n i } i ~ , VsTo = {oi)iep, Vscn = {}, Vs~s -- {}). - Labelling of ordinary actions: Vi 9 7): read(i,., .)[mi] write(i,., .)[{mj}j~p]. -The first and last action of every process are TOsyne(i, oi,acq) and T O sync( i, oi, r el ) .
890
Causal. The executions of this model satisfy that, for all process /,they are < c R Ii-serializable 9 In order to emulate this model, every action of each process must be CRrelated. Formally: - M u m e = (7), ]2~, C = {rni}iep, ])sTo = {}, ])Scn = {r}, ]2s~e = {}). - Labelling of ordinary actions: Vi E 7): read(i,., .)[mi] write(i,., ")[{mi}iep]. -The first and last action of every process are CRsyne(i,r, acq) and C Rsync( i, r, rel). Cache. The executions of this model satisfy that, for all variable x,they are <po Ix-serializable. It can be emulated by TO-relating actions operating on the same d a t a variable Formally: -
Mume
= (P,
C = {indic.,
VSTo = {Ov} cv , Vsc
= {},
= {}.
- Labelling of ordinary actions: Yi E 7): read(i,., .)[mi] write(i,., ")[{mi}ie~]. - Each action is surrounded by the pair TOsyne(i,o~,aeq) and TOsyne(i, o~, rel) if it operates on the d a t a variable v.
Processor. Every execution of this model satisfies that, for any pair of processes i and j, the execution can be <po Ii-serialized and
Related
Work
and
Concluding
Remarks
In this paper we have defined a new m e m o r y model called Mume. M u m e ordinary operations are close to the message passing interface since the p r o g r a m m e r has to declare to which caches they have to be applied. In spite of this, ordinary operations still possess m e m o r y semantics because a write can be hidden by a subsequent one, this is, not all written values have to be read. A p a r t of ordinary operations, M u m e also have three types of synchronizations in order to relate the ordinary ones. One of the choices followed to improve the performance of shared m e m o r y systems has been to give the p r o g r a m m e r means to tell the system useful information so the system can know which is the best way to do things or when
891 actions should be performed. An example of this are synchronized models such as Release consistency. These models use synchronizations not only to achieve mutual exclusion but also to tell the system when to update the data. Another case, [4], is to allow the p r o g r a m m e r to classify the variables of his p r o g r a m s according to the expected access p a t t e r n of each one. This permits the system to apply the best strategy to maintain the consistency of each variable. C o m p a r e d to other synchronized models, Mume has two synchronizations more. T h e y help to relate ordinary actions without the need of using 'mutual exclusion. Furthermore, because the ordinary actions of M u m e are marked with the target cache, the p r o g r a m m e r has more means to influence the performance t h a n in other systems. For example, some systems have to use complex protocols (write-shared protocols) in order to eliminate the problem of false sharing. T h e overhead introduced is significant and m a y reduce the gains dramatically. In Mume, the p r o g r a m m e r can eliminate false sharing by fine tuning concrete and particular points of his program, carefully choosing the caches to where ordinary actions are applied. But it is not m a n d a t o r y to use M u m e at the lowest level. Just because it is low level, libraries can be developed in order to emulate different m e m o r y models or to support different access patterns to variables. If libraries are used, the p r o g r a m m e r still have the choice of fine tuning a program. The "high level" program can be translated to the level of the native primitives where such tuning can be done. Because we are concerned with formal issues, we have presented here the formal definition of Mume. You can find a more elaborate version of this work along with the correctness proofs of the emulations of other m e m o r y models in [2]. You can also see [1] to find a wider discussion on the practical issues of Mume. We are currently developing an implementation of M u m e for a loosely coupled system in order to compare it with other m e m o r y systems and to evaluate its expected benefits.
References 1. J. Bataller and J. Bernabeu. Constructive and adaptable distributed shared memory. In Proc. of the 3rd Int'l Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 15-22, March 1998. 2. :I. BatMler and J. Bernabeu-Auban. An adaptable dsm model: a formal description. Technical Report II-DSIC-17/97, Dep. de Sistemes Informatics i Computacio (DSIC), Universitat Politecnica de Valencia (Spain), 1997. 3. J. Bataller and J. Bernabeu-Auban. Synchronized DSM models. In Europar'9% Third Intl. Conf. on Parallel Processing, Lecture notes on computer science, pages 468-475. Springer-Verlag, August 1997. 4. J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Techniques for reducing consistency-related communication in distributed shared memory systems. A C M Transactions on Computer Systems, 13(3):205-243, August 1995. 5. N. Lynch. I/O Automata: A model for discrete event system. Technical Report MIT/LCS/TM-351, Laboratory for Computer Science, Massachusetts Institute of Tecnology, March 1988.
Parameterized Parallel Complexity Marco Cesati I and Miriam Di Ianni 2 1 Dip. di Informatica, Sistemi e Produzione, Univ. "Tor Vergata", via di Tor Vergata, 1-00133 Roma, Italy. cesat i@uniroma2, it
2 Istituto di Elettronica, Universit& di Perugia, via G. Duranti l/A, 1-06123 Perugia, Italy. di iarmiOist el. ing. unipg, i t
A b s t r a c t . We introduce a framework to study the parallel complexity of parameterized problems, and we propose some analogs of NC.
1
Introduction
The theory of NP-completeness [8] is a theoretical framework to explain the apparent asymptotical intractability of many problems. Yet, while many natural problems are intractable in the limit, the way by which they arrive at the intractable behaviour can vary considerably. For instance, deciding if the nodes of a graph can be properly colored by k colors is NP-complete even for a constant k > 3 [8], while there exists an algorithm deciding if the edges of a n nodes graph can be covered by k nodes in time O(n) for any fixed value of k. The parameterized complexity setting [6, 7] has been introduced in order to overcome the intrinsic inability of the standard NP-completeness model to give insight into this variety of behaviours. In this paper we investigate the issue of which problems do admit efficient fixed parameter parallel algorithms. A first a t t e m p t to formalize the concept of efficiently fixed parameter parallelizable problems has been pursued by Bodlaender, Downey and Fellows: in a one-page abstract [3] they suggested the introduction of the class PNC as the parameterized analogue of NC. However, neither theoretical results nor applications to concrete natural problems were presented. We now want to give a deeper insight to such concept. According to the degree of efficiency we are interested in, several kinds of efficient parallelization for parameterized problems can be considered. In section 2, after reviewing some basic concepts of the parameterized complexity theory, we define the two classes of efficiently parallelizable parameterized problems, PNC and F P P , and we study their relationship with the class of sequentially tractable parameterized problems (FPT). We also present a non trivial tool for proving FPP-membership of parameterized graph problems based on the concept of treewidth and on the results in [1, 4, 2]. In section 3 we study the relationship between NC, PNC, and F P P . In section 4 we give two alternative characterizations of both F P P and
893 PNC, and we use them to prove the PNC-completeness of two parameterized structural problems. 2
The
Classes
PNC and
FPP
Let Z be a finite alphabet. A parameterized problem is a set L _C Z* • Z*. Tipically, the second component represents a parameter k C IN. The kth slice of the problem is defined as Lk = { x C Z* : (x, k) E L }. The class F P T of fixed parameter tractable problems contains all parameterized problems that have a solving algorithm with running time bounded by f(k)Ixl ~, where (x, k} is the instance of the problem, k is the parameter, f is an arbitrary function and a is a constant independent of x and k. From now on, with the term "parallel algorithm" we always refer to a PRAM algorithm. The first class of efficiently parallelizable parameterized problems, PNC, has been defined in [3]. PNC (parameterized analog of NC) contains all parameterized problems which have a parallel solving algorithm with at most g(k)Ixl ~ processors and running time bounded by f(k)(log IxD h(k), where (x, k) is the instance of the problem, k is the parameter, f , g and h are arbitrary functions, and/3 is a constant independent of x and k. The definition of PNC is coherent with the definition of F P T ; indeed we can prove that PNC is a subset of FPT. A drawback with the definition of PNC is the exponent in the logarithm which bounds the running time. We thus introduce the class of fixed-parameter parallelizable problems F P P that contains all parameterized problems having a parallel solving algorithm with at most g(k)Ixl ~ processors and running time bounded by f(k)(log [xl) ~, where (x, k) is the instance of the problem, k is the parameter, f and g are arbitrary functions, and a and /3 are constants independent of x and k. Observe that, by definition, F P P C_ PNC. Several parameterized parallel problems can be proved to be included in F P P by directly showing a PRAM algorithm for them. Among them we recall parameterized VERTEX COVER and MAX LEAF SPANNING TREE. However, we can now present a different way for proving FPP-membership, based on the concept of treewidth. Treewidth was introduced by Robertson and Seymour [9], and it has proved to be a useful tool in the design of graph algorithms. Bodlaender and Hagerup [4] showed that there exists an optimal parallel algorithm on a E R E W PRAM using C9((log n) 2) time and O(n) space which is able to construct a minimum-width tree decomposition of G or correctly decide that tw (G) > k. Therefore parameterized TREEWIDTH belongs to FPP. The concept of treewidth turns out to be an useful tool for proving F P P membership. In [1] it is shown that many problems that are (likely) intractable for general graphs are in NC when restricted to graphs of bounded treewidth. In particular, the authors prove that some graph properties (MS and EMS properties) are verifiable in NC for graphs of bounded treewidth. A careful study of the proofs in [1] reveals that such properties are also verifiable in F P P under the same hypothesis, since any optimal tree decomposition of width k can be transformed in O(logn) time and O(n) operations on a E R E W PRAM into
894 a binary tree decomposition of depth O(logn) and width at most 3k + 2 [4]. Therefore: all problems involving MS or EMS properties are in F P P when restricted to graphs of parameterized treewidth. Thus, for any (E)MS property P, the following BOUNDED TREEWIDTH GRAPHS VERIFYING P problem belongs to F P P : given a graph G, is G satisfying P and such that tw(G) < k (k being the parameter)? Lists of graph problems defined over (E)MS properties can be found in [2]. In a few cases it is possible to prove FPP-membership even without restrictions on the treewidth since some properties directly imply a bound on the treewidth. Therefore several natural problems (like FEEDBACK VERTEX SET) are in FPP, because (i) yes-instances have bounded treewidth, and (ii) the corresponding property is (E)MS. 3
Relationship
Between
FPP,
PNC,
and
NC
Trivially, the slices of every parameterized problem belonging to F P P or PNC are included in NC. It is now worthwhile to verify whether the converse is true, that is, whether the parameterized versions of all problems whose slices belong to NC are included in F P P or PNC. Consider the CLIQUE problem. Each slice CLIQUEk is trivially in NC. Indeed, CLIQUEk can be easily decided by a parallel algorithm that uses O(n k) processors and requires constant time. Since CLIQUE is likely not in F P T [7], it is also likely not in PNC. Thus, slice-membership to NC is not a sufficient condition for membership to PNC. Therefore, the classes F P P and PNC are somewhat "orthogonal" to NC, as much as the definition of F P T is orthogonal to the one of P. Nevertheless, there is a strong relation between P-completeness and FPT-completeness, where the completeness for F P T is defined with respect to parameterized reductions preserving membership to F P P or PNC. Indeed, consider the WEIGHTED CIRCUIT VALUE problem W C V : its instance consists of the description of a logical decision circuit C and of an input word w; the parameter k is the Hamming weight of w, and the question is whether C accepts w. We have proved the following. L e m m a 1. W C V is in F P T , while each slice WCVk is P-complete. Previous lemma has an important consequence. Let L, L t be parameterized problems. We say that L FPP-reduces (PNC-reduces) to L ~ if there are an FPP-algorithm (PNC-algorithm) ~ and a function f such that O(x, k) computes (x', k' = f(k)} and (x, k} C L if and only if (x', k'} C L'. It is not hard to prove the following. C o r o l l a r y 1. Let <_~ be either the FPP-reduction or the PNC-reduction. If there exists an FPT-hard problem L with respect to <_~ such that all its Slices Lk are included in NC, then P - N C . Corollary 1 implies that every F P T - h a r d parameterized problem must have at least one slice which is not in NC unless P = NC. Thus, the notion of F P T completeness does not add anything to the classical notion of P-completeness
895 to characterize hardly parallelizable parameterized problems. In particular, to distinguish between parallel intractability in the two contexts, we must find an hardly parallelizable parameterized problem with slices in NC. This is the aim of next section. However, we can hope to characterize " F P P ' s intractability" by proving the PNC-completeness of some problem (in the hypothesis F P P PNC), but we have no way to characterize "PNC's intractability". 4
Alternative
Characterizations
of FPP
and
PNC
NC is primarly defined as the class of problems that can be decided by uniform families of polynomial size, polylogarithmic depth circuits. Successively, it has been proved that uniform families of circuits and PRAM algorithms are polynomiMly related. In this paper, we have taken the inverse approach, by initially defining the classes F P P and PNC in terms of PRAM algorithms. In this section, we extend the relation between PRAM algorithms and uniform families of circuits to the parameterized setting. In this case, uniform means that the circuit C~, able to decide instances of size n when the value of the parameter is k, can be derived in space f(k)(log n) ~, for some function f and constant ~. It is possible to prove the following theorem by essentially using the same proof technique in [10]. T h e o r e m 1. A uniform circuit family of size s(n, k) and depth d(n, k) can be
simulated by a P R A M - P R I O R I T Y algorithm using O(s(n, k ) ) active processors and running in time O(d(n, k)). Conversely, there are a constant c and a polynomial p such that, for any F P P (PNC) algorithm 9 operating in time T(n, k) with processor bound P(n, k), there exist a constant d~ and, for any pair (n, k), a circuit C k of size at most dvp(T(n, k), P(n, k), n) and depth cT(n, k) realizing the same input-output behaviour o r e on inputs of size n. Furthermore, the family {C~} is uniform. The previous theorem allows us to show the first PNC-complete problem. It is denoted as BOUNDED SIZE-BOUNDED DEPTH-CIRCUIT VALUE PROBLEM (in short, B S - B D - C V P ) and is defined as follows: given a constant ~, three functions f , g and h, a boolean circuit C with n input lines, an input vector x and an integral parameter k, decide if the circuit C (having size g(k)n ~ and depth h(k)(logn) ](k)) accepts x. We have proved the following. C o r o l l a r y 2. B S - B D - C V P is PNC-complete with respect to FPP-reduetions. A second characterization of the classes of parameterized parallel complexity is based on random access alternating Turing machines. There is a strong relation between the time required by a random access alternating Turing machine and the depth of a simulating circuit, and between the space required by a random access alternating Turing machine and the size of a simulating circuit [5]. Such relation also holds in the parameterized setting. For the sake of brevity, we do not state the corresponding theorem. We only want to remark that such a characterization allows us to define another PNC-complete problem,
896 RANDOM ACCESS ALTERNATING TURING MACHINE COMPUTATION (in short, R A - A T M C ) : given a random access alternating Turing machine AT, an input word x and an integral parameter k, does A T ( x ) halts within O(f(k)(log Ixl) k) steps with SPACE*(n) E O(g(k) log n)? C o r o l l a r y 3. R A - A T M C is PNC-complete with respect to FPP-reductions.
References 1. Stefan Arnborg, Jens Lagergren, and Detlef Seese. Easy problems for treedecomposable graphs. Journal of Algorithms, 12, 308-340, 1991. 2. Hans L. Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Technical Report, Department of Computer Science, Utrecht University, 1995. 3. Hans L. Bodlaender, Rodney G. Downey, and Michael R. Fellows. Applications of parameterized complexity to problems of parallel and distributed computation, 1994. Unpublished extended abstract. 4. Hans L. Bodlaender and Torben Hagerup. Parallel algorithms with optimal speedup for bounded treewidth. Technical Report UU-CS-1995-25, Department of Computer Science, Utrecht University, 1995. 5. A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer. Alternation. J. of the ACM, 28, 114-133, 1981. 6. Rodney G. Downey and Michael R. Fellows. Parameterized computational feasibility. Feasible Mathematics II, 219-244. Birkhs Boston, 1994. 7. Rodney G. Downey and Michael R. Fellows. Fixed-parameter tractability and completeness II: On completeness for WIll. TCS, 141, 109-131, 1995. 8. Michael R. Garey and Davis S. Johnson. Computers and Intractability -- A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., New York, 1979. 9. Neil Robertson and Paul D. Seymour. Graph Minors. II. Algorithmic aspects of tree-width. Journal of Algorithms, 7, 309-322, 1986. 10. Larry Stockmeyer and Uzi Vishkin. Simulation of parallel random access machines by circuits. SlAM J. Comput., 13(2), 409-422, 1984.
Asynchronous (Time-Warp) Versus Synchronous (Event-Horizon) Simulation Time Advance in BSP Mauricio Marfn Programming Research Group, Computing Laboratory, University of Oxford mmarin@comlab, ox. ac. uk
A b s t r a c t . This paper compares the very fundamental concepts behind two approaches to optimistic parallel discrete-event simulation on BSP computers [10]. We refer to (i) asynchronous simulation time advance as it is realised in the BSP implementation of Time Warp [3], and (ii) synchronous time advance as it is realised in the BSP implementation of Breathing Time Buckets [8]. Our results suggest that asynchronous time advance can potentially lead to more efficient and scalable simulations in BSP.
1
Introduction
As BSP is a (bulk) synchronous model of parallel computing [10], the obvious method of parallel discrete-event simulation [2, 7] on BSP computers seems to be synchronisation protocols based on synchronous time advance, such as Breathing T i m e Buckets [8]. In this paper we show that using synchronous time advance on a BSP computer might not be a good idea for two reasons. Firstly, synchronous simulation methods force the execution of periodical parallel minreductions during which no interesting advance in simulation time can be made. These operations take some number o f B S P supersteps to complete [10], and the simulation is stopped during their execution. The cumulative cost of these operations is relevant since the total number of min-reductions is equal to the total number of supersteps used to process the simulation events. Secondly and more importantly, synchronous methods impose barriers in simulation time which tend to reduce the rate of time advance per superstep of the simulation. This increases the total number of supersteps dedicated to parallel event-processing. Equivalently, time barriers tend to decrease the number of events that are processed in each superstep. In BSP, supersteps are delimited by the barrier synchronisation of all the processors. In turn this operation takes some amount of reM time to complete and it can become comparatively expensive in simulation models where the a m o u n t of c o m p u t a t i o n involved in the processing of events is small (e.g., queuing networks). As m a n y real life simulations can easily demand the execution of hundreds of thousands of event-processing supersteps, a significant reduction of the total number of supersteps can also cause a significant reduction of the total running time of the simulation. In addition, the cost of the remaining barrier
898 synchronisation of processors can be amortised by the processing of a comparatively larger number of events per superstep. On the other hand, approaches based on asynchronous time advance do not stop the simulation to perform periodical rain-reductions, and simulation time barriers are not required (cf., [6]). For symmetric work-loads and excluding minreduction supersteps, we have observed that as the-size of the simulation model scales up, the rate of simulation time advance per superstep decreases comparatively faster using synchronous time advance. In particular, synchronous time advance requires on the average O ( x / ~ / l n In P) times more event-processing supersteps than asynchronous time advance, with P being the total number of logical processes (LPs). Moreover, for any simulation model, we show that the number of supersteps for the asynchronous approach is at most the supersteps required by the synchronous approach, and in the end the total number of snpersteps required by the asynchronous approach is optimal for any model. In addition, for the same symmetric work-loads, we have observed that load balance tends to be better under asynchronous time advance for moderate level of aggregation of LPs onto processors (here load balance refers to balance in computation and communication as it is understood in BSP), whereas load balance is optimal in both approaches when the aggregation of LPs is large enough (domain of current practical simulations). Obviously the nature of simulation work-loads is completely irregular and it is hard (perhaps impossible) to draw general conclusions about the overall running time achieved by protocols based on the two approaches analysed in this paper. We contribute with the analysis of an important performance indicator, namely total number of supersteps. In our view, the results presented in this paper suggest that the domain of simulation models and current BSP machines in which protocols based on synchronous time advance can be efficient should be expected to be limited. As to the development of general purpose BSP simulation environments, we conjecture that protocols based on asynchronous time advance offer better opportunities to achieve scalable performance.
2
Two
Approaches
to Time
Advance
Approaches based on the event-horizon concept [9] like Breathing Time Buckets [8], advance in simulation time in a synchronous manner by consuming supersteps as in figure 1.a. Note that here we do not consider the additional supersteps required for min-reductions which compute a new global event-horizon on each cycle. T h a t is, each cycle of the pseudo-code shown in this figure should be followed by a min-reduction. Approaches like Time Warp [3], on the other hand, advance in simulation time in an asynchronous manner [6] by consuming supersteps as in figure 1.b. Note that only silent min-reductions are needed in these approaches since values such as global virtual time (GVT) [3] are calculated without stopping the simulation. T h a t is, these min-reductions are carried out using the same supersteps used to simulate events. We call these two styles of superstep advance as SYNC and ASYNC respectively.
899 L e m m a 1. The total number of supersteps required by A S Y N C is optimal. P r o o f : Dependencies a m o n g events form trees. If the occurrence of event e generates event el, then el is a child of e. If el is scheduled to occur in a different processor and e takes place at superstep s, then el must be processed at least in the superstep s + 1. But the simulation of el m a y be delayed up to some superstep s / > s + 1 if earlier events take place in el's processor at superstep s I. In this case, each descendant of el m a y only take place at superstep _> s ~ and if an el's decendant takes place in another processor, it m a y occur at least in superstep s ~ + 1. In this sense we say that every event has an initial "energy" which indicates the m i n i m u m superstep at which it can take place. Figure 1.b helps us to realise that given any set of events to be processed during the whole simulation, the superstep counters are only updated with the initial energy of chronological events that cannot be processed in earlier supersteps. This rule is inductively applied in each processor from the first to the last superstep. In this way the simulation, as m a p p e d onto the processors, is completed in the m i n i m u m number of supersteps. D L e m m a 2. The total number of supersteps required by A S Y N C is at most the
supersteps of SYNC. Proof." A subset E H of all simulation events are the events that m a r k the eventhorizon times. These events are not necessarily causally related and they m a y be generated in different processors. Also note that these events are the least t i m e s t a m p e d messages buffered during the respective SYNC supersteps. The size IEul of this subset is the total number of supersteps required by SYNC since by construction these events cannot occur in the same superstep. Consider the simulation with ASYNC. The first chronological event in E'H is necessarily the first event processed by one of the processors in its second superstep. However, the remaining processors m a y advance farther in time during their first superstep since they are not barrier synchronised by event-horizon times. This implies that more than one horizon event m a y be processed in the second superstep of ASYNC. Similar argument applies to the following supersteps so that, in general, SYNC is not optimal in supersteps. Both approaches require identical number of supersteps when, for example, each event in EH is causally related so that they must be simulated sequentially. [] L e m m a 3. Under assumption of unlimited memory, Time Warp as realised in BSP can approximate the supersteps of A S Y N C within log P supersteps. P r o o f i The l e m m a follows trivially by letting T i m e W a r p process every available event (with time within the simulation period), correct erroneous c o m p u t a t i o n s accordingly, and start a new G V T calculation in each superstep. [] Note that processing every available event per superstep m a y lead to large roll-back overheads. However, near-optimal supersteps can be achieved at low overheads by limiting the number of events processed in each superstep [6].
900
Generate N initial pending events; T z := c~; [event horizon time] S z + - ~ ; [buffer] loop JfTimeNextEvent 0 > T z t h e n SStep := SStep + 1; Schedule(Sz ); T z := oo; S z ~- ~; endif e := NextEvent0; e.t := e.t + Timelncrement0; p := e.p; [e occurs in processor p] e.p := SelectProcessor0; i f e.p ~ p t h e n S z e- S z u {e}; T z := MinTime(Sz); else Schedule(e); endif endloop
(a) SYNC
Generate N initial pending events; [e.s indicates the minimal superstep at which the event e may take place in processor e.p.] loop e := NextEvent0; p := e.p; [ e occurs in processor p] i f e . s > SStep[p] t h e n SStep[p] := e.s; endif e.t := e.t + TimeIncrement0; e.p :--- SelectProcessor0; i f p = e.p t h e n e.s :----SStep[p]; else e.s := SStep[p] + 1; endif Schedule(e); endloop The total number of supersteps is the maximum of the P values in array SStep. (b) ASYNC
Fig. 1. Sequential programs describing the rate of superstep advance in two approaches to parallel simulation. This program, called the hold-model [11], simulates work-loads of systems such as queuing networks. Schedule() stores events in the set of pending events. NextEvent 0 retrieves from this set the event with the least time. Timelncrement 0 returns random time values. SelectProcessor 0 returns a number between 0 and P - 1 selected uniformly at random. The variable/array SStep maintains the current number of supersteps of the simulated BSP machine.
901
3
Average Case Analysis
It is not difficult to see that the number of supersteps per unit simulation time, denoted by Sp, required by SYNC and ASYNC is the optimal, Sp = 1, for fully connected communication topology when the TimeIncrement function of figures 1 returns 1. However, when this function returns random values, say exponentially distributed, the Sp values increase noticeably as shown in the following analysis (details in [5]). We assume one LP pet" processor. Suppose that the initial event-list has N pending events e with timestamps e.t. Let X be a continuous random variable with p.d.f, f ( x ) and c.d.f. F(x). A hold operation consists of (i) retrieving the event e* with the least timestamp from the event-list, (ii) creating an event e with timestamp e.t = e*.t + X , and (iii) storing e in the event-list, e* is discarded so that N remains constant throughout the whole sequence of hold operations. It is known [11] that after executing a long sequence of hold operations the probability distribution of the times e.t approaches an steady state distribution with p.d.f, g(y) = (1 - F ( y ) ) / p where # = E[X]. We represent the e.t values with the random variable Y. The values of Y are measured relative to the timestamp of the last event e* retrieved by the last hold operation. Let G(y) be the c.d.f, of Y. SYNC: We use the above defined f ( x ) and g(x) probability density functions to calculate SYNC's Sp, say @, as follows. Let A be the minimum of the N random variables Y. Since Prob[A > x] = G(x) N the c.d.f, of A is MA(X) = 1 ---G(x) N. Let B be the minimum of the N random variables resulting from the sum of X and Y. The variable B represents the time of the event-horizon. The e.d.f, of X + Y, say F2(x), can be calculated using /?2(x) : f : F ( x - t)g(t)dt or F2(x) = f o G ( x - t)f(t)dt. The c.d.f, of B is then MB(X) = 1--F-~2(x) N. Note that E[B] - E[A] is the average time advance per superstep. Thus if the simulation ends at time T >> 1, then it will require an average of T / ( E [ B ] - E[A]) supersteps to complete. Therefore S~ = 1 / ( E [ B ] - E[A]). Let us consider the negative exponential distribution with mean # = 1 for time increments X. In this case f ( x ) = g(x) = e - z , therefore E[A] is given by
E[A] =
J0 ~
-MA(t) dt =
j0 ~
e - x t dt = ~ , 1
and E[B] is
E[B] =
(e-t+t e-t) N dt =
t i e - N t dt = -~ i=o i=0
Using the approximation given in [4] (pp. 112-117) we obtain 5 so that
1
4
3 1
902 Let us now consider SYNC's event-efficiency E l , say E}, which is defined as the ratio of the optimal number of events to the average maximum number of events processed in each processor per superstep. This is a measure of load balance in computation and communication for the studied symmetric workload. On average, a total of m events take place in each superstep. Given a sensible distribution of LPs onto the P processors, by Valiant's theorem [10] we learn that the average maximum number of events per superstep should tend to the optimal r n / P as m scales up. Thus for m large enough the efficiency is close to 1. For exponential distribution, rn ~ 5 x / ~ [5,9]. Define D = N / P . Regression analysis on simulation data from the program in figure 1.a (validated with numerical evaluation of the expected maximum of P binomial random variables) produces (asymptotically) for exponential distribution [5],
t[)1/4 E} = p1/4 In P + D1/4 " A S Y N C : In this case there is no global synchronisation in simulation time. In a given superstep each LP pi simulates events with time less than the minimum event time of all new events (messages) arriving to Pi from other LPs by the next superstep. For large number of LPs (which justifies parallel simulation itself), it is reasonable to assume that a significant amount of LPs will work with its own statistically independent local event-horizon. The average instance of this local event-horizon being the average minimum of D independent random variables Z = X + Y. Thus, from the previous results for SYNC, these comments lead to a first view of ASYNC's Sp, say S~ ~ O(v'-D) for fixed number of LPs P. T h a t is, we conjecture that for P >> 1 and D >> 1, S~ asymptotically tends to the Sp value of a SYNC simulation with just D events. In the following we provide evidence supporting this claim. The LPs advance the simulation in a generally different amount of time in each superstep. Note that the average time advance per superstep is by definition 1/S~. Let us consider a lower bound Tp for this time advance. Consider all the order statistics associated with the set of N random variables Z = X + Y. Thus Tp = Average among the first P values E[Z~], where the lower bound Tp comes from the assumption that the first P order statistics of Z are all located in a different LP. However, the actual average cannot be much larger than Tp since any difference Zi - Zk increases very slowly with N. For example, for exponential distribution, the maximum ZN increases only logarithmically with N [1]. In this case we conjecture that the true average time advance per superstep is ~ 1 . 2 5 / v ' ~ . Let us then compare Tp with this quantity. Unfortunately calculating Tp is mathematically intractable. Thus we performed the following experiment. For large N, say N = 104, we generated I = 10 a different instances of a set of N random values X + Y, with X and Y being exponentially distributed. For each instance i we calculated the partial P 9 sums Sum[~,9 P] = ~1 ~k=~ Zk w~th 1 _< P _< N, so that Tp for a particular D is given by Tp = ~ Y~[=I Sum[i, DN---].The results show that in the range D _> 10, Tp behaves like O(1/v/D). Assuming Tp = ~ ( P / N ) ~ least squares regression on
903 the points (logTp, l o g P ) produced c~ = 0.47738 and/3 = 0.01194 which should be compared with the conjectured c~ = 0.5 and/3 = 0.0125 resulting from the curve 1.25/x/~. We now consider the effect of P in S~ when D is fixed. Let Xk be the sum of k > 1 random variables with p.d.f, f ( x ) so that they represent the time increments of events forming a thread of causally related events. Then, for any positive difference between LP time advances, say A = Tj - Ti, there is some non-zero probability that Tj < Ti+ei+Xk < Tj+ej for some q , ej >_ 0, such that Ti+ei +Xk is the time of the next event ej in LP pj after superstep s. This event ej is processed in some superstep s' with s + 1 < s' < s + k + 1. An increase in P implies an increase in the diversity of Ti values and thereby an increase in the probability of events of type "ej". Thus, for fixed D, S~ may increase in about k supersteps with P. For exponential distribution the k values are Poisson. The average k for the maximum time interval A = ZN -- Z1 is bounded from above by ZN = O(ln N). So a conservative upper bound for S~ is O(ln P). However, similar experiment to the above described tells us Zp < 1 in in P in the range D > 10. This leads to S~ ~ in In P v/-D. This expression was validated with simulation results obtained with the program in figure 1.b. Regression analysis on data from the same program produces (asymptotically) for exponential distribution [5],
E ] = (ln In p)l/21n P + D1/4
In P i n D
'
C o m m e n t s : Note that the ratio S~/S~ behaves in practice as O(v/-fi) for large systems. For fixed P and large D values, SYNC's efficiency is better than ASYNC's efficiency. For large P's, p1/4 goes to oc faster than (ln In p)t/2 In P . Thus for large P and small D (practical simulation models) the efficiency of ASYNC is better than SYNC efficiency. However, the expression for E~ was obtained considering just one LP per processor. As more LPs are put on the processors, E~ improves noticeably. This is so because the variance of the cumulative sum of co-resident LP time advances is reduced as the number of LPs per processor increases. Consequently, the variance of the number of events simulated in each processor decreases [5]. Conversely, E} is insensitive to the relation Dip D = m / P with Dip being the number of LPs per processor. Another important point here is that we can always move ASYNC towards SYNC by imposing upper limits to the number of events processed in each processor and superstep (this filters peaks). If these limits are sufficiently small, ASYNC degenerates to SYNC [51. To compare the two approaches under more practical grounds we performed experiments with programs similar to those shown in figure 1. First note that the work-load generated by these programs is equivalent to a fully connected queuing network where each node (LP) contains infinite servers. The service times are given by the time increments of events. In our experiments we considered
904 a more realistic queuing network: one non-preemptive server per node, exponentiM service times, fully connected topology, and unlimited queue capacities with moderate number of jobs flowing throughout the network (closed system). Figure 2 shows the results. The comparison is made in terms of R s = S p / S p and RE = E } / E ~ . The plotted data clearly show that ASYNC outperforms SYNC in a wide range of parameters (the minimal value of R s in figure 2.a is 1.9). The data indicate that R s increases as ~ and RE decreases when N and P scale up simultaneously. In other words, the results show that ASYNC has better scalability than SYNC. In addition, the level of aggregation of LPs onto processors contribute to further increase R s and decrease RE. In particular, for fixed P , the ratio R s increases as X/-~p in figure 2.a.
RE
Rs 100 80
1
1
1
1
1
1
1
1
1
1
D~p=2 O
60 40 20 0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4
I
I
I
I
1
I
I
|
D~p=2 9 8 t
I
|
-~
32 []
I
I
I
I
I
I
I
I
I ~ /
1 2 3 4 5 6 7 8 9 10 11 12
2 3 4 5 6 7 8 9 10 11 12
(a)
I
(b)
,#F
Fig. 2. Fully-connected non-preemptive single-server queuing network with exponential service times. Figure a: Rs= supersteps-SYNC/supersteps-ASYNC. Figure b: RE= AvgMaxEv-SYNC/AvgMaxEv-ASYNC where AvgMaxEv is the average maximum number of events per superstep. Dip= number of LPs per processor, and P = number of processors. Initially D = 8 "jobs" are scheduled in each LP (server).
References 1. R. Felderman and L. Kleinrock. "An upper bound on the improvement of asynchronous versus synchronous distributed processing". In SCS Multiconference on Distributed Simulation V.22, pages 131-136, Jan. 1990. 2. R.M. Fujimoto. "Parallel discrete event simulation". Comm. ACM, 33(10):30-53, Oct. 1990. 3. D.R. Jefferson. "Virtual Time". ACM Trans. Prog. Lang. and Syst., 7(3):404-425, July 1985. 4. D. E. Knuth. "The Art of Computer Programming, Vol. 1, Fundamental algorithms". Addison-Wesley, Reading, Mass., 1973. 5. M. Marfn. "Simulation time advance in BSP". Technical Report Oxford University, May 1998. http://www.comlab.ox.ac.uk/oucl/groups/bsp/. 6. M. Marfn. "Time Warp On BSP Computers". Technical Report Oxford University, Feb. 1998. To appear in 12th European Simulation Multiconference.
905 7. D.M. Nicol and R. Fujimoto. "Parallel simulation today". Annals of Operations Research, 53:249-285, 1994. 8. J.S. Steinman. "SPEEDES: A multiple-synchronization environment for parallel discrete event simulation". International Journal in Computer Simulation,
2(3):251-286, 1992. 9. J.S. Steinman. "Discrete-event simulation and the event-horizon". In 8th Workshop on Parallel and Distributed Simulation (PADS'9~), pages 39-49, 1994. 10. L.G. Valiant. "A bridging model for parallel computation". Comm. A CM, 33:103111, Aug. 1990. 11. J.G. Vaucher. "On the distribution of event times for the notices in a simulation event fist". INFOR, 15(2):171-182, May 1977.
Scalable Sharing Methods Can Support a Simple Performance Model Jonathan Nash Scalable Systems and Algorithms Group School of Computer Studies, The University of Leeds Leeds LS2 9JT, West Yorkshire, UK A b s t r a c t . The Bulk Synchronous Parallelism (BSP) model provides a simple and elegant cost model, as a result of using supersteps to develop parallel software. This paper demonstrates how the cost model can be preserved when developing software for irregular problems, which typically require dynamic load balancing and introduce runtime task dependencies. The solution introduces shared data types within a superstep, which support weakened forms of shared data consistency for scalable performance. An example of a priority queue to support a solution of the travelling salesman problem is given, with predicted and observed performance results provided for 256 processors of a Cray T3D MPP.
1
Introduction
The Bulk Synchronous Parallelism (BSP) model [7, 12] provides a simple and elegant cost model, as a result of using supersteps to develop parallel software. A key feature is the independent execution of processors in generating remote accesses, allowing a superstep cost to be characterised by an h - r e l a t i o n [7]. Specifically, given a machine with network performance g, barrier cost L and computational performance s, a superstep can be costed as g h + s w + L , where h and w are the maximum usage of each resource by any processor. The BSP model seems less suited to the efficient support of irregular problems, which require dynamic load balancing and introduce runtime task dependencies [9]. This paper describes work on supporting irregular problems with scalable high performance, while preserving the BSP-style cost model (based on earlier experiences with the W P R A M model [10]). The key idea has been to develop scalable data structures which support dynamic sharing patterns [8, 9, 3], using highly concurrent implementations provided by the use of weakened forms of shared data consistency [4, 1]. This highly concurrent behaviour approximates the superstep operation, allowing the BSP-style cost model to be preserved. Section 2 describes the use of abstractions for sharing in parallel systems. Section 3 uses the example of a priority queue to describe the techniques to support scalable high performance. Section 4 describes the extension of the BSP cost model to characterise the priority queue performance, with Section 5 providing predicted and observed performance, both for the queue and its use in the travelling salesman problem, for 256 processors of a Cray T3D. Section 6 provides a summary and points to some current work.
907
$$ ...... / /
*,0r,0,,0., ..... /
* migration strategy J / ~i .....
l
Fig. 1. (a) Generic SADT; (b) PriQueue#l SADT; (c) PriQueue#2 SADT
2
Abstractions
for Sharing
in Parallel
Systems
The programming framework is based on the idea of shared abstract data types (SADTs) [5]. An abstract data type (ADT) encapsulates local data, and provides methods to change the state of the data (and return some result). An SADT augments an ADT to include concurrent method invocation, and must specify when changes in state are visible to the methods. SADTs support the key sharing patterns in a parallel system [1, 2, 6], allowing the encapsulation of possibly complex concurrency issues. This well defined layer of abstraction allows for optimised implementations on specific platforms, code portability and re-use. The diagrams in Figure 1 point to the four main features of an SADT, both in generic terms, and related specifically to the support of two alternative priority queue (PriQueue) SADT implementation strategies. A more detailed description can be found in [3]. Local data: Each processor maintains a local state. This data might be replicated or partitioned across the processors, depending on the access patterns which the SADT generates. The PriQueue maintains a partitioned segment of the data set on each processor. D a t a access functions" In order to support the external interface methods, the internal SADT data will have to be accessed. The first PriQueue implementation adopts a bulk synchronous approach in which the Enqueue and Dequeue operations will access local data items, with dynamic data migration occurring Within regular superstep periods [3]. The second implementation supports a shared data structure across the processors, and so Enqueue and Dequeue operations may result in the access of remote data. G l o b a l consistency: This is typically only explicitly defined if the data access functions, defined above, make only local data accesses. In the case of the first PriQueue, the consistency is specified by the load balancing superstep fl'equency and the number of data items migrated. The second implementation does not contain any explicit consistency routines, since a shared data structure is used to directly store and retrieve the data items. S u p p o r t libraries: Both the data access functions and the global consistency protocols will typically need to make use of serial and parallel libraries.
908
The first PriQueue requires libraries to support a serial heap for local d a t a access, and various migration strategies are supported for the load balancing superstep. The second implementation also makes use of the heap library, and also routines to support the access of a shared array data structure, so that data items on other processors may be accessed [3] (see Section 3.3). SADTs are utilised within a bulk synchronous superstep, to support dynamic data sharing patterns in a portable and scalable manner: Enqueue sequence #i; Barrier Enqueue sequence #2; Barrier Dequeue operations
If a FIFO Queue SADT is being accessed [8], with the FIFO ordering defined at the barrier points (and an arbitrary ordering between barriers) then the Dequeue operations would initially retrieve items from the first Enqueue sequence. If the SADT was the PriQueue defined earlier, the first barrier could be removed, since this will not effect the ordering of the data items (an explicit priority ordering is maintained). The second barrier guarantees that all Enqueue operations have completed, and the following Dequeues will begin to access the globally highest priority items. This barrier could Mso be removed if this strict guarantee is not required, allowing processors to access the currently highest priority tasks (or if the creation of the data items is depicted by runtime task dependencies [3]). While a serialised implementation of the PriQueue might be practical for a small-scale parallel machine, highly parallel computers consisting of tens or hundreds of processors require high degrees of concurrency, if performance is to scale. The next section describes how this has been achieved for the PriQueue, based around the second PriQueue implementation strategy described above.
3
Techniques
for Supporting
Scalable
Performance
The example of the travelling salesman problem (TSP) is given here, to illustrate the use of the eriQueue SADT (full details of the solution are presented in [3]). As shown in Figure 2 (a), the solution of the TSP makes use of the PriQueue to store and retrieve new tasks, representing tours, which are dynamically created within a tree-based computation. The priority is related to the length of the (partially generated) tour. An Accumulator SADT is also used to note the current best tour, to which new tours can be compared (for reasons of brevity, further details of this SADT are excluded). A lower level lightweight Lock SADT is also used to support the operation of the PriQueue and Accumulator [3]. The next sections describe the use of weak data consistency, bulk synchronous parallelism ] and fine-grain parallelism to support the PriQueue with scalable performance. Specific implementation details are provided for the Cray WaD. 1 The algorithm which solves the TSP in fact only uses a single superstep
909 TravellingSalesmanProblem items
Accum~ueue
I Fine-grainParallelism
I
rJ
a I.....Ip-ll - -
J[ BSP
Parallel Platform(eg Cray T3D)
Locks
Flags
Fig. 2. (a) Structure of the TSP solution; (b) PriQueue implementation Table 1. The SHMEM operations Bulk synchronous operations Action Get ( b u f , x, B u f , i) b u f [ O . . x - 1] = B u f i [ O . . x - 1]; Put ( b u f , x, B u f , i) B u f i [ O . . x - 1] = b u f [ O . . x - 1]; Quiet 0 suspend upon completion of Put(s) Barrier 0 synchronise all processors Fine-grain operations Action w = Swap (c, W, i) w = W,; W i = c; w = Inc (W,i) w=W~; W i = W i + l ; Wait (w, c) suspend until location w ~ c
3.1
Weak Data Consistency
A key characteristic for scalable performance was to use weakened data consistency [4, 1]. Sequential consistency can provide a global priority ordering on the PriQueue elements, but typically through the use of a serialising lock. This becomes the bottleneck in large systems, limiting performance. The implementation here removes this lock by removing the strict ordering guarantee. Each processor 0 . . p - 1 holds a locally ordered data segment, as shown in Figure 2(b). These are accessed cyclically by the processors when storing and retrieving data, distributing the priorities approximately evenly. A processor is then only guaranteed to remove one of the p highest priorities (Section 5 will show that this does not have any substantial effect on the operation of the algorithm). This approach provides for a highly concurrent implementation (see below), allowing scalable performance characteristics as the number of processors grow. 3.2
Bulk Synchronous Parallelism
For the implementation study on the Cray T3D, the SHMEM library of operations has been used, as given in Table 1. In the table, V/ denotes a variable V which is accessed on processor i, in contrast to v, which is a local variable. SHMEM allows the direct access of remote memory locations through the specification of a processor-address pair, supporting high bandwidth and low latency operations. The BSP programming model can be supported through the use of the P u t and G e t operations, which write to and read from remote memories
910 O p e r a t i o n Cost Swap/Inc D + 2g Get D + 4gx
Get and Put
Swap and Inc
Put Quiet Wait Barrier
2gx D L
Cost Value g 0.04-0.06 D 0.75-1.27 L 2.00 s 0.0533 t 0.0664
Fig. 3. (a) Modelling SHMEM; (b) SHMEM costing; (c) Cost values
respectively, and a Barrier operation. Put supports the pipelining of multiple requests, to amortize the network latency cost (and allow overlapping of communication with subsequent local computation), with Quiet being used to suspend until these complete. The Put and Get operations can be used to maintain a priority ordering on a given remote data segment of the PriQueue, through a simple binary search method.
3.3
Fine-grain Parallelism
The PriQueue concurrency can be exploited by operations which support finegrain parallelism. Looking back to Figure 2(b), it was mentioned that the local data segments are accessed cyclically by the processors. The SHMEM library provides an atomic increment operation, Inc, which can be used to coordinate the global access of these segments, as shown in Table 1. The result in w can be taken modulus p to obtain the required PriQueue data segment. Figure 2(b) shows that each data segment also contains a number of locks and local flags, which are used to control the access to the d a t a segments [3]. The implementation of these locks, through a lower level Lock SADT [3], can use the SHMEM atomic Swap operation, in which a word c is exchanged with the current contents of W on processor i. This operation allows for the construction of a scalable linked list [8]. In addition, Wait can be used to support fine-grain pointto-point synchronisation methods [10], which are typically used to coordinate the access of the shared linked lists [8]. The next section will show how the specific hardware support for these operations allows for the construction of high performance software.
4
A BSP-style
Performance
Model
This section describes how a model of the performance of parallel applications can be constructed from a simple BSP-like cost model, as described in the introduction. The key to preserving this costing approach has been the scalable implementation of the PriQueue SADT, which preserves the independent superstep behaviour of the processors.
911 Table 2. PriQueue SADT costing Operation Cost Enqueue (max,tot) Inc(max,tot) + Get (1,4*max,4*max) +Put (10,max,max) + 882*max*s + max*Add_Heap(HEAP_SIZE) + 2 { Acquire(max,tot/p) + Release(max,tot/p) } Dequeue (max,tot) Inc(max,tot) + Get (1,max,max)+Get (9,3*max,a*max)+Put (1,max,max) + ll07*max*s + max*Sub_Heap(HEAP_SIZE) + 2 { Acquire(max,tot/p) + Release(max,tot/p) }
4.1
Characterising Machine Performance
As described in the introduction, a superstep cost can be modelled as g h + s w + L , where g, s and L cost the machine operations (network access, local computation and barrier synchronisation respectively), with h and w specifying their maxim u m usage by any one processor within the superstep. In order to effectively model the operations for fine-grain parallelism, two additional machine costs, D and t, are required. The cost D measures the round-trip network latency, and is incurred when there exists a data dependency within a superstep (for example, when accessing the results of an atomic increment). The cost t represents the time to transfer words within the local memory, and is present due to the fact that irregular applications typically access complex local data structures. So the cost of a superstep can now be represented by gh + sw + tn + Dc + L, where n and c again reflect the m~ximum resource usage. 4.2
Characterising the Performance of the SHMEM
Library
Figure 3(a) shows how the SHMEM operations which access the network are costed. For the Swap and Inc operations, performance is measured by the latency term D and the network access cost g. For the Put and Get operations, the time g is incurred at both the sending and receiving ends. Figure 3(b) provides a summary of all the costs. Put allows the pipelining of multiple remote accesses (totaling x words), with the processor suspending using Quiet, or Barrier, until they complete. Put and Get requests are split into packets of 4 words, with an acknowledgement packet returning for a Put, and the data returned for a Get. However, the effective network bandwidth is halved for the Get (hence the factor of 4 in the table), since each packet must return before the next packet can be sent. Figure 3(c) gives each of the cost parameters for the Cray T3D. The terms of g and D for each operation were derived from network performance results under randomised and deterministic traffic patterns. 4.3
Characterising the Performance of the PriQueue and TSP
As described in Section 4.1, the performance at the machine level can be modelled using the terms gh + sw + tn + Dc + L, where h, w, n and c represent the
912 Table 3. Costs for maintaining an ordered heap segment Operation Cost Add_Heap (n) Put(18,n,n) + 2*Get(9,n,n) Sub_Heap (n) Put(18,n,n) q- 3*Get(9,n,n) Table 4. Lock SADT costing Operation Cost Acquire (max,tot) Swap(max,tot) + Swap(max,max) + 37*max*s IRelease (max,tot) Put(1,tot,tot) + 36*tot*s
maximum usage of each associated machine resource. When characterising the performance of parallel software, two terms are used to model the workload at a given level of abstraction. The term tot measures the total usage of a given resource across all processors, whereas max measures the maximum resource usage by any given processor. For example, the performance of the solution to the travelling salesman problem given in [5] can be described as: Enqueue (max,tot) + Dequeue (max,tot) + Accum_Read (max,tot) + Accum_Update (u) + Compute (max) In this case, tot and max refer to the number of generated tours. The costs Enqueue and Dequeue are for the PriQueue, and Accurn_Read and Accurn_Update the Accumulator (the u term has been found by experiment to be a small constant number). A Compute term measures the local work required for a given number of tours. In this example, SADT access occurs within a single superstep. Processors independently produce and consume tours through the access of the PriQueue. The only exception is when all items have been temporarily removed from the PriQueue and new items are being enqueued (causing any idle processors to momentarily suspend execution), and at the termination stage. The former case typically only occurs for a short time at the start of execution, when the first few PriQueue items are being generated. Thus, the independent superstep behaviour is preserved at the application level. The highly concurrent behaviour of the SADTs preserves this state of affairs (see below and Section 3.3), allowing the simple BSP cost model to be employed. Table 2 gives the costing of the Enqueue and Dequeue operations. These include access to the Inc, Put and Get SHMEM operations. The Inc cost represents a potential source of contention, including a term of tot, reflecting its use to coordinate the access of the local data segments which make up the PriQueue, as described in Section 3.3. The PriQueue costing also includes the need to maintain a local priority ordering within each local data segment after an Enqueue or Dequeue occurs, using Add_Heap and Sub_Heap respectively. These costs are parameterised by the size of the local segment (2 HEAP-SIZE items). Also included are costs to Acquire and Release a lock at a local segment. The term of tot/p in these costs reflects the scalable distributed implementation, in which the local data segments are accessed cyclically by the processors.
913 Table 5. Parameterised SHMEM costs Operation Cost Swap/Inc (max,tot) m a x * Get (x,max,tot) max * Put (x,max,tot) max *
D + tot 9 2 * g D § tot 9 4 * g * x D -I- t o t * 2 * g * x
Performance of the PriQueue 300
i
i
i
i
i
Performance of the TSP i
256
Enqueue Predicted -+-Dequeue -o-Predicted -x--,
280
~
i
260
128 64
32 9 16 @
240 I 220 [?3"" . . . .
180 160
.
2 I
I
I
4
8 16 32 64 128 256 Processors
I
I
I
I
1;
I
I
20 cities predicted 21 cities predicted linear
I
// / /'7
I
I
I
//
~ . ~.,..J--;~. -+-.,'v," -El--."19" ,~.-:~---. . . . 7.(.~"
// ,/
8 4
2
I
,,//~-4~
//."
J/i
I I
2 4 8 16 32 64 128256 Processors
Fig. 4. (a) PriQueue performance (b) TSP performance
Tables 3 and 4 give the costs for maintaining the priority ordering of the data segments and the Lock SADT. The ordering uses a binary search to exchange d a t a items within the segment, resulting in the H E A P _ S I Z E term. The Lock SADT implementation is too involved to describe here, but further information can be found in [3]. Finally, Table 5 gives the cost of the SHMEM operations, when parameterised by m a x and t o t . It can be seen, for example, that when a S w a p operation is concurrently performed on a given shared word, the m a x term gives the usage of the network latency resource D (since m a x * D gives the total time that any processor incurs the latency resource), and the t o t term relates to the usage of the network resource g (since t o t 9 g provides the total contention at the memory module holding the word being accessed).
5
P r e d i c t e d and O b s e r v e d P e r f o r m a n c e
The performance of the PriQueue is given in Figure 4(a), in which all processors continuously generate E n q u e u e and D e q u e u e requests. It can be seen that the predicted and observed performance agree closely. Figure 4(b) shows the T S P performance for 20 and 21 city problems. Speedups of over 150 on 256 processors for a 21 city problem are typical, reducing the time from 53 seconds on one processor to around gi second on 256 processors. It can be seen that the predicted times are again close, using the performance models derived for the SADTs. The values for t o t and m a x , described in Section 4.3, were taken from actual runs of the T S P solution, on a given number of processors. Figure 5(a) shows the number of tasks generated for both problems. The total tasks, mean tasks (ie t o t ~ p ) and maximum tasks dealt with by any processor are
914 Number of Tasks 65536 16384 l
i
i~
i
i
I
SADT Performance i
4096
i
16
i
i
i
i
i
256
oo
1024
64
[i~ .
1024
4096 I
256
i
:.............................................. X ..................* .......
Total (20) Mean Max Total (21) Mean Max
~",,,. -~--D-.-x .... -,~----~-.
"'1~,~; ~*~... "~ ~:-:.-,.-. . . . "~---
64 16
Accum 20 -,--PriQueue 21 -El-Accum 21 .-~,--
~ A
_ .]i~:: 7
t/"
4 I.-
)4." ~u.-'"
1
4 025 1
I
I
r
2
4
8 32 Processors
lj6
I
I
I
64 128 256
2'
4'
1 1 63 2' 6' 4 8I Processors
' 256 128
Fig. 5. (a) Number of generated tasks (b) SADT overheads
given. The total number of tasks stays quite constant as the processors vary. Obviously, a larger sample set is required if specific conclusions are to be drawn on the change in task numbers. The maximum number of tasks is very close to the mean value, even for large numbers of processors, demonstrating the good dynamic load balancing properties of the PriQueue. Figure 5(b) presents the SADT overheads for both problem sizes. The 20 city problem overhead is chiefly due to the rising cost of Accumulator updates. The 21 city problem overheads increase slowly from 23% for 4 processors up to 37% for 256 processors. The PriQueue overheads essentially decreases linearly with the number of processors. This is due to the even load balance of tasks, together with the highly concurrent implementation of the PriQueue providing scalable high performance. The Accumulator overheads show an approximately linear increase, due to its serial implementation updating replicated copies on each processor in turn. This is not a serious limitation for the TSP, since the update frequency is very low, but for larger number of processors a tree-based solution may offer improved performance.
6
S u m m a r y and Current Work
This paper has described an approach to the development of scalable high performance parallel software for irregular problems, and how this can lead to the support of a simple analytical costing, derived from the BSP model. Further work is required to describe the assumptions under which the cost model can be used (in particular, the degree of concurrency within the sharing methods), and to expand the number of machines to demonstrate the generality of the approach. However, results for the Cray T3D have demonstrated both the scalable and high level of performance which can be achieved. Current work is centering around the support of an adaptive unstructured 3D mesh SADT, for solving computational fluid dynamic problems, from existing MPI codes [11]. Each processor maintains a local state which consists of both partitioned internal mesh elements and replicated shared halo elements, with global consistency points defining when these halos are updated.
915
Acknowledgements Thanks to Professor Peter Dew and the members of the TallShiP and Pallas research projects for their helpful comments and the use of the serial code for the travelling salesman problem.
References 1. C. Clemencon, B. Mukherjee and K. Schwan, Distributed Shared Abstractions (DSA) on multiprocessors, IEEE Transactions on Software Engineering, vol 22(2), pp 132-152, February 1996. 2. J. Darfington and H. W. To, Building Parallel Applications Without Programming, Abstract Machine Models for Highly Parallel Computers, (eds J.R. Davy and P.M. Dew) Oxford University Press, pp 140-154, 1995. 3. P. M. Dew and J. M. Nash, The High Performance Solution of Irregular Problems, MPPM'97: Massively Parallel Programming Models Workshop, Royal Society of Arts, London, November 1997 (to be published in IEEE Press). 4. K. Gharachorloo, S. V. Adve, A. Gupta and J. H. Hennessy, Pro9ramming for Different Memory Consistency Models, Journal of Parallel and Distributed Computing, vol 15, pp 399-407, 1992. 5. D. M. Goodeve and S. A. Dobson, Programming with Shared Data Abstractions, Irregular'97, Paderborn, Germany, June 1997. 6. L. V. Kale and A. B. Sinha, Information sharing mechanisms in parallel programs, Proceedings of the 8th International Parallel Processing Symposium, pp 461-468, April 1994. 7. W. F. McColl, An Architecture Independent Programming Model For Scalable Parallel Computing, Portability and Performance for Parallel Processing, J. Ferrante and A. J. G. Hey eds, John Wiley and Sons, 1993. 8. J. M. Nash, P. M. Dew and M. E. Dyer, A Scalable Concurrent Queue on a Message Passing Machine, The Computer Journal 39(6), pp 483-495, 1996. 9. J. M. Nash, P. M. Dew, J. R. Davy and M. E. Dyer, Scalable Dynamic Load Balancing using a Highly Concurrent Shared Data Type, The 2nd European School of Computer Science: Parallel Programming Environments for High Performance Computing, pp 123-128, April 1996. 10. J. M. Nash, P. M. Dew, J. R. Davy and M. E. Dyer, Implementation Issues Relating to the W P R A M Model for Scalable Computing, Euro-Par'96, Lyon, France, pp 319326, 1996. 11. P. M. Selwood, M. Berzins and P. M. Dew, 3D Parallel Mesh Adaptivity: DataStructures and Algorithms, Proceedings of the 8th SlAM Conference on Parallel Processing for Scientific Computing, SIAM, 1997. 12. L. G. Valiant, A Bridging Model for Parallel Computation, Communications of the ACM 33, pp 103-111, 1990.
EDPEPPS : A Toolset for the Design and Performance Evaluation of Parallel Applications T. Delaitre, M.J. Zemerly, P. Vekariya, G.R. Justo, J. Bourgeois, F. Schinkmann, F. Spies, S. Randoux, and S.C. Winter Centre for Parallel Computing, Cavendish School of Computer Science, University of Westminster 115 New Cavendish Street, London W1M 8JS [email protected] http://www.cpc.wmin.ac.uk/~edpepps
Abstract. This paper describes a performance-oriented environment for the design of portable parallel software. The environment consists of a graphical design tool based on the PVM communication library for building parallel algorithms, a state-of-the-art simulation engine, a CPU characteriser and a visualisation tool for animation of program execution and visualisation of platform and network performance measures and statistics. The toolset is used to model a virtual machine composed of a cluster of workstations interconnected by a local area network. The simulation model used is modular and its components are interchangeable which allows easy re-configuration of the platform. Both communication and CPU models are validated.
1
Introduction
A major obstacle to the widespread adoption of parallel computing in industry is the difficulty in program development due mainly to lack of parallel programming design tools. In particular, there is a need for performance-oriented tools, and especially for clusters of heterogeneous workstations, to allow the software designer to choose between design alternatives such as different parallelisation strategies or paradigms. A portable message-passing environment such as Parallel Virtual Machine (PVM) [12] permits a heterogeneous collection of networked computers to be viewed by an application as a single distributed-memory parallel machine. The issue of portability can be of great importance to programmers but optimality of performance is not guaranteed following a port to another platform with different characteristics. In essence, the application might be re-engineered for every platform [26]. Traditionally, parallel program development methods start with parallelising and porting a sequential code on the target machine and
The EDPEPPS (Environment for the Design and Performance Evaluation of Portable Parallel Software) project is funded by an EPSRC PSTPA Grant No.: GR/K40468 and also by EC Contract Nos.: CIPA-C193-0251 and CP-93-5383.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 113–125, 1998. c Springer-Verlag Berlin Heidelberg 1998
114
T. Delaitre et al.
running it to measure and analyse its performance. Re-designing the parallelisation strategy is required when the reached performance is not satisfactory. This is a time-consuming process and usually entails long hours of debugging before reaching an acceptable performance from the parallel program. Rapid prototyping is a useful approach to the design of (high-performance) parallel software in that complete algorithms, outline designs, or even rough schemes can be evaluated at a relatively early stage in the program development life-cycle, with respect to possible platform configurations, and mapping strategies. Modifying the platform configurations and mappings will permit the prototype design to be refined, and this process may continue in an evolutionary fashion throughout the life-cycle before any parallel coding takes place. The EDPEPPS toolset described here is based on a rapid prototyping philosophy and comprises four main tools: – A graphical design tool (PVMGraph) for designing of parallel applications. – A simulation utility (SES/Workbench [24]) based on discrete-event simulation. – A CPU performance prediction tool (Chronos) which characterises computational blocks within the C/PVM code based on basic operations in C. – A visualisation tool (PVMVis) for animation of program execution using traces generated by the simulator and visualisation of platform and network performance measures and statistics. Other tools used in the environment for integration purposes are: – A trace instrumentation utility (Tape/PVM) [19]. – A translator (SimPVM) [7] from C/PVM code to queueing network graphical representation. – A modified version of the SAGE++ toolkit [3] for restructuring C source code. In the original version the C files were passed through the C-preprocessor (cpp) and then processed by the SAGE++ parser (pC++2dep), which creates a parse tree containing nodes for the individual statements and expressions in the code (stored in .dep files). The modification is needed because pC++2dep does not understand preprocessor directives such as #define and #include. Therefore, rules for these directives were added to the grammar of pC++2dep and new node types for them were introduced in the parse tree. – A translator from existing C/PVM parallel applications into PVMGraph graphical representation (C2Graph) based on the modified SAGE++ toolkit. This is provided in order to allow already written parallel applications to experiment with the toolset. The advantage of the EDPEPPS toolset is that the cyclic process of designsimulate-visualise is executed within the same environment. Also the EDPEPPS toolset allows generation of code for both simulation and real execution to run on the target platform if required. The toolset is also modular and extensible to allow modifications and change of platforms and design as and when required.
EDPEPPS: A Toolset for the Design and Performance Evaluation
115
This paper describes the various tools within the EDPEPPS environment and presents a case study for illustration and validation of the models used. In the next section we describe several modelling tools with similar aims to EDPEPPS and we highlight the differences between them. In section 3 we describe the different tools in the EDPEPPS toolset. In section 4 we present results obtained from the case study. Finally, in section 5 we present conclusions and future work.
2
Parallel System Performance Modelling Tools
The current trend in parallel software modelling tools is to support all the software performance engineering activities in an integrated environment [22]. A typical toolset should be based on at least three main tools: a graphical design tool, a simulation facility and a visualisation tool [22]. The graphical design tool and the visualisation tool should coexist within the same environment to allow information about the program behaviour to be related to its design. Many existing toolsets consist of only a subset of these tools but visualisation is usually a separate tool. In addition, the modelling of the operating system is usually not addressed. The HAMLET toolset [23] supports the development of real-time applications based on transputers and PowerPCs. HAMLET consists of a design entry system (DES), a specification simulator (HASTE), a debugger and monitor (INQUEST), and a trace analysis tool (TATOO). However, the tools are not tightly integrated as in the case of EDPEPPS but are applied separately on the output of each other. Also no animation tool is provided. HeNCE (Heterogeneous Network Computing Environment) [1] is an X-window based software environment designed to assist scientists in developing parallel programs that run on a network of computers. HeNCE provides the programmer with a high level of abstraction for specifying parallelism as opposed to real parallel code in EDPEPPS. HeNCE is composed of integrated graphical tools for creating, compiling, executing, and analysing HeNCE programs. HeNCE relies on the PVM system for process initialisation and communication. HeNCE displays an event-ordered animation of application execution. The ALPSTONE project [17] comprises performance-oriented tools to guide a parallel programmer. The process starts with an abstract, BACS (Basel Algorithm Classification Scheme), description [5]. This is in the form of a macroscopic abstraction of program properties, such as process topology and execution structure, data partitioning and distribution descriptions, and interaction specifications. From this description, it is possible to generate a time model of the algorithm which allows performance estimation and prediction of the algorithm runtime on a particular system with different data and system sizes. If the prediction promises good performance implementation and verification can start. This can be helped with a skeleton definition language (ALWAN or PEMPIProgramming Environment for MPI), which can derive a time model in terms of the BACS abstraction, and a portability platform (TIANA), which translates the program to C with code for a virtual machine such as PVM.
116
T. Delaitre et al.
The VPE project [21] aims to design and monitor parallel programs in the same tool. The design is described as a graph where the nodes represent sequential computation or a reference to another VPE graph. Performance analysis and graph animation are not used here, but the design aspect of this work is elaborate. The PARADE project [27] is mainly oriented on the animation aspects. PARADE is divided into a general animation approach which is called POLKA, and specific animation developments such as PVM with PVaniM, Threads with GThreads and HPF. This work does not include any graphical design of parallel programs, thus, the predefined animations and views can decrease the user understanding. One of the most important aspects of POLKA is the ability of classification between general and specific concepts. In the SEPP project [6] (Software Engineering for Parallel Processing) a toolset based on six types of tools has been developed. There are static design tools, dynamic support tools, debugging tools, behaviour analysis tools, simulation tools and visualisation tools [14,10,15]. These tools are integrated within the GRADE environment [16,11]. The GRAPNEL application programming interface currently supports the PVM message passing library. The TOPSYS (TOols for Parallel SYStems) project [2] aims to develop a portable environment which integrates tools that help programmers cope with every step of the software development cycle of parallel applications. The TOPSYS environment contains tools which support specification and design, coding and debugging, and optimisation of multiprocessor programs. The TOPSYS environment comprises: a CASE environment for the design and specification of applications (SAMTOP) including code generation and mapping support, a multi-processor operating system, a high level debugger (DETOP), a visualiser (VISTOP), and a performance analyser (PATOP). The tools are based on the MMK operating system which has been implemented for Intel’s iPSC/2 hypercube. The tools were later ported to PVM in [18]. A more detailed review of parallel programming design tools and environments can be found in [8].
3
Description of the Integrated Toolset
The advantages of the EDPEPPS toolset over traditional parallel design methods are that it offers a rapid prototyping approach to parallel software development, allows performance analysis to be done without accessing the target platform, helps the user to take decisions about scalability and sizing of the target platform, offers modularity and extensibility through layered partitioning of the model and allows the software designer to perform the cycle of design-simulate-analysis in the same environment without having to leave the toolset. Figure 1 shows the components of the EDPEPPS toolset. The process starts with the graphical design tool (PVMGraph), step (1) in the figure, by building a graph representing a parallel program design based on the PVM programming model. The graph is composed of computational tasks and communications. The
EDPEPPS: A Toolset for the Design and Performance Evaluation
117
tool provides graphical representation for PVM calls which the user can select to build the required design. The software designer can then generate (by the click of a button) C/PVM code (.c files) for both simulation and real execution. The toolset also provides a tool (C2Graph), step (0), to translate already developed parallel applications onto graphical representation suitable for PVMGraph. The software designer can then experiment with the toolset by changing the parallelisation model or other parameters, such as the number of processors or processor types to optimise the code.
PIC (Chronos) (3)
_t.c
SimPVM Translator (Sage++) (4) .grf
Statistics File
.stats
SES/Workbench Simulation Stages (5) .tape
PVMVis (6)
Simulation path .app
_t.c
.c
PVMGraph (1)
.def
Real execution path
EDPEPPS Tool
PVMGL
(C2Graph+Split) (0)
.c
Reverse Engineering
Real C Applications
Tape/PVM Preprocessor (2)
Tape/PVM Trace File
.tape User Interface .c
EDPEPPS File
Tape/PVM Preprocessor (7)
_t.c Compilation and Real Execution (8)
Fig. 1. The EDPEPPS Integrated Environment.
In the simulation path each C/PVM source code obtained from the PVMGraph is instrumented using a slightly modified version of the Tape/PVM trace pre-processor, step (2), [19]. The output is then parsed using the Program Instruction Characteriser (PIC), step (3), which forms a part of the CPU performance prediction tool called “Chronos” which inserts cputime calls at the end of each computational block. The instrumented C source files are translated using the SimPVM Translator [7], step (4), into a queueing network representation suitable for Workbench graph (.grf file). SES/Workbench, step (5), translates the graph file into the Workbench object oriented simulation language called SES/sim [25] using an SES utility (sestran). The sim file is then used to generate an executable model using some SES/Workbench utilities, libraries, declarations and the PVM platform model. The simulation is based on discrete-event modelling. SES/Workbench has been used both to develop and simulate the platform models. Thus the Workbench simulation engine is an intrinsic part of the toolset. All these simulation actions are hidden from the user and are executed from the PVMGraph window by a click on the simulation button and hence shown in the EDPEPPS environment in Figure 1 in one box under “Sim-
118
T. Delaitre et al.
ulation Stages”. The simulation executable is carried out by using three input files containing parameters concerning the target virtual environment (e.g. number of hosts, host names, architecture, the UDP communication characteristics and the timing costs for the set of instructions used by Chronos [4]). The UDP model and the instruction costs are obtained by benchmarking (benchmarks are provided off-line) the host machines in the network. The simulation outputs are the execution time, a Tape/PVM trace file and a statistics file about the virtual machine. These files are then used by the visualisation tool (PVMVis), step (6), in conjunction with the current loaded application to animate the design and visualise the performance of the system. The design can be modified and the same cycle is repeated until a satisfactory performance is achieved. In the real execution path the Tape/PVM pre-processor, step (7), is used to instrument the C source files and these are then compiled and executed, step (8), to produce the Tape/PVM trace file required for the visualisation/animation process. This step can be used for validation of simulation results but only when the target machine is accessible. The visualisation tool (PVMVis) offers the designer graphical views (animation) representing the execution of the designed parallel application as well as the visualisation of its performance. The PVMGraph and PVMVis are incorporated within the same Graphical User Interface where the designer can switch between these two possible modes. The performance visualisation presents graphical plots, bar charts, space-time charts, and histograms for performance measures concerning the platform at three levels (the message passing layer, the operating system layer and the hardware layer). The following sections describe the main tools within EDPEPPS. 3.1
PVMGraph
PVMGraph is a graphical programming environment to support the design and implementation of parallel applications. PVMGraph offers a simple but yet expressive graphical representation and manipulation for the components of a parallel application. The main function of PVMGraph is to allow the parallel software designer or programmer to develop PVM applications using a combination of graphical objects and text. Graphical objects are composed of boxes which represent tasks (which may include computation) and arrows which represent communications. The communication actions are divided into two groups: input and output. The PVM actions (calls) are numbered to represent the link between the graph and text in the parallel program. Also different types and shapes of arrows are used to represent different types of PVM communication calls. Parallel programs (PVM/C) can be automatically generated after the completion of the design. Additionally, the designer may enter PVM/C code directly into the objects. The graphical objects and textual files are stored separately to enable the designer to re-use parts of existing applications [13]. A PVMGraph on-line1 demonstration is available on the Web. 1
http://www.cpc.wmin.ac.uk/~edpepps/demo html/ppf.html
EDPEPPS: A Toolset for the Design and Performance Evaluation
3.2
119
PVMVis
The main objective of this tool is to offer the designer graphical views and animation representing the execution and performance of the designed parallel application from the point of view of the hardware, the design and the network. The animation is an event-based process and is used to locate an undesirable behaviour such as deadlocks or performance bottlenecks. The animation view in PVMVis is similar to the design view in PVMGraph except that the pallet is not shown and two extra components for performance analysis are added: barchart view and platform view. The barchart view shows historical states for the simulation and the platform view shows some statistics for selected performance measures at three levels: the message passing layer, the operating system layer and the hardware layer. 3.3
The CPU Performance Prediction Tool: Chronos
The Program Instruction Characteriser (PIC) is called only in the simulation path to estimate the time taken by computational blocks within a parallel algorithm. PIC characterises a workload by a number of high-level language instructions (e.g. float addition) [4] taking into account the effect of instruction and data caches. Assumptions have been made to reduce the number of possible machine instructions to 43 (see [4] for more details on these assumptions). The costs associated with the various instructions are kept in a file in the hardware layer accessible by the SES utilities. These costs are obtained by benchmarking the instructions on different machines. PIC first parses an instrumented C/PVM program using the modified pC++2dep (mpC++2dep) tool from the SAGE++ toolkit In the second stage, PIC traverses the parse tree using the SAGE++ library and inserts cputime calls with the number of machine instructions within each sequential C code fragment. The cputime call is a simple function with a fixed number of parameters (a total of 31). This is different from the number of machine instructions because the instruction cache duplicates some of the instructions (hit or miss). Each parameter of the cputime function represents the number of times each instruction is executed within the sequential C code fragment. The only exception is the last parameter, which determines whether the instruction cache is hit or miss for the code fragment in question. 3.4
C2Graph
As mentioned before, this tool allows existing PVM applications to be converted into the EDPEPPS format. The C/PVM application files are first parsed with mpC++2dep to get the .dep files. These files are then traversed by the C2Graph translator using the SAGE++ library. The translator also takes into account the PVM calls in the original code and generates their corresponding graphical representation in the PVMGraph files.
120
T. Delaitre et al.
The translator then determines the master process, positions it with the other tasks by calculating appropriate coordinates for them in the PVMGraph screen, and writes the PVMGraph definition files (.def) for each task. The translator finally writes the application file (.app) required for PVMGraph. 3.5
SimPVM Translator
From PVMGraph graphical and textual objects, executable and “simulatable” PVM programs can be generated. The “simulatable” code generated by PVMGraph is written in a special intermediary language called SimPVM, which defines an interface between PVMGraph and SES/Workbench [7]. To simulate the application, a model of the intended platform must also be available. Thus, the simulation model is partitioned into two sub-models: a dynamic model described in SimPVM, which consists of the application software description and some aspects of the platform (e.g. number of hardware nodes) and a static model which represents the underlying parallel platform. The SimPVM language contains C instructions, PVM and PVM group (PVMG) functions, and simulation constructs such as computation delay and probabilistic functions. 3.6
The EDPEPPS Simulation Model
The EDPEPPS simulation model consists of the PVM platform model library and the PVM programs for simulation. The PVM platform model is partitioned into four layers: the message passing layer, the group functions layer which sits on top of the message passing layer, the operating system layer and the hardware layer. Modularity and extensibility are two key criteria in simulation modelling, therefore layers are decomposed into modules which permit a re-configuration of the entire PVM platform model. The modelled configuration consists of a PVM environment which uses the TCP/IP protocol, and a cluster of heterogeneous workstations connected to a 10 Mbit/s Ethernet network.
PVM Applications
Application Layer
Group Server Group Library PVM Daemon
PVM Library
Scheduling
Communication TCP/IP
CPU
Network: Ethernet
Message-passing Layer
Operating System Layer Hardware Layer
Fig. 2. Simulation model architecture.
EDPEPPS: A Toolset for the Design and Performance Evaluation
121
The message-passing layer models a single (parallel) virtual machine dedicated to a user. It is composed of a daemon, which resides on each host making up the virtual machine, a group server and the libraries (PVM and PVMG), which provide an interface to PVM services. The daemon and the group server act primarily as message routers. They are modelled as automatons or state machines which are a common construct for handling events. The LIBPVM library allows a task to interact with the daemon and other tasks. The PVM library is structured into two layers. The top layer includes most PVM programming interface functions and the bottom layer models the communication interface with the local daemon and other tasks. Only this layer needs to be modified if the Message Passing Interface (MPI [20]) model is to be supported. The major components in the operating system layer are the System Call Interface, the Process Scheduler, and the Communication Module. The Communication Module is structured into 3 sub-layers: the Socket Layer, the Transport Layer and the Network Layer. The Socket Layer provides a communications endpoint within a domain. The Transport Layer defines the communication protocol (either TCP or UDP). The Network Layer models the Internet Protocol (IP). The Hardware Layer is comprised of hosts, each with a CPU layer, and the communications subnet (Ethernet). Each host is modelled as a single server queue with a time-sliced round-robin scheduling policy. The communications subnet is Ethernet, whose performance depends on the number of active hosts and the packet characteristics. Resource contention is modelled using the CSMA/CD (Carrier Sense Multiple Access with Collision Detection) protocol. The basic notion behind this protocol is that a broadcast has two phases: propagation and transmission. During propagation, packet collisions can occur. During transmission, the carrier sense mechanism causes the other hosts to hold their packets.
4 4.1
Case Studies Bessel Equation
The computational intensive application chosen here to validate the CPU model is the Bessel equation which is a differential equation defined by: dy d2 y +x + (x2 − m2 )y = 0 dx dx When m is an integer, solutions to the Bessel differential equation are given by the Bessel function of the second kind [4], also called Neumann function or Weber function. The Bessel function of the second kind is translated in C and executed 100000 times.The results for four machines are given in Figure 3. The average error is about 6.13%. The errors for the 486 machine are larger than those of other machines and this is probably due to the fact that there is only one cache used for both data and instructions in the 486 machine. However, for the superscalar x2
122
T. Delaitre et al.
machines, Pentium and SuperSparc, the results are encouraging with errors of 1.7% and 3.4% respectively.
Resolution of the Bessel equation of the second kind 50.00%
80
Real Execution (sec) 70
45.00%
Simulation (sec) Difference in absolute value (%)
40.00%
60
Time (sec)
35.00% 50 30.00%
25.00%
40
18%
30
20.00%
15.00% 20 10.00%
5%
10
5.00%
3.40%
1.70% 0
0.00%
Horus Pentium150Mhz FreeBSD 2.1
Cheops SparcStation20 SuperSparc60Mhz Solaris2.4
Anubis SparcStation5 MicroSparcII70Mhz Solaris2.3
Osiris 486DX4/100Mhz FreeBSD 2.0.5
Machines
Fig. 3. Comparison between real execution and simulation.
4.2
CCITT H.261 Decoder
The application chosen here to demonstrate the capabilities of the environment in the search for the optimal design is the Pipeline Processor Farm (PPF) model [9] of a standard image processing algorithm, the CCITT H.261 decoder [9]. Figure 4 shows how the H.261 algorithm decomposes into a three-stage pipeline: frame initialisation (T1); frame decoder loop (T2) with a farm of 5 tasks; and frame output (T3).
0
1
2
3
6
T1
4
5
T2
T3
Fig. 4. PPF topology for a three-stage pipeline.
EDPEPPS: A Toolset for the Design and Performance Evaluation
123
The first and last stages are inherently sequential, whereas the middle stage contains considerable data parallelism. The same topological variation in the PPF model leads directly to performance variation in the algorithm, which, typically, is only poorly understood at the outset of design. One of the main purposes of the simulation tool in this case is to enable a designer to identify the optimal topology, quickly and easily, without resorting to run-time experimentation. Two experiments for 1 and 5 frames were carried out. The number of processors in Stage T2 is varied from 1 to 5 (T1 and T3 were mapped on the same processor). In every case, the load is evenly balanced between processors. The target platform is a heterogeneous network of up to 6 workstations (SUN4’s. SuperSparcs and PC’s). Timings for the three computational stages of the algorithm were extracted from [9] and inserted as time delays. Figure 5 shows the simulated and real experimental results for speed-up.
5 4.5 4
SpeedUp
3.5 3 2.5 2 1.5
EDPEPPS Simulator 5 Frames Real Experiments 5 Frames EDPEPPS Simulator 1 Frame Real Experiments 1 Frame
1 0.5 0
1
2
3 4 Number of Processors in T2
5
6
Fig. 5. Comparison between speed-ups of simulation and real experiments for the H.261 PPF algorithm.
As expected, the figure shows that the 5-frame scenario performs better than the 1-frame scenario, since the pipeline is fuller in the former case. The difference between simulated and real speed-ups is below 10% even though the PPF simulation results do not include packing costs.
5
Conclusion
This paper has described the EDPEPPS environment which is based on a performance-oriented parallel program design method. The environment supports graphical design, performance prediction through modelling and simulation and visualisation of predicted program behaviour. The designer is not required to leave the graphical design environment to view the program’s behaviour, since
124
T. Delaitre et al.
the visualisation is an animation of the graphical program description. It is intended that this environment will encourage a philosophy of program design, based on a rapid synthesis-evaluation design cycle, in the emerging breed of parallel programmers. Success of the environment depends critically on the accuracy of the underlying simulation system. Preliminary validation experiments showed average errors between the simulation and the real execution of less than 10%. An important future direction of our work is to extend the simulation model to support other platforms, such as MPI. The modularity and flexibility of our model will ensure that PVM layer may be re-used where appropriate in the development of the MPI model component. Another planned extension to our environment is the integration of a distributed debugger such as the DDBG [16].
References 1. A. Beguelin, et. al. HeNCE: A Heterogeneous Network Computing Environment Scientific Programming, Vol. 3, No. 1, pp 49–60. 115 2. T. Bemmerl. The TOPSYS Architecture, In H. Burkhart, editor, CONPAR90VAPPIV Conf., Zurich, Switzerland, Springer, September 1995, Lecture Notes in Computer Science, 457, pp 732–743. 116 3. F. Bodin, et. al. Sage++: An Object-Oriented Toolkit and Class Library for Building Fortran and C++ Restructuring Tools, Proc. 2nd Annual Object-Oriented Numerics Conf., 1994. http://www.extreme.indiana.edu/sage/docs.html. 114 4. J. Bourgeois. CPU Modelling in EDPEPPS EDPEPPS EPSRC Project (GR/K40468), D3.1.6, EDPEPPS/35, Centre for Parallel Computing, University of Westminster, London, June 1997. 118, 119, 121 5. H. Burkhart, et. al. BACS: Basel Algorithm Calssification Scheme, version 1.1, Tech. Report 93-3, Universit¨ at Basel, URZ+IFI, 1993. 115 6. T. Delaitre, et. al. Simulation of Parallel Systems in SEPP, in: A. Pataticza, ed., The 8th Symposium on Microcomputer and Microprocessor Applications 1 (1994), pp 294–303. 116 7. T. Delaitre, et. al. Final Syntax Specification of SimPVM, EDPEPPS EPSRC Project (GR/K40468) D2.1.4, EDPEPPS/22, Centre for Parallel Computing, University of Westminster, London, March 1997. 114, 117, 120 8. T. Delaitre, M.J. Zemerly, and G.R. Justo, Literature Review 2, EDPEPPS EPSRC Project (GR/K40468) D6.2.2, EDPEPPS/32, Centre for Parallel Computing, University of Westminster, London, May 1997. 116 9. A.C. Downton, R.W.S. Tregidgo and A. Cuhadar, Top-down structured parallelisation of embedded image processing applications, in: IEE Proc.-Vis. Image Signal Process. 141(6) (1994) 431-437. 122, 123 10. G. Dozsa, T. Fadgyas and P. Kacsuk, A Graphical Programming Language for Parallel Programs, in: A. Pataricza, E. Selenyi and A. Somogyi, ed., Proc. of the symposium on Microcomputer and Microprocessor Applications (1994), pp 304–314. 116 11. G. Dozsa, P. Kacsuk and T. Fadgyas, Development of Graphical Parallel Programs in PVM Environments, Proc. of DAPSYS’96, pp 33–40 116 12. A. Geist, et. al. PVM: Parallel Virtual Machine, MIT Press, 1994. 113
EDPEPPS: A Toolset for the Design and Performance Evaluation
125
13. G.R. Justo, PVMGraph: A Graphical Editor for the Design of PVM Programs, EDPEPPS EPSRC Project (GR/K40468) D2.3.3, EDPEPPS/5, Centre for Parallel Computing, University of Westminster, February 1996. 118 14. P. Kacsuk, P.Dozsa and T. Fadgyas, Designing Parallel Programs by the Graphical Language GRAPNEL, Microprocessing and Microprogramming 41 (1996), pp 625– 643. 116 15. P. Kacsuk, et. al. Visual Parallel Programming in Monads-DPV, in: L´ opez Zapata, ed., Proc. of the 4th Euromicro Workshop on Parallel and Distributed Processing (1996), pp 344–351. 116 16. P. Kacsuk, et. al. A Graphical Development and Debugging Environment for Parallel Programs, Parallel Computing, 22:1747–1770, 1997. 116, 124 17. W. Kuhn and H. Burkhart. The ALPSTONE Project: An Overview of a Performance Modelling Environment, In 2nd Int. Conf. on HiPC’96, McGraw Hill 1996, pp 491–496. 115 18. T. Ludwig, et. al. The TOOL-SET - An Integrated Tool Environment for PVM, In EuroPVM’95, Lyon, France, September 1995. Tech. Rep. 95-02, Ecole Normale Superieure de Lyon. 116 19. E. Maillet, TAPE/PVM: An Efficient Performance Monitor for PVM Applications - User Guide. LMC-IMAG, ftp://ftp.imag.fr/ in pub/APACHE/TAPE, March 1995. 114, 117 20. M. P. I. Forum. MPI: A Message Passing Interface Standard. The Int. Journal of Supercomputer Applications and High-Performance Computing, 8(3/4), 1994. 121 21. P. Newton, J. Dongarra, Overview of VPE: A Visual Environment for MessagePassing, Heterogeneous Computing Workshop, 1995. 116 22. C. Pancake, M. Simmons and J. Yan, Performance Evaluation Tools for Parallel and Distributed Systems, Computer 28 (1995), pp 16–19. 115 23. P. Pouzet, J. Paris and V. Jorrand, Parallel Application Design: The Simulation Approach with HASTE, in: W. Gentzsch and U. Harms, ed., HPCN 2 (1994), pp 379–393. 115 24. Scientific and Engineering Software Inc. SES/workbench Reference Manual, Release 3.1, Scientific Engineering Software Inc., 1996. 114 25. K. Sheehan and M. Esslinger, The SES/sim Modeling Language, Proc. The Society for Computer Simulation, San Diego CA, July 1989, pp 25–32. 117 26. A. Reinefeld and V. Schnecke, Portability vs Efficiency? Parallel Applications on PVM and Parix, in Parallel Programming and Applications, P. Fritzson and L. Finmo eds., (IOS Press, 1995), pp 35–49. 113 27. J.T. Stasko. The PARADE Environment for Visualizing Parallel Program Executions, Technical Report GITGVU-95-03, Graphics, Visualization and Usability Center, Georgia Inst. of Tech., 1994. 116
Long Operand Arithmetic on Instruction Systolic Computer Architectures and Its Application in RSA Cryptography Bertil Schmidt 1, Manfred Schimmler 2, and Heiko Schr6der 3 1 Lehrstuhl ffir Informatik I, RWTH Aachen, 52056 Aachen, Germany, bes@i I. inf ormat ik. rwth-aachen, de 2 Inst. f. Datenverarbeitungsan]agen, T U Braunschweig,
38106 Braunschweig, Germany, Schimmler@ida. ing. tu-bs, de
3 Department of Computer Studies, Loughborough University, Loughborough, LE11 3TU, England, H. Schroder@lboro. ac. uk A b s t r a c t . Instruction systolic arrays have been developed in order to combine the speed and simplicity of systolic arrays with the flexibility of MIMD parallel computer systems. Instruction systolic arrays are available as square arrays of small RISC processors capable of performing integer and floating point arithmetic. In this paper we show, that the systolic control flow can be used for an efficient implementation of arithmetic operations on long operands, e.g. 1024 bits. The demand for long operand arithmetic arises in the field of cryptography. It is shown how the new arithmetic leads to a high-speed implementation for RSA encryption and decryption. 1 Introduction Instruction systolic arrays ( I S A s ) provide a p r o g r a m m a b l e high performance hardware for specific computationally intensive applications [5]. Typically, such an array is connected to a sequential host, thus operating like a coprocessor which solves only the computationally intensive tasks within a global application. The ISA model is a mesh connected processor grid, which combines the advantages of special purpose systolic arrays with the flexible p r o g r a m m a b i l i t y of general purpose machines [3]. In this paper we illustrate how the capabilities of ISAs are exploited to derive efficient parallel algorithms of addition, subtraction, multiplication and division of long operands. The demand for long operand arithmetic arises in the field of cryptography, e.g. RSA encryption and decryption. Their implementations on Systola 1024 show that the concept of the ISA is very suitable for long operand arithmetic and results in significant run time savings. The ISA concept is explained in detail in Section 2. Section 3 gives an overview over the architecture of Systola 1024. It is documented how the ISA has been integrated on an low cost add-on board for commercial PCs. The new ISA algorithms for long operand arithmetic are explained in Sections 4 to 6. The
917 implementation of RSA based o n these arithmetic routines is given in Section 7. Section 8 discusses its performance and concludes the paper.
2
Principle of the ISA
The basic architecture of the ISA is a quadratic n • n array of identical processors, each connected to its four direct neighbours by d a t a wires. The array is synchronized by a global clock. The processors are controlled by instructions, row selectors and column selectors. The instructions are input in the upper left corner of the processor array, and from there they move step by step in horizontal and vertical direction through the array. This guarantees t h a t within each diagonal of the array the same instruction is active during each clock cycle. In clock cycle k + 1 processor (i + 1, j) and (i, j + 1) execute the instruction t h a t has been executed by processor (i, j) in clock cycle k. The selectors also move systolically through the array: row-selectors horizontally from left to right, column-selectors vertically from top to b o t t o m (Fig. 1). The selectors mask the execution of the instructions within the processors, i.e. an instruction is executed if and only if both selector bits, currently in t h a t processor, are equal to one. This construct leads to a very flexible structure which creates the possibility of very efficient solutions for a large variety of applications.
instructions ~ instructions
,~
se;:ctor
sceolelucm torns ISA
Fig. 1. Control flow in an 1SA Every processor has read and write access to its own memory. Besides that, it has a designated communication register ( C - r e g i s t e r ) t h a t can also be read by the four neighbour processors. Within each clock phase reading access is always performed before writing access. Thus, two adjacent processors can exchange d a t a within a single clock cycle in which both processors overwrite the contents of their own C-register with the contents of the C-register of their neighbour. This convention avoids read/write conflicts and also creates the possibility to broadcast information across a whole row or column with one single instruction: R o w b r o a d c a s t : Each processor reads the value from its left neighbour. Since the execution of this operation is pipelined along the row, the same value is propagated fi'om one C-register to the next, until it finally arrives at the rightmost processor. Note that the row broadcast requires only a single instruction. The p e r i o d of ISA programs is the number of their instructions, which is 2n - 2 clock cycles less than their execution time. The period describes the minimM time from the first input of an instruction of this program to the first
918 input of an instruction of the next program. In the following the period of ISA programs is used to specify their time-complexity. This is appropriate because they will be used as subroutines of much larger ISA programs in Section 7.
3
Architecture of Systola 1024
The ISATEC Systola 1024 parallel computer is a low cost add-on board for standard PCs. The ISA on the board is a 4 x 4 array of processor chips. Each chip contains 64 processors, arranged as an 8 x 8 square. This provides 1024 processors on the board. In order to exploit the computation capabilities of this unit, it is necessary to provide data and control information at an extremely high speed. Therefore, a cascaded memory concept, consisting of interface processors and board RAM, is implemented on board that forms a fast input and output environment for the parallel processing unit (see Fig. 2). RAM NORTH
.-[ r
[ISA-Prog.i,.._
Interface [I ............ I
RAM WEST ] ISA T Host Computer Bus
I Controller I [
Fig, 2. Data paths in the parallel computer Systola 1024
4
Addition and Substraction of Long Operands
The difficulty of partitioning an addition in an array where only neighbours can talk to each other is the carry propagation. If it meets a sum value consisting of ones only, then an incoming carry produces a new carry in the most significant digit. In the worst case this can hold for all blocks of the partitioned addition, such that the carry-bit of the first block propagates through all other blocks. Therefore, we use an accumulation technique based on the generate and propagate signals of a carry-look-ahead-adder. If every processor knows, whether it has to propagate an incoming carry from the left to the right, it can set a flag (e.g. the zero-flag) to one if and only if it propagates an incoming carry. Furthermore every processor can store the carry generated by itself in its C-register. With the accumulation operation (*) if zeroflag then C :=C [WEST] all carry-bits travel to the processor one position leftof their destinations. This is again because of the skewed instruction execution: Suppose, processors (i, j - 1 ) , (i, j), and (i,j + 1) have set their zero-flag to one (in order to propagate a carry) and processor (i, j - 2 ) has generated a carry-bit one. The instruction (*) is executed in processor (i, j - 1) which replaces its carry-bit (0) with that of processor (i, j - 2). In the next clock cycle the same instruction is executed in processor (i, j), where again the own carry (0) is replaced by the value of the left neighbour (which has been set to one in the previous instruction cycle). In the following cycle, processor (i, j + 1) replaces its carry with the one from the left neighbour, etc. With the final assignment C:=C [WEST] all carries are at the
919 place where they are needed. Note that the carry propagation requires only two instructions, although the carry-bits can travel over a distance of up to n - 1 processors. Of course, this technique only works if no processor propagating a carry has generated a carry-bit on its own. But this is obviously impossible. The operands to be added are distributed over the rows of the ISA. If the length of the operands is 512/1024 bits, every processor in a row of 32 processors gets 16/32 bits. The LSBs are stored in the leftmost and the MSBs in the rightmost processor of the row. Every processor has 16 bits of the operand A in register KA and 16 bits of B in register RB. The operations RSUM:=RA + RB; C:=carry; store the s u m of RA and RB in RSUM and the carry-bit in C. N o w every processor must find out whether it has to propagate an incoming carry. This is the case if and only if the value of RSUM is consisting of ones only. Thus, we can complete the long addition as pointed out above with if (RSUM + 1=0) t h e n C:=C[WEST]; RSUM:=C[WEST] + RSUM; Table 1. Addition of two 24-bit numbers in a row with 6 processors l',!(ocessor 1 2 3 4 5 6 A 010I 1100 1010 0010 ]101 1000 Every pr. gets 4 bits of A and B. LSB/MSB B 0001 0011 0101 1010 1010 0100 in lcftmost/rightnmst bit in p[:. l/6. SUM 0100 1111 l i l t 1001'0000 1100 SUM:=A + 13 ic 1 0 0 0 1 0 C-register stores the carry-bit Zerofiag 0 1 1 0 0 0 if (SUM+I=O) then zeroflag:=l else :=0 ci' 1 [ t 0 1 0 C after prol).: ifzeroflag then C:=C[WEST SUM 010010000 0000 [)101 0C)000010 SUM:=SUM + C[WIi3S'I"] The s u b t r a c t i o n works in principle in the same way as the addition, using RDIFF:=RA-RB, the propagation condition i f (RDIFF=0) and the carry-bit subtraction RDIFF : =RDIFF-C [WEST]. Due to the systolic control flow the n-bit addition/subtraction is possible using O(n) ISA processors with constant period. The complete 512/1024-bit addition/subtraction in one Systola 1024 processor-row requires only 5/8 instructions. However, having 32 processor-rows, 32 different additions/subtractions can be calculated in parallel in the same time.
5
Multiplication of Long Operands
The idea behind the multiplication on the 1SA is related to the school method for integer multiplication: One operand is cut into pieces and all pieces are multiplied with the other operand in parallel. The results are then shifted with respect to each other in an appropriate way and added to form the final result. The ISA-program for the m x n-bit (m = 16 9p, n = 16 9q) multiplication in each processor-row is explained below. Every processor stores 16 bits of the first operand in register RA, the least significant 16 bits in the leftmost and the most significant 16 bits in the rightmost processor. The complete second operand is stored in RB0,. 9 RBq_ 1 in each processor. At the end the most significant m bits
920 of the result will be distributed over the rows in RPq. The least significant n bits will be stored in the first processor of each row in RP0 . . . . RPq_I. RP0..RPq: = RA 9 RB0. ,RBq-1 / / compute 16 x 16-q-bit multiplication C:=RP0 // load least significant result in C-register for i : = l t o q d o / / add RP0..RPq along processor-row in q steps: add 16 bit more significant word in C[EAST] begin and RPi. Sum is stored in C and RPi RPi_ 1 :=C C : =C [EAST] +RPi+carry This addition is repeated up to the most end significant word RPq:=c // store most significant result C:=carry if R P q + I = 0 t h e n C : =C [WEST]// propagate the carry-bit of the last addition as described in Section 4 RPq :=C [WEST] +RPq The implementation on Systola 1024 requires 223/507 instructions for one single 512 • 512/1024 • 1024-bit multiplication in one row of 32 processors.
6
Division of Long Operands
Division can be efficiently reduced to multiplication and subtraction by using the N e w t o n - R a p h s o n - m e t h o d . For a value B the reciprocal value is c o m p u t e d by an iteration that converges rapidly towards ~ . To achieve a fast convergence (2m bits precision in m iteration steps), 0.5 < B < 1 must hold for B. The iteration is given by the equations X0 : = 1 , Xi+I := (2 -- B . xi) . x i (1) Obviously, any division Q = A can be computed by multiplying A with ~ . For division on the ISA we assume B to be in the range between 0.5 and 1. If this is not the case, B is initially shifted in the ISA such that the most significant one is the MSB of the rightmost word of B. In order to understand the choice of intermediate operand lengths in our division method it is useful to consider the behavior of the convergence in the Newton-Raphson-algorithm. Let xi differ fi'om the precise value of 1 by some e. Then the following holds for Xi+l: xi+l :
(2-
.x, :
(2-
B.
( 89 -
-
=
- 8.
d
(2)
Since B < 1, this means that the precision of xi+l (i. e. n u m b e r of correct leading digits) is at least double the precision of xi. This means t h a t we can restrict the operand lengths to 2 i bits for B and xi-1 while computing the i th iteration xi. E.g. the first four iterations can be performed in one processor, since the required precision of the operands is < 16. Only the last iteration step needs full precision of m bits. The subtractions and multiplications are performed efficiently on the ISA as explained above.
7
Parallel Implementation of RSA Encryption
For obtaining a high-speed implementation of RSA encryption a fast m o d u l a r e x p o n e n t i a t i o n ( M c m o d N) is necessary. To achieve a sufficient degree of security the operand-lengths have to be relatively large (currently ranging from
92I
512 up to 2048 bits). Modular exponentiation can be accomplished by iterating modular multiplications using the square-and-multiply algorithm [2]. The difficulty in implementing an efficient modular multiplication (A. B m o d N) is the modular reduction step. In the case of RSA encryption the modulus N is known in advm~ce to each modular multiplication. Thus, the precomputation of ~- requires only a negligible amount of work. Now, the division x div N can be replaced by the more efficient multiplication x. -~. Of course, we cannot produce 1N precisely. We instead produce a number # such that x 9# is close enough to x div N; i. e. the result needs only a smM1 correction by at most 4 subtractions with N. Thus, the modular multiplication for RSA can be implemented efficiently on the ISA by three multiplications and at most four subtractions. One single n-bit modular multiplication for n = 512/1024 requires 581/1709 instructions in one processor- row of Systola 1024.
8
P e r f o r m a n c e E v a l u a t i o n and C o n c l u s i o n s
The performance of the parallel implementation is compared with an optimized sequential implementation on a PC (see Table 2). The current ISA-prototype Systola 1024 is a machine based on technology far from being state-of-the-art. Extrapolating to technology used for processors such as the Pentium II 266 MHz would lead to a speedup of the ISA by a factor of at least 80. Thus resulting in encryption/decryption speeds of more than 500 K b i t / s for 1024-bit RSA (without taking advantage of the Chinese Remainder Theorem). Such performance can be expected from the already announced next generation Systola board S y s t o l a 4096. Table 2. Performance of RSA encryption ISystola 1024IPentium II 2661Speedupl 512-bit RSA with full length exponent 128 KBit/s I 18 KBit/s 7 I 1024-bit RSA with full length exponent 56 KBit/s I 6 KBit/s 9 We have presented ideas for fast implementations of addition, subtraction, multiplication and division on operands that are too long to be handled within one processor. We take advantage of the ability of the ISA to implement accumulation operations extremely efficiently. We have used the results for finding a high-speed instruction systolic implementation of RSA encryption and decryption. This leads to efficient software solutions with respect to performance and hardware cost. We also used the long operand arithmetic routines for the implementation of a prime number generator for RSA keys on Systola 102411].
References 1. Hahnel, T.: The Rabin-Miller Prime Number Test on Systola 1024 on the Background of Cryptography. Master Thesis, University of Karlsruhe (1998) 2. Knuth, D.E.: The Art of Computer Programming: Seminumerical Algorithms. Volume 2, Reading, Addison-Wesley, second edition (1981) 3. Kunde, M., et al.: The Instruction Systolic Array and its Relation to other Models of Parallel Computers. Parallel Computing 7 (1988) 25-39
922
4. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digitM signatures and public key cryptosystems. Comm. of the ACM 21 (1978) 120-126 5. Schmidt, B., Schimmler, H., SchrSder, H.: Morphological Hough Transform on the Instruction Systolic Array. Euro-Par'97, LNCS 1300, Springer (1997) 798 806
Hardware Cache Optimization for Parallel Multimedia Applications C. KulkarnP,F. C a t t h o o r 1,2 and H. De Man 1,2 1 IMEC, Kapeldreef 75, B-3001 Leuven, Belgium 2 Professor at the Katholieke Universiteit Leuven
Abstract. In this paper, we present a methodology to improve hardware cache utilization by program transformations so as to achieve lower power requirements for real-time multimedia applications. Our methodology is targeted towards embedded parallel multimedia and DSP processors. This methodology takes into account many program parameters like the locality of data, size of data structures, access structures of large array variables, regularity of loop nests and the size and type of cache with the objective of improving cache performance for lower power. Experiments o n real life demonstrators illustrate the fact that our methodology is able to achieve significant gain in power requirements while meeting all other system constraints. We also present some results about software controlled caches and give a comparison between both the types of caches and an insight about where the largest gains lie.
1
Introduction
Parallel machines were mainly, if not exclusively, being used in scientific c o m m u nities until recently. Lately, the rapid growth of real-time multimedia processing (RMP) applications have brought new challenges in terms of the required processing (computing) power, power requirements and a host of other issues. For these type of applications, especially video, the processing power of traditional uniprocessors is no longer sufficient, which has lead to the introduction of smalland medium-scale parallelism in this field too, but then mostly oriented towards single chip systems for cost reasons. Today, m a n y parallel video and m u l t i m e d i a processors are emerging ([20] and its references), increasing the importance of parallelization techniques. However, the cost functions to be used in these new application fields are no longer purely performance based. Power is also a crucial factor, and has to be optimized for a given throughput. R M P applications are usually m e m o r y intensive and most of the power consumption is due to the m e m o r y accesses for d a t a transfers i.e. in the m e m o r y hierarchy [22]. Power m a n a g e m e n t and reduction is becoming a m a j o r issue in such applications [3]. Many of the current algorithm standards (including the ones which are still under development like M P E G - x and object-oriented coders [2]) are based on a hierarchy of subsystems. The main data-dependencies and irregularities are situated in the higher layers. However, from there several submodules are called,
924 such as motion estimation/compensation, wavelet schemes, D C T etc. These lend themselves for extensive compile-time analysis enabling more efficient use of caches 1, as will be shown in the paper. This is in contrast with much of the previous work in a more "general" application context. The performance of these systems is related to the critical path in the algorithm (requires only local analysis, as shown with bold line in figure 1 where segments represent loop nests), whereas power requirements are related to the global data flow over conditional a n d / o r parallel paths (as shown with the other lines in figure 1). Also this is a significant difference with the previous work (see below). These observations are the motivations for us to come up with an optimization strategy to utilize especially the cache hierarchy efficiently for low power.
n
differentTime/Powerweights overexclusiveorconcurrentpaths
Put~ s
Out /
~ C r i t i c a l path
Fig. 1. Illustration of the difference between speed oriented critical path optimization which operates on the worst-case path in the global condition tree (with mutually exclusive branches) and power optimization which acts on the global flow. Our target architecture is based on (parallel) programmable video or multimedia processors (TMS320C6x, TriMedia, etc.) as this is a current trend to use in RMP applications [7] (see section 2). Most of the work related to efficient utilization of caches has been directed towards optimization of the throughput by means of (local) loop nest transformations to improve locality [1, 15] and loop blocking transformation to improve cache utilization [14]. Some work [17] has been reported on the data organization for improved cache performance in embedded processors but also they do not take into account a power oriented model. Instead they try to reduce cache conflicts in order to increase the throughput. Similar work regarding reduction of cache conflicts and improvement of register allocation has been reported in [10, 15]. Architectural techniques to reduce the overhead in cache related power are available in [23]. Recently a study has been presented in [8] which reports on the effect with respect to power requirements on the type of cache used in an architecture. But this work also does not address the basic issue of improving the cache utilization by means of program transformations. Program transformations for reducing memory related power and area overhead are discussed in [5, 6]. These transformations are part of our ATOMIUM [16] global data transfer and storage oriented approach and have focussed upto now on uni-processors and customized memory organizations. Program transfor1 in this paper unless specifically mentioned cache generally means a hardware controlled cache
925 mations for improving cache usage for software controlled caches are discussed in [12]. Some other complementary aspects of our work on system level m e m o r y m a n a g e m e n t for parallel processor m a p p i n g have been discussed in [4]. The paper is organised as follows. In section 2, the correlation between cache and power is discussed. Section 3 gives a formalized methodology to explore system level optimization for efficient cache usage. This is followed by discussion of two real life demonstrators and their results in section 4. Section 5 presents conclusions from the above study and proposes some future directions.
2
Power
Models
for Hardware
Controlled
Caches
The relation between the extent of d a t a caching and power is explored in this section. Consider the simple but representative architecture model that we are using in this paper as shown in figure 2 (extensions to this model are feasible e.g. as in [12]). The power model used in our work comprises a power function which is dependent upon access frequency, size of memory, number of read/write ports and the number of bits that can be accessed in every (data) fetch operation.
PE's can be programmed individually
DISTRIBUTED SHARED
Dual Port, Fully Associative and Write Back
MEMORY ARCHITECTURE
Fig. 2. The target architecture used, where each processing element (PE) has its own cache hierarchy (C~) without prefetching or branch prediction. Let P be the total power, N be the total number of accesses to a m e m o r y (can be SRAM or D R A M or any other type of memory) 2 and S be the size of the memory. The total power is then given by:
P = x • F(s)
(1)
where F is a polynomial function with the different coefficients representing a vendor specific energy model per access. F is completely technology dependent. For an architecture with no cache at all i.e. only a C P U and (off chip) m a i n memory, power can be approximated as below: ~- P~C~ain "~ N m a i n
•
(2)
2 Power consumption varies for different types of SRAM or DRAM depending on the individual power models (vendor and architecture specific)
926 Let us introduce a level one cache as in figure 2, then we observe that the total number of accesses is distributed into number of cache accesses (Neaehe) and n u m b e r of main m e m o r y accesses (N,~ai,~). The total power is now given by: + Peoche
(3)
P~~ = (Nine + Nm~i,~) • F(Sm~i~) Peaehe : (Nmc + Nc~r X F(So~hr
(4)
Ptoto
=
(5)
Here the t e r m Nrnc represents the amount of traffic between the two levels of memory. It is a useful indicator of the power overhead due to various characteristics like block size, associativity, replacement policy etc. for an architecture with hierarchy. Let the factor of gain for an on-chip cache access compared to that of an off-chip main m e m o r y access be ~. Here ~ is technology dependent. Then we have: -~(Scache) -~ Og X F ( S m a i n )
where
o' <
1.
(6)
T h e constants in this formula is instantiated for a proprietary vendor model. In the literature several more detailed cache power models have been reported recently (e.g. [11]) but we do not need these because we only require relative comparisons.
3
Methodology
The steps comprising our methodology in this work are shown in figure 3. We assume that all the parallelization related steps (i.e d a t a a n d / o r task level partitioning) are done before applying this methodology. We only focus on the cache related issues here since other issues in our work have been addressed in [4, 5].
Original Source Code
1
Data Flow and Locality Analysis And High Level In-Place Estimat Global Data And Control Flow Transformations Data Re-use Decisions
~ _dvanced In-Place Mapping 1
To Native C Compiler
Fig. 3. Proposed methodology The first step, D a t a Flow and Locality Analysis, is necessary to identify the parts and d a t a structures in the program where there is the most potential gain or where the performance bottlenecks are located. Next, a high-level estimation is applied of the potential in-place optimization for d a t a [6]. In step 3, Loop
927 and Control flow transformations are applied [4, 5] but over a much more global scope than typically done in compilers [24, 13], so as to improve the regularity and locality for power reduction. The constraints are the given cycle budget and the code size limitations and data dependencies. In the Data Re-use Decisions step, different data re-use trees [18] are explored and the best hierarchy mapping (from the power point of view) is obtained for a particular application. Finally, during Advanced In-place Mapping and Caching, both inter-signal and detailed intra-signal in-place mapping [6] are applied to obtain better cache utilization. 4
Experimental
Results
In this section we illustrate the validity and effectiveness of the methodology on two realqife applications, namely a Qua&tree Structured DPCM video compression algorithm and a voice coder algorithm, using a DLX machine simulator and the Dinero III cache simulator [9]. It is important to emphasize that the results of the hardware controlled cache for both the test vehicles are based on relatively small test data a due to the small address space available in the DLX simulator. But we also observe that relative gains for larger frame sizes will almost be similar for hardware controlled caches. Hence comparing relative gains for hardware and software controlled caches is justified. 4.1
QSDPCM
Application
The Quadtree Structured Difference Pulse Code Modulation (QSDPCM) [21] technique is an interframe adaptive coding technique for video signals. The algorithm consists of two basic parts: the first part is motion estimation and the second is a wavelet-like quadtree mean decomposition of the motion compensated frame-to-frame difference signal. A global view of the QSDPCM coding algorithm is given in figure 4(a). It is worth noting that the algorithm spans 20 pages of complex C code. It should be stressed that the functional partitioning in the QSDPCM algorithm was not based on ad hoc algorithmic issues but based on our methodology proposed in [4] where the best possible one was selected for power. As motivated above, a small cache size of 64 words is chosen for all the functions of the QSDPCM algorithm throughout this paper (i.e. in figure 2, C 1 = C 2 = . . . . Cn=64 words). The results of applying the above methodology for a hardware controlled cache are given in table 1. Here we give the power requirement of the most memory intensive parts of the QSDPCM algorithm for different levels of optimization. We observe that the initial specification consumes a relatively large amount of power. The gains due to the different steps of the methodology are quite clear 3 This means that e.g. for QSDPCM the frame size for software controlled cache was the real frame of 288 x 528, whereas for hardware controlled caching this was scaled down to 48 • 64
928 D Initial (loop blocked) Case Globally transformed Case
.2
z
Outpol
Roconstmcmd
SS24
V4
V2
VI
QC
RECO UPSA
Fig. 4. (a) The QSDPCM algorithm (b) Gain in number of misses between the globally optimized version and the initial loop blocked version for QSDPCM algorithm from the columns 3,4 and 5. The functions V4 and V2 have almost similar gains due to the different steps in the methodology. This indicates that apart from gains due to improving locality by means of global transformations and data re-use (column 3), there are equal gains due to only applying the aggressive in-place mapping and caching step (column 4). The various sources of power consumption are described below. In general we observe that the gain in power for the above methodology is quite significant for hardware controlled caches. Table 1. Normalized power requirements of the QSDPCM algorithm for hardware controlled cache Function
Initial (no transf)
]Subsamp2/4 V4
0.57 0.89 1.00 1.51 0,28
V2
V1 QuadConstr.
Modified Initial,loop In-placed caching (globally blocked and on globally transformed) cached transformed 0.0282 0.218 0.117 0.0753 0.398 0.344 0.0721 0.381 0.334 0.1161 0.452 0.597 0.073 0.0230 0.140
Table 2 gives the power consumption for the various functions of QSDPCM algorithm for a software controlled cache. Note that initially the major power consumption is in functions V4,V2,V1 and QuadConstr. These are due to multiple reads from the frame memories as well as large number of intermediate signals. After the global transformations, the power consumption is reduced close to the absolute minimum for the given algorithm. All other sources of power consumption are reduced (eliminated) to a negligible amount. We observe a gain (on average) of a factor 9 globally. This illustrates the huge impact of global transformations combined with more aggressive cache exploitation at compile time, even on a fully predefined processor architecture. Figure 4(b) shows the number of misses for each function in the Q S D P C M algorithm for initial and optimized versions. The initial version has been optimized for caching with loop blocking, whereas the entire methodology presented
929 Table
2. Normalized
power requirements
of the QSDPCM
algorithm for software
controlled cache Function
Initial (no transf)
Subsamp2/4 V4 V2 V1 QuadConstr.
0.21 0.75 1.90 3.29 0.71
Modified Initial,In-placed In-placed caching (globally locally transformed on globally transformed) and cached transformed 0.092 0.095 0.0359 0.002 0.374 0.0010 0.005 0.492 0.0024 0.208 0.644 0.0836 0.18 0.149 0.074
in section 3 is applied for the optimized version. We observe a gain of (on average) 26%. This shows that our methodology is not only beneficial for power but also for performance (speed). For the software controlled case, an even larger gain of about 40% was measured however. It is quite clear that the gain due to software controlled cache is relatively large compared to that of the hardware controlled cache. The reasons for this lie in the two major parameters that govern a hardware controlled cache. First, if the updating policy of the cache is not taken into account in the code at compile time, a large amount of power is consumed in the write-backs (copy-backs). So a write through policy consumes more power than a write back one. The second issue is the block size. In general, increasing block size reduces the miss rate. But nevertheless it also increases cache pollution 4 in case of algorithms with less than average spatial locality. It can be concluded that more software control over the cache has a very positive impact on power as well as for speed. This is an important consideration for the cache designers of (embedded) multi-media processors. 4.2
Voice Coder
Application
We have used a Linear Predictive Coding (LPC) vocoder which provides an analysis-synthesis approach for speech signals [19]. In our algorithm, speech characteristics (energy, transfer function and pitch) are extracted from a frame of 240 consecutive samples of speech. Consecutive frames share 60 overlapping samples. All information to reproduce the speech signal with an acceptable quality is coded in 54 bits. The general characteristics of the algorithm are presented in figure 5(a). It is worth noting that the steps LPC analysis and Pitch detection use autocorrelation to obtain the best matching predictor values and pitch (using a Hamming window). The autocorrelation function has irregular access structures due to algebraic manipulations, and in-place mapping performs better on irregular access structures than just loop blocking [12]. Figure 5(b) shows the power requirements of the initial version (with loop blocking) and the in-place mapped version. Here only the in-place mapping and 4 Cache pollution is a phenomenon wherein data which are not accessed (or demanded) are brought in due to larger block sizes
930 Speech 128 Kbps
~origlnal (loop blocked) c::~In placed Code
Code
Pre-emphasisand High [ Pass filter I Pitch Detection
LPCAnalysis I Vector
L
~a
Analysis
Quantization
I
Coding
1I
l
Coded Speech at 800bps
2
'4
i
Block size (bytes)
Fig. 5. (a) The different parts of the voice coder algorithm and the associated global dependencies (b)Gain in terms of energy for a voice coder autocorrelation function for different block sizes on DLX simulator
caching steps of section 3 are used so as to demonstrate that for irregular access structures it acts much better from the point of view of cache utilization t h a n just loop blocking. We have performed this experiments for varying block sizes and have observed that our approach indeed performs much better for all the block sizes. A gain of a factor 2 is observed.
5
Conclusion
T h e main contributions of this paper are: (1) a formalized methodology for power-oriented source-level transformations to improve cache utilization in a hardware controlled cache context (2) the effectiveness of this methodology is demonstrated on two real-life test vehicles: the Q S D P C M video compression algorithm and a voice coder (3) insight is provided into the differences in power and speed gains between hardware and software controlled caches, with consequences for low-power cache design in embedded multi-media applications.
References 1. J.Anderson, S.Amarasinghe, M.Lam, "Data and computation transformations for multiprocessors", in 5th ACM S I G P L A N Symposium on Principles and Practice of Parallel Programming, pp.39-50, August 1995. 2. V.Bhaskaran and K.Konstantinides, "Image and video compression standards: Algorithms and Architectures", Kluwer Academic publishers, Boston, MA, 1995. 3. R.W.Broderson, "The network computer and its future", Proc. [EEE Int. SolidState Circuits Conf.,San Francisco, CA, pp.32-36, Feb.1997. 4. K.Danckaert, F.Catthoor and H.De Man, "System-level memory management for weakly parallel image processing", In proe. EUROPAR-96, Lecture notes in computer science series, vol. 1124, Lyon, Aug 1996.
931 5. E.De Greef, F.Catthoor, H.De Man, "Program transformation strategies for reduced power and memory size in pseudo-regular multimedia applications", accepted for publication in IEEE Trans. on Circuits and Systems for Video Technology, 1998. 6. E.De Greef, F.Catthoor, H.De Man, "Memory Size Reduction through Storage Order Optimization for Embedded Parallel Multimedia Applications", Intnl. Parallel Proc. Symp.(IPPS)in Proc. Workshop on "Parallel Processing and Multimedia", Geneva, Switzerland, pp.84-98, April 1997. 7. T.Halfhill and J.Montgomery, "Chip fashion: multi-media chips", Byte Magazine, pp.171-178, Nov 1995. 8. P.Hicks, M.Walnock and R.M.Owens, "Analysis of power consumption in memory hierarchies", In Proc. Int'l symposium on low power electronics and design, pp.239242, Monterey, CA, Aug 1997. 9. Marck Hill, "Dinero III cache simulator", Online document available via http ://www. cs. wisc. edu/ markhill, 1989.
10. M.Jimenez, J.M.Liaberia, A.Fernandez and E.Morancho, "A unified transformation technique for multilevel blocking", In proc. EUROPAR-96, Lecture notes in computer science series, vol. 1124, Lyon, Aug 1996. 11. Milind Kamble and Kanad Ghose, "Analytical Energy Dissipation Models for Low Power Caches", /n Proc. Int'l symposium on low power electronics and design, pp.143-148, Monterey, Ca., Aug 1997. 12. C.Kulkarni,F.Catthoor and H.De Man, "Code transformations for low power caching in embedded multimedia processors", Proc. of Intnl. Parallel Processing Symposium (IPPS), pp.292-297,Orlando,FL,April 1998. 13. D.Kulkarni and M.Stumm, "Linear loop transformations in optimizing compilers for parallel machines", The Australian computer journal, pp.41-50, May 1995. 14. M.Lam, E.Rothberg and M.Wolf, " The cache performance and optimizations of blocked algorithms", In Proc. ASPLOS-IV, pp.63-74, Santa Clara, CA, 1991. 15. N.Manjikian and T.Abdelrahman, "Array data layout for reduction of cache conflicts", In Proc. of 8th Int'l conference on parallel and distributed computing systems, September, 1995. 16. L.Nachtergaele, F.Catthoor, F.Balasa, F.Franssen, E.De Greef, H.Samsom and H.De Man, "Optimisation of memory organisation and hierarchy for decreased size and power in video and image processing systems", Proc. Intnl. Workshop on Memory Technology, Design and Testing, San Jose CA, pp.82-87, Aug. 1995. 17. P.R.Panda, N.D.Dutt and A.Nicolau, " Memory data organization for improved cache performance in embedded processor applications", In Proc. ISSS-96, pp.9095, La Jolla, CA, Nov 1996. 18. J.P. Diguet, S. Wuytack, F.Catthoor and H.De Man, "Formalized methodology for data reuse exploration in hierarchical memory mappings", In Proc. Int'l symposium on low power electronics and design, pp.30-36, Monterey, CA, Aug 1997. 19. L.R.Rabiner and R.W.Schafer, "Digital signal processing of speech signals", Prentice hall Int'l Inc., Englewood cliffs, N J, 1988. 20. I<.Ronner, J.Kneip and P.Pirsch, "A highly parallel single chip video processor", Proc. of the 3rd International Workshop, Algorithms and Parallel VLSI architectures III, Elsevier, 1995. 21. P.Strobach, "QSDPCM - A New Technique in Scene Adaptive Coding," Proc. 4th Eur. Signal Processing Conf., EUSIPCO-88, Grenoble, France, Elsevier Publ., Amsterdam, pp.1141-1144, Sep. 1988. 22. V.Tiwari, S.Malik and A.Wolfe, "Instruction level power analysis and optimization of software", Journal of VLSI signal processing systems, vol. 13, pp.223-238, 1996.
932 23. Uming Ko, P.T.Balsara and Ashmini K.Nanda, "Energy optimization of multi-level processor cache architectures", In Proc. Int'l symposium on low power electronics and design, pp.45-49, Dana Point, CA, 1995. 24. M.E.Woff and M.Lam, "A loop transformation theory and an algorithm to maximize parallelism", I E E E Trans. on parallel and distributed systems, pp.452-471, Oct 1991.
Parallel Solutions of Simple Indexed Recurrence Equations Yosi B e n - A s h e r 1 a n d G a d y H a b e r 2 1 D e p . of Math. and CS. HaifaUniversity 31905 Haifa, Israel,
[email protected] 2 IBM Science and Technology, H~fa, Israel,
[email protected]
We consider a type of recurrence equations called "Simple Indexed Recurrences" (SIR) wherein ordinary recurrences of the form X[i] = o p i ( X [ i - 1],X[i]) (i = 1 . . . n ) are extended to X[g(i)] = opi(X[f(i)],X[g(i)]), such that op~ is an associative binary operation, f , g : { 1 . . . n } ~-~ { 1 . . . m } and g is distinct. 1 This extends our capabilties for parallelizing loops of the form: for i ---- 1 to n { X[i] = o p i ( X [ i - 1],X[i]) } to the form: for i ---- 1 to n { X[g(i)] = op~(X[f(i)],X[9(i)]) }. An efficient solution is presented for the special case where we know how to compute the inverse of op~ operator. The algorithm requires O(log n) steps with O(n/log n) processors. Furthermore, we present a practical and a more improved version of the non-optimal algorithm for SIR presented in [1] which uses repeated iterations of pointer jumping. A sequence of experiments was performed to test the effect of synchronous and asynchronous message-passing executions of the algorithm for p < < n processors. This algorithm computes the final values of X 0 in ~ 9logp steps and n 9logp work, with p processors. The experiments show that pointer jumping requires O(n) work in most practical cases of SIR loops, thus forming a more practical solution. Abstract.
1
Introduction
O r d i n a r y recurrence e q u a t i o n s of t h e f o r m X~ = o p i ( X i - l , X i ) i = 1 . . . n can be generalized to w h a t we refer to as Indexed Recurrence (IR) e q u a t i o n s . In I R e q u a t i o n s , g e n e r a l i n d e x i n g functions of the f o r m Xg(o = opi(X/(i),Xg(i)) r e p l a c e t h e i, i - 1 i n d e x i n g of t h e o r d i n a r y recurrences. Efficient p a r a l l e l s o l u t i o n s to such e q u a t i o n s can be used to p a r a l l e l i z e s e q u e n t i a l l o o p s of t h e f o r m : for i = 1 to n { X[g(i)] = opi(X[f(i)],X[g(i)]) } if t h e following c o n d i t i o n s are m e t :
1. opi(x, y) is a n y s e g m e n t of code t h a t is equivalent to a b i n a r y a s s o c i a t i v e o p e r a t o r a n d m a y b e d e p e n d e n t on i. 2. t h e index f u n c t i o n s f , g : {1..n} ~ {1..m} do n o t include references to e l e m e n t s of t h e X ~ array. 1 This paper is a continuation of the work on IR equations presented in [1].
934
3. the index function g is distinct, i.e., for every indexes i,j if i = j then
g(i) : g(j). In practice, such solutions can be used to parallelize sequential loops within given source code segments. An Efficient SIR Algorithm I n v e r s e o f opi is K n o w n
2
for the
Case
Where
the
In this section we describe an efficient parallel algorithm for S I R loops, providing we know how to compute the inverse operation op{ 1. The algorithm requires O(logn) steps with O(n/logn) processors. The first step in the algorithm is to translate the S I R loop into its upside down tree representation T, in which every vertex i (1 < i < n) represents the g(i) index and every edge e : {i, j} represents the dependency opi(Xg(j), Xg(i) where g(j) = f(i) and j < i. For example consider the following upside-down tree representation of a S I R loop: X[l..n] initializ,4 by A[I]... A[n] X[l] := A[I] X[2I := X D ] " x[2l X[a] := X D ] " X[3]
X[4]
:= X[3]' X[4] X [ M := x[3l" X[~] X[6] := X[3]" X[6]
A S I R L o o p for i ~ 1 . . , 6
An Upside-down
Tree Representation
of t h e l o o p
We then apply the Euler Tour circuit on T in order to compute all the prefixes of the vertices in T. An Euler Tour in a graph is a path that traverses each edge exactly once and returns to its starting point. The Euler tour circuit which operates on a tree assumes that the tree is directed and that it is represented by an Adjacency List. In the directed tree version TD of T, each undirected edge {i, j ) in T has two directed copies - one for (i,j) and one for (j, i). For example, consider the directed version of the tree for the above example, and its Adjacency List representation: 1 : (1, 3) --* (1, 2) --4" N d 2 : (2, I ) --+ /Vii
3 : (3, 1) -~ (3, 4) ~
(3, 5) ~
(3, 6) --* ~v~z
4 : (4, 3) --~ N i t
5 : (s,3)~
Nit
6 : (6, 3) --.)- N d
TD
- The
Directed
V e r s i o n of T
Adjacency
List Representation
of T D
Our main concern when coming to construct the adjacency list of To, is finding the list of all the children of each vertex i in TD efficiently; i.e., finding all vertices j < i which satisfy that g(j) = f(i). Sorting all the pairs < f(1), g(1) > , . . . , < f(n), g(n) > using f(i) as the sorting key, will automatically group together all vertices which have the same parent. The sorting process can be done efficiently using the Integer Sort algorithm [3] which operates on keys in the range of [1..n]. After the adjacency list of To is formed, we can construct the Euler tour of TD in a single step according to the following rule:
EtourLink(v, u) =
Next(u, v) if Next(u, v) • Nil AdjList(u) Otherwise
935 by allocating a processor to each pair of edges (v, u), (u, v) in TD. Once the Euler tour of To is constructed, in the form of a linked list, we can compute the prefixes of all the vertices of To, using the efficient parallel prefix algorithm on the linked list [2], after initializing all the edges of the tree with the following weights: - The weight of each forward edge < f(i)iy(i) > in To is initialized by Ag(i), where A is the array of initial values of the X array of S I R . - The weight of each backward edge < g(i), f(i) > is initialized by the inverse value Ag(~). %
X [ 1 ] ---- A [ 1 ] = A[1], A[2] X [ 3 ] ----- A [ 1 ] . AI2]/A[2] X [ 4 ] = A[1] . A[2]/A[2] X[5] = A[1] . A[2]/A[2] X [ 6 ] ----- A [ 1 ] 9 A [ 2 ] / A [ 2 ]
A i,
X[2]
E u l e r T o u r of T D
3
Parallel
with initial
weights
Solution
9 . . .
A[3] A[3]" A[4] A[3] - A[4]/A[4] A[3] - A[4]/A[4]
9 A[5] 9 A[5]/A[5]
. A[6]
T h e P r e f i x e s for e a c h v e r t e x
for SIR
by Using
Pointer
Jumping
Consider for example the SIR problem A'[2i] := A'[i + 1]-A'[2i]; (for i = 1, 2..n) where g(i) = 2i and f(i) = i+1. The final values of each element A'[i] is a product of a varying number of items. We refer to this sequence of multiplications of every element i in A'[] as the t r a c e of A'[i]. In general the trace of each element A'[g(i)] satisfies that for all i = 1 . . . n : A'[g(i)] = A[f(jk )] (~ . . . 9 A [ f ( j l )] (~ A[g( i)] such that: -
jl=i.
for t = 2 . . . k the indices jt satisfy that jt < jr-1 and g(j,) = f(jt-1). - jk is the last index for which g(Jd = f(J~-0, i.e., there is no 1 <__jk+t < jk such that g(jk+~) = f(jk). -
The above rule suggests a simple method for computing W[g(i)] in parallel. Let A-t[g(i)] denote the sub-trace with t + 1 rightmost elements in the trace of A'[g(i)], i.e., A-t[g(i)] = A[f(jk-t)] @ . . . | A[f(i)] G A[g(i)], and now consider the c o n c a t e n a t i o n (or m u l t i p l i c a t i o n ) of two "successive" sub-traces: A-(tl+t2)[g(i)] = A -tl [g(j)] | A -t2 [g(i)] where g(j) = f ( j k - t 2 ) and A[f(jk-t2)] is the last element in A -t2[g(i)]. The proposed algorithm (as shown in [1]) is a simple greedy algorithm t h a t keeps iterating until all traces are completed. In each iteration all possible concatenations of successive sub-traces are computed in parallel where the concatenation operation of sub-traces A -tt [g(j)], A -t~ [g(j)] can be implemented as follows: 1. The value of a sub-trace d-t[g(i)] is stored in its array element A[g(i)]. 2. A pointer Next[g(i)] points to the sub-trace A-t~[g(j)] to be concatenated to A-~2[g(i)] (to form A -(tl+t2) [g(i)]). Hence, A[Next[g(i)]] contains the value of the sub-trace A -tl [g(j)].
936
Following is the code of an improved and a more practical version of the algorithm for which we ran our experiments with p < < n processors. The new algorithm improves the above algorithm in the following points: - It works with compressed array of size n (the number of loop iterations) instead of the original array size, which is of size m > n, by mapping every A'[9(i)] element to an auxiliary array X[i]. - The index function f(i) is transformed to a decreasing function which never refers to future indexes of X ~ . This reduces the total number of iterations required by the algorithm from an order of logn to logp, improving the execution time to p logp. Suppose each processor computes ~ traces and uses a sequence of passes wherein it calculates the traces sequentially one after another. Since f 0 always points backwards to earlier elements in the input array, then after each step i we already have the final values of the f i r s t 2 i elements, requiring a total of logp steps in order to complete the entire array of size n. The same argument however, does not hold for the case where fO can refer to future elements in the array. Input: I n i t i a l i z e d Arrays A [ 1 . . m ] ,
g[1..m], f[l..rn],
Output: Array X [ 1 . . n ] ~here X [ , ] r i l l k01d the . a l u e of at the end of the I I loop.
I n i t i a l i z e A,xillary Arrays X [ 1 . . n ] , do in p a r a l l e l G i g [ i ] ] :• i; ( T h e a u x l l a r y e~d p a r a l l e l - f o r
G[1..m],
@.
(where
(9 is a b i n a r y
associative
opet-ator)
A[g[il]
Next[1..n]
to zero.
forall t E {1..n}
array
G represents
the
inverse
o f g, i . e . G[t] _= 9 - - 1 ( t ) )
forall i E {l,.n} do ia p a r a l l e l u := G[/'[i]]; ( T h e i t e r a t i o n where A/[,f(t)] was modified by the if (u > ~ or u = O) { ( T h e t r a c e o f A t [ g ( t ) ] is o f s i z e t w o . ) x [ , ] := A [ I [ , ] ] (9 A[g[,]]; (| . . . . A ' [ 9 ( * ) ] = A [ f ( , ) ] (9 A[g(i)].) N e x t [ t l := 0;
} ~l.e {
i f (f[U] ~ f[t] or f[u] = O) { ( T h e
trace
of
At[f(,)] is
loop).
of size three,)
X[i] := A [ / [ u l ] 9 A i r [ i l l (9 A[g[i]]; Ne~t[i] := 0; } else { ( T h e t . . . . . x[i] : = A [ I [ i l ] 6) Next[,] := l[u];
f A / [ g ( , ) ] is o f , i z e
greater
than
th ....
)
Aig[i]];
} } end parallel-for
Initialization
Stege
f o r t = 1 to l o g n do ( T h e a l g o r i t h m performs, at most, log n iterations) f 0 r a l l i ~ { ( 2 ~ - 1 + 1 ) . . n } do in p a r a l l e l i f (Next[i] > 0) then { ( A p p l y t h e e o n c a t e n a t l o n operation followed
uDdatlng) by the
pointer
X b ] := X [ N e x t I ' ] ] e x b ] ; } N e x t [ i ] :m Next[next[t]];
The code for the
4
teration
Experimental Async SIR
pha~es
Results
for Message
Passing
Sync
and
For the message-passing version of SIR we assume that the initial values A[1..rn] are partitioned between the processors, with each processor holding ~ elements.
937 Thus the first step is for each processor to fetch the O(~) elements of A[] that it will use during the initialization stage. Each processor will sort the requested elements according to their processor destination, and then fetch them using p "big-messages" (i.e., using message packaging). Two variants of this algorithm have been tested: S y n c SIR: where all processors compute one iteration of the main loop synchronously. A s y n e S I R ( k ) : where each processor performs 1 iterations of the main-loop before passing the control to the next processor. The number l is chosen randomly (every time) from the range 1 . . . k , where k is the parameter of the algorithm. Thus, by choosing different values of k we can change the amount of asynchronous deviation of the execution. We simulated different nmnbers of processors, p = 4, 8, 16, 32, 6"1,128,256,512, 1024 with n -- 500,000. The simulation of the algorithm was made on a sequential machine, since this allowed us an exact and simple measure of the total size of generated messages (communication), the work (in units of C instructions ) and the execution time (also in units of C instructions). All diagrams contain an artificial curve (e.g., f(p) -~ ~ . log n) for comparison. We first chose to test a SIR loop where g(i) = i and f(i) = i - 1, as this maximizes the length of each trace and consequently the expected work. For Sync SIR, we already know the results: execution time approximately ~ -logp, work n. logp, total of p. logp big messages and communication (total number of data items that have been sent) o f p . logp. The results of the execution time in fig. 1 verify this computation and show that increasing k cost of Async SIR(k) can reduce the speedups significantly. Another observation regarding Async SIR(k) is that for a relatively small number of processors the impact of k is much larger than with a large number of processors. According to fig. 1,the difference between the various Async SIR(k) results, becomes constant (independent on p) for values greater than 32. This is because of the relatively small number of A[]% elements that are allocated to each processor (.~ ~ p). The same results are obtained for the work of the algorithms as described in fig. 1. The communication and number of big messages is exactly p . logp, as expected. These experiments were repeated for a random setting f(i) = random(1...n). In general, we hoped to reach optimal performances of execution time ~ and work O(n). The execution time has improved, and sync S I R (see fig. 2) is now between ~ and ~ . logp. In addition, the effect of k in Async SIR(k) on the execution time is reduced, and for k = 1, 5, x/P we get an execution time that is less than ~- 9 logn. In particular, we can see that the work (fig. 2) behaves like O(n) rather than n logp. The communication (fig. 2) is below n. Unlike the case f(i) --- i, the effect of k in Async SIR(k) on the communication is negligible (all the results for k = 1, 5, v ~ are the same). In addition, asynchronous execution seems to improve the communication compared to syne SIR. An asynchronous execution might, for example, cause the middle processor to advance its Next[i] pointers before other processors started to work. This might reduce the communication that would have occured between
938
\
...............
<.......... ii: .....................
Fig. 1. Execution time and Work for f(~) : i - 1,
these processors if the middle processor hadn't advanced its pointers. The total number of messages was not affected by the random setting and (as is described in fig. 2) is around plogp.
A~y,,~ ~ ~ l ~
~T. "--~ ~-~
//
fJ !,/
4 ~64 i~H
~ 2CI46
2~6
i,lO~SO~~
Fig. 2. Execution time, Work, Comm. and big-messages for f ( i ) = random(1..n).
References 1. Y. Ben-Asher and G. H~ber. Parallel solutions of indexed recurrence equations. In In Proceedings of the IPPS'97 con/erence, Geneva, /997.
939 2. L. Rudolph Kruskal C. P. and M. Snir. The power of parallel prefix. I E E E Transactions on Computers, 34(C):965-968, 1985. 3. S. Rajasekaran, S. Sen. On parallel integer sorting. Technical Report To appear in ACTA INFORMATICA, Department of Computer Science, Duke University, 1987.
Scheduling Fork Graphs under L o g P w i t h an U n b o u n d e d N u m b e r of P r o c e s s o r s Iskander Kort and Denis T r y s t r a m LMC-IMAG BP53 Domaine Universitaire 38041 Grenoble Cedex 9, France {kort, trystram}{} imag. f r
A b s t r a c t This paper deals with the problem of scheduling a specific precedence task graph , namely the Fork graph, under the LogP model. LogP is a computational model more sophisticated than the usual ones which was introduced to be closer to actual machines. We present a scheduling algorithm for this kind of graphs. Our algorithm is optimal under some assumptions especially when the messages have the same size and when the gap is equal to the overhead.
1
Motivation
The last decade was characterized by the huge development of m a n y kinds of parallel computing systems. It is well-known today that a universal c o m p u t a tional model can not unify all these varieties. P R A M is probably the most important theoretical computational model. It was introduced for s h a r e d - m e m o r y parallel computers. The main drawback of P R A M is that it does not allow to take into account the communications through an interconnection network in a distributed-memory machine. Practical P R A M implementations have often bad performances. Many a t t e m p t s to define standard c o m p u t a t i o n a l models have been proposed. More realistic models such as BSP and LogP [3] appeared recently. They incorporate some critical parameters related to communications. The LogP model is getting more and more popular. A LogP machine is described by four parameters L,o,g and P. P a r a m e t e r L is the interconnection network latency. Parameter o is the overhead on processors due to local m a n a g e m e n t of communications. P a r a m e t e r g represents the m i n i m u m duration between two consecutive communication events of the same type. P a r a m e t e r P corresponds to the number of processors. Moreover, the model assumes t h a t at most [ ~ ] messages from (resp. to) a processor m a y be in transit at any time. We present in this paper an optimal scheduling algorithm under the LogP model for a specific task graph, namely the Fork graph. We assume t h a t the number of processors is unbounded. In the remainder of this section we give some definitions and notations concerning Fork graphs, then we briefly describe some related work. In Sect. 2, the scheduling algorithm is presented.
941 1.1
About the Fork Graph
A Fork graph is a tree of height one. It consists of a root task denoted by To preceeding n leaf tasks T 1 , . . . , T~. The root sends, after its completion, a message to each leaf task. In the remainder of this paper, symbol F will refer to a Fork graph with n leaves. In addition, we denote by wi the execution time of task 2~, i 6 { 0 , . . . , n}. The processors are denoted by Pi, i = 0, 1,.... We assume, without loss of generality, that task To is always scheduled on processor P0. Let S be a schedule of F, then dfi(S) denotes the completion time of processor Pi in S where i 6 {0, 1,...}. Furthermore, dr(S) denotes the makespan (length of S) and dr* the length of an optimal schedule of F. T h e contents of the messages sent by To must be considered when dealing with scheduling under LogP. Indeed, assume that To sends the same d a t a to some tasks Ti and Tj. Then assigning these tasks to the same processor (say Pi, i # 0) saves a communication. Two extreme situations are generally considered. In the first one, it is assumed that To sends the same d a t a to all the leaves. This is called a common data semantics. In the second situation, the messages sent by To are assumed to be pairwise different. This is called an independent data semantics [4]. 1.2
Related Work
The problem of scheduling tree structures with communication delays and an unbounded number of processors has received much interest recently. The m a j o r part of the available results focused on the extended R a y w a r d - S m i t h model. Chr@tienne has proposed a polynomial-time algorithm for scheduling Fork graphs with arbitrary communication and c o m p u t a t i o n delays [1]. He has also showed that finding an optimal schedule for a tree with a height of at least two is NPHard [2]. More recently, some results about scheduling trees under LogP have been presented. In [6], Verriet has proved that finding optimal Fork schedules is N P - H a r d when a common data semantics is considered. In [5], the authors have presented a polynomial-time algorithm that determines optimal linear schedules for inverse trees under some assumptions.
2
An Optimal Scheduling Algorithm
We present in this section an optimal scheduling algorithm of F when the n u m b e r of processors is unbounded (P > n). Furthermore , we assume that: - T h e messages sent by the root have the same size. - The gap is equal to the overhead: 9 = o. This is the case for systems where the communication software is a bottleneck. - An independent d a t a semantics is considered. - wi>wi+1,
V / 6 {l .... , n - l } .
We start by presenting some dominant properties related to Fork schedules.
942
L e m m a 1. Each of the following properties is dominant:
~rl Each leaf task Ti such that wi <<_0 is assigned to Po. ~r2 Processor Po executes To at first, then it sends the messages to tasks T~ which are assigned to other processors. These messages are sent according to decreasing values of w~. Finally, Po executes the tasks that were assigned to it in an arbitrary order. ~v3 Each processor Pi(i 7s O) executes at most one task, as early as possible (just after receiving its message). Every schedule that satisfies properties ~'1-~r3 will be called a d o m i n a n t schedule. Such schedules are completely determined when the subset A* of the leaves t h a t should be assigned to Po is known. We propose in the sequel an algorithm (called C L U S T E R F O R K ) that computes this subset. Initially, each task is assigned to a distinct processor. The algorithm manages two variables b and B which are a lower bound and an upper bound on dr* respectively. More specifically, at any time we have b <_ dr* < B. These bounds are refined as the algorithm proceeds. C L U S T E R F O R K explores the tasks from T] to Tn. For each task 7~,i E { 1 , . . . , n } , one of the following situations m a y be encountered. If the completion time of Ti in the current schedule is not less t h a n B then T/ is assigned to Po. If dfi is not greater than b, then the algorithm does not assign this task to Po. Finally, if dfi is in the range ]b, B[, then the algorithm checks whether there is a schedule 5: of F such that df(S) < df~. If such a schedule exists, then T/ is assigned to Po and B is set to dfi, otherwise T/ is not assigned to Po and b is set to d]). T h e o r e m 1. Let A* be a subset produced by C L U S T E R F O R K , then any dorainant schedule S* associated with A* is optimal. Finally, it is easy to see that C L U S T E R F O R K has a computational complexity of O(n2). 3
Concluding
Remarks
In this paper we presented an optimal polynomial time scheduling algorithm for Fork graphs under the LogP model and using an unbounded number of processors. This problem was solved under some assumptions which hold for a current parallel machine namely the IBM-SP. We remark t h a t a slight modification in the problem parameters (for instance when the messages are the same) leads to an NP-hardness result. Indeed, in this latter case, scheduling at most one task on each processor Pi, i # 0 is no more a dominant property and gathering some tasks on the same processor m a y lead to better schedules.
References 1. P. Chrfitienne. Task Scheduling over Distributed Memory Machines. In North Holland, editor, International Workshop on Parallel and Distributed Algorithms, 1989.
943 Algorithm
1 CLUSTERFORK
Begin {Initially: df, = wo + (i + 1) 9 o + L + wi, V1 < i < n} b := 0;{lower bound on dr*} B := M A X V A L ; {upper bound on dr*} for (i := 1 t o n) d o i f (df~ > B) t h e n xi := 1; {7) is assigned to p0} UPDATE_DF(i); else if (dfl > b) t h e n {Look for a schedule such that df < dfi} j:=i; k:=0; v:=df0; w h i l e ( j ~ n) d o if (dfj - k 9 o > dfi) t h e n {The assignment of Tj to p0 is necessary) if (v + wj - o > dfi) t h e n j := n + 1; else v :=v+wj-o; k :=k+l; j:=j+l; if(j=n+2) then {The looked for schedule does not exist) xi := 0; b :-- dfi; else {dr* < dr,) xi := 1; B := df~; U P D A T E _ D F ( i ) ; {Update processors completion times) else {dfi < b} xi := O;
End CLUSTERFORK
2. P. Chr~tienne. Complexity of Tree-scheduling with Interprocessor Communication Delays. Technical Report 90.5, MASI , Pierre and Marie Curie University, Paris, 1990. 3. D. E. Culler et al. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78-85, November 1996. 4. L. Finta and Z. Liu. Complexity of Task Graph Scheduling with Fixed Communication Capacity. International Journal of Foundations of Computer Science, 8(1):4366, 1997. 5. W. LSwe, M. Middendorf, and W. Zimmermann. Scheduling Inverse Trees under the Communication Model of the LogP-Machine. Theoretical Computer Science, 1997. to appear. 6. J. Verriet. Scheduling Tree-structured Programs in the Logp Model. Technical Report UU-CS-1997-18, Department of Computer Science, Utrecht University, 1997.
A D a t a Layout Strategy for Parallel Web Servers* J6rg Jensch, Reinhard Lifting, and Norbert Sensen Department of Mathematics and Computer Science University of Paderborn, Germany {jaglay, r l , sensen}~uni-paderborn.de
In this paper a new mechanism for mapping data items onto the storage devices of a parallel web server is presented. The method is based on careful observation of the effects that limit the performance of parallel web servers, and by studying the access patterns for these Abstract.
servers.
On the basis of these observations, a graph theoretic concept is developed, and partitioning algorithms are used to allocate the data items. The resulting strategy is investigated and compared to other methods using experiments based on typical access patterns from web servers that are in daily use.
1
Introduction
Because of the exponential growth in terms of number of users of the World Wide Web, the traffic on popular web sites increases dramatically. To resolve capacity requirements on the server systems hosting popular web sites, one way to strengthen the capability of a server system is to use parallel server systems that contain a number of disks attached to different processors which are connected by a single bus or a scalable communication network. The same solution can be applied for very large web sites which cannot be stored on one disk but use a number of disks connected to a single processor system. The problem that shows up if a number of disks are used to store the overall amount of data is to balance the number of requests issued to the disks as evenly as possible in order to achieve the largest overall throughput from the disks. The requests that have to be served by one disk are strictly dependent on the d a t a layout of the server system, i.e. the mapping of the files stored on the server onto the different storage subsystems (disks). The investigation of a new concept for the data layout of parallel web servers is the focal point of this paper. The problem is of major relevance for the performance of web servers and can be assumed to become even more important if the current technological trends concerning the bandwidth offered by storage devices, processors and wide area * This work was partly supported by the MWF Project "Die Virtuelle Wissensfabrik", the EU project SICMA, and the DFG Sonderforschungsbereich 1511 "Massive Parallelits Algorithmen, Entwurfsmethoden, Anwendungen".
945 communication networks are regarded: Whereas the performance of processors and external communication networks (ATM, Gigabit Ethernet) has been dramatically increased over the last years, the read/write performance of storage devices shows only little increase over the last years. Thus, the performance of a server system can only be increased by the use of a larger number of disks t h a t delivers d a t a elements in parallel to the external communication network. Different approaches can be found in the literature in order to increase the performance of web servers: NCSA [7] and SWEB [4] have built a multiworkstation H T T P server based on round-robin domain name resolution(DNS) to assign requests to a set of workstations. In [5] this DNS-strategy is extended, the time-to-live (TTL) periods are used to distribute the load more evenly. In the past different researchers tried to identify the workload behavior with emphasis on the development of a caching strategy. NCSA developers analyzed user access patterns for system configurations to characterize the W W W traffic in terms of request count, request d a t a volume, and requested resources [9]. Arlitt and Williamsion [3] studied workloads of different Internet Web servers. Their emphasis placed on finding universal invariants characterizing Web servers workloads. Bestavros et. al [2] tried to characterize the degree of temporal and spatial locality in typical Web server reference streams. A general survey on file allocation is given in [13]. In [9] the problem of declustering is solved by transforming the problem onto a M A X - C U T problem. The d a t a layout method that is presented in this paper is based on the following idea: The goal of the layout is to minimize the number of items t h a t resides on the same disk and are requested often simultaneously. So we examine the access pattern of the past and place those items on different disks. Since the items which are often simultaneously requested are roughly the same at following days, this strategy leads to a good d a t a layout. The model of our parallel web server is presented in section 2. In section 3 we show that typical access patterns remain constant over some time, and collisions are typical patterns of a web server. In section 4 the d a t a layout strategy is presented and the strategy is studied in detail, compared with other methods and limits of the strategy are shown. The paper finishes on some concluding remarks in section 5.
2
Model
This section presents the model of the parallel web server that is used to describe the d a t a layout algorithm in the next sections. Thus, in our model we concentrate on the aspects that are i m p o r t a n t for the presentation in the rest of the paper. A parallel web server is built by the following entities: A number of processing modules t h a t are connected by some kind of network or bus architecture, a number of communication devices t h a t connect the processing modules to the external clients accessing the server and requesting information, and a n u m b e r of storage devices (disks) that are connected to the processing modules.
946 The parallel web server stores d a t a items (files) on the disks and works as follows: - The server is able to accept one or more requests arriving via the c o m m u n i cation devices f r o m the external clients per time step. - T h e server forwards a request to the disk holding the requested item. We assume here, t h a t each item is only stored once on the whole disk pool. - Every disk can accept only one request per time step. If more t h a n one request is sent to a disk per time step, these requests queue up. - The processing time for each request on the disk is constant and takes one time unit, so one disk can process only one request in one time unit. - All disks are independent, so t h a t the server can process a m a x i m u m of n requests per time step if n is the n u m b e r of disks. In case t h a t two requests are forwarded to the same disk in one time step, one request can be served and the other request has to wait for one t i m e unit. This conflict resolution can be done arbitrarily, i.e. we do not assume a specific protocol here. In case t h a t two (or more) requests access the same storage device (disk) we say, t h a t these requests are colliding, i.e. these requests are in collision. We say t h a t two requests collide if they arrive on the web server within a time interval of time a. We have chosen a to be one second. The aim of our work is now to develop a d a t a layout s t r a t e g y in order to m a p the d a t a items in a way onto the disks, t h a t the requests t h a t arrive at the server lead to a m i n i m a l n u m b e r of collisions and therefore to a m i n i m a l latency in answering the d a t a requests issued by the external clients.
3
Monitoring
the
access
to web
servers
In order to increase the overall performance of a parallel web server it is n o t i m p o r t a n t to balance the overall n u m b e r of requests issued to the disks as evenly as possible, but to avoid t h a t a larger n u m b e r of requests are s u b m i t t e d to a single disk. This means t h a t in each small time interval the load has to be distributed as evenly as possible minimizing the latency time for a request this way. Thus, the aim is to minimize the n u m b e r of collisions on the disks. In order to get an impression of the collisions t h a t occur we have taken the log files f r o m the web server of the University of P a d e r b o r n (www.uni-paderborn.de) for N o v e m b e r and December 1997. We built all tuples (i, j) of files i and j stored on the web server in P a d e r b o r n and m e a s u r e d the n u m b e r of collisions, i.e. the n u m b e r of hits t h a t we m a d e on files i and j in a time interval a. Figure 1 shows the n u m b e r of collisions for all pairs of files for two consecutive days sorted by the collisions of the first day. T h e x-axis lists all tuples (i, j) of files according to the collisions on the first day, and the y-axis lists the n u m b e r of collisions for each pair of files. Now it is interesting to observe t h a t the collisions are very similar on the two consecutive days. So if we develop an algorithm, t h a t minimizes the collisions for a typical access p a t t e r n of the web server m o n i t o r e d at one day, these collisions will also
947
Fig. 1. Distribution of collisions
be eliminated on the next day. Thus, we can take the similarity of access patterns into account for the construction of our data layout strategy. This is the basic observation and the foundation of our data layout principle.
4
Data
layout
strategies
As described above the data layout strategy aims at minimizing the number of collisions for a typical access pattern. The mapping that is computed in this way can be assumed to also avoid a large number of collisions in the future as the distribution of collisions is very similar on consecutive days. In the following we will firstly explain the algorithm used to compute the data layout, and secondly evaluate the performance of this algorithm in detail. 4.1
Algorithm
Throughout the rest of the paper we define F to be the set of data items (flies) and {(tl, fl), (t2, f 2 ) , . . . , (trn, frn),..-} be an access pattern for a web server with ti being the time when the request for data item (file) fi E F arrives on the server. Then c(i, j) =[ {{(t, i), (t', j)} lit - t ' ]_< 5} l is defined as the number of collisions of files i and j for the given access pattern. Our strategy is to distribute the objects stored on the web server onto the given disks in such a way so that collisions are minimized. For a given access pattern this leads to the following algorithmic problem: g i v e n : A set of data items F, the access pattern, and a given number of storage devices n. q u e s t i o n : Determine mapping 7r, ~r = mint:F-~(1 ...... } ~,jeF,~(q=~(j) c(i, j). It is easy to map this problem to the MAX-CUT-problem. The MAX-CUTproblem is defined as follows: g i v e n : A graph G = (V, E), weights w(e) E 1N and a number of partitions n. q u e s t i o n : Determine a partition of V into n subsets, such that the sum of the weights for the edges having the endpoints in different subsets is maximal.
948
The m a p p i n g is done in a way that the d a t a items (files) are the nodes of the graph that has to be partitioned, and the number of collisions c(i, j) determines the weight of edge {i, j}. The M A X - C U T problem is known to be N P - h a r d [6]. However, there are good polynomial approximation algorithms which deliver good solutions. In our experiments we make use of the P A R T Y - L i b r a r y [12] containing an efficient implementation of an extension from the partitioning algorithm described in [8]. 4.2
Collision Resolving if access pattern
are known in advance
In a first step we examine the gain of our algorithm if the access p a t t e r n and therefore the collisions are known in advance. Therefor we take the access statistics of one day and determine the number of collisions of each pair of d a t a items. On the base of these statistics, we build the graph, partition the graph, and look how m a n y collisions have remained and how m a n y have been resolved. In the following we compare the results of our algorithm with the r a n d o m m a p p i n g strategy for a number of access patterns. Each access patterns exactly represents all requests t h a t were issued to the server during one day. We compare the number of collisions induced by the access pattern (ka) with the n u m b e r of remaining collisions that occur when applying the m a p p i n g algorithm described above (kr). The factor f describes the ratio between the number of remaining collisions for the random mapping (which is ~ ) and the m a p p i n g that is determined by the algorithm (kr). Table 1. Results for a number of access patterns, each representing one day results, n = 8 day lodes eagesl ka sun 11/23/97 5533 151481 75334 mon 11/24/97 8877 481361239365 tue 11/25/97 7932 437201228870 wed 11/26/97 8206 411721215825 thu 11/27/97 7464 446561231364 fri 11/28/97 6976 309191174120 sat 11/29/97 5065 90591 37702 sun 11/30/97 4798 86571 42544
3152 12100 11367 10800 12021 8671 1305 1579
4.18 2.99 5.06 2.50 4.9712.52 5.00t2.50 5.2012.40 4.9812.51 3.4613.61 3.7113.37
In Table 1 the statistics and results of the partition of one week are shown. The table shows that all collisions up to 4 - 5 percent can be resolved and t h a t the algorithm has about 2 to 3 times the performance of the r a n d o m mapping. 4.3
Realistic
optimization
In this section we examine how m a n y collisions can be resolved if we use the access-statistics of the past. This approach can only be successful if the collisions
949 of successive days have some similarity. We already examined this similarity in section 3. Table 2 presents the average results for a number of days where the m a p p i n g was determined by the access pattern of day i - 1 and this m a p p i n g was used for collision avoiding of day i. The table shows the factor fpd comparing the performance of the r a n d o m m a p p i n g method with the algorithm presented above in respect to the number of disks n. It also shows the percentage of remaining collisions k~,~d[%] that could not be resolved. . It is shown that the n u m b e r of remaining collisions with this m a p p i n g is clearly lower than with a r a n d o m mapping. The advantage of our m a p p i n g increases with the number of available disks. T a b l e 2. Comparison of random placement and algorithm using access pattern of the previous day and algorithm using access pattern of the same day
n k~--~r%l f;d ka I. "J 2 3 4 5 6 7 8
45.2 27.4 18.7 13.7 10.5 8.2 6.7
1.11 1.22 1.34 1.47 1.61 1.76 1.87
43.1/1.161 o.7o 24.911.341 16.3 1.541 11.3 1.78 8.212.051 6.012.41l 4.712.73[
0.70 0.72 0.72 0.73 0.73 0.74
The table also shows the loss which occurs from the fact t h a t the access pattern of successive days are not identical. To see this the according percentage of remaining collisions ~ d [ % ] and the factor f,d are given which could be achieved if the access pattern of each day would be known in advance. The value of the term ~ - k r , p , shows the relative performance difference of the r a n d o m m a p p i n g and the d a t a layout determined by the algorithm presented above, for the case that the algorithm knows the exact access pattern or only knows the access pattern of the day before. The results show t h a t the performance of our method decreases by about 30 percent if the d a t a layout is c o m p u t e d on the basis of the access pattern from day i - 1 instead of day i if the access pattern of day i is applied. This loss seems to be nearly independent from the number of disks but becomes smaller for larger number of disks. In general the results show, that the d a t a layout that is based on the access pattern of a previous day leads to a large reduction of the collisions later on.
4.4
U s i n g a n u m b e r o f a c c e s s p a t t e r n s t o d e t e r m i n e t h e data layout
Up to now we only used the access pattern of one day to compute the d a t a layout. Here we will present results for the case t h a t a number of access patterns
950 T a b l e 3. Results of using larger access patterns for determination of data layout days 1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 partitions avg I max min 44.65144.9 c 43.95 44.49144.92 43.60 44.35144.91 43.26 44.38J44.85 43.43 44.48145.21 43.47 44.48145.15 43.66 44.48145.24 43.51 44.42144.94 43.73 44.46145.37 43.63 44.39145.22 43.32 44.46145.45 43.36 44.38145.07 43.40 44.40J45.30 43.47 44.37144.91 43.43
8 partitions avgl max min 6.5817.08 5.96 5.4016.9(] 5.80 6.3316.6~ 5.71 5.35J 6.93 5.78 5.28[6.83 5.63 5.2616.85 5.64 3.30] 6.88 5.53 3.2616.88 5.68 3.2416.86 5.58 3.2516.71 5.58 ~.~, j5.80 5.63 3.28r 5.75 !5.64 L3315.83 5.73 ~.29l 7.00 5.67
were used to determined this layout. All collisions f r o m these p a t t e r n s were taken into account to determine the d a t a layout. T h e results in Table 3 were determined using m previous days of a fixed date i and determining the d a t a layout using all the collisions t h a t occur in these m previous days. T h e experiments were m a d e for different start dates i, so average, m a x i m u m and m i n i m u m could be studied. T h e table presents the percentage of the collisions t h a t could not resolved for the access p a t t e r n of date i using the different d a t a layouts. It is shown, t h a t the best results were achieved with a d a t a layout t h a t is determined using only a few days of access patterns. A larger n u m b e r of access patterns does not increase the overall p e r f o r m a n c e of the d a t a layout m e t h o d .
4.5
Update
frequency for data layout
Using the results of the previous sections we know t h a t the d a t a layout m e t h o d presented here provides reasonable performance i m p r o v e m e n t s to r a n d o m i z a t i o n strategies when it considers the access pattern. T h e results also show t h a t a typical access p a t t e r n which is used to determine the d a t a layout should contain the access information of only a few days. No better results could be m a d e with m o r e information. T h e question is now, if the d a t a layout is d e t e r m i n e d on a day i taking the access p a t t e r n of day i - 1, i - 2, ... i - x into account (small x) how will the performance be on a day i + d, thus d days in the future. This will answer the question how often the d a t a layout has to be u p d a t e d to achieve the best results. T h e following experiment was done to answer this question: A d a t a layout was determined taking the access p a t t e r n of d a t a i into account. T h e n the access p a t t e r n s of the days i + d, was applied to this d a t a layout and the n u m b e r of
951
9.5 9
" ........................... i :
-
-
; r
;
'
~
T ................ '
l %
: "" : : ' " .~"
.'~nin :
-c3--
8.5 = o a)
"6
7
5.5
i
..~
i
i
-r
', /
2
4
6
"~? ...... i
.-6""
'
:
~'
!
i
i
~
i
i
i
14
16
18
8 10 12 days after last partitioning
20
Fig. 2. Influence of time since last update of data layout on performance
collisions that had not been resolved was measured. For different values (x-axis) the results are shown in Figure 2. As the results were gained for different starting days i, average values, m a x i m u m and m i n i m u m could be determined for every d. The results show, that the number of unresolved collisions is steadily increasing, thus the data layout should be updated as often as possible (taking the last access pattern into account) to achieve the best results.
5
Conclusions
This paper presents a new strategy for allocating d a t a items onto the storage devices of a parallel or sequential web server that contains a n u m b e r of disks to store this data. The method is based on the observation that requests t h a t often arise in parallel on the web server should be forwarded to different storage devices. Using this observation a graph-theoretic formulation is developed and the problem is reduced to a M A X - C U T problem that is solved using heuristics. The results show that the performance improvements of the strategy are considerable compared to randomization strategies that are usually used and that the performance improvement scales up with the number of disks used in the parallel server. The various investigations presented in the paper show t h a t it makes no sense to observe the access to the parallel web server for a long time to determine a typical access pattern which then builds the basis for the determination of a d a t a layout and that the data layout has to be u p d a t e d from time to time as for a fixed d a t a layout the number of resolved collisions decreases steadily. Future work will focus on the investigation of an extended m e t h o d t h a t determines the number of duplicates for d a t a items considering different restrictions (in terms of disk capacity, or number of copies) into account.
952
References 1. M. Adler, S. Chakrabarti, M. Mitzenmacher, L. Rasmussen: Parallel Randomized Load Balancing. Proc. 27th Annual ACM Symp. on Theory of Computing (STOC '95), 1995. 2. Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de Oliveira: Characterizing Reference Locality in the W W W . Proceedings of PDIS'96: The IEEE Conference on Parallel and Distributed Information Systems, December 1996. 3. Martin Arlitt and Carey Williamson: Web server workload characterization: The search for invariants. Proceedings of A C M SIGMETRICS'96,1996. 4. D. Andresen, T. Yang, V. Holmedahl and O. Ibarra: SWEB: Towards a Scalable W W W Server on MultiComputers, Proceedings of the lOth International Parallel Processing Symposium (IPPS'96), April 1996. 5. M. Colajanni and P.S. Yu: Adaptive T T L schemes for Load Balancing of Distributed Web Servers, A C M SIGMETRICS Perfermance Evaluation Review, 1997, volume 25,2 6. R. M. Karp: Reducibility among combinatorial problems, in R. E. Miller and J. W. Thatcher, Complexity of Computer Computations, pages 85-103, 1972 7. Eric Dean Katz, Michelle Butler and Robert McGrath: A scalable H T T P server: The NCSA prototype. Computer Networks and ISDN Systems, volume 27, pages 155-164, 1994. 8. B.W. Kernighan and S.Lin. An effective heristic procedure for partitioning graphs. The Bell Systems Technical Journal, pages 192-308, Feb 1970. 9. Kwan, T.T., R.E. McGrath and D.A. Reed: NCSA World Wide Web Server: Design and Performance. IEEE Computer, November 1995, 28(11), pages 68-74. 10. Y.H. Liu and P. Dantzig and C.E. Wu and J. Challenger and L.M. Ni, A distributed {Web} server and its performance analysis on multiple platforms. Proceedings of the 16th International Conference on Distributed Computing Systems, IEEE, 1996 11. D.R. Liu and S. Shekhar. Partitioning Similarity Graphs: A Framework for Declustering Problems, Information Systems, 1996, volume 21,6, pages 475-496 12. R. Preis and R. Diekmann. The PARTY Partitioning - Library User Guide. Technical Report tr-rsfb-96-024, University of Paderborn. 13. Benjamin W. Wah., File placement on distributed computer systems. Computer
Magazine of the Computer Group News of the IEEE Computer Group Society, 17(1), January 1984.
ViPIOS: The Vienna Parallel Input/Output System* Erich Schikuta, Thomas Fuerle and Helmut Wanek Institute for Applied Computer Science and Information Systems Department of Data Engineering, University of Vienna, Rathausstr. 19/4, A-1010 Vienna, Austria schikiOif s .univie. ac. at
Abstract. In this paper we present the Vienna Parallel Input Output System (ViPIOS), a novel approach to enhance the I/O performance of high performance applications. It is a client-server based tool combining capabilities found in parallel I/O runtime libraries and parallel file systems.
1
Introduction
In the last few years the applications in high performance computing (Grand Challenges [1]) shifted from being CPU-bound to be I/O-bound. Performance can not be scaled up by increasing the number of CPUs any more, but by increasing the bandwidth of the I/O subsystem. This situation is commonly known as the I/O bottleneck in high performance computing ([5]) In reaction all leading hardware vendors of multiprocessor systems provided powerful concurrent I/O subsystems. In accordance researchers focused on the design of appropriate programming tools to take advantage of the available hardware resources. 1.1
T h e ViPIOS Approach
Conventionally two different directions in developing programming support are distinguished: Runtime libraries for high-performance languages (e.g. Passion [6]) and parallel file systems, (e,g. IBM Vesta [4]). We see a solution to the parallel I / O problem in a combination of both approaches, which results in a dedicated, smart, concurrently executing runtime system, gathering all available information of the application process both during the compilation process and the runtime execution. Initially it can provide the optimal fitting data access profile for the application and may then react to the execution behavior dynamically, allowing to reach optimal performance by aiming for maximum I/O bandwidth. * This work was carried out as part of the research project "Language, Compiler, and Advanced Data Structure Support for Parallel I/O Operations" supported by the Austrian Science Foundation (FWF Grant Pll006-MAT)
954 This approach led to the design and development of the Vienna Input Output System, ViPIOS ([2,3]). ViPIOS is an I / O runtime system, which provides efficient access to persistent files, by optimizing the data layout on the disks and allowing parallel read/write operations. ViPIOS is targeted as a supporting I / O module for high performance languages (e.g. HPF).
2
System Architecture
The basic idea to solve the I/O bottleneck in ViPIOS is de-coupling. The disk access operations are de-coupled from the application and performed by an independent I / O subsystem, ViPIOS. This leads to the situation that an application just sends general I / O requests to ViPIOS, which performs the actual disk accesses in turn. This idea is caught by figure 1. application processes
datai requests I diisk~~ asset coupledI/O
de-coupledI/0
Fig. 1. Disk access de-coupling
disksub-system ViPIOS
Fig, 2. ViPIOS system architecture
Thus ViPIOS's system architecture is built upon a set of cooperating server processes, which accomplish the requests of the application client processes. Each application process AP is linked by the ViPIOS interface VI to the ViPIOS servers VS (see figure 2). The server processes run independently on all or a number of dedicated processing nodes on the underlying MPP. It is also possible that an application client and a server share the same processor. Generally each application process is assigned exactly one ViPIOS server (which is called the buddy server to the application), but one ViPIOS server can serve a number of application processes, i.e. there exists a one-to-many relationship between the application and the servers (see figure 3). The other ViPIOS servers are called foe server to the application.
955
A is
'buddy' to app.
.•i
I Dl BDL B
ViPIOS proprietory interface messagemanager
"'. B is 'foe' to application
fragmenter
I
directory manager
memorymanager disk manager
Fig. 3. "Buddy" and "Foe" Servers
2.1
Fig. 4. ViPIOS server architecture
ViPIOS Server
A ViPIOS server process consists of several functional units as depicted by figure 4. Basically we differentiate between 3 layers: The Interface layer provides the connection to the "outside world" (i.e. applications, programmers, compilers, etc.). Different interfaces are supported by interface modules to allow flexibility and extendibility. Until now we implemented an H P F interface module (aiming for the VFC, the H P F derivative of Vienna FORTRAN) a (basic) MPI-IO interface module, and the specific ViPIOS interface which is also the interface for the specialized modules. - The Kernel layer is responsible for all server specific tasks. - The Disk Manager layer provides the access to the available and supported disk sub-systems. This layer too is modularized to allow extensibility and to simplify the porting of the system. At the moment ADIO [7], MPI-IO, and Unix style file systems are supported. The ViPIOS kernel layer is built up of four cooperating functional units: - The Message manager is responsible for the external (to the applications) and internal (to other ViPIOS servers) communication. - The Fragmenter can be seen as "ViPIOS's brain". It represents a smart data administration tool, which models different distribution strategies and makes decisions on the effective data layout, administration, and ViPIOS actions. - The Directory Manager stores the meta information of the data. We designed 3 different modes of operation, centralized (one dedicated ViPIOS directory server), replicated (all servers store the whole directory information), and localized (each server knows the directory information of the data it is storing only) management. Until now only localized management is implemented. - The Memory Manager is responsible for prefetehing, caching and buffer management.
956 Requests are issued by an application via a call to one of the functions of the ViPIOS interface, which in turn translates this call into a request message which is sent to the buddy server. The local directory of the buddy server holds all the information necessary to map a client's request to the physical files on the disks. The fragmenter uses this information to decompose (fragment) a request into sub-requests which can be resolved locally and sub-requests which have to be communicated to other ViPIOS-servers (foe servers). The I/0 subsystem actually performs the necessary disk accesses and the transmission of data t o / f r o m the AP.
2.2
System
Modes
ViPIOS can be used in 3 different system modes, as runtime library, - dependent system, or - independent system.
-
These modes are depicted by figure 5.
[ app. I v i
w
ViPIOS
viPIOS
runtime library
dependent system
Fig. 5. ViPIOS system
independent system
modes
Runtime Library. Application programs can be linked with a ViPIOS runtime module, which performs all disk I / O requests of the program. In this case ViPIOS is not running on independent servers, but as part of the application. The ViPIOS interface is therefore not only calling the requested data action, but also performing it itself. This mode provides only restricted functionality due to the missing independent I / O system. Parallelism can only be expressed by the application (i.e. the programmer). Dependent System. In this case ViPIOS is running as an independent module in parallel to the application, but is started together with the application. This
957 application clients view pointer
@
persistent file global pointer
~-'..."'~
Problem layer
II[.![[
File layer
f
ViPIOS servers ~ local pointer
Data layer mapping functions Fig. 6. ViPIOS data abstraction
is inflicted by the MPI 1 specific characteristic that cooperating processes have to be started together in the same communication world. Processes of different worlds can not communicate until now. This mode allows smart parallel data administration but objects a preceeding preparation phase9
Independent System. This is the mode of choice to achieve highest possible I / 0 bandwidth by exploiting all available data administration possibilities. In this case ViPIOS is running similar to a parallel file system or a database server waiting for applications to connect via the ViPIOS interface. This connection is realized by a proprietary communication layer bypassing MPI. We implemented two different approaches, one by using PVM, the other by patching MPI. A third promising approach is just evaluated by employing PVMPI, a possibly uprising standard under development for coupling MPI worlds by PVM layers. 3
D a t a A b s t r a c t i o n in V i P I O S
ViPIOS provides a data independent view of the stored data to the application processes. Three independent layers in the ViPIOS architecture can be distinguished, which are represented by file pointer types in ViPIOS. - Problem layer. Defines the problem specific data distribution among the cooperating parallel processes (View file pointer). - File layer. Provides a composed view of the persistently stored data in the system (Global file pointer)9 - Data layer. Defines the physical data distribution among the available disks (Local file pointer)9 1 The MPI standard is the underlying message passing tool of ViPIOS to ensure portability
958 Thus d a t a independence in ViPIOS separates these layers each other, providing m a p p i n g functions between these layers. data independence between the problem and the file layer, independence between the file and d a t a layer. This concept is depicted in figure 6 showing a cyclic d a t a
4
conceptually from This allows logical and physical data distribution.
Conclusions and Future Work
In this paper we presented the Vienna Parallel Input Output System (ViPIOS), a novel approach to parallel I / O based on a client server concept which combines the advantages of existing parallel file systems and parallel I / O libraries. We described the underlying design principles of our approach and gave an in-depth presentation of the developed system.
References 1. A Report by the Committee on Physical, Math., and Eng. Sciences Federal Coordinating Council for Science, Eng. and Technology. High-Performance Computing and Communications, Grand Challenges 1993 Report, pages 41 - 64. Committee on Physical, Math., and Eng. Sciences Federal Coordinating Council for Science, Eng. and Technology, Washington D.C., October 1993. 2. Peter Brezany, Thomas A. Mueck, and Erich Schikuta. Language, compiler and parallel database support for I/O intensive applications. In Proceedings of the International Conference on High Performance Computing and Networking, volume 919 of Lecture Notes in Computer Science, pages 14-20, Milan, Italy, May 1995. Springer-Verlag. also available as Technical Report of the Inst. f. Software Technology and Parallel Systems, University of Vienna, TR95-8, 1995. 3. Peter Brezany, Thomas A. Mueck, and Erich Schikuta. A software architecture for massively parallel input-output. In Third International Workshop PARA'96 (Applied Parallel Computing - Industrial Computation and Optimization), volume 1186 of Lecture Notes in Computer Science, pages 85-96, Lyngby, Denmark, August 1996. Springer-Verlag. Also available as Technical Report of the Inst. f. Angewandte Informatik u. Informationssysteme, University of Vienna, TR 96202. 4. Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. A CM Transactions on Computer Systems, 14(3):225-264, August 1996. 5. Juan Miguel del Rosario and Alok Choudhary. High performance l/O for parallel computers: Problems and prospects. IEEE Computer, 27(3):59-68, March 1994. 6. Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996. 7. Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 180-187, October 1996.
A Performance Study of Two-Phase I / O Phillip M. Dickens 1 Department of Computer Science Illinois Institute of Technology Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory
A b s t r a c t . Massively parallel computers are increasingly being used to solve large, I / 0 intensive applications in many different fields. For such applications, the I/O subsystem represents a significant obstacle in the way of achieving good performance. While massively parallel architectures do, in general, provide parallel I/O hardware, this alone is not sufficient to guarantee good performance. The problem is that in many applications each processor initiates many small I/O requests rather than fewer larger requests, resulting in significant performance penalties due to the high latency associated with I/O. However, it is often the case that in the aggregate the I/O requests are significantly fewer and larger. Two-phase I/O is a technique that captures and exploits this aggregate information to recombine I/O requests such that fewer and larger requests are generated, reducing latency and improving performance. In this paper, we describe our efforts to obtain high performance using twophase I/O. In particular, we describe our first implementation which produced a sustained bandwidth of 78 MBytes per second, and discuss the steps taken to increase this bandwidth to 420 MBytes per second.
1
Introduction
Massively parallel computers are increasingly used to solve large, I/O-intensive applications in several different disciplines. However, in m a n y such applications the I / O subsystem performs poorly, and represents a significant obstacle to achieving good performance. The problem is generally not with the hardware; m a n y parallel I / O subsystems offer excellent performance. Rather, the problem arises from other factors, primarily the I / O patterns exhibited by m a n y parallel scientific applications [1, 5] In particular, each processor tends to make a large n u m b e r of small I / O requests, incurring the high cost of I / O on each such request. The technique of collective I / 0 has been developed to better utilize the parallel I / O subsystem [2, 7, 8]. In this approach, the processors exchange information about their individual I / O requests to develop a picture of the aggregate I / O request. Based on this global knowledge, I / O requests are combined and s u b m i t t e d in their proper order, making a much more efficient use of the I / O subsystem.
960 Two significant implementation techniques for collective I/O are two-phase I/O [2, 7] and disk-directed I/O [4, 6]. In the two-phase approach, the application processors collectively determine and carry out the optimized approach. In this paper, we deal only with the two-phase approach. Consider a collective read operation. If the data is distributed across the processors in a way that conforms to the way it is stored on disk, each processor can read its local array in one large I/O request. This distribution is termed the conforming distribution, and represents the optimal I/O performance. Assume the array is not distributed across the processors in a conforming manner. The processors can still perform the read operation assuming the conforming distribution, and then use interprocessor communication to redistribute the data to the desired distribution. Since interprocessor communication is orders of magnitude faster than I/O calls, it is possible to obtain performance that approaches that of the conforming distribution. The question then arises as to what is the best approach to implement the two-phase I/O algorithm. While much has been published regarding the performance gains using this technique, relatively little has been written about specific implementation issues, and how these issues affect performance. In this paper, we focus on the implementation issues that arise when implementing a two-phase I/O algorithm. We start by describing a very simple implementation, and go through a sequence of steps where we explore different optimizations to this basic implementation. At each step, we discuss the modification to the algorithm and discuss its impact on performance.
2
Experimental
Design
We performed all experiments on the Intel Paragon located at the California Institute of Technology. This Paragon has 381 compute nodes and an I/O subsystem with 64 SCSI I/O processors, each of which controls a 4GB seagate drive. We define the maximum 1/O performance as each processor writing its portion of the file assuming the conforming distribution. We define the naive approach as each processor performing its own independent file access without any global strategy, i.e. no collective I/O. We assume an SPMD computation, where each processor operates on the portion of the global array that is located in its local memory. We study the costs of writing a two-dimensional array to disk for various two-phase I/O implementations. Our metric of interest is the percentage of the maximum bandwidth achieved by each approach. The application does nothing except make repeated calls to the two-phase I/O routine. Our experiments involved a two dimensional array of integers with dimensions 4096 X 4096 (for a total file size of 64 megabytes). We used MPI for all communications. To conserve space, we present one graph which maps the performance of each of the various approaches.
961 3
Experimental
Results
The initial implementation of the two-phase I/O algorithm is quite simple. First, the processors exchange information related to their individual I/O requirements to determine the collective I/O requirement. Next, each processor goes through a series of sending and receiving messages to perturb the data into the conforming distribution. When a processor receives a portion of its data it performs a simple byte to byte copy into its write buffer. When a processor sends a portion of its data it performs a byte for byte copy from its local array into the send buffer. After all data has been exchanged, each processor performs its write operation in one large request. This initial implementation uses both blocking sends and blocking receives. The cost associated with the naive implementation (no collective I/O) is the latency incurred from issuing many small I/O requests. There are four primary costs associated with the simple implementation of two-phase I/O. First is the extra buffer space required for the write buffer and the communication buffers. Second is the cost of copying data into the communication buffers, copying data from the communication buffers to the write buffer, and copying data from the local array into the write buffers. Third is interprocessor communication and fourth is the actual writing of the data to the disk. The results for the non-collective I/O implementation is depicted in the curve labeled the naive approach of Figure 1. The initial implementation of the twophase I/O algorithm is depicted in the curve labeled Step1. As noted, the graph depicts the percentage of the maximum bandwidth achieved by each approach given 16, 64 and 256 processors. With 16 processors, each processor must allocate a four megabyte write buffer and the messages passed between the processors are one megabyte. With these values, the time to perform the many small I/O operations is virtually the same as the extra costs associated with the two-phase approach. When we move to 64 processors however, each processor allocates a much smaller write buffer (one megabyte), and the messages between the processors are much smaller (131072 bytes). Thus its relative performance is improved. With the naive approach however the additional processors are all issuing many small I/O requests resulting in significant contention for the IOPs and communication network. With 64 processors, this very simple implementation of the two-phase I/O algorithm improves performance by a factor of 3. With 256 processors, it improves performance by a factor 40. 3.~- Step2: R e d u c i n g Copy Costs As noted above, two-phase I/O requires a significant amount of copying. For this reason, we would expect modifications that reduced the cost of copying data would have a significant impact on performance. The next step then is to change all of the copy routines to use the m e m c p y ( ) library call whenever possible. This optimization, labeled as Step2 in Figure 1, has a tremendous impact on performance. As can be seen, with the optimized copying routine this implementation of the two-phase I/O algorithm outperforms
962 both the naive approach and the initial implementation. The improvement in performance over the initial implementation varies between a factor of 1.7 and a factor of 15. The reason for such a tremendous difference is that memepy 0 uses a block move instruction, and requires only four assembly language instructions per block of memory. Performing a byte for byte copy requires 20 assembly language instructions per byte.
3.2
Step3: Asynchronous Communication
The next step we investigated was performing asynchronous rather than synchronous communication. In this implementation, a processor first posts all of its sends (using M P I l s e n d ) and then posts all of its receives (using M P I _ I r e c v ) . After posting its communications, the processor copies any of its own data into the write buffer. The processor then waits for all of the messages it needs to receive, and copies the data into the write buffer as they arrive. It then waits for all of the send operations to complete and performs the write. The trade-offs in this implementation are quite interesting. In previous versions of the algorithm, a processor would alternate between sending a message and waiting to receive a message, blocking until a specific message arrives before posting its next send. The advantage is that the processor frees the send and receive buffers immediately, thus maintaining only one communication buffer at any time. With reduced buffering requirements the cost of paging is decreased. There are two advantages to using asynchronous communication. First, the underlying system can complete any message that is ready without having to wait for a particular message. Second, it can perform the copying between its local array and its write buffer with the asynchronous communication going on in the background. The primary drawback of this approach is that each asynchronous request requires its own communication buffer. Thus the cost of buffering is considerably higher reducing performance. The results are shown in the curve labeled Step 3 in Figure 1. With 16 processors, the asynchronous approach further improves performance by approximately 20%. With 64 processors however, this improvement over the previous implementation is reduced to approximately 5%. There is no noticeable improvement with 256 processors. The reason for the decrease in performance with 64 and 256 processors is that as the number of processors increase, the time required to perform the actual write to disk begins to dominate the costs of the algorithm. Thus using asynchronous communication is less important than it is with a smaller number of processors.
3.3
Step4: Reversing the Order of the Asynchronous Communication
The next approach is to reverse the order of the asynchronous sends and receives. Thus a processor first posts all of its asynchronous receives and then posts its asynchronous sends. The idea behind this optimization is that MPI communications are generally much faster if the receive has been posted (and thus a buffer
963 in which to receive the message has been provided) before the message is sent [3]. Thus pre-allocating all of the receive buffers before the corresponding sends are initiated should improve the interprocessor communication costs. There is of course the same issue discussed above: pre-allocating all of the communication buffers can result in increased paging activity. The results are given in the curve labeled Cotep~in Figure 1. With 16 processors, reversing the order of sends and receives provides a 15% improvement over the previous approach. Again this improvement in performance decreases as the number of processors is increased. With 64 processors there is an i m p r o v e m e n t of approximately 5%, and the improvement disappears with 256 processors. This is again due to the fact that the write time begins to dominate the cost of the algorithm as the number of processors increases.
3.4
Step5: Combining Synchronous with Asynchronous Communication
The final optimization we pursued was to combine asynchronous receives with synchronous sends. The idea behind this optimization is that posting all of the receive buffers will improve communication costs, and releasing the send buffer after each use will reduce the buffering costs. The results are shown in the curve labeled step5. This implementation results in performance gains of up to 10%.
4096 X 4096 Array oflnlegers
0.8
206
~0.4
i 0.2
0.9
,
I 100
,
I 2
,
OO
Number of Processors
Fig. 1. Improvement in performance as a function of the implementation and the number of processors.
964 3.5
Scalability of Two-Phase I/O
Various File Sizes With 256 Processors
400.0
G.~E) Naive Step1 I i Step2 D r ~ Step3 Step4 Step5
~_ 300 0
g .~ 200.0
j~
N
~ ~ 0
Total File Size in Megabytes
Fig. 2. The bandwidth achieved by each approach as the file size is increased and the number of processors is a constant 256.
For completion, we examine the behavior of these two-phase I / O implementations as the amount of data being written increases and the number of processors remains a constant 256. We looked at total file sizes of 16 megabytes, 64 megabytes, 256 megabytes and 1 gigabyte. The results are shown in Figure 2. In this figure, the bandwidth achieved by each approach is given as a function of the totM file size. For comparison, the maximum bandwidth is also shown. As can be seen, neither the naive approach nor the first two-phase implementation scale well. It is interesting to note that step3, where MI communication is asynchronous and the sends are posted before the receives, performs more poorly in the limit than does step2 where all communication is synchronous. Also in the limit there is a small improvement in performance between step2 and step4 and both approaches appear to scale relatively well. The combination of asynchronous receives and synchronous sends scales very well, and, in the limit, approaches the optimal performance. It is important to note that in the limit the initial implementation of twophase I / O resulted in a bandwidth of 78 megabytes per second. The final implementation resulted in a bandwidth of 420 megabytes per second.
965
4
Conclusions
In this paper, we investigated the impact on performance of various implementation techniques for two-phase I/O, and outlined the steps we followed to obtain high performance. We began with a very simple implementation of two-phase I / O t h a t provided a bandwidth of 78 megabytes per second, and ended with an optimized implementation that provided 420 megabytes per second. During the course of this analysis, we provided a good look at the trade-offs involved in the implementation of a two-phase I / O algorithm. Current research is aimed at extending these results to other parallel architectures such as the IBM SP2.
References 1. Crandall, P., Aydt, R., Chien, A. and D. Reed Input-Output Characteristics of Scalable Parallel Applications, In Proceedings o] Supercomputing '95, ACM press, December 1995. 2. DelRosario, J., Bordawekar, R. and Alok Choudhary. Improved parallel I/O via a two-phase run-time access strategy. In Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems pages 56-70, Newport Beach, CA, 1993. 3. Gropp, W., Lusk, E. and A. Skjellum. Using MPI. Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge, Massachusetts. 1996. 4. Kotz, D. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems 15(1):41-74, February 1997. 5. Kotz, D. and N. Nieuwejaar. Dynamic file-access characteristics of a production parallel scientific workload. In Supercomputing '9~ pages 640-649, November 1994. 6. Kotz, D. Expanding the potential for disk-directed I/O. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing. Pages 490 - 495, IEEE Computer Society Press. 7. Thakur, R. and A. Choudhary. An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays. Scientific Programming 5(4):301-317, Winter 1996. 8. Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996.
Workshop 1 3 + 1 4 Architectures and Networks Kieran Herley and David Snelling Co-chairmen This workshop embraces two broad themes: routing and communication networks and parallel computer architecture. The first theme is devoted to communication in parallel computers and covers all aspects of the theory and practice of the design, analysis and evaluation of interconnection networks and their routing algorithms. The continued vitality of this area is reflected by the strength and quality of the papers presented in Sessions 1 and 3 (Routing Networks and AIgorithms I and II) of this workshop. Although recent years have seen computer architecture undergoing a period of contraction and a tendency toward uniformity in the form of shared memory systems composed of super scalar processors, there is nonetheless no lack of innovation in the field. The computer architecture theme, represented by the papers of Sessions 2 and 4 (Computer Architecture I and II), highlights not only innovation within the converging area of super scalar systems, but the continued presence of novel ideas as well. A total of 32 submissions were received (19 on the theme of routing and communication networks and 13 on the computer architecture theme) of which a total of 15 were accepted by the workshop committee for presentation at the conference. Session 1 is devoted to the issue of routing algorithms for unstructured or loosely structured communication on various networks. The paper by S. R. Donaldson, J. M. D. Hill and D. B. Skillicorn entitled "Predictable Communication on Unpredictable Networks: Implementing BSP over TCP/IP" addresses the problem of supporting the well-known BSP model on networks of workstations running TCP/IP. Techniques that offer high throughput with low variance are presented and evaluated experimentally with encouraging results. N. Agrawal and C. P. Ravikumar consider novel techniques for efficient wormhole routing where deadlock is handled not by avoidance but by detection and recovery in their paper "Adaptive Routing Based on Deadlock Recovery". An adaptive routing technique is presented that ensures that deadlocks are infrequent and two schemes are outlined for deadlock recovery are described and evaluated. M. Ould-Khaoua's paper "On the Optimal Topology for Multicomputers with Fully-Adaptive Wormhole Routing: Torus or Hypercube?" revisits the hypercube versus torus comparison for wormhole routing networks. Whereas previous studies, generally based on nonadaptive routing, indicated the superiority of torus topologies, the paper suggests that under certain conditions the hypercube may offer superior performance when adaptive techniques are employed. The fundamental h-relation routing problem is the focus of the paper "Constant Thinning Protocol for Routing in Complete Networks" by A. Kautonen, V. Lepp~innen
968 and M. Penttonen. They present and evaluate a randomized algorithm for this problem on the OCPC model with encouraging results. T. Griin and M. A. Hillebrand's paper, "NAS Integer Sort on Multi-Threaded Shared Memory Machines", proposes novel techniques for improving performance in the already novel environment of nmlti-threaded computer architecture. They use both the commercial Tera MTA and the experimental SB-PRAM machines as platforms for these techniques. In "Analysing a Mnltistreamed Superscalar Speculative Instruction Fetch Mechanism", R. R. dos Santos and P. O. A. Navaux address the possibility of a more conservative step toward multithreading within the more traditional super scalar environment. Their primary target is the performance loss due to instruction fetch latency resulting from miss predicted branches. Not surprisingly we find detailed consideration of implementation costs critical in the design of special purpose processor arrays. In "Design of Processor Arrays for Real-time Applications", D. Fimmel and R. Merker study various algorithms for optimizing the layout of uniform processes on processor arrays. Session 3 is largely devoted to routing algorithms for relatively structured communication such as gossiping or all-to-all scattering. The paper "Interval Routing and Layered Cross Product: Compact Routing Schemes for Butterflies, Mesh of Trees and Fat Trees" by T. Calamoneri and M. Di Ianni show how shortest-path routing information may be maintained in a compact (spaceefficient) way for a class of interconnections that includes the butterfly, the mesh of trees and certain fat-trees. The results rely on an elegant exploitation of interval routing techniques. The gossiping problem involves simultaneously broadcasting a set of packets where each node is the origin of a distinct packet and must receive a copy of the packet that originates at every other node. This problem is the subject of both the paper "Gossiping Large Packets on Full-Port Tori" by U. Meyer and J. F. Sibeyn and "Time-Optimal Gossip in Noncombining 2-D Tori with Constant Buffers" by M. Soch and P. Tvrdfk. The former presents a near-optimal algorithm for two- and higher-dimensional tori that are timeindependent (each node repeats the same simple routing action repeatedly for the duration of the algorithm's execution), whereas the latter presents a timeoptimal algorithm for two-dimensionM tori that requires only a constant amount of buffer-space per node. The paper "Divide-and-Conquer Algorithms on TwoDimensional Meshes" by M. Valero-Garcfa, A. Gonzs L. Dfaz de Cerio and D. Royo explores techniques for supporting divide-and-conquer computations on the two-dimensional mesh. The approach provides efficient solutions to this problem under both the wormhole and store-and-forward routing regime. The Ml-to-all-scatter problem involves routing a set of distinct packets, one for each source-destination node-pair, where each packet is to routed from its source to its destination. The paper "Average Distance and All-to-All Scatter in Kautz Networks" by P. Salinger and P. Trvdfk describes an asymptotically optimal algorithm for this problem for the Kautz network. Modern ccNUMA systems represent the merger of easy to build distributed memory systems and easy to program SMP systems. In their paper, "Reac-
969 tive Proxies: a Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention", S. A. M. Talbot and P. H. J. Kelly propose and evaluate a technique for distributing the contention that arises naturally in ccNUMA systems when processes share a data structure. The approach is novel in that the degree of distribution is driven by the degree of contention and that well behaved applications do not suffer because of the protocol. The flip side of performance is always reliability. T. Skeie presents techniques for "Handling Multiple Faults in Wormhole Mesh Networks". The infrastructure for large distributed systems needs to provide both performance and reliability. The presence of multiple paths between nodes on a mesh provides both. In this case, fault tolerance is achieved without dramatic losses in performance. In another instance of novel approaches in modern computer architecture, "Shared Control Supporting Control Parallelism using a S1MD-like Architecture", N. Abu-Ghazaleh and P. Wilsey provide mechanisms for supporting control parallelism in SIMD systems. The aim is to provide flexibility in conditional and similar computation that does not require global redundant computation necessary in traditional SIMD systems. The members of the workshop committees wish to extend their sincere thanks to those who acted as reviewers during the selection process for their invaluable help during that period and to all of those who by their submissions have contributed to the success of this workshop.
Predictable C o m m u n i c a t i o n on U n p r e d i c t a b l e Networks: I m p l e m e n t i n g B S P over T C P / I P Stephen R. Donaldson 1, Jonathan M.D. HilP, and David B. Skillicorn 2 1 Oxford University Computing Laboratory, UK. 2 CISC, Queen's University, Canada
A b s t r a c t . The BSP cost model measures the cost of communication using a single architectural parameter, g, which measures permeability of the network to continuous traffic. Architectures, typically networks of workstations, pose particular problems for high-performance communication because it is hard to achieve high throughput, and even harder to do so predictably. Yet both of these are required for BSP to be effective. We present a technique for controlling applied communication load that achieves both. Traffic is presented to the communication network at a rate chosen to maximise throughput and minimise its variance. Performance improvements as large as a factor of two over MPI can be achieved.
1
Introduction
The BSP (Bulk Synchronous Parallel) model [10, 8] views a parallel machine as a set of processor-memory pairs, with a global communication network and a mechanism for synchronising all processors. A BSP calculation consists of a sequence of supersteps. Each superstep involves all of the processors and consists of three phases: (1) processor-memory pairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other processor's memories; and (3) all processors synchronise. The BSP cost model treats communication as an aggregate operation of the entire executing architecture, and models the cost of delivery using a single architectural parameter, the permeability, 9. This parameter can be intuitively understood as defining the time taken for a processor to communicate a single word to a remote processor, in the steady state where all processors are simultaneously communicating. The value of the g parameter will depend upon: (1) the bisection bandwidth of the communication network topology; (2) the protocols used to interface with and within the communication network; (3) buffer management by both the processors and the communication network; and (4) the routing strategy used in the communication network. However, the BSP runtime system also makes an important contribution to the performance by acting to improve the effective value of g by the way it uses the architectural facilities. For example, [6] shows how orders of magnitude
971 improvements in g can be obtained, for architectures using point-to-point connections, by packing messages before transmission, and by altering the order of transmission to avoid contention at receivers. In this paper, we address the problem raised by shared-media networks and protocols such as T C P / I P , where there is far greater potential to waste bandwidth. For example, if two processors try to send more or less simultaneously, collision in the ether means that neither succeeds, and transmission capacity is permanently lost. The problem is compounded because it is hard for each processor to learn anything of the global state of the network. Nevertheless, as we shall show, significant performance improvements are possible. We describe techniques, specific to our implementation of BSPIib [5], that ensure that the variation in g is minimised for programs running over bus-based gthernet networks. Compared to alternative communications libraries such as Argonne's implementation of MPI [2], these techniques have an absolute performance improvement over MPI in terms of the mean communication throughput, but also have a considerably smaller standard deviation. Good performance over such networks is of practical importance because networks of workstations are increasingly used as practical parallel computers.
2
Minimising g in Bus-Based Ethernet Networks
Ethernet (IEEE 802.3) is a bus-based protocol in which the media access protocol, 1-persistent CSMA/CD (Carrier Sense Multiple Access with Collision Detection) proceeds as follows. A station wishing to send a frame listens to the medium for transmission activity by another station. If no activity is sensed, the station begins transmission and continues to listen on the channel for a collision. After twice the propagation delay, 2v, of the medium, no collision can occur, as ~11 stations sensing the medium would detect that it is in use and will not send data. However, a collision may occur during the 2v window. On detection, the transmitting station broadcasts a jamming signal onto the network to ensure that all stations are notified of the collision. The station recovers from a collision by using a binary exponential back-off algorithm that re-attempts the transmission after t x 2r, where t is a random variable chosen uniformly from the interval [0, 2k] (where k is the number of collisions this attempted transmission has experienced). For Ethernet, the protocol allows k to reach ten, and then allows another six attempts at k = 10 (see, for example, King [7]). Analysis of this protocol (Tasaka [9]) shows that S --~ 0 as G --+ ec (where S is the rate of successful transmissions and G the rate at which messages are presented for delivery), whereas for a p-processor BSP computer one would expect that S --+ B as G --+ oe, from which one could conclude that g = p/B; where B is a measure of the bandwidth. In the case of BSP computation over Ethernet, the effect of the exponential backoff is exaggerated (larger delays for the same amount of traffic) because the access to the medium is often synchronised by the previous barrier synchronisation and the subsequent computation phase. For perfectly-balanced
972
G-S
T
IS T r a n s p o r t Layer
BSPIib
Analytic Model
(TCP)
Cable
Fig. 1. Schematic of aggregated protocol layers and associated applied loads
computations and barrier synchronisations, all processors a t t e m p t to get their first message onto the Ethernet at the same time. All fail and back off. In the first phase of the exponential back-off algorithm, each of p processors choose a uniformly-distributed wait period in the interval [0, 4r]. Thus the expected number of processors attempting the retransmits in the interval [0, 2r] is p/2, making secondary collisions very likely. If the processors are not perfectly balanced, and a processor gains access to the medium after a short contention period, then that process will hold the medium for the transmission of the packet, which will take just over 1000#s for 10Mbps Ethernet. With high probability, many of the other processors will be synchronised by this successful transmission due to the l-persistence of this protocol. The remaining processors will then contend as in the perfectly-balanced scenario. In terms of the performance model, this corresponds to a high applied load, G, albeit for a short interval of time. If S were (at least) linear in G then this burstiness of the applied load would not be detrimental to the throughput and would average out.
0B
062
O7
i
o~ i
0 S6
/
OH O.3
0 S2
0.2
O.S 20
40 G, ~ o a
~ad eO
80
100
Fig. 2. Plot of applied load (G) against successful transmissions (S)
048
14
Fig. 3. Contention expectation for a particular slot as a function of p
973 Fortunately, the BSP model allows assumptions to be made at the global level based on local data presented for transmission. At the end of a superstep and before any user data is communicated, BSPIib performs a reduction, in which all processors determine the amount of communication that each processor intends sending. From this, the number of processors involved in the communication is determined. For the rest of the communication the number of processors involved and the amount of data is used to regulate (at the transport level) the rate at which data is presented for transmission on the Ethernet. By using BSD Socket options (TCPANDELAY), the data presented at the transport layer are delivered immediately to the MAC layer (ignoring the depth of the protocol stack and the availability of a suitable window size). Thus, by pacing the transport layer, pacing can be achieved at the MAC or link layer. This has the effect of removing burstiness from the applied load. Most performance analyses give a model of random-access broadcast networks which provide an analytic, often approximate, result for the successful traffic, S, in terms of the offered load, G. H a m m o n d and O'Reilly [3] present a model for slotted 1-persistent CSMA/CD in which the successful traffic, S, (or efficiency achieved) can be determined in terms of the offered load G, the end-to-end propagation delay, r (bounded by 25.65ps, i.e., the time taken for a signal to propagate 2500m of cable and 4 repeaters), and E, the frame transmit time (which for 10Mbps Ethernet with a maximum frame size of 1500 bytes is approximately 1200#s). Figure 2 shows the predicted rate of successful traffic against applied load, assuming that that jamming time is equal to r. Since both S, the rate of successful transmissions, and G are normalised with respect to E, S is also the channel efficiency achieved on the cable. T, shown in Figure 1, also normalised with respect to E, is the load applied by BSPIib on the transport layer. Our objective is to pace the injection of messages into the transport layer such that T, on average, is equal to a steady-state value of S without much variance. The value of T determines the position on the S-G curve of Figure 2 in a steady state; in particular, T can be chosen to maximise S. If the applied load is to the right of the maximum throughput in Figure 2, then small increases in the mean load lead to a decrease in channel efficiency which in turn increases the backlog in terms of retries and further increases the load. Working to the right of the maximum therefore exposes the system to these instabilities which manifest themselves in variances in the communication b a n d w i d t h - - a metric we try to minimise [4, 1]. In contrast, when working to the left of the maximum, small increases in the applied load are accompanied by increases in the channel efficiency which helps cope with the increased load and therefore instabilities are unlikely. As an aside, the Ethernet exponential backoff handles the instabilities towards the right by rescheduling failed transmissions further and further into the future, which decreases the applied load. In BSPIib, the mechanism of pacing the transport layer is achieved by using a form of statistical time-division multiplexing that works as follows. The frame size and the number of processors involved in the communication are known. As the processors' clocks are not necessarily synchronised, it is not possible to allow
974 the processors access in accordance with some permutation, a technique applied successfully in more tightly-coupled architectures [6]. Thus the processors choose a slot, q, uniformly at random in the interval [ 0 . . . Q - 1] (where Q is the n u m b e r of processors communicating at the end of a particular superstep), and schedule their transmission for this slot. The choice of a r a n d o m slot is i m p o r t a n t if the clocks are not synchronised as it ensures t h a t the processors do not repeatedly choose a bad communication schedule. Each processor waits for time qe after the start of the cycle, where e is a slot time, before passing another packet to the transport layer. The length of the slot, ~, is chosen based on the m a x i m u m time that the slot can occupy the physical medium, and takes into account collisions t h a t might occur when good throughput is being achieved. The mechanism is designed to allow the medium to operate at the steady state that achieves a high throughput. Since the burstiness of communication has been smoothed by this slotting protocol, the erratic behaviour of the low-level protocol is avoided, and a high utilisation of the medium is ensured. An alternative protocol, not considered here, would be to implement a deterministic token bus protocol in which each station can only send d a t a whilst holding a "token". This scheme was not considered viable as it is inefficient for small amounts of traffic due to the need for the explicit token pass when the token holding processor has no message for the "next to-go" processor. In the worst case this would double the communication time. Also, token mechanisms protect shared resources such as a single Ethernet bus, however a network m a y be partioned into several independent segments, or processors m a y be connected via a switch. In this case, the token bus protocol would ensure only a single processor has access to the m e d i u m at any time, therefore wasting bandwidth. In contrast, the parameters used in the slotting mechanism can be trivially adjusted to take advantage of a switch based medium. For example, for a full-duplex cross-bar switch, Q can be assumed to be 1 (Q = 2 for half-duplex), and e encapsulates the rate at which the switch and protocol stacks of sender and receiver can absorb messages in a steady state. If the back-plane capacity of the switch is less than the capacity of the sum of the links, then Q and e can be adjusted accordingly. Therefore, the randomised slotting mechanism is superior to a deterministic token bus scheme, as Q and e can be used to model a large variety of LAN interconnects.
3
Determining
the
Value
of e
In any steady state, T = S because, if this were not the case, then either unbounded capacity for the protocol stacks would be required, or the stacks would dispense packets faster than they arrive, and hence contradict the steady-state assumption. Since s is the slot time, packets are delivered with a m e a n rate of 1/~ packets per ps. Normalising this with respect to the f r a m e size E gives a value for T = E / 6 packets per unit frame-time. We therefore choose an S value from the curve and infer a value of the slot size e = E / S as S = T in a steady state. Choosing a value of S = 80% and E = 1200#s gives a slot size of 1500#s.
975
Fig. 4. Delivery time as a function of slot time for a cyclic shift of 25,000 words per processor, p = 2, 4, 6, 8, data for mpich shown at 1500 although slots are not used
In practice, while the m a x i m u m possible value for r is known, the end-to-end propagation delay of the particular network segment is not, and this influences the slot size via the contention interval modelled in Figure 2. The analytic model assumes a Poisson arrival process, whereas for a finite number of stations, the arrival process is defined by independent Bernoulli trials (the limiting case of this process, as the number of processors increases, is Poisson, and approximates the finite case after the number of processors reaches about 20 [3]). More complicated topologies could also be considered where more than one segment is used. The slot size e can be determined empirically by running trials in which the slot size is varied and its effect on throughput measured. The experiments involved a 10Mbps Ethernet networked collection of workstations. Each workstation is a 266MHz Pentium Pro processor with 64MB of m e m o r y running Solaris 2. The experiments were carried out using the T C P / I P implementation of BSPlib. The machines and network were dedicated to the experiment, although the Ethernet segment was occasionally used for other traffic as it was a subset of a teaching facility. Figure 4(a) to Figure 4(d) plot the time it takes to realise a cyclic-shift communication pattern (each processor bsp_hpputs a 25,000 word message into the m e m o r y of the processor to its right) for various slot sizes (e E [0, 2000]) and for 2, 4, 6 and 8 processors. The figures show the delivery t i m e as a function
976 70
,
,
MPIch
sl~d~ddevlaflon
..-
6~
SO
40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
20
10
ol
(a) p = 4, mean per slot-size
(b) p ----4, standard deviation per slot-size leo
~so
14o
12o
ioo
.
.
.
.
. . . . .
(c) p = 8, mean per slot-size
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
~ l o , l ] ~ U$~2. . . . . . . . . . . .
(d) p = 8, s t a n d a r d d e v i a t i o n p e r s l o t - s i z e
Fig. 5. Mean and standard deviation of delivery times of data from Figures 4(a,b)
of slot size, oversampled 10 times. The horizontal line towards the b o t t o m of each graph gives the m i n i m u m possible delivery time based on bits t r a n s m i t t e d divided by theoretical bandwidth. Results for an M P I implementation of the same algorithm running on top of the Argonne implementation of M P I (mpich) [2] are also shown on these graphs. In this case, the d a t a is presented at a slot size of 1500 ps (even though mpich does not slot), so only one (oversampled) measurement is shown. The dotted horizontal line in the centre of these figures is the mean delivery time of the M P I implementation. The BSP slot time should be chosen to minimise the mean delivery time. Choosing a small slot time gives some good delivery times, but the scatter is large. In practice, a good choice is the smallest slot time for which the scatter is small. For p = 2 this is 1200 #s, for p = 4 it is 1450 ~s, for p = 6 it is 1650 #s, and for p = 8 it is 1700 ps. Notice that these points do not necessarily provide the m i n i m u m delivery times, but they provide the best combination of small delivery times and small variance in these times. Figure 4(b) shows a particularly interesting case at e = 1500, as b o t h the m e a n transfer rate and standard deviation of the BSPIib b e n c h m a r k is much smaller t h a n those of the corresponding mpich program. This slot-size can be clearly seen in Figure 5(c) and Figure 5(d) where the scatter caused by the
977
Fig. 6. Delivery time as a function of slot time for a cyclic shift of 8,300 words per processor, p = 4, 8, data for mpich shown at 1500 although slots are not used
oversampling at each slot size in Figure 4(b) has been removed by only displaying the mean and standard deviation of the oversampling. In contrast, the m e a n and largest outlier of the mpich program in Figure 4(d) is clearly lower t h a n the corresponding BSPIib program when a slot size of 1500 is used. For larger configurations, the slot size that gives the best behaviour increases and the m e a n value of g for BSPlib quickly becomes worse than that for mpich. An increase in the best choice of slot size from Figure 4(a) to Figure 4(d) should be expected as the probability P(n) of n processors choosing a particular slot is binomially distributed. Thus as p increases, so does the expectation E{X >_2} of the amount of contention for the slot, where p
P(n)= (P)(1/p)'~(1-1/p) p-n
and
E{X > 2}
= ~iP(i) i:2
Figure 3 shows that for p ~ 20 and greater, the dependence on p is minimal, and therefore the increase in slot size reaches a fixed point. Below twenty processors the dependence varies by at most 26%. The limit as p --+ eo gives E{X >_2} 1 1/e ~ 0.63, as shown in the figure. The same is true of the probability of contention, but the range is very small, from 0.25 at p = 2, and as p -+ ~ , > 2}
1 - 2/c
0.26.
In the mpich implementation [2] of MPI, large communications are presented to the socket layer in a single unit. However, in the BSPIib implementation all communications are split into packets containing at most 1418 bytes, so t h a t we can pace the submission of packets using slotting. For this benchmark, each BSPlib process sends 71 small packets in contrast to rapich's single large message. Therefore, when p is small we would expect BSPIib to perform worse t h a n mpich due to the extra passes through the protocol stack, and for larger values of p we would expect t h a t the benefits of slotting out-weigh the extra passes through the protocol stack. Figures 4(a) to 4(d) show an opposite trend. As can be seen from the Figures 4(b)-(d), as p increases, there is a noticeable "hump" in the d a t a as the slot size increases. This phenomenon is not explained
978 4cO 3~
2~
~.0I
24 22
340
20 320 ~B ~e IS 2NI 14 2C~0 12 240
10
220 200
e
....... %,o,,~.,A,. . . . . . . . . . . . (a) p = 8, mean per slot-size
S
(b) p = 8, standard deviation per slot-size
Fig. 7. Mean and standard deviation of delivery times of data from Figure 6(b)
by the discussion above. The problem arises because we are modelling the communication as though it were directly accessing the Ethernet, without taking into account the T C P / I P stack. W h a t we are observing is the T C P acknowledgement packets, which interfere with data traffic as they are not controlled by our slotting mechanism. The effect of this is to increase the o p t i m u m slot size to a value that ensures that there is enough extra bandwidth on the m e d i u m such t h a t the extra acknowledgement packets do not negatively impact the transmission of d a t a packets. Implementations of T C P use a delayed acknowledgement scheme where multiple packets can be acknowledged by a single acknowledgement transmission. To minimise delays, a 200ms timeout timer is set when T C P receives d a t a [11]. If during this 200ms period data is sent in the reverse direction then the pending acknowledgement is piggy-backed onto this d a t a packet, acknowledging all d a t a received since the timer was set. If the timer expires, the d a t a received up to t h a t point is acknowledged in a packet without a payload (a 512 bit packet). In the benchmark program that determines the optimal slot size, a cyclic shift communication pattern is used. When p > 2 there is no reverse traffic during the d a t a exchange upon which to piggy-back acknowledgements. If the entire communication takes less than 200ms then only p acknowledgement packets will be generated for each superstep; as the total time exceeds 200ms, considerably more acknowledgement packets are generated. In Figure 4(a) the communication takes approximately 200ms and a minimal number of acknowledgements are generated as can be seen by the lack of a hump. In Figures 4(b)-(d), the size of the h u m p s increases in line with the increased number of acknowledgements. The mpizh p r o g r a m does not suffer as severely from this artifact as BSPlib. W h e n slotting is not used (for example in mpich) there is potential for a rapid injection of packets onto the network by a single processor for a single destination, which m e a n s that it is likely that more packets arrive at their destination before the delayed acknowledgement timer expires. This reduces the number of acknowledgement packets. When slotting is used, packets are paced onto the network with a m e a n inter-packet time between the same source-destination pair of pc. This drasti-
979 cally decreases the possibility of accumulated delayed acknowledgements. For example, in Figure 4(c), as the total time for communication is a p p r o x i m a t e l y 800ms, and as the slot size steadily increases, the number of acknowledgements increases. This in turn steadily increases the standard deviation and mean of the communication time. From the figure it can be seen that this suddenly drops off when the slot size becomes large as the probability of collision decreases due to the under-utilisation of the network. The global nature of BSP communication means that d a t a acknowledgement and error recovery can be provided at the superstep level as opposed to the packet by packet basis of T C P / I P . By moving to U D P / I P , we can implement acknowledgements and error recovery within the framework of slotting. This lower-level communication scheme is under development, although the hypothesis t h a t it is the acknowledgements limiting the scalability of slotting can be tested by performing a benchmark on a dataset size that requires a total communication time that is less than 200ms. Figures 6(a)-(d) shows the slotting b e n c h m a r k for an 8333-relation where there are no obvious humps. In all configurations the m e a n and standard deviations of the BSPIib results are considerably smaller than mpich. Also, as can be seen from Figure 7 the optimal slot size at p = 8 is approximately 1200#s. 4
Conclusions
We have addressed the ability of the BSP runtime system to improve the performance of shared-media systems using T C P / I P . Using BSP's global perspective on communication allows each processor to pace its transmission to maximise throughput of the system as a whole. We show a significant improvement over M P I on the same problem. The approach provides high throughput, but also stable throughput because the standard deviation of delivery times is small. This maintains the accuracy of the cost model, and ensures the scalability of systems. Acknowledgements The work of Jonathan Hill was supported in part by the EPSRC Portable Software Tools for Parallel Architectures Initiative, as Research Grant GR/K40765 "A BSP Programming Environment", October 1995-September 1998. David Skillicorn is supported in part by the Natural Science and Engineering Research Council of Canada.
References 1. S. R. Donaldson, J. M. D. Hill, and D. B. Skillicorn. Communication performance optimisation requires minimising variance. In High Perfomance Computing and Networking (HPCN'98), Amsterdam, April 1998. 2. W. Gropp and E. Lusk. A high-performance MPI implementation on a sharedmemory vector supercomputer. Parallel Computing, 22(11):1513-1526, Jan. 1997.
980
3. J. L. Hammond and P. J. P. O'Reilly. Performance Analysis of Local Computer Networks. Addison Wesley, 1987. 4. J. M. D. Hill, S. Donaldson, and D. B. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Oxford University Computing Laboratory, October 1997. 5. J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP Programming Library. Parallel Computing, to appear 1998. see www.b s p - w o r l d w i d e , org for more details. 6. J. M. D. Hill and D. B. Skillicorn. Lessons learned from implementing BSP. Journal of Future Generation Computer Systems, 13(4-5):327-335, April 1998. 7. P. J. B. King. Computer and Communication Systems Performance Modelling. International series in Computer Science. Prentice Hall, 1990. 8. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249-274, Fall 1997. 9. S. Tasaka. Performance Analysis of Multiple Access Protocols. Computer Systems Series. M I T Press, 1986. 10. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990. 11. G. R. Wright and W. R. Stephens. TCP/IP Illustrated, Volume 2. Addison-Wesley, 1995.
Adaptive Routing Based on Deadlock Recovery Nidhi Agrawal 1 and C.P. R a v i k u m a r 2 1 Hughes Software Systems, Sectorl8, Electronic City, Gurgaon, INDIA Deptt. of Elec. Engg., Indian Institute of Technology, New Delhi, INDIA
Abstract. This paper presents a deadlock recovery based fully adaptive routing for any interconnection network topology. The routing is simple, adaptive and is based on calculating the probabilities of routing at each node to neighbors, depending upon the static and dynamic conditions of the network. The probability of routing to the i th neighbor at any node is a function of the traffic and distance from the neighbor to the destination. Since with our routing algorithm deadlocks are rare, deadlock recovery is a better solution. We also propose here two deadlock recovery schemes. Since deadlocks occur due to cyclic dependencies, these cycles are broken by allowing one of the messages involved in deadlock to take an alternate path consisting of buffers reserved for such messages. These buffers can be centralized buffers accessible to all neighboring nodes or can be set of virtuals. The performance of our algorithm is compared with other recently proposed deadlock recovery schemes. The 2-Phase routing is found to be superior compared to the other schemes in terms of network throughput and mean delay.
1
Introduction
Interprocessor communication is the bottleneck in achieving high performance in Message Passing Processors (MPPs). Various routing algorithms exist in the literature for wormhole routed[4] interconnection networks. Most routing algorithms achieve deadlock-freedom by restricting the routing, for example, 3~urn Model[5] router for hypercubes and meshes ensures deadlock-freedom by prohibiting certain turns in the routing algorithm. Many routing techniques achieve deadlock-freedom and adaptivity through the use of virtual channels [3,4]. Not only do virtual channels increase hardware complexity, they also reduce the speed of the router. Chien[2] has shown that virtual channels adversely affect the router performance, and multiplexing more than 2 to 3 virtual channels over one physical channel results in poor router performance. From the discussion above, it is clear that achieving deadlock-freedom and adaptivity at the cost of increased router complexity and increased delays is not a good solution. In this paper we propose fully adaptive routing based on deadlock recovery. The algorithm breaks deadlocks without adding significant hardware. We introduce the notion of Routing Probability at each node. Pi(j, k) denotes the probability of routing a message at node i to a neighboring node k, destined for j. Entirely chaotic routing can be achieved by selecting Pi(j, k) = 1/di. On
982 the other hand, a deterministic routing algorithm can be modeled by selecting P~(j,l) = 1 for some 1 < l < d~, and Pi(j,k) = 0, Vk ~ l; here di is the degree of node i. In this paper we propose a simple heuristic to calculate the routing probabilities at any node depending on the distance from the destination a n d / o r congestion at the neighboring node. Our algorithm dynamically computes the probabilities, thus providing adaptivity. A message is declared to be deadlocked at any node i, if the message header has to wait at node i for more than a predefined time called TimeOut. Since deadlocks are infrequent, it is more logical to recover from deadlocks rather than avoiding. Various schemes for deadlock recovery are available in the literature[l, 6]. Compressionless routing[6] is a deadlock recovery scheme based on killing the deadlocked packets and transmitting again. This scheme requires padding flits to be attached with the message when the message length is less than the network diameter. Moreover, messages that are killed and rerouted suffer large delays. A deadlock recovery scheme called Disha[1] uses a single flit buffer at each node. The disadvantage of this scheme is that at any time only one message can be routed onto deadlock-free buffers, thus only one of the deadlocks we can recover from at a time; furthermore the token based mutual exclusion on the central virtual network is implemented by adding additional hardware. We propose here two schemes for deadlock recovery. Our first scheme makes use of multiple central buffers; on the other hand our second scheme, also called 2-Phase Routing divides the network into two virtual networks. Both the schemes eliminate the disadvantages associated with the previously proposed recovery schemes. Rest of the paper is organized as follows : Next section describes the routing algorithm. Section 3 describes the proposed deadlock recovery schemes. The results of simulation are given in Section 4. Section 5 concludes the paper. 2 2.1
Adaptive
Routing
Algorithm
Various heuristics may be used to calculate the routing probabilities Pi(j, k) introduced in Section 1. In this section we propose a simple heuristic as shown below: 1
Vtr C Wbr(i),
Pi(j, k) - ~ kwk 1
(1)
Here Nbr(i) is the set of immediate neighbors of node i and Wk is a weight assigned to the neighbor k which is a linear function of the distance 5(j, k) and the congestion at the neighboring node k denoted as Qk. The weight Wk is assigned as follows :
wk =
5(j, k) + 3 ,
(2)
Each node is assumed to have the knowledge of the fault-status of its neighbor and the congestion at each neighboring node. The congestion at, any node k,
983 also denoted by Qk, is the average amount of time a header has to spend at k normalized against a timeout period TimeOut. To measure the value of Qk, node k makes use of as m a n y timers as there are neighbors. If a header waits for more t h a n TimeOut period, the message is declared to be deadlocked. For most regular interconnection networks, the normalized distance function 5 0 is easy to evaluate. The scaling factors c~ and 3 are non-negative real constants, which dictate the nature of the routing algorithm; by selecting ~ = 0, we obtMn distance-optimal routing. Similarly, by selecting c~ = 0, we obtain a h o t - p o t a t o routing algorithm. We found o p t i m u m values of the ratio ~, by simulation, for uniform and bit-reversal traffic (see Section 4). 2.2
Detection of deadlocks
The selection of TirneOut interval greatly affects the router performance. If the TimeOut period is very large, the deadlocked messages remain in the network and blocking m a n y other messages; on the other hand if the TimeOut period is small, false deadlocks are declared. Whenever node i receives a message header H , a timer T(i, H) starts counting tile number of clock cycles the header has to wait before being forwarded. If for any header H, T(i, H) exceeds TirneOut at node i, then header is declared to be deadlocked. The o p t i m u m TimeOut interval can be found by simulations (see Section 4). We also measured the frequency of deadlocks due to the proposed routing algorithm. It is observed t h a t under m a x i m u m load less than 5% messages are deadlocked for a 64 node hypercube and TimeOut = 16 clocks. 3
Deadlock-Recovery
3.1
Recovery Scheme 1
After detecting a deadlock, the deadlock cycle is broken by switching one of the messages involved in the deadlock cycle to the central buffers kept aside at each node for routing deadlocked messages. Each node say i has K + 1 central buffers numbered 0, 1, 2..., K. These buffers are accessible by all neighboring nodes of i. The k th central buffers of all the nodes form a central virtual network VNk, 0 < k < K. On any central virtual network, only one message is allowed to travel at a time. Thus, at most K deadlocks can be recovered simultaneously. The permission to break deadlocks is given in a "round-robin" fashion to all the nodes. This m a y be implemented using K tokens corresponding to each central virtual network. Token i corresponds to the central virtual network VNi. These tokens are nothing but packets of one flit size, which rotate on VNo along a cycle including all healthy nodes of the network. Each token carries the address of the next node in the cycle. Further details on token m a n a g e m e n t and implementation are given below: - Token synchronization: The tokens are synchronized with the router clock i.e. each token remains at a node for one clock cycle while circulating. The worst
984 case waiting time to capture a token is ( N , (Avg.Dist.)/K) clock cycles, where N is the number of nodes and Avg.Dist. is the average distance of the network. - Capturing a token: When a node detects a deadlock and concurrently receives a token j along the central buffer 0 carrying the address of the node, the token is captured by the node. - Regenerating a token: In order to make sure that messages do not get deadlocked on central virtual networks, a captured token j at a node Ni is regenerated after 5(Ni, D) clock cycles, ensuring that there is at most one packet on the network, where D is the destination node. 3.2
Recovery Scheme-2
The recovery scheme-2 , also called 2-phase routing discussed in this section, provides simultaneous recovery from all the deadlocks. Each physical channel is divided into two virtual channels. The network can be viewed as two virtual networks : the Adaptive Virtual Network(AVN) and the Deadlock-free Virtual Network(DVN). All the messages are initiated in AVN. Routing is carried out without any restrictions following the routing algorithm of Section 2. Whenever any message gets deadlocked, the routing enters phase 2 and the message is switched over to DVN. The routing procedure followed on DVN can be any deadlock-free routing for example, e-cube routing for hypercube, X-Y routing and negative-first routing for meshes and k-ary n-cubes. We describe here briefly, a deadlock-free, Hamiltonian path based routing to route on DVN, which is partially adaptive and works for various topologies like hypereubes, k-ary n-cubes, Star graphs, meshes etc. The nodes of DVN are numbered according to Hamiltonian number denoted as Hamilt(i). The DVN is further partitioned into two subnetworks the Higher Virtual Subnetwork (HSN) and the Lower Virtual Subnetwork (LSN). In HSN, there is an edge from any node i to any other node j, iff Hamilt(i) < Hamilt(j). There is an edge from i to j in the lower network iff Hamilt(i) > Hamilt(j). Each of these subnetworks is acyclic, thus routing in each of these networks is deadlock-free [7]. The routing function ~1(~2) of Equation3(4) corresponds to routing in LSN(HSN) from i to j, where EL(EH) is edge set of LSN(HSN).
Y~l(i,j) = k: (i, k) E EL, Hamilt(i) > Hamilt(k) >_Hamilt(j)
(3)
~2(i, j) = k: (i, k) E EH, Hamilt(i) < Hamilt(k) <_Hamilt(j)
(4)
The routing functions defined by Equations 3 and 4 are deterministic. In a modification of the routing function of Equation 3, if we allow a message to switch from LSN to HSN, when a faulty or congested node is encountered, without permitting the message to return to LSN, the routing function ~1 continues to remain deadlock-free (see Theorem 1). A similar argument holds for the function ~2. The resulting partially adaptive routing is called Adapt-Hamilt. T h e o r e m 1. The adaptive routing Adapt-Hamilt is deadlock-free.
985
C4,1J
~ C2,3 C ~
C3,2
(n)
(B)
Fig. 1. (A) HSN and LSN for 2-dimensional Hypercube (B) CDG for ~1 and ~2
9
@ ;:::9 G (A)
I
.......
l
.... :::::::2::f:jf. .........
(8)
Fig. 2. CDG for adaptive routing in a 2-d Hypercube
Proof. We have seen that the routing function ~1 of Equation 3 is deadlock-free, implying that the channel dependency graph(CDG) induced by this function is acyclic [4](see Figure 1). The modified routing introduces additional edges in the CDG. These additional edges are of the form (Xh, Yt) as shown in Figure 2, where ~h is a node of the higher network CDG and Yl is a node of the lower network CDG. The adaptive routing function Adapt-Hamilt, if allows the additional edges of the form (xh, xt) then edges (xl, Yh) are not permitted. Hence the theorem. [] 4
Performance
Study
We conducted experiments to measure the performance of our adaptive routing with various deadlock recovery schemes on a 256 node hypercube. The performance metric considered are the throughput and the mean delay. Throughput of the network is defined as the mean number of messages going out of the network per node per clock cycle. Mean delay is the mean number of clock cycles elapsed from the time the first flit of the message enters the network to the time the last flit leaves the network. It may be noted that throughput and latency are measured for various applied loads, thus throughput-latency graphs do not represent a function. Each node generates messages with a poisson distribution. Two kinds of traffic patterns are generated: uniform traffic and bit-reversal traffic. In uniform traffic the destinations are generated uniformly with each node being equally probable; on the other hand in bit-reversal traffic pattern, the destination node address is obtained by reversing the bits of the source node address, for example source sl,s2,...,s,, sends a message to the destination node D = sn,sn-1, ...,s2, sl. Messages are 8 flits long. The results are taken for 4000 messages out of which
986 the information for first 2000 messages is discarded (warm-up period). It takes one clock cycle time to transfer a flit across a physical channel. The header is assumed to be one flit long. The experiment is repeated several times by varying the seed and the average value is plotted. 4.1
{X
F i n d i n g o p t i m u m T i m e O u t a n d ~-
For fine tuning TimeOut period for each type of traffic, we conducted experiments. The network performance is measured for various TirneOut periods. Figure 3 shows the throughput versus latency curves for various TimeOut periods under bit-reversal traffic pattern. It may be observed that latency increases slowly initially upto throughput value of 0.02 messages per node per clock cycle and then there is a sharp increase. Since the highest throughput is achieved with TimeOut = 16, it is chosen as the optimum value. Similarly, for uniform traffic the optimum TimeOut value is also found to be 16. The plot is not shown clue to the lack of space. For both type of traffic model, we found optimum value of the ratio ~-. Figure 4 shows the performance of the routing algorithm for various values Zof for uniform traffic. The optimum value of ~ is the value which gives highest peak performance. The throughput reaches its maximum value upto a certain applied load then starts reducing; this is the unstable condition of the network. From the Figure 4 it can be noticed that the optimum value is ~ = 10, because for this value the highest throughput is achieved before saturation. Similarly for bit-reversal traffic the optimum value of ~ is found to be 5, as shown in figure 5. 4.2
C o m p a r i s o n of various s c h e m e s
We compared the performance of the proposed routing algorithm for various deadlock recovery schemes. Both the proposed recovery schemes are compared with the existing deadlock recovery scheme disha. Figure 6 shows a comparison of the three recovery schemes under uniform traffic conditions. The throughput value saturates at 10% with both disha and scheme-1 and at 21% with 2-phase routing. It is also observed that by adding more and more central virtual buffers (or tokens) the performance of scheme-1 improves. In figure 6, the performance of the scheme-1 is measured for 3 central virtual buffers at each node. We analyze the performance of the proposed routing in the presence of node faults. It is observed that there is a negligible degradation in the performance routing for various fault conditions. The plots are not shown here due to the lack of space. The performance is also compared for the three recovery schemes under bitreversal traffic. Figure 7 shows the comparison. It may be observed that the recovery scheme-1 outperforms the recovery scheme disha, and the 2-phase routing performs far better than both the scheme-1 and disha. Figure 7 shows that after saturation the mean delay increases sharply, however the throughput value does not increase. In this case saturation occurs only at 6%.
987
5
Conclusion
We have presented a fully adaptive, restriction-free routing algorithm. Since deadlocks are rare we considered deadlock recovery instead of traditional deadlock avoidance. We have presented two deadlock recovery schemes, the recovery scheme-1 and 2-phase routing. The recovery scheme-1 provides simultaneous recovery from K deadlocks, where K is the number of central buffers at each node; on the other hand 2-phase routing provides the simultaneous recovery from all the deadlocks. The performance of both the proposed recovery schemes is compared with the recovery scheme disha [1]. It is observed that the recovery scheme-1 outperforms the recovery procedure of [1], under both uniform and bitreversal traffic. Also the 2-phase routing performs far superior than both disha and the recovery scheme-1.
References 1. K.V.Anjan and T.M.Pinkston, "Disha: A deadlock Recovery Scheme for Fully Adaptive Routing", Proc. of the 19th Int. Parallel Processing Symposium, 1995. 2. A.A.Chien, "A Cost and Speed Model for k-ary n-cube Wormhole Routers", I E E E Transactions on Parallel and Distributed Systems, Vol.9, No.2, pages 150 - 162, 1998. 3. W.Dally and H.Aoki, "Deadlock-free Adaptive Routing in Multicomputer Networks using Virtual Channels", IEEE Transactions on Parallel and Distributed Systems, Vol.4, No.4,1993, pages 466 - 475. 4. W.Dally and C.Seitz, "Deadlock-free message routing in multiprocessor interconnection networks", I E E E Transactions on Computers, C-36(5), 1987, pages 547 - 553. 5. C.J.Glass and L.M.Ni, "The turn model for Adaptive routing", Proc. of the 19th Int. Syrup. on Comp. Architecture, 1992, pages 278 - 287. 6. J.Kim, Z.Liu and A.Chien, "Compressionless Routing: A Framework for Adaptive and Fault-tolerant Routing", IEEE Transactions on Parallel and Distributed Systems, Vol.8, No.3, pages 229 - 244, 1997. 7. X.Lin, P.K.McKinley and L.M.Ni, "Deadlock-free multicast wormhole routing in 2D mesh multicomputers", IEEE Trans. on Parallel and Distributed Systems, Vol.5, no.8, 1994, pages 7 9 3 - 804.
988 i00 90
,
,
,
,
,"
w
,
l
120
/ imeOut=8 - - _ / imeOut=16 ....
.~ 280 030504060~70
/
'V"' ~
I00
'
' alpha/bet:=0 .'1 ~ alpha/beta=l. 0 .... alpha/beta=10.0..... alpha/beta=20.0
~
~l
. . . . . . . .
80
TimeOut !I I=24 ..... I 60
40 20 0. 010.0150.020.0250.030.0350.040.0450.05 Accepted traffic in msg/node/cycle
Fig. 3. Tuning T i m e O u t interval for bit-reversal traffic
i00
$
,
90
,
r
z
,
f
J
/ ~,ipha/beta=0.1
80
70 60 50
Fig. 4. Tuning ~ for unifrom traffic
,
[ ~ipha/beta=l.0 / / / a " alpha/beta=5 ~0 ipha/beta=l0.0
u
3040
J
0.0N.0N.0~.0~.0~.0N.0~.0~.090.1
Aueepted traffic in msg/node/cyele
,
120 u
i00
,
J
,
J
ReCOvery Scheme Disha - - Recovery Scheme 1 .... Recovery Scheme 2 .....
o~
.~
80 60
~/'/
,
20 I , 0.0D. 01B.02. 028.09.038.09. 049.08.059.06 Accepted traffic in msg/node/cycle
Fig. 5. Tuning ~ for bit-reversal traffic
i00 Recovery Scheme Dishai-90 ~ecoveD9 Scheme ~'---80 70 60 50 40 30 20 I I I I I 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Accepted traffic in msg/node/cycle
Fig. 7. Throughput Vs. mean delay for the three recovery schemes under bitreversal traffic
40
;i ~"
,
2O 0.05 0.i 0.15 0.2 0.25 0.3 Accepted trafflc in msg/node/cycle
Fig. 6. Throughput Vs. mean delay for the three recovery schemes under uniform traffic
On the Optimal N e t w o r k for M u l t i c o m p u t e r s : Torus or H y p e r c u b e ? Mohamed Ould-Khaoua Department of Computer Science, University of Strathclyde, Glasgow G1 1XH, UK.
A b s t r a c t . This paper examines the relative performance merits of the torus and hypercube when adaptive routing is used. The comparative analysis takes into account channel bandwidth constraints imposed by VLSI and multiple-chip technology. This study concludes that it is the hypercube which exhibits the superior performance, and therefore is a better candidate as a high-performance network for future multicompute r s with adaptive routing.
1
Introduction
The hypercube and torus are the most common instances of k-ary n-cubes [5]. The former has been used in early multicomputers [3, 10] while the latter has become popular in recent systems [4, 8]. This move towards the torus has been mainly influenced by Dally's study [5]. When systems are laid out on a VLSIchip, Dally has shown that under the constant wiring density constraint, the 2 and 3-D torus outperform the hypercube due to their higher bandwidth channels. Abraham [1] and Agrawal [2] have argued that the wiring density argument is applicable where a network is implemented on a VLSI-chip, but not in situations where it is partitioned over many chips. In such circumstances, they have identified that the most critical bandwidth constraint is imposed by the chip pin-out. Both authors have concluded that it is the hypercube which exhibits better performance under this new constraint. Wormhole routing [11] has also promoted the use of high-diameter networks, like the torus, as it makes latency independent of the message distance in the absence of blocking. In wormhole routing, a message is broken into flits for transmission and flow control. The header flit governs the route, and the remaining data flits follow in a pipeline. If the header is blocked, the data flits are blocked in situ. Most previous comparative analyses of the torus and hypercube [1,2, 5, 6] have used deterministic routing, where a message always uses the same network path between a given pair of nodes. Deterministic routing has been widely adopted in practice [3, 8, 10] because it is simple and deadlock-free. However, messages cannot use alternative paths to avoid congested channels. Fully-adaptive routing has often been suggested to overcome this limitation by enabling messages to explore all the available paths in the network. Duato [6] has recently
990 proposed a fully-adaptive routing, which achieves deadlock-freedom with minimal hardware requirement. The Cray T3E [4] is an example of a recent machine that uses Duato's routing algorithm. The torus continues to be a popular topology even in multicomputers which employ adaptive routing. However, before adaptive routing can be widely adopted in practical systems, it is necessary to determine which of the competing topologies are able to fully exploit its performance benefits. To this end, this paper re-assesses the relative performance merits of the torus and hypercube in the context of adaptive routing. The study compares the performance of the 2 and 3-D torus to that of the hypercube. The analysis uses Duato's fully-adaptive routing [6]. The present study uses queueing models developed in [9] to examine network performance under uniform traffic. Results presented in the next section reveal that it is the hypereube which provides the optimal performance under both the constant wiring density and pin-out constraints, and thus is the best candidate as a high-performance network for future multicomputers with fully-adaptive routing. 2
Performance
Comparison
The torus has higher bandwidth channels than its hypercube counterpart under the constant wiring density and pin-out constraints. The detailed derivation of the exact relationship between the channel width of the torns in terms of that of the hypercube under both constraints can be found in [1,2, 5]. The router's switch in the hypercube is larger than that in the torus due to its larger number of physical and virtual channels. As a consequence, the switching delay in the hypercube should be higher due to the additional complexity. Comparable switching delays in the two networks can be obtained if the routers have comparable switch sizes. This can be achieved by normalising the total number of channels in the routers of the two networks. When mapped in the 2D plane, the hypercube ends up with longer wires, and thus higher wire delays than the torus. However, delays due to long wires can be reduced by using pipelined channels as suggested in [12]. The performance of the networks is examined below when both the wire delay is taken into account and when it is ignored. For illustration, network sizes of N=64 and 1024 nodes are examined. The channel width in the hypercube is one bit. The channel width in the torus is normalised to that of the hypercube. The message length (M) is 128 flits. A physical channel in the hypercube has V=2 virtual channels. The total number of virtual channels per router in the torus is normalised to that in the hypercube. Figures l-a and b depict latency results in the torus and hypercube under the constant wiring density constraint and when the effects of wire delays are taken into account in the 64 and 1024 node systems respectively. The figures reveal that the torus is able exploit its wider channels to provide a lower latency than the hypercube under light to moderate traffic. However, as traffic increases its performance degrades as message blocking rises, offsetting any advantage of
991
10@0 Latency (cycles)
I
1000
I
Hypercube (V=2)
[
,
I .......
I
I
Latency
(cycles)
2D-Torus(V=~t) 500
JOO
2D- Tol~mS ~ 4 )
i
2D-Tarusl(~=$)
I
--
I:
,
..~ ---.~"~- ~ I
(b)
I
0.002 0.004 Traffic (messages/cycle)
0.006
oI
*
I
I
0.00;! 0.004 0 005 Traffic (messagesA:ycle)
Fig. 1. The performance of the torus and hypercube including the effects of wiring delays. (a) N=64, (b)N=1024.
i
1000
La~ehey
(cycles)
10~0
i
Hypercube(VC=2)
Latency
. ..e~.bo (Vo~)
.
;,
(cycles)
2D-Tsrus(vc =3)
! [ 2D-Terus(g=5)
500 i
(a) o 1 - -
~
o~o2
5~@
o~o4
Traffic (messa~esAeyclc)
/JJ
,p_:T.o,.s (
v
I
=
j~
~
0006 (b) *
Tr~flic(messages/cycle)
Fig. 2. The performance of the torus and hypercube ignoring the effects of wiring delays. (a) N=6~, (b)N=102~.
992 having wider channels. Figures 2-a and b show latency when the effects of wire delays are ignored. The torus outperforms the hypercube under light to m o d e r a t e traffic, but loses edge to the hypercube under heavy traffic. The difference in performance between the two networks increases in favour of the hypercube for larger network sizes. Figures 1 and 2 together reveal an i m p o r t a n t finding a b o u t the torus, and that is even though wires are longer in the hypercube, the torus is more sensitive to the effects of wire delays. This is because a message in the torus crosses, on average, a larger number of routers, and therefore require a longer service time to reach its destination. Since the ratio in channel width in the hypercube and torus decreases under the constant pin-out constraint, we can conclude that the hypercube is even more favourable when the networks are subjected to this condition.
3
Conclusion
This paper has compared the performance merits of the torus and hypercube in the context of adaptive routing. The results have reveMed t h a t the hypereube has superior performance characteristics to the torus, and therefore is a better candidate as a high-performance network for future multicomputers, t h a t use adaptive routing.
References 1. S. Abraham, Issues in the architecture of direct interconnection networks schemes for multiprocessors, Ph.D. thesis, Univ. of Illinois at Urbana-Champaign(1991). 2. A. Agarwal., Limits on interconnection network performance, IEEE Trans. Parallel & Distributed Systems,Vol. 2(4), (1991) 398-412. 3. R. Arlanskas, iPSC/2 system: A second generation hypercube, Proc. 3rd ACM Conf. Hypercube Concurrent Computers and Applications,Vol. 1 (1988). 4. Cray Research Inc., The Cray T3E scalable parallel processing system, on Cray's web page at http://www.cray.com/PUBLIC/product-info/T3E/. 5. W.J. Dally, Performance analysis of k-ary n-cubes interconnection networks, IEEE Trans. Computers,Vol. 39(6) (1990) 775-785. 6. J. Duato, A New theory of deadlock-free adaptive routing in wormhole routing networks, IEEE Trans. Parallel & Distributed Systems, Vol. 4 (12) (1993) 320331. 7. J. Duato, M.P. Malumbres, Optimal topology for distributed shared-memory multiprocessors: hypercubes again?, Proc. EuroPar'96, Lyon, France, (1996) 205 212. 8. R.E. Kessler, J.L. Schwarzmeier, CRAY T3D: A new dimension for Cray Research, in CompCon, Spring (1993) 176-182. 9. M. Ould-Khaoua, An Analytical Model of Duato's Fully-Adaptive Routing Algorithm in k-Ary n-Cubes, to appear in the Proc. 27th Int. Conf. Parallel Processing, 1998. 10. C.L. Seitz, The Cosmic Cube, CACM, Vol. 28 (1985) 22-33. 11. C.L. Seitz, The hypercube communication chip, Dep. Comp. Sci., CalTech, Display File 5182:DF:85 (1985). 12. S.L. Scott, J.R. Goodman, The impact of pipelined chaunel on k-ary n-cube net works, IEEE Trans. Parallel & Distributed Systems, Vol. 5(1) (1994) 2-16.
Constant Thinning Protocol for Routing h-Relations in Complete Networks Anssi Kautonen 1, Ville Leppanen 2, and Martti Penttonen 1 1 University of Joensuu, Department of Computer Science, P.O.Box 111, 80101 Joensuu, Finland, {Anssi. g autonen,Martt i. P enttonen}~cs, joensuu, f i
2 University of Turku, Department of Computer Science and Turku Centre for Computer Science, Lemmink'~senkatu 14 A, 20520 Turku, Finland, '4i l l e . Leppanen~cs. utu. ~ i
A b s t r a c t . We propose a simple protocol, called constant thinningprotocol, for routing in a complete network under OCPC assumption, analyze it, and compare it with some other routing protocols.
1
Introduction
Parallel programmers would welcome the "flat" shared memory, because it would make PRAM [15] style programming possible and thus make easier to utilize the rich culture of parallel algorithms written for the PRAM model. As large memories with a large number of simultaneous accesses do not seem feasible, the only possible way to build a PRAM type parallel computer appears to be to build it from processors with local memories. The fine granularity of parallelism and global memory access, what makes the PRAM model so desirable for algorithm designers, sets very high demands for data communication. Fortunately, m e m o r y accesses at the rate of the processor clock are not necessary, but due to the parallel slackness principle [16], latency in access does not imply inefficiency. It is enough to route an h-relation efficiently. By definition, in h-relation there is a complete graph, where each node (processor) has at most h packets to send, and it is the target of at most h packets. We assume the OCPC (Optical Communication Parallel Computer) or 1-collision assumption [1]: If two or more packets arrive at a node simultaneously, all fail. An implementation of an hrelation is work-optimal at cosr c, if all packets arrive at targets in time ch. The first attempt to implement an h-relation is to use a greedy routing algorithm. By greedy principle, one tries to send packets as fast as one can. The fatal drawback of the greedy algorithm is the livelock: packets can cause mutual failure of sending until eternity. In a situation, when each of two processors has one packet targeted to the same processor, they always fail. Another routing algorithm was proposed by Anderson and Miller [1], and it was improved by Valiant [15]. They realize work-optimally an h-relation for h E ~2(logp), where p is the number of processors. Other algorithms with even lower latency were provided by [6, 7, 2, 3, 11]. Contrary to these, the h-relation
994 algorithm of Ger~b-Graus and Tsantilas [5], G G T for short, has the advantage of being direct, i.e. the packets go to their targets directly, without intermediate nodes. For other results related with direct or indirect routing, see also [4, 14, 8, 9, 12, 13]. Actually our new algorithm was inspired by [13], although the latter deals with the continuous routing problem, not the h-relation. p r o c GGT(h,E,a) f o r i = 0 t o 1ogw0_e) h do Transmit((1 - c)ih,GcO p r o c Transmit (h,e,o0 f o r all processors P p a r d o f o r 1---~(eh + m a x { ~ , 4 a l n p ] ) t i m e s choose an unsent packet x at random attempt to send x with probability #
2
do u n s e n t packets h
Thinning Protocols
The throughput of the greedy routing of randomly addressed packets is characterized by
(i- l h-1 > ! h ~
-
e
where I/h is the probability that one of the h - 1 competing processors is sending to the same processor at the same time. (For all x > 0, (1- I/x) x-1 > e -I.) This would be the throughput if all processors would create and send a new randomly addressed packet, which is not the case in routing an h-relation. W h e n failed packets are resent again and again, the risk of live]ock grows. A solution is to decrease the sending probability. In G G T algorithm, the sending probability of packets varies between ] and I - e (0 < ~ < i). The transmission of packets is thus 'thinned' by factor 1 to 1/1 - e, preventing the livelock. We now propose a very simple routing protocol, where thinning is more explicit.
p r o c CT(h,ho,t,to) %1 <_ ho <_ h, 1 <_ to <_ t f o r all processors d o w h i l e packets remain d o Transmit h randomly chosen packets (if so many remain) within time window [1.. [th]] h := max{(1 - e- /to)h, h0} A drawback of thinning is that a processor cannot successfully send a packet at those moments, when it does not even try to send. Thinning by factor t would thus imply inefficiency by factor t. But this is somewhat balanced by the success probability, which increases from 1/e to th "
- el~ t "
995 The expected throughput with thinning is thus characterized by the function te 1/t whose growth is very modest: for t = 1, 1.2, 1.5,2.0 and 3.0 the values of te 1/t are 2.7, 2.8, 2.9, 3.3, and 4.2. Still, even small thinning increases greatly the robustness against livelock. For very small numbers of packets constant thinning alone does not guarantee high success probability. For that reason, C T has a minimum size h0 E ~(logp) for thinning window, to prevent repeated collisions.
3
Analysis
The analyses for G G T and CT are rather similar. One can prove that 1. the number of unseat packets decreases geometrically from h to logp 2. the rest of the packets can be routed in time O(logploglogp) T h e o r e m 1. ['or h C ~ ( l o g p l o g l o g p ) , C T rou*es any h-relation in lime O(h) with high probability. Proof. Consider a round of the while loop, where h' > k0 logp = h0 for a suitable constant k0. We shall show that if in the beginning of a round the routing task is a n h'-relation, it will be a ch'-relation after the round with high probability. To ease the calculation of probabilities, we worsen the routing task a little by completing the initial h'-relation to a full h'-relation. We do this by adding 'dummy packets' with proper destination to those processors not having initially h' packets. We assume that the d u m m y packets participate in routing as the other packets. In the analysis below, the d u m m y packets can collide with normal packets as well as with other d u m m y packets, but in the actual algorithm attempting to route a d u m m y packet corresponds to an unallocated time slot (unable to cause any collisions). Thus the number of successful normal packets is always better in the actual situation. We first show that, with high probability, the number of packets decreases to ch I or below. As a packet has the probability 1/th' of being sent at a given moment of time (from the interval [ 1 . . . [th']]), and by the h'-relation assumption, at most h / packets have the same target, the probability of success for a packet at a given moment of time is at least 1 1 th I x ( 1 - -~-'~" 1 ~h'-i >- ell t 9 t h "
-
-
since
1 m 1 .h~_l (1 - g r ) > (] - g r ) for any m < h' - 1 (consider m as the actual number of other packets with the same target) and (1 --
1 hi-1
~--~')
-~-
(
(1 --
1 ]thJ-t~l/t > ( (1 -- L~th'-l~ lit > )
-
th'"
)
-
e-1/t
996 for h t > 1 and t > 1. For each packet, the probability of success for a packet during the procedure call is at least
1 ell t 9th'
-
x
th' = e _ l / t .
Hence, the expected number of successful packets of a processor within these th' units of time is Et -= hie -1/t. By applying Chernoff bound [10]
and choosing (1 - e)Et = Eto, we get ~ = 1 - Eto/Et ~ 1~to - 1/t, and
P r ( N < h' e -Ut~ < e -0"Sxc2xh'e-1/t 1 -~ p2.---5 for h / > h0 = k0 logp for some constant k0. Hence, in a processor the number of outgoing packets decreases by compression factor c = 1 - lie 1/t~ with probability 1 - 1/p 2"5, and in all p processors and in all log1/c p < x/P rounds with probability 1 - lip. Guaranteeing that the number of incoming packets is at most eh ~ for each processor is analyzed analogously. Clearly, the full h'-relation (completed with d u m m y packets) decreases to eh'-relation. Removing the dummy packets from the system can only decrease the degree of the relation. Within log1/c h geometrically converging phases, or O(ht) time, there remain no more than O(logp) packets in each processor. Observe that by choosing a larger h0 in the analysis above, we can easily show the same progression in the degree of the relation with probability 1 - p - ~ for any positive constant a. For the rest of the algorithm, when h ~ < h0 packets remain, the success probability of a packet is - h' - x(1-
1---] h ~
> h'
1
tho
tho"
- ho teUt"
Thus the expectation of sending times for this packet is teUtho/h'. The sum of all these expectations, until all packets have been sent, is
f:el/tho(
1
1
+ ho--_l + . . . + - ~ + l ) = E c O ( h o l o g h o ) .
By another form of Chernoff bound [10]
P r ( T > r) < 1 we see that
for r > 6E
1 P r ( T > klhologho) < p2.5 --
for some k0 and therefore all packets can be transmitted to their targets in time O (log p log log p) with high probability. By combining the two phases we see that all packets can be routed in time O (h + log p log log p) with high probability. By choosing h E ~2(logploglogp) we complete the proof. []
997
4
Experiments
We r a n some e x p e r i m e n t s to get practical experience of the new a l g o r i t h m , see T a b l e 1. T h e results are averages of 500 e x p e r i m e n t s .
T a b l e 1. Routing cost of the CT algorithm with h = log 2 p and h0 = log 2 p. Thinning parameters t = 1.2, 1.5, 2.0 were tried. For comparison, results of G G T with h ----log~ p, = 0.25, ~ = 0.1 are presented. p $=1.25=1.5
~=2.0 GGT
By the results in T a b l e 1, the c o n s t a n t t h i n n i n g a l g o r i t h m C T achieves the cost 3.7 w i t h slackness h = log 2p. W i t h a s u i t a b l e c o m b i n a t i o n of ~ a n d a G G T comes below 5 b u t is clearly slower t h a n CT. Note t h a t e = 2.7 is the lower b o u n d of the cost. I n a d d i t i o n to mere n u m b e r s , our r o u t i n g s i m u l a t o r shows the progress of r o u t i n g graphically (see Figure 1). In C T the t h r o u g h p u t of packets is c o n s t a n t for m o s t of the time, while G G T follows a saw blade p a t t e r n . T h e "tail" of the g r a p h is very criticM. Too little t h i n n i n g or too s m a l l h0 grows the tail, as the a l g o r i t h m approaches to the greedy a l g o r i t h m .
Fig. 1. Graphical output of a CT simulation in case p = 1024, h = 128, h0 : 16, t : $0 : 1.2. The picture is divided in three horizontal bands by ragged borderlines at 1/t : 83% and at 1~re1# = 36% of all processors. The bottom band represents successful processors, the middle band failed processors, and the top band passive processors. The vertical lines at intervals of cth, c2th, ..., until cith < h0 : 16 (c = 1 1/e 1/*~ separate the phases of the algorithm. The total time in picture is 456 = 3.56h.
998
References 1. R.J. Anderson and G.L. Miller. Optical communication for pointer based algorithms. Technical Report CRI-88-14, Computer Science Department, University of Southern California, LA, 1988. 2. A. Czumaj and F. Meyer auf der Heide, and V. Stemann. Shared memory simulations with triple-logarithmic delay. Proc. ESA95, 46-59, 1995. 3. M. Dietzfelbinger and and F. Meyer auf der Heide. Simple, efficient shared memory simulations. Proc. SPAA93, 110-119. 4. J. Hs T. Leighton, and B. Rogoff. Analysis of backoff protocols for multiple access channels. SIAM J. Comput. 25:740-774, 1996 5. M. Ger~b-Graus and T. Tsantilas. Efficient optical communication in parallel computers. In Proe. SPAA '92, pp. 41 - 48, June 1992. 6. L.A. Goldberg, M. Jerrum, T. Leighton, and S. Rao. A doubly logarithmic communication algorithm for the completely connected optical communication parallel computer. In Proc. SPAA'93, pp. 300 - 309, June 1993. 7. L.A. Goldberg, Y. Matias, and S. Rao. An optical simulation of shared memory. In Proc. SPAA '94, pp. 257 - 267, June 1994. 8. L.A. Goldberg and P.D. MacKenzie, Analysis of Practical Backoff Protocols for Contention Resolution with Multiple Servers. Proc. SODA '96, pp. 554-563. 9. L.A. Goldberg and P.D. MacKenzie, Contention Resolution with Guaranteed Constant Expected Delay, Proc. FOCS'97, pp. 213-222. 10. T. Hagerup, C. Riib. A guided tour of Chernoff bounds. [nformation Processing Letters 33:305-308, 1989. 11. R.M. Karp, M. Luby, and F. Meyer auf der Heide. Efficient P R A M simulation on a distributed memory machine. Proc. STOC'92, pp. 318-326. 12. A. Kautonen and V. Lepp~iiaen and M. Penttonen. Simulations of P R A M on Complete Optical Networks. In Proc. EuroPar'96, LNCS 1124:307-310. 13. M. Paterson and A. Srinivasan. Contention resolution with bounded delay. In Proc. FOCS'95, pp. 104-113. 14. P. Raghavan, and E. Upfal. Stochastic Contention Resolution With Short Delays. In Proc. STOC'95, pp. 229-237. 15. L.G. Valiant. General purpose parallel architectures. I11 Handbook of Theoretical Computer Science, Vol. A, 943-971, 1990. 16. L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33:103-111, 1990.
NAS Integer Sort on Multi-threaded Shared Memory Machines* T h o m a s Griin 1 and Mark A. Hillebrand 1 Computer Science Department, University of the Saarland Postfach 151150, Geb. 45, 66041 Saarbrfid:en, Germany (gruen~cs .uni-sb. de, mah~studcs, tmi-sb, de)
A b s t r a c t . Multi-threaded shared memory machines, like the commer-
cial Tera MTA or the experimental SB-PRAM, have an extremely good performance on the Integer Sort benchmark of the NAS Parallel Benchmark Suite and are expected to scale. The number of CPU cycles is an order of magnitude lower than the numbers reported of general purpose distributed memory or shared memory machines; even vector computers are slower. The reasons for this behavior are investigated. It turns out that both machines can take advantage of a fetch-and-add operation and that due to multi-threading no time is lost waiting for memory accesses to complete. Except for non-scalable vector computers, the Cray T3E, which supports fetch-and-add but not multi-threading, is the only parallel computer that could challenge these machines.
1
Introduction
One of the first programs that ran on a node processor of the Tera MTA ( m u l t i threaded architecture) [2,3] was the Integer Sort (IS) from the Parallel Benchm a r k suite [8, 14] of the Numerical Aerospace Simulation Facility (NAS) at NASA Ames Research Center. According to press releases of Tera Corporation [17], a single node processor of the Tera machine was able to beat a one-processor Cray T90, which hitherto held the record for one processor machines, by more than 30 percent. The S B - P R A M [1,9, 12, 15] has m a n y features in c o m m o n with the Tern MTA, although it is inspired by a totally different idea, viz realizing the P R A M model from theoretical computer science [18]. Both machines differ from most of todays shared m e m o r y parallel computers in two aspects: First, they implement the UMA (uniform m e m o r y access) model instead of the N U M A (non uniform m e m o r y access) scheme found in todays scalable shared m e m o r y machines like SGI Origin, Sun Starfire or Sequent N u m n - Q . In other words, they do not employ d a t a caches with a cache coherence protocol, but hide m e m o r y latency by means of multi-threading. Second, they provide special hardware for f e t c h - a n d - a d d instructions in the m e m o r y system. Commercial microprocessors, which are employed in most of todays parallel computers do not offer this kind * This work was partly supported by the German Science Foundation (DFG) under contract SFB 124, TP D4.
I000
of hardware support for running efficient parallel programs. An exception is the Cray T3E, which is a UMA machine with fetch-and-add support but without multi-threading. The IS benchmark is part of the NAS (Numerical Aerospace Simulation Facility) Parallel Benchmark Suite (NPB) [8]. It models the ranking step of a counting sort (kind of bucket sort) application, which occurs for instance in particle simulations. The IS benchmark takes a list L of small integers as an input and computes for every element x E L the rank r(x) as its position in the sorted list. On this benchmark, the best program on a 4 processor S B - P R A M spends 3.3 cycles per element (CPE) on the average, and the single-node Tera machine needs 2.4 cycles. Both machines are expected to scale without significant loss of efficiency. A Cray T90 node approaches 5 CPE and the numbers reported for the MPI-based IS sample code on various distributed and shared memory computers (including the Cray T3E) are greater than 25 CPE. Besides, scalability is a problem on these machines. In this article we explain, why this numbers differ in such a wide range and investigate whether there are better implementations than the sample code for CC-NUMA machines and the Cray T3E. The paper is organized as follows. Section 2 discusses the IS benchmark specification. Section 3 addresses multi-threaded shared m e m o r y machines and Section 4 discusses a fast algorithm for vector computers. Section 5 presents qsults of the MPI-based message-passing sample code and Section 6 investigates CC NUMA machines or the Cray T3E UMA machine could improve on these ,ults. Section 7 concludes.
2
The
NAS
Integer
Sort
Benchmark
The Integer Sort (IS) program is part of the NAS Numerical Parallel Benchmark suite [8]. Despite its name it does not sort but rank an array of small integers. The first revision (NPB 1) was a "paper and pencil" benchmark specification, so as not to preclude any programming tricks for a particular machine. The benchmark specifies that an array keys [] of N keys is filled using a fully specified random number generator that produces Gaussian distributed random integers in the range [0, B r n a x - - 1]. This key generation and a final verification step, which checks the results, are excluded from the timing. There are several "classes" of the benchmark, which define the parameter values N and Brnax. We concentrate on class A (N = 223, B m a x : 219); the parameters for class B and C are higher by a factor 4, and 16 respectively. The timed part of the benchmark consists of 10 iterations of (a) a modification of two keys, (b) the actual ranking step, and (c) a partial verification that tests if the well known rank of 5 keys has been computed correctly. A typical implementation of the inner loops that perform the actual ranking is listed in Table 1. The NPB 2 revision of the benchmark intends to supplement the original benchmark description with sample implementations that can be used as a starting point for machine-specific optimizations. There is a serial implementation (NPB 2.3-serial) and an implementation based on the message passing standard
i001
Table 1. Inner loops of the NAS IS benchmark (1) (2) [(3) [(4)
I
for for for for
i = 1 to B: i 1 to N: i 1 to B: i 1 to tl:
count[i] : 0 count[key[ill++ pos[x] = ~k-O count[x] rank[i] = pos[key[i]]++ 9
i--1
.
// // // //
clear count array count keys calc start position ONLY NPB 1
MPI (NPB 2.3b2). Unfortunately, the fourth loop of the benchmark, which was present in earlier sample implementations of NPB 1, has been omitted in NPB 2. However, the fourth loop is identical to the second loop except for storing the result, and does not require a new programming technique. Therefore, we use the NPB 2 variant of IS, class A throughout this paper. Performance measures9 The original performance measure of NAS IS is the elapsed time for 10 iterations of the inner loops. The main performance measure in our investigation is the number of CPU cycles per key element (CPE), because it emphasizes the architectural aspect rather than technology. We will point out situations in which CPE is an unfair measure.
3
Multi-threaded
SMMs
First, we sketch the S B - P R A M design and highlight the Tern MTA differences. Then we describe the result of the best S B - P R A M implementation and indicate the algorithmic changes in the Tern MTA implementation. 3.1
Hardware Design
The S B - P R A M [1,9, 12] is a shared memory computer with up to 128 processors and an equal number of memory modules, which are connected by a butterfly network. Processors send their requests to access a memory location via the butterfly network to the appropriate memory modules and receive an answer packet if the request was of type LOAD. There are no caches in the memory system. Three key concepts are employed to yield a uniform load distribution on the memory modules and to hide memory latency: (a) synchronous multi-threading of 32 virtual processors (VP) with a delayed load; (b) hashing of subsequent logical addresses to physical addresses that are spread over all memory modules; (c) butterfly network with input buffers and combining of packets. The combining facility is extended to do parallel prefix computations, which, due to synchronous execution, look to the programmer as if the VPs executed the operations one after another in a predefined order (sequential semantics). The design ensures that the VPs are never stalled in practice, neither by network congestion on the way to the memory modules nor by answer packets that arrive too late. Although this issue has been intensively investigated in [7]
1002 and [19] using different and detailed simulators, the main criticism on the S B P R A M is that these investigations were based on simulations only. Therefore, a 64 processor machine is being built as a proof of concept. At the time of this writing, the re-design of an earlier 4-processor prototype [4] has been completed. The 64 processor machine is expected to be running in s u m m e r 1998. In [10] a variant of the S B - P R A M , called High Performance P R A M (HPP), has been sketched. Due to modest architectural changes and using top 1995 technology, the H P P is expected to achieve the tenfold performance of the S B P R A M on average compiler generated programs. Tera MTA. The design goal of the Tera MTA [2, a, 17] was to build a powerful m u l t i - p u r p o s e parallel computer with special emphasis on numerical programs. In order to meet this goal, it was determined early in the design phase to employ GaAs technology and liquid cooling. A processor chip runs at _> 294 MHz and has a power dissipation of 6 K W per processor. Similar to the S B - P R A M , the Tera MTA hides m e m o r y latency by means of multi-threading. Unlike the S B - P R A M , it does not schedule a fixed n u m b e r of VPs round robin, but it can support up to 128 threads, all of which can have up to 8 active m e m o r y requests. New threads can be created using a low overhead mechanism; the penalty for an instruction that a t t e m p t s to use a register that is waiting for a m e m o r y request to arrive is only one cycle. The m e m o r y system of the Tera MTA is completely different from that of the S B - P R A M : the network topology is a kind of 3-dimensional torus. The messages are processed by a randomized routing scheme that m a y detour packets if the output link in the right direction is either overloaded or out of order. To our knowledge there is no detailed, technical description or a correctness proof of the network available to the public. The machine supports f e t c h - a n d - a d d instructions; the requests are not combined in the network, but serviced sequentially at the m e m o r y modules. Because the Tera threads m a y become asynchronous due to race conditions in the network, the machine does not offer the sequential semantics of the S B - P R A M . 3.2
Benchmark
Implementation
Due to some limitations of our gcc compiler port for the S B - P R A M , we have hand-coded the loop bodies in assembly language and manually unrolled the inner loops of the benchmark. For details we refer to [11]. We now present how m a n y instructions are necessary at least to implement the IS b e n c h m a r k on the S B - P R A M . Hence, we hand-code the loop bodies in assembly language and neglect loop overhead. As on every other machine, the count array can be cleared with a single "store with auto increment" instruction, if every VP is assigned a different but contiguous portion of the count array. The arrays c o u n t [] and pos [] are used one at a time and can be coalesced into a single cp [] array, which saves a few address computations. Because there are no d a t a dependencies between different iterations of loop 2, all PROCS virtual processors are m a p p e d round robin onto the k e y s and cp arrays and synchronously execute loops 2 and 3. Two successive
1003 (a) e l e m l = M( k_ptrl+=2*PR0CS); c p @ t r l = e l e m l + cp_base; s y n c a d d ( c p _ p t r l , 1); (b) eleml = M( c p _ p t r l + = 2 * P g 0 C S ) ; e l e m l = mpadd( ranksum, e l e m l ) ; M(cp_ptrl) = eleml;
elem2 = M ( k _ p t r 2 + = 2 * P R O C S ) ; [ cp_ptr2 = elem2 + cp_base; syncadd( cp_ptr2, I) ; elem2 = M ( c p _ p t r 2 + = 2 * P R O O S ) ; elem2 = mpadd( ranksum, elem2) ; M(cp_ptr2) = elem2;
Fig. 1. Interleaved loop bodies: (a) Loop 2, and (b) Loop 3.
loop bodies are interleaved in order to fill all load delay slots (see assembly code in Figure 1.a). The first instruction combines the incrementing of a properly initialized, private key pointer k_ptr by 2.PROCS with the loading of the next key. In the second instruction of the loop body, the pointer c p _ p t r is computed as cp[] indexed by key, and the syncadd operation (mpadd without result) increments this entry. Figure 1.b lists the assembly code for the third loop, which also employs the interleaving technique. In the first instruction, the pointer c p _ p t r is adjusted and an entry of cp[] is read. Then, the accumulated rank is computed in the multi-prefix add on ranksum, and the result is written back to the cp array. This loop relies on synchronous execution, because the order of rapadd instructions among VPs is crucial. Loop 2 contributes most instructions to the timed portion of the code, because it is executed N times whereas loops 1 and 3 are executed only B , ~ a x = N times. Thus, CPE(ranking) = ~ + 3 + 3 = 3.25 is optimal for the S B - P R A M . The benchmark implemented using assembly language macros and 64-fold unrolled loop achieves 3.30 CPE. A similar tuned C version is slower by roughly one cycle, because the compiler does not use "load with a u t o - a d d of 2*PROCS" but generates two separate instructions instead. We have run the benchmark on our instruction-level simulator as well as on the real 4 processor machine. The run times differ by less than 0.1%, a fact that validates our network simulations which predicted that m e m o r y stalls are extremely unlikely. The run time obtained by a 128 processor simulation is only 1 1% slower than m-5-gth of the one processor run time, i.e. speedup is nearly linear. Because no caches are employed in the SB PRAM, the run times of class B and C can be simply and accurately predicted; they are higher by a factor of 4, respectively 16. The High-Performance-PRAM H P P would gain a factor of 5.6 in absolute speed, but the C P E value would be worse, because the machine can issue memory requests at only a third of the instruction rate. For a fairer comparison, the CPU clock should be replaced by the memory request rate. T e r n M T A . The Tern MTA system has a sophisticated parallelizing compiler, which processes the sequential source code and automatically generates threads when needed. The assembler output of loop 2, which is listed in [2], shows a 2 instruction loop body that is 5-fold unrolled. The memory operations are "load next key" and "fetch & increment count". An integer addition plus the loop
1004 assume: V1 +-- 0..63 (a) 1 : V 2 t - k e y [ V l + k ] (b) 2 . 1 : c o u n t [ V 2 ] +- V1 2 : V3 +-- count[ V2 ] -t- 1 2.2 : V4 e- count[ V2 ] - V1 3: count[ V2 ] +- V3 2.3: if( V4 r 0 ) check
Fig. 2. Loop 2 vector code: (a) straight forward, (b) correction
overhead are hidden in the n o n - m e m o r y operations of the ten instructions of the loop body. Note, that the CPU clock and the m e m o r y request rate are equal in the super-scalar Tera MTA design. Since the Tera MTA lacks synchronous execution, the third loop cannot be parallelized as on the S B - P R A M . Instead, every processor works on a contiguous block of the count array, computes the local prefix sums and outputs the global sum of its elements. In a second step, the prefix sum of the global sums is computed. Then, each processor adds the appropriate global prefix sum to its local elements. This procedure requires 3 instructions plus a small logarithmic term. For the class A benchmark it is less than 4 instructions. The Tera MTA requires less than 1 + 2 + ~ = 2.3125 CPE. The improvement c o m p a r e d to the S B - P R A M comes from the ability to execute one m e m o r y operation and two other (integer, j u m p , ... ) operations per instruction. We do not have access to a Tera MTA ourselves. The performance figures contained in this section are derived from [2, 5] and personal communication with P. Briggs (Tera Computer Corporation) and L. Carter (SDSC). The numbers reported in [2] are 1.53 seconds and 5.25 < C P E < 5.3125 for N P B 1. Assuming a C P E value of 5.25, the 294 MHz machine has an overhead of at least 153294'1~ 1 = 2.14 %, i.e. hiding m e m o r y latency works well for the 5.25,10.22a one processor prototype. The fourth inner loop of the benchmark, which is not present in NPB 2, accounts for 3 CPE. If we attribute all overhead to the three loops of IS(NPB 2) and conservatively assume C P E ( N P B 2)=2.3125, we end up with a run time of < 0.68 seconds. Carter [5], who ran the NPB 2 b e n c h m a r k on a 145 MHz machine, achieved a run time of 2.05 seconds, which corresponds to an overhead of 50 percent. He believes that on the early machine flipped bits in the network packets induced retransmissions, because the m e m o r y overhead figures for other benchmarks dropped as the hardware became more stable. 4
Vector
Computers
In [5], a Tera MTA node is compared to a Cray T90 vector processor node on the NPB 2-serial sample code of IS. Vector processors have instructions for computing prefix sums on vector registers. Thus, the third b e n c h m a r k loop can be parallelized in the same way as on the Tera MTA. The difficult part is loop 2, where even a single vector processor encounters problems. Figure 2.a lists the straightforward, but wrong, implementation of loop 2. Successive, contiguous stripes of key [] are read into vector register V2. Then,
1005 Table 2. IS (class A) benchmark results minimum maximum Machine CPU clock Name [MHz] #procs I time I CPE #procsltime I CPE SB-PRAM 7 1 39.55 3.30 4 9.89 3.30 128 0.31 a.3a SB-PRAM simulation Tera MTA 294 1 0.68 2.32: Cray T90 440 1 1 . 1 1 5.82 SB-PRAM (MPI) 7 2 160.8 26.85 66 2 29.1 27.38 128 0.6 60.42 IBM SP2 WN SGI Origin 2000 195 2 20.5 95.30 32 1.6 119.02 Cray T3E-900 450 2 12.7 136.26 256 0.4 258.15
V3 is filled by reading the corresponding count [] entries and incrementing them. Afterwards V3 is written back. If a specific key value v is contained twice in V2 at positions kl and k2, count Iv] is incremented only once, because u which finally ends up in c o u n t [v], contains countold Iv] +1 and not the incremented V3[kl] value. The Cray compiler can cure the situation using a patented algorithm [6] invented by B o o t h 3 Before writing V3 back in line 3, the code listed in Figure 2.b tests whether duplicates occur, and, if necessary, processes t h e m in the scalar unit. The run time of IS(NPB 2), as reported in [5], is 1.11 s, which corresponds to 5.82 CPE. The NPB 2.3-serial sample code contains an unnecessary store t h a t possibly has not been detected by the compiler 9 Hence, the o p t i m a l C P E for a single vector computer could be lower by 1 CPE. The above code for loop 2 can be easily parallelized by maintaining a private copy count~ [] of the count array on each node computer. When p denotes the number of participating processors, the third loop becomes a parallel prefix 9 . ,-1 p-1 c o m p u t a t i o n pos [1] = }--~k=0~ t = 0 countt [k]. Memory consumption as well as the execution time of loop 3 scale with p. If p becomes very large, a two pass radix sort, like described by Zagha [20], can improve performance.
5
Comparison with MPI-based sample code
Table 2 lists the benchmark results discussed so far and compares t h e m with run times of M P I - b a s e d sample code. We implemented the necessary M P I library functions on the S B - P R A M and optimized the sample code to the extent t h a t an optimizing compiler could also achieve [11]. The times for the other machines have been taken from benchmark results published by NAS. The IBM SP2 is a distributed m e m o r y machine (DMM) with an extremely fast interconnection network, the SGI Origin 2000 a cache-coherent NUMA, and the Cray T 3 E a UMA shared m e m o r y machine (SMM) without cache coherence. 1 Personal Communication with Larry Carter and Max Dechantsreiter.
1006 Sample code implementation. The keys are distributed on all participating p computers before the timing starts. In a first local step, each computer sorts its keys using only the b most significant bits into 2 b buckets. Because the keys are put into a sorted order, this step alone requires more instructions than IS(NPB 1). Then, a mapping of buckets to processors, which balances the load, is computed. Afterwards, all local bucket parts are sent to their destinations in a single, global, all-to-all communication step. Now, every processor owns a distinct, contiguous key range and can count its keys locally. The pos [] entries can be computed from count [] like in the Tera MTA implementation. The sample code algorithm is guided by the idea of doing one, central a l l - t o all communication. The price for this procedure is to rearrange the keys locally, such that keys that are sent to the same destination are stored contiguously in memory. A plus of the implementation is the good cache efficiency: the key arrays are accessed sequentially, so that the initial cache miss time is amortized over the complete cache line, and the count arrays fit into second level cache. With less than p = 16 processors another strategy with fewer communication overhead would be possible: every processor could count its keys in a local count array and pos [] could be computed like on vector processors after transposing the p • B,~a~ count matrix. I f p < 16, then only p. Brnax < 16" 1~ data elements had to be sent over the network. However, the sample code is written for a large number of processors and slow communication networks. The CPE of the SB-PRAM is 26.85, of which 9 cycles belong to m e m o r y instructions operating on key elements. The S B - P R A M issues one instruction per cycle and encounters no memory waits. The other machines have superscalar processors, but suffer from memory waits. The impact of wait states on the C P E is the higher, the higher the clock rate of the specific machine is. Table 2 highlights this fact, which limits the benefit of cache-based architectures from future technological improvements compared to vector computers or m u l t i threaded machines. 6
Other
Algorithms
The MPI-based sample code is tailored for distributed memory machines. We now investigate, if there are better algorithms for CC-NUMA machines or the Cray T3E. CC-NUMA. Cache-coherent non-uniform-memory-access SMMs [13], like the SGI Origin, have overcome the argument "SMMs do not scale" by other means than the S B - P R A M or the Tera MTA. While multi-threading relies on having enough parallelism available to keep the processors busy, C C - N U M A machines try to minimize the average memory latency by caching, at the risk of running idle on cache misses. We now investigate, if the direct approach of locking the c o u n t [] elements for updating could improve performance so much, that the results of m u l t i threaded SMMs could be reached. Therefore, we calculate the average m e m o r y
1007 latency introduced by accessing c o u n t ['keys [ i ] ] in loop 2. We assume that on a p-processor machine 1 of the count [] elements are cached on every processor and the miss penalty is 20 cycles (i.e. 100 ns on a 200 MHz computer). Thus, the average memory latency is t=
1 -'1 P
p-1 + ---20 P Y
cache hit
cache miss
On machines with more than p = 16 processors, t is greater than 181~. In other words, the exchange of c o u n t [] entries between the coherent caches is rather expensive. If the remaining cycles for the program are taken into account, the C P E is at least an order of magnitude higher than on multi-threaded SMMs with fetch-and-add.
Crag T3E. A Cray T3E node computer consists of a DEC Alpha processor with up to 2 GByte local RAM. Additionally, the system logic provides a notion of a logically addressable shared memory through the E-registers [16]. Up to 512 E-registers can be accessed by the user to trigger shared memory accesses. In addition to load and store, E-registers also support fetch ~z add and an especially fast fetch ~ increment operation. Although the machine does not support multi-threading in hardware, memory wait conditions can be avoided by using multiple E-registers to simulate multi-threading in software. With this technique, alternate key load and count increment operations can be performed at the maximal network clock of 75 MHz. The third loop of IS can also be parallelized in the same way as on the Tera MTA. Thus, less than 3 network cycles per key element are possible. 7
Conclusion
We have investigated the performance of the IS benchmark on the S B - P R A M and the Tera MTA, two scalable SMMs which employ multi-threading to hide memory latency instead of caching like most of todays computers. Besides m u l t i threading, both machines take advantage of a fetch-and-add instruction to update shared data structures conflict-free, These two properties lead to a performance that is an order of magnitude higher compared to published results of other scalable parallel computers, both DMMs and SMMs. In particular, the experimental 7 MHz S B - P R A M prototype can compete with modern parallel computers on this benchmark. The Tera MTA makes use of expensive top technology, and achieves better performance figures than n o n scalable top vector processors. Although CC-NUMA SMMs could possibly improve performance compared to DMMs, they are still an order of magnitude slower than multi-threaded machines. Alone the Cray T3E, which supports and atomic fetch ~ increment operation could challenge the performance of the Tera MTA by emulating multi-threading in software.
1008
Acknowledgments. T h e a u t h o r s would like to t h a n k L a r r y C a r t e r , M a x D e c h a n t s reiter, a n d P r e s t o n Briggs for helpful discussions a n d all p e o p l e who c o n t r i b u t e d to t h e S B - P R A M p r o j e c t [15].
References 1. F. Abolhassan, R. Drefenstedt, J. Keller, W. J. Paul, and D. Scheerer. On the Physical Design of PRAMs. Computer Journal, 36(8):756-762, December 1993. 2. R. Alverson, P. Briggs, S. Coatney, S. Kahan, and R. Korry. Tera H a r d w a r e Software Cooperation. In Proc. of Supercomputing '97. San Jose, CA, November 1997. 3. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The T E R A Computer System. In Proc. of Intl. Conf. on Supercomputing, June 1990. 4. P. Bach, M. Braun, A. Formella, J. Friedrieh, T. Gr/in, and C. Lichtenau. Building the 4 Processor S B - P R A M Prototype. In Proc. of the 30th Hawaii International Conference on System Sciences, pages 14-23, January 1997. 5. J. Boisseau, L. Carter, K.S. Gatlin, A. Majumdar, and S. Snavely. NAS Benchmarks on the Tera MTA. In Proc. of Workshop on Multi-Threac'e( Execution, Architecture ant ~ Compilation (M-TEA C 98), Las Vegas, February 1998. 6. M. Booth. US Patent 5247696. see h t t p : / / w w w . p a t e n t s , ibm.corn, September 1993. 7. C. Engelmann and J. Keller. Simulation-based comparison of hash functions for emulated shared memory. In Proc. PARLE (Parallel Architectures ant' Languages Europe), pages 1-11, 1993. 8. D. Bailey et al. The NAS Parallel Benchmarks. RNR Technical Report RNR-94007, NASA Ames Research Center, March 1994. see also ht tp ://sc ience, nas. nasa. gov/Soft ware/NPB.
9. A. Formella, T. Gr/in, and C.W. Kessler. The S B - P R A M : Concept, Design and Construction. In Draft Proceec~ings o/3re': International Working Conference on Massively Parallel Programming Moc'els (MPPM-97), November 1997. see also h t t p : / / www-wjp, c s . u n i - s b , d e / ~ f o r m e l l a / mppm. p s . gz. 10. A. Formella, J. Keller, and T. Walle. HPP: A High-Performance-PRAM. In Procee~'ings of the 2nC Europar, volume I I of LNCS 1124, pages 425-434. Springer, August 1996. 11. T. Gr/in and M. A. Hillebrand. NAS Integersort on the S B - P R A M . Manuscript, available via h t t p : / / w w w - w j p / ~ t g r / N A S I S , May 1998. 12. T. Gr/in, T. Rauber, and J. RShrig. Support for Efficient Programming on the S B - P R A M . International Jour~al of Parallel Programming, 26(3):209-240, June 1998. 13. D.E. Lenoski and W.-D. Weber. Scalable Sharec' Memory Multiprocessing. Morgan Kaufmann Publishers, 1995. 14. NAS Parallel Benchmarks Home Page. http:// s c i e n c e , n a s . n a s a . gov/Software/NPB/. 15. S B - P R A M Home Page. h t t p : / / www-wjp, cs. u n i - s b . d e / s b p r a m . 16. S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In
Proc. of the VII ASPLOS (Architectural Support for Programming Languages an(! Operating Systems) Conference, pages 26-36. ACM, October 1996. 17. Tera Computer Corporation Home Page. h t t p ://www. t e r a . com/.
1009 18. J. van Leeuwen, editor. Hanc?book of Theoretical Computer Science, volume A, pages 869-941. Elsevier, 1990. 19. T. Walle. Das Netzwerk c'er SB-PRAM. PhD thesis, University of the Saarland, 1997. in German. 20. M. Zagha and G.E. Belloch. Radix Sort for Vector Multiprocessors. In Proc. of Supercomputing '91, pages 712-721, New York, NY, November 1991.
Verifying a Performance Estimator for Parallel DBMSs E.W. Dempster1 , N.T. Tomov1 , J. L¨ u1 , C.S. Pua1 , M.H. Williams1 , A. 1 Burger , and H. Taylor1 and P. Broughton2 1 2
Department of Computing and Electrical Engineering, Heriot-Watt University, Riccarton, Edinburgh, Scotland, EH14 4AS, UK International Computers Limited, High Performance Technology, Wenlock Way, West Gorton, Manchester, England, M12 5DR, UK
Abstract. Although database systems are a natural application for parallel machines, their uptake has been slower than anticipated. This problem can be alleviated to some extent by the development of tools to predict the performance of parallel database systems and provide the user with simple graphic visualisations of particular scenarios. However, in view of the complexities of these systems, verification of such tools can be very difficult. This paper describes how both process algebra and simulation are being used to verify the STEADY parallel DBMS performance estimator.
1
Introduction
Database systems are an ideal application area for parallel computers. The inherent parallelism in database applications can be exploited by running them on suitable parallel platforms to enhance their performance - a fact which has attracted significant commercial interest. A number of general purpose parallel machines are currently available that support different parallel database systems, including adaptations of standard commercial DBMSs produced by vendors such as Oracle, Informix and Ingres. For a platform based on non-shared memory, such systems distribute their data (and associated query processing) amongst the available processing elements in such a way as to balance the load [1]. However, despite the potential benefits of parallel database systems, their uptake has been slower than expected. While information processing businesses have a strong interest in obtaining improved performance from current DBMSs, they have a substantial investment in existing database systems, running generally on mainframe computers. Such users need to be convinced that the benefits outweigh the costs of migrating to a parallel environment. They need assistance to assess what parallel database platform configuration is required and what improvements in performance can be achieved at what cost. They need tools to tune the performance of their applications on a parallel database platform to ensure that they take best advantage of such a platform. They need help in determining how the system should be upgraded as the load on it increases, and D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 126–135, 1998. c Springer-Verlag Berlin Heidelberg 1998
Verifying a Performance Estimator for Parallel DBMSs
127
how it should be reorganised to meet anticipated changes in usage of database capacities. Application sizing is the process of determining the database and machine configuration (and hence the cost) required to meet the performance requirements of a particular application. Capacity planning, on the other hand, is concerned with determining the impact on performance of changes to the parallel database system or its workload. Data placement is the process of determining how to lay out data on a distributed memory database architecture to yield good (optimal) performance for the application. All three activities require suitable tools to predict the performance of parallel database systems for the particular application. They determine whether a given hardware/software configuration will meet a user’s performance requirements, how performance will change as the load changes, and how the user’s data should be distributed to achieve good performance [2]. Such tools would rely on a model of transaction throughput, response time and resource utilisation for given system configurations, workloads and data distributions. To complicate matters further, since a parallel machine may host several different DBMSs, such tools should be capable of supporting models of different parallel DBMSs and of providing facilities for describing different features of applications running on these DBMSs. This paper presents the results from an initial verification exercise of a parallel DBMS performance estimation tool called STEADY. The predictions produced by components of STEADY are compared with results from a simulation and with those derived from a process algebra model. The rest of the paper is organised as follows. Section 2 briefly describes STEADY. Sections 3 and 4 provide a comparison between predictions of components of STEADY and those obtained from a simulation and a process algebra model. Section 5 provides a summary and conclusion.
2
STEADY
STEADY is designed to predict maximum transaction throughput, resource utilisation and response time given a transaction arrival rate. The maximum throughput value is derived by analysing the workload and identifying the system bottlenecks. Given a transaction arrival rate, lower than the maximum throughput, the response time is derived using an analytical queuing model. During the process of calculating the response time, the resource utilisation can also be obtained. The basic STEADY system originally worked with an Ingres Cluster model, and has been extended to model the Informix OnLine Extended Parallel Server (XPS) and the Oracle7 Parallel Server with the Parallel Query Option. These two systems are representative of two different parallel DBMS architectures: shared disk and shared nothing. The underlying hardware platform for both systems consists of a number of processing elements (PEs) on which instances of the DBMS are running. PEs communicate using a fast interconnecting network. The platform which STEADY currently supports is the ICL GoldRush machine [3].
128
E.W. Dempster et al.
queries
relations
DP strategy
DBMS config
Application Layer
query trees
relation profiles
data layout & page heat
Query Pre-processor
DPtool
DBMS Kernel Layer (Task Block Generator) Query Paralleliser
Cache Model Component
cache hit ratio estimation
Modeller Kernel
task blocks
workload statistics
Profiler
STEADY Graphical User Interface
Platform Layer (Maximum Throughput Estimator)
Evaluator
Builder
resource usage blocks Output Layer maximum throughput
Queue Waiting Time Estimator
Response Time Estimator analytical formulae
response time information
resource utilisation
maximum throughput
Fig. 1. STEADY Architecture In a shared disk DBMS (Oracle 7), each of the PEs of the parallel machine may access and modify data residing on the disks of any of the PEs of the configuration. In a shared nothing DBMS (Informix XPS), the disks of a PE are not directly accessible to other PEs. STEADY takes as input the execution plans of SQL queries, represented as annotated query trees. The query trees capture the order in which relational operators are executed and the method for computing each operator. Within STEADY the annotated query trees undergo several transformations until they reach a form - the resource usage representation - which allows the prediction of response time and resource utilisation. This process of transformation can be followed in Fig. 1 which illustrates the architecture of STEADY. In addition to the graphical user interface the system comprises four parts. The application layer consists of the Profiler, DPTool and the Query PreProcessor. The Profiler is a statistical tool, primarily responsible for generat-
Verifying a Performance Estimator for Parallel DBMSs
129
ing base relation profiles and estimating the number of tuples resulting from data operations. DPTool is used to generate data placement schemes using different strategies, and estimates the access frequency (heat) of different pages in each relation. DPTool provides the necessary heat information along with the generated data layout to the Cache Model Component. Both the Profiler and DPTool make use of the annotated query tree format of the original SQL queries. The DBMS kernel layer consists of the Cache Model Component, the Query Paralleliser and the Modeller Kernel. The Cache Model Component estimates the cache hit ratio for pages from different relations [4]. The Query Paralleliser transforms the query tree into a task block structure. Each task block represents one or more phases in the execution of the relational operators within the query trees. Examples of phases are the following: building a hash table in the first phase of a hash join operator; merging two sorted streams of tuples in the last phase of a merge-sort join operator; performing a full table scan of a relation. The task blocks are organised as the nodes of a tree with dependencies among blocks represented by links. The dependencies specify ordering constraints among execution phases. One form of dependency is a pipeline dependency where a sending and a receiving block communicate tuples through a pipeline. Another form is full dependency which requires the previous block to complete before the next one can start. In addition, the task block tree structure captures the inter- and intra-operator parallelism within the query. The Query Paralleliser is able to construct the tree based on knowledge of the parallelisation techniques and the load balancing mechanism employed by the parallel DBMS. The Modeller Kernel takes as input the relation profiles, data layout, estimated cache hit ratios and the task block profile of the query, produced by the Query Paralleliser. It produces workload profiles in terms of the numbers of basic operations which are to be executed on each PE in the course of a transaction. This results in a set of workload statistics. Together with this, the Modeller Kernel fills in the details of the task blocks by expanding each execution phase within the block into a corresponding sequence of basic operations. For example, a full table scan of a relation is an execution phase process which is represented by the following sequence of operations: wait for a page lock; obtain the page lock; read the page; select tuples from the page; send selected tuples to the next phase. This sequence will be executed once for each page of the relation. The obtained locks are released within the body of a different task block which represents the commit phase of the transaction the query belongs to. The platform layer consists of the Evaluator and the Builder. The task block profiles of the queries are mapped by the Evaluator into sequences of resource usages. The Evaluator contains a hardware platform model which is generated by the Builder from a set of analytical formulae. This enables the Evaluator to map each basic operation into the appropriate sequence of resource usages. For example, a basic page read operation may be mapped to the following sequence, which makes explicit the order of usage and time consumption associated with the read:
130
E.W. Dempster et al. cpu 32µs; disk3 150ms; cpu 50µs
Apart from producing these resource usage profiles, the evaluator also gives an initial estimation of the maximum throughput value. A resource usage representation is then produced which captures the pattern of resource consumption of the original query by mapping the task blocks to resource blocks. The Queue Waiting Time Estimator and the Response Time Estimator of the output layer of STEADY work with this representation to estimate individual resource utilisation and response time and, from these, overall query response time. The details of these processes is the subject of a future paper.
3
Verification by Simulation
When estimating the performance of complex systems using analytical methods, two basic approaches are available: 1) to build a high-level abstract model and find an exact solution for the model, or 2) to build a fairly detailed model and find an approximate solution for it — finding exact solutions for detailed models is in general not feasible. In the case of STEADY, the second approach has been adopted. Part of the verification of STEADY is to determine the level of accuracy of results that can be achieved using the approximation algorithm described in the previous section. This can be achieved by finding solutions, within error bounds, for the model using discrete event simulation and then comparing these with the figures that are obtained using STEADY. The simulation is based on the resource usage profiles of transactions, which are generated by STEADY’s platform layer. Transactions are implemented as simulation processes which access shared resources according to the specifications in these profiles. New transactions are generated according to the overall arrival rate distribution and the respective transaction frequencies. Simulation output includes statistics on resources (including utilisation) and transaction response times. A number of examples have been selected and experiments conducted to compare the results obtained from simulation with those predicted by STEADY. One example is given in Figure 2 which shows the average utilisation of resources as predicted by simulation and by STEADY. By contrast Figure 3 provides a comparison of average transaction response times. In all the experiments results for average utilisation show very good agreement whereas those for response time are more variable. This is being investigated further.
4
Verification by Process Algebra
The second way in which the STEADY approach is being “verified” is through process algebras. A process algebra is a mathematical theory which models communication and concurrent systems. In the early versions of process algebras,
Verifying a Performance Estimator for Parallel DBMSs
0.9 Utilisation A (simulation) Utilisation A (steady) Utilisation B (simulation) Utilisation B (steady) Utilisation C (simulation) Utilisation C (steady)
0.8
0.7
resource utilisation
0.6
0.5
0.4
0.3
0.2
0.1
0 0
20
40
60 80 transaction arrival rate (trxs/sec)
100
120
140
Fig. 2. Utilisation of resources A, B and C
100 Response Time T2 (simulation) Response Time T2 (steady)
90
80
response time (msec)
70
60
50
40
30
20
10 0
20
40
60 80 transaction arrival rate (trxs/sec)
100
Fig. 3. Mean response time of t2
120
140
131
132
E.W. Dempster et al.
time was abstracted away. A classical example of such an algebra is CCS (Calculus of Communicating Systems) [7]. The form of process algebra which has been selected for use here is known as a Performance Evaluation Process Algebra (PEPA) [6]. This is a stochastic process algebra which has been developed to investigate the impact of the computational features of process algebras upon performance modelling. In PEPA, a system is expressed as an interaction of components, each of which engages in some specific activity. These components correspond to the parts of the system or the events in the behaviour of the system. The behaviour of each component is defined by the activities in which it can engage. Every activity has an action type and an associated duration (which is represented by a parameter known as the activity rate), and is written as (α, r) where α is an activity and r the activity rate. To illustrate the PEPA approach, consider the simplest of models, comprising a single processing element with one processing unit and one disk (see fig 4). Suppose that transactions arrive at the transaction manager (TM) module, which on receipt of a transaction passes it to the concurrency control unit (CCU). The latter sends a message to the distributed lock manager (DLM) requesting a lock; the DLM grants the lock and replies accordingly. The CCU sends a message to the buffer manager (BM) which reads from disk and returns the data. Finally the CCU releases the lock and commits the transaction.
TM
DLM
CCU TM0 TM1
Q
DLM0
D
DLM1 P
BM B
Fig. 4. Single Processing Element Configuration This is expressed in PEPA notation as follows: #Q0 = (tm2ccu,infty).(ccu2dlm,r sgn0).Q0 ; #Q1 = (dlm2ccu,infty).(lock granted,r sgn).Q1; #Q = Q0 Q1; #D = (release lock,infty).(commit,infty).(ccu2dlm0,r sgn0). (ccu2tm,r sgn0).D; #P = (lock granted,infty).P1; #P1 = (ccu2bm,r sgn0).(bm2ccu,infty).(release lock,r sgn). (commit,r commit).P; #CCU = (Q D) commit, release lock, lock granted P ; #DLM0 = (ccu2dlm,infty).(dlm2ccu,r gnt).DLM0;
Verifying a Performance Estimator for Parallel DBMSs
133
#DLM1 =(ccu2dlm0,infty).(release,r sgn).DLM1; #DLM = DLM0 DLM1; #BM = (ccu2bm,infty).(deliver,r deliver).(bm2ccu,r sgn0).BM; #TM0 = (request,r req).(tm2ccu,r sgn0).TM0; #TM1 = (ccu2tm,infty).(reply, r reply).TM1; #TM = TM0 TM1; #PEPA = (TM BM DLM) tm2ccu,ccu2tm,bm2ccu,ccu2bm,ccu2dlm,ccu2dlm0,dlm2ccu CCU; PEPA Here the notation (α, r).REST is the basic mechanism by which the behaviour of a component is expressed. It denotes the fact that the component will carry out activity (α, r ) which has action type α and a duration of mean 1/r, followed by the rest of the behaviour (REST). The notation P LQ where P and Q are components and L is a set of action types, denotes the fact that P and Q proceed independently and concurrently with any activity whose action type is not contained in L. However, for any activity whose action type is included in L, components P and Q must synchronise to achieve the activity. These shared activities will only be enabled in P LQ when they are enabled in both P and Q , i.e. one component may be blocked waiting for the other component to be ready to participate. For this very simple configuration, a very simple benchmark has been adopted in which a single relation with tuple size of 100 bytes is scanned by a query performing a simple search. The page size is set to 1024 bytes and each page is assumed to be 70% full. Each fragment of the relation is assumed to consist of a single page. The average time to perform a disk read is taken as 41.11 milliseconds. Results for this very simple example are given in Table 1.
Throughput (tps) data in data in data in data in 1 frag 2 frags 3 frags 4 frags 5.0 4.996 4.962 4.816 4.512 10.0 9.919 9.015 7.339 5.858 25.0 20.418 11.924 8.058 6.059 100.0 24.019 12.089 8.077 6.063 STEADY Max throughput(tps) 24.324 12.162 8.108 6.081 Query arrival rate
Table 1. System performance of PEPA model and STEADY: throughput (in tps)
In order to solve such a model, the method reduces the model to a set of states. In the case of this very simple example, the number of states generated is 1028. However, when one considers slightly more complex models, such as configurations involving more than one processing element, the number of states
134
E.W. Dempster et al.
increases rapidly ( approx 810000 states for two processing elements) and the computation required to solve it becomes too large. In order to handle such cases we are currently experimenting with the flow equivalent aggregation method [8] to decompose the models into manageable sub-models and find solutions to these. The results so far are encouraging although no complete solution is yet available.
5
Conclusions
This paper has discussed the verification of an analytical model (STEADY) for estimating the performance of parallel relational DBMSs. Two different methods of verification were investigated. The first approach was to apply simulation to a set of problems. The same problems were also solved using STEADY. The results of the comparison between STEADY and the simulation were mixed, as some compared very favourably whereas others require further investigation. Although the simulation alone is not sufficient to verify all aspects of STEADY, it has proved to be valuable in analysing, choosing and developing approximation algorithms for mathematical solutions of detailed analytical models. The second, more theoretical, method of verification that was investigated was a process algebra system known as Performance Evaluation Process Algebra (PEPA). This system was used to verify the components of STEADY which produce both the resource usage profiles and the first estimation of maximum system throughput. The system used to verify STEADY was a simple system as PEPA becomes unmanageable when more complex configurations are considered. Further work on PEPA is being carried out to handle more complex configurations, whilst keeping the system manageable. The processes described are intended as an initial stage in the verification of STEADY, being performed in parallel to validation against actual measured performance.
6
Acknowledgments
The authors acknowledge the support received from the Engineering and Physical Sciences Research Council under the PSTPA programme (GR/K40345) and from the Commission of the European Union under the Framework IV programme, for the Mercury project (ESPRIT IV 20089). They also wish to thank Arthur Fitzjohn of ICL and Shaoyu Zhou of Microsoft Corp. for their contribution to this work.
References 1. K. Hua, C. Lee, and H. Young. “An efficient load balancing strategy for sharednothing database systems” Proceedings of DEXA’92 conference, Valencia, Spain, September 1992, pages 469-474 126
Verifying a Performance Estimator for Parallel DBMSs
135
2. M. B. Ibiza-Espiga and M. H. Williams. “Data placement strategy for a parallel database system” Proceedings of Database and Expert Systems Applications 92, Spain, 1992. Springer-Verlag, pages 48-54 127 3. P. Watson and G. Catlow. “The architecture of the ICL GoldRush MegaSERVER” Proceedings of the 13th British National Conference on Databases (BNCOD 13), Manchester, U.K., July 1995, pages 250-262 127 4. S. Zhou, N. Tomov, M.H. Williams, A. Burger, and H. Taylor. ”Cache Modelling in a Performance Evaluator of Parallel Database Systems” Proceedings of the Fifth International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE Computer Society Press, January 1997, pages 46-50 129 5. L. Kleinrock. ”Queueing Systems, Volume 1: Theory” John Wiley & Sons, Inc., Canada, 1975 6. J. Hilston. “A Compositional Approach for Performance Modelling”, PhD Thesis, University of Edinburgh, 1994. 132 7. R. Milner. “Communication and Concurrency” Prentice Hall International, UK, 1989 132 8. K. Kant. “Introduction to Computer System Performance Evaluation” McGraw-Hill Inc., Singapore, 1992 134
Analysing a Multistreamed Superscalar Speculative Instruction Fetch M e c h a n i s m Rafael R. dos Santos ,~ a n d P h i l i p p e O. A. Navaux** Informatics Institute C P G C C / F e d e r a l University of Rio Grande do SUl, P.O. Box 15064 91501-970 Porto Alegre - RS, Brazil
A b s t r a c t . This work presents a new model for multistream speculative instruction fetch in superscalar architectures. The performance evaluation of a superscalar architecture with this feature is presented in order to validate the model and to compare its performance with a real superscalar architecture. This model intends to eliminate the instruction fetch latency introduced by branch instructions in superscalar pipelines. Finally, some considerations about the model are presented as well as suggestions and remarks to future works.
1
Introduction
Even using a c c u r a t e d b r a n c h p r e d i c t i o n m e c h a n i s m s , c u r r e n t s u p e r s c a l a r archit e c t u r e s offer less p e r f o r m a n c e t h a n an i d e a l a r c h i t e c t u r e . W h e n a b r a n c h is encountered, t h e b r a n c h p r e d i c t o r can p r e d i c t it as to be t a k e n or n o t taken. In the first case, c o n t i g u o u s i n s t r u c t i o n s are a l r e a d y i n t o the fetch s t a g e when t h e p r e d i c t i o n is m a d e , a n d these i n s t r u c t i o n s do n o t t a k e p a r t of t h e p r e d i c t e d p a t h . For t h e second case, n o t h i n g occurs if t h e p r e d i c tion is correct. B u t for b o t h cases, we are a s s u m i n g t h a t the b r a n c h p r e d i c t i o n m e c h a n i s m is efficient. T h e p r o b l e m u s u a l l y is t h a t b r a n c h e s are p r e d i c t e d as to be t a k e n a n d each t i m e t h i s occur a flow i n t e r r u p t i o n also occurs. T h e t i m e r e q u i e r e d to refill t h e fetch buffer a n d p u t the correct i n s t r u c t i o n s into t h e i n s t r u c t i o n queue is g r e a t e r t h a n t h e necessary t i m e to execute these i n s t r u c t i o n s . If t h e r e are m a n y funct i o n a l u n i t s a n d flow i n t e r r u p t i o n s occur frequently, it s h o u l d be e x p e c t e d t h a t t h e i n s t r u c t i o n queue will be e m p t y for m a n y cycles. T h i s m e a n s t h a t the i n s t r u c t i o n fetch m e c h a n i s m m u s t be d e s i g n e d to keep t h e i n s t r u c t i o n queue w i t h i n s t r u c t i o n s t h a t can be scheduled for e x e c u t i o n on free f u n c t i o n a l units. A s u p e r s c a l a r a r c h i t e c t u r e could n o t execute i n s t r u c t i o n s faster t h a n it c a n fetch i n s t r u c t i o n s f r o m the i n s t r u c t i o n cache. So, design an efficient fetch m e c h a n i s m is m o r e i m p o r t a n t t h a n increase the n u m b e r of resources, b e c a u s e it is n o t * Ph.D. Student - E-mail: rrsantos~inf.ufrgs.br - UNISC - University of Santa Cruz do SUl - Santa Cruz - RS - Brazil ** Ph.D. - E-mail: [email protected] - C P G C C / F e d e r a l University of Rio Grande do Sul
I011
possible to increase the performance only by increasing the number of functional units. The question is how many instructions could be fetched per cycle? Where are these instructions comming from? An alternative approach to branch prediction is instead of predict whether the branch is taken or not taken, is execute speculatively both the taken and not-taken paths and cancel the execution of the incorrect path as soon as the branch result is known [3]. 2
Fetching
instructions
from
multiple
streams
Talcott [9] reffers a technique called Fetch Taken and not-Taken Paths to reduce the branch problem. In [1] was presented a new model based on this scheme. In this model, both paths of a conditional branch are fetched and put into the fetch buffer. When the branch prediction stage transfers instructions, from the fetch buffer to the instruction queue, and reaches a branch into the fetch buffer, both possible paths of this branch and its instructions are already into the fetch buffer. In this case, no delay is introduced when the branch is predicted as taken. The transfer is redirected to the predicted path whithout delay. To enable this operation, the fetch stage must detect the branch instruction just when it is fetched. In the next cycle, it must also start to fetch from both paths. When the branch is predicted, the instructions transfer to instruction queue is not interrupted. In the multistreamed superscalar architecture proposed in [6], the fetch stage was modified to enable to fetch both paths of a branch instruction. K][~EUFI-1 ....... K][] r~wF--T---T--1 ................. L,,,~,;I ]
1,~,;I I
;
!
H
.......
H
....... ,
~.....
Fig. 1. Fetch Buffer Structure (e.g. fetch depth equal to 4 streams) The figure 1 suggests a new buffer structure in the multistream pipeline. The number of stream buffers defines the fetch depth. Each stream has four independent elements: Program Counter, Status Bit, Children List and Fetch Buffer. In the Fetch Buffer are stored the fetched instructions. The PC is used to point the instructions that are in the stream. The status bit indicates whether the stream structure is busy and the Children-List stores the identification of the children streams. The Children-List also stores the branch address and the identifier of its children streams allowing the predict stage to start the transfer of instructions of the new stream when a branch is predicted, without any delay.
1012
2.1
M u l t i s t r e a m Architecture Operation
The Fetch stage fetches instructions to put t h e m into the fetch buffer. When conditional branch instruction is detected, a new stream is generated and initialized, as if there were available resources. The stream generation consist of Children-List updating operation with the branch address and the identification of a new stream structure that will store the instructions related to the new stream. For each branch detected there are two possible paths. The instructions that are in the not taken p a t h are fetched and stored into the same stream structure where is stored the conditional branch. But, the taken p a t h and their instructions will be stored in this new stream structure. The stream structure initialization consists of the P C initialization with the target address and the setting of the status bit, to indicate that the structure was allocated. The predict stage transfers instructions, from the fetch buffer to the instruction queue (like conventional superscalar architectures) looking for branch instructions. However, when it finds a branch instruction, it also makes a prediction. When the prediction is to be taken, this stage just concatenates the instructions which are in the children stream of this branch, discarding the closest instructions. When the prediction is not taken, the children stream is discarded and the neightbouring instructions continue to be transfered to the instruction queue. When a stream is discarded, all children streams originated by this stream are also discarded, through a recursive operation. 3
Experimental
Framework
For this work, we used 4 benchmarks from the SPECint95 suite (compress, go, ijpeg, li). Also~ we used a execution-driven simulator to generate traces. This simulators acomplish with the SPARCV7 instruction set and simulates the execution in a scalar pipelined fashion. This is the reference machine. For the superscalar simulations, we used 2 trace-driven simulators which executed the choosed benchmarks based on the traces generated by the first simulator. These 2 simulators are respectively called Real and Mulflux as we will describe below:
- Real Simulator: Simulates the execution of programs using two-level branch prediction [4, 5]. - Mulflux Simulator: Simulates the execution of programs using the proposed speculative instruction fetch mechanism [6]. 4
Performance
Analysis
In this section, we analize the performance of the multistreamed model based on the d a t a extracted through several simulations. The experiments simulate
1013 different configurations of the real machine and the multistreamed machine. The tests started with a fetchwidth equal to 2 up to 8 instructions per cycle. We can point out that in all situations the cycles with no dispatch are less than for the real machine. The no dispatch decreases with the increase of fetchwidth in the real machine. The percentage decreases from 40.50%, for fetchwidth equal to 2, to 39.30% in configurations with a fetchwidth equal to 8 instructions, as showed in the figure 2. In the Mulflux machine, this percentage increases join to the increase of fetchwidth. This occur because the mispredictions and resources conflict increase too. Other studies have been developed to research more accurated branch predictors and the ideal balancing of the architectures resources to support a new and more powerfull parallelism through the pipeline. The sum of the three components, which causes no dispatch in both machines, correspond to in the percentage of cycles with no dispatch.
Fig. 2. No Dispatch in Mulflux with 2 Streams and Real Architecture The divergency between the components is i m p o r t a n t in the Real machine. The occurency of an e m p t y queue is the critical component that contributed to the existency of no dispatch cycles. This is not the case for the Mulflux machine. It is predictable that when there are more instructions ready to dispatch, the number of functional units becomes the main problem. This is true in the Mulflux architecture. The occurency of e m p t y queue in the Real machine decrease from 30.05%, with fetchwidth equal to 2 to 21.79%, with fetchwith equal to 8 instructions per cycle. In the Mulflux, the results obtained are 7.73% and 7.80% for the same configurations perhaps applying multiple streams.
Fig. 3. Components that Causes Emptying Queue in Mulflux with 2 Streams and Real Architecture
1014 The figure 3 presents the efficiency of the multistream model to reduce the empty queue occurrency. In the Mulflux, the stream interruptions are up to 5.24%, but in the Real architeeure this percentage reach 27.07%. The multistreamed model enables a reduction around 74.24% of empty queue occurrency. The effect of stream interruptions were reduced by 80.64% in the Mulflux machine. Another important aspect is that resource conflict and speculation depth must be considered with more attention when the multistream model is used.
4.1
Impact Analysis of Multistream Implementation
Before the implementation of the multistream model, we must consider the impact in the close blocks, and also the fetchwidth, resource conflicts and speculation depth. In the previous results we did not consider the instruction cache misses. The use of the multistream model could generate more cache misses and the fetch latency can be important. So, the potential of the mechanism can be reduced. Then, we performed some experiments using a instruction cache with the same delay cycles as those of the Intel Pentium to observe the performance under real conditions. In this cache, the miss latency is equal to 3 cycles for the L1 cache, and 13 cycles when the miss causes an access to the L2 cache. Also, for the last results we consider that the fetchwidth was multiplied by the number of valid stream structures. If the fetchwidth is equal to 8 instructions and the number of valid streams (initialized structures) is equal to 4 then the total and the real fetchwidth is equal to 32 instructions per cycle. This was made in the last experiments. To avoid problems with the cache size and its configuration we have proposed a new strategy called Dynamic Split Fetch - (DSF) which consists in spliting the total fetchwidth between the valid stream structures [6]. In this case, if there are 4 valid streams and the fetchwidth is equal to 8 instructions per cycle, will be fetched 2 instructions for each valid stream and the fetchwidth is kept up to 8 instructions. 4.2
O v e r a l l Performance
In this section we discusse the overall performance delivered by the multistreamed mechanism and compare it with real architectures performance. Thus, we could observe the aspects discussed in the last section. Increasing the number of functional units delivers more speed up for the multistreamed architecture as show in the figure 4. However, we can observe that no dispatch cycles increases even when the number of functional units increases. This is due to the speculation depth that was kept for all configuration with a single branch unit. The machine considered has n generic functional units and only one branch unit that could stall the dispatch of instructions when saturated. This is showed in the graphic 5.
1015
Fig. 4. Multistream Speed up when the Number of Functional Units Increase
Fig. 5. Components of No Dispatch when Functional Units Increases
We m a d e several simulations variating the number of functional units for the multistreamed architecture. We wanted to show that the resource conflict is the m a i n factor in the limitation of the performance in this machine like in an ideal machine. We performed experiments with 5 machines consisting of: a Multistream with perfect icache, a Multistream with normal icache, a Multistream with normal icache and DSF (Dynamic Split Fecth), a Real machine with perfect icache and a Real machine with normal icache. The results that will be presented comes from the same configurations for each machine [6]. We used 2 streams structures to the m u l t i s t r e a m machines. All machines have a fetchwidth equal to 8 instructions per cycle, and a fetch buffer and instruction queue with 16 and 32 entries respectively; a dispatchwidth equal to 8 instructions; 8 functional units, each one with 8 reservation stations; 1 branch unit with 8 reservation stations; 8 bus results and reorder buffer with 64 entries. The figure 6 shows the no dispatch cycles expended for each machine. The best case is the multistream machine with perfect icache and using a total fetchwidth that consists in multiplying the fetchwidth by the number of valid stream structukes. We could observe that the Mulflux with normal icache has a percentage of no dispatch cycles similar to the Mulflux with normal icache using DSF. The division of the fetchwidth by each valid stream do not h a r m the performance of the considered cases. The Real machines expend more cycles with no dispatch. The figure 7 shows the components that cause no dispatch cycles in each machine. In the mulstistream machines, the worst component is the resource conflict (around 45.00% of the no dispatch cycles). In the Real machines the emptying queue result around 60% of the occurency of no dispatch. In the second
1016
Fig. 6. Percentage of No Dispatch figure 7, we could notice the reduction of the stream interruptions in multistream machines.
Fig. 7. No Dispatch Components and Emptying Queue Components
5
Conclusions
and
Future
Works
The superscalar speed up is directly proportional to its IPC (Instructions per Cycle). It is desirable to obtain a speed up proportional to the number of functional units present in the architecture. However, the constant flow interruptions flush the instruction queue and decrease the number of ready instructions which could be dispatched. Thus, the IPC is reduced drastically because of the instruction queue flushing and a desirable speed up could not be achieved. The multistreamed model allow a reduction around 74.24% of empty queue occurrency. The effect of stream breaks was reduced by 80.64% in the Mulflux machine. Another important aspect is that resource conflict (structural hazards) and speculation depth must be considered carefully when the multistream model is used. In our experiments, the instruction cache performed similar performance in both cases: multistream and real architectures. The use of Dynamic Split Fetch brings a worthwhile strategy that allows to keep the number of icache buses. Even if we reduce the occurency of empty instruction queue, we have verified that the decrease of no dispatch cycles do not decrease as we wanted. In the Mulflux machine the no dispatch cycles is between 28.99% and 33.11%, while in the Real machine it is between 39.30% and 40.50%. The increase of instructions flow in the instruction queue generate a major resource conflict like in the ideal architecture. Increasing the number of functional units but keeping the speculation depth did not allow good results in our experiments. Because of this, we are looking for resource balancing and ideal speculation depth in multistreamed
1017 architectures. The saturation of branch unit could come late the instructions execution. Remarks to the next step could be pointed here. We are looking to introduce new instruction cache mechanisms in our simulations. Such mechanisms have been proposed and we believe that they will be used in next generations of superscalar microprocessors. Also, after to get a good configuration of our architecture we plan to compare it with other alternatives like trace processors [8], simultaneous multithreading [2], multisealar [7] and other alternatives which certainly will be suggested. Such comparison is very important to get an idea about the potential of such architecture and its complexity of implementation. The fourth generation of microprocessors can not be predicted at the m o m e n t but m a n y new schemes have been proposed. The question is how to increase the performance of current microprocessors and how to obtain more machine parallelism using efficiently the increasingly chip density ?
References 1. Chaves Filho,Eliseu M. et al. A Superscalar Architecture with Multiple Instruction Streams. In: SBAC-PAD, VIII. Recife, Agosto 1996. P r o c e e d i n g s .... SBC/UFPE, 1996, pp 67-77 (in portuguese). 2. Eggers, Susan J. e t a ] . Simultaneous Multithreading: A Plataform for NextGeneration Processors. [EEE Micro, V.17, n.5, Sep/Oct 1997. 3. Fromm; Richard. Branching on Superscalar Machines: Speculative Execution of Multiple Branch Paths Project Final Report. December 11, 1995. CS 252:Graduate Computer Architecture. 4. Yeh, Tse-Yu; Patt, Yale N. Two-Level Adaptive Training Branch Prediction. In: ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 24., 1991. P r o c e e d i n g s . . . New York: ACM, 199I. p. 51-61. 5. Yeh, Tse-Yu; Patt, Yale N. A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In: ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 20, 1993. P r o c e e d i n g s . . . New York: ACM, 1993. p. 257-266. 6. Santos, Rafael R. dos. A Mechanism]or Multistreamed Speculative Instruction Fetch. Porto Alegre: CPGCC/UFRGS, 1997 (M.Sc. Thesis - in portuguese). 7. Sohi, G. S.; Breach, S. E.; Vijaykumar, T. N. Multiscalar Processors. Computer Architetcure News, New York, v.23, n.2, p. 414-425, 1995. 8. Smith, James E,; Vajapeyam, Sriram. Trace Processors: Moving to FourthGeneration. C o m p u t e r , Los Alamitos, v.30, n.9, p.68-74, Sep. 1997. 9. Talcott, Adam R. Reducing the Impact o] the Branch Problem in Superpipelined and Superscalar Processors. Santa Barbara: University of California, 1995 (Ph. D. Thesis).
Design of Processor Arrays for Real-Time Applications Dirk Fimmel
Renate Merker
Department of Electrical Engineering * Dresden University of Technology f immel, merkerOieel, et. tu-dre sden. de
A b s t r a c t . This paper covers the design of processor arrays for algo-
rithms with uniform dependencies. The design constraint is a limited latency of the resulting processor array. As objective of the design the minimization of the costs for an implementation of the processor array in silicon is considered. Our approach starts with the determination of a set of proper linear allocation functions with respect to the number of processors. It follows the computation of a uniform anne scheduling function. Thereby, a module selection and the size of partitions of a following partitioning is determined. A proposed linearization of the arising optimization problems permits the application of integer linear programming.
1
Introduction
Processor arrays represent an appropriate kind of co-processors for time consuming algorithms. Especially in heterogeneous systems a dedicated hardware for parts of algorithms, e.g. for the motion estimation of MPEG, is required. An admissible latency Lg for the selected part of the algorithm can be derived from the time constraint of the entire algorithm. The objective of the design is a processor array with minimal hardware costs that matches the admissible latency Lg. In this paper, we consider algorithms with uniform dependencies. First, a set of linear allocation functions that lead to a small number of processors is computed. Then, for each allocation function a uniform affine scheduling function is determined. We assume that a LSGP (locally sequential and globally parallel) -partitioning can be applied to the processor array. The size of the partitions is derived with respect to the constraint that the latency of the resulting processor array is less than Lg. Furthermore, the processor functionality, i.e. kind and number of modules that have to be implemented in each processor, is determined. The objective of the design is a minimization of hardware costs. We measure hardware costs by the chip area needed to implement modules and registers in silicon. * The research was supported by the "Deutsche Forschungsgemeinschaft", in the project A1/SFB358.
1019 Related works handling the allocation function cover (1) a limited enumeration process to determine a linear allocation function leading to a minimal number of processors of the processor array [15], (2) the determination of a set of linear allocation functions that match a given interconnection network of the processor array [16], (3) the inclusion of a limited radius of the intereonnections into the determination of the allocation function [14] and (4) the minimization of the chip area of the processor array by consideration of the processor functionality [3]. An approach to compute a variety of linear allocation and scheduling functions is proposed in [11]. Some notes to the determination of unconstrained minimal scheduling functions for algorithms with uniform dependencies can be found in [1]. Resource constraint scheduling for a given processor functionality is presented in [13]. An approach to minimize the throughput by consideration of the chip area is proposed in [9]. In [2] the approach [13] is extended to determine additionally the processor functionality in order to minimize a chip area latency product. The paper is organized as follows. Basics of the design of processor arrays are given in section 2. In section 3 hardware constraints considered in this paper are introduced. A linear program to determine a set of linear allocation functions is presented in section 4. Section 5 covers the determination of scheduling functions. A linear programming approach is presented in detail. Finally, a short conclusion is given in section 6. 2
Design
of Processor
Arrays
In this paper we restrict our attention to the class of algorithms that can be described as systems of uniform recurrence equations (SURE) [5]. D e f i n i t i o n 1 ( S y s t e m o f u n i f o r m r e c u r r e n c e e q u a t i o n s ) . A system of uniform recurrence equations is a set of equations Si of the following form: Si:
yi[i]=Fi(...,yj[f~(i)],...),
ieZ,
l<_i,j<_m,l
(1)
where i 9 Z ~ is an index vector, f•(i) = i - d ~ j are index functions, the constant vectors dikj 9 ~ are called dependence vectors, Yi are indexed variables and Fi are arbitrary single valued operations. All equations are defined in the index space 5[ being a polytope Z = {i ] Hii > hi0}, Hi 9 Q'~'x~, hi0 9 Q ' ~ . We suppose that the SURE has a single assignment form (every instance of a variable Yi is defined only once in the algorithm) and that there exists a partial order of the instances of the equations that satisfies the data dependencies. Next, we introduce a graph representation of the data dependencies of the SURE. Definition 2 (Reduced S U R E build the m nodes directed edges e = (vl, vj) dence vectors d(e) = dlk. 5(e) respectively.
d e p e n d e n c e g r a p h ( R D G ) ) . The equations of the vl 9 V of the reduced dependence graph (V, $). The 9 E are the data dependencies weighted by the depenSource and sink of an edge e 9 g are called c~(e) and
1020 The main task of the design of processor arrays is the determination of the time and the processor when and where each instance of the equations of the SURE has to be evaluated. In order to keep the regularity of the algorithm in the resulting processor array, we apply only uniform affine mappings [8] to the SURE. D e f i n i t i o n 3 ( U n i f o r m at=line s c h e d u l i n g ) . A uniform aJ:fine scheduling func-
tion vi(1) assigns an evaluation time to each instance of the equations: r~ : 25n --~ Z :
ri(i)=rTi+ti,
l
(2)
where r C 25~, ti E Z. D e f i n i t i o n 4 ( L i n e a r p r o c e s s o r a l l o c a t i o n ) . A linear allocation function 7r(i) assigns an evaluation processor to each instance of the equations:
z
z
-l:
= si,
(a)
where S E 25n-1x~ is of full row rank. Since S is of full row rank, the vector u G 25~ which is coprime and satisfies S u = 0 and u 7s 0 is uniquely defined and called projection vector. Because of the lack of space we refer to [2,3] for a treatment of uniform affine allocation functions 7ri : Z ~ --+ 25~-1 : 7ri(i) = Si + Pi, 1 < i < m. The importance of the projection vector is due to the fact that those and only those index points of an index space lying on a line spanned by the projection vector u are mapped onto the same processor. Due to the regularity of the index space and the uniform affine scheduling function, the processor executes the operations associated with that index points one after each other if v T u 7s 0 with a constant time distance A = ]vTu] which is called iteration interval. The application of a scheduling and an allocation function to a SURE results in a so called fullsize array. 3
Hardware
Description
We consider a given set 3/t of modules which are responsible to evaluate the operations of a processor. Instead of assuming given processors we want to determine modules which realize the operations of the processors. First, we introduce some measures needed to describe the modules. To each module mt E M we assign an evaluation time dl E 25 in clock cycles needed to execute the operation of module ml, a necessary chip area cl E 25 needed to implement the module in silicon and the number nl E Z of instances of that module which are implemented in one processor. If a module ml E M has a pipeline architecture we assign a time offset ot to that module which determines the time delay after that the next computation can be started on this module, otherwise ol = dr. Some modules are able to compute different operations, i.e. a multiplication unit is likewise able to compute an addition. To such modules different delays dli and offsets oti depending on the operations Fi are assigned.
1021 The assignment of a module ml E 34 to an operation Fi is denoted as re(i), and the set of modules which are able to perform the operation Fi is ~'li. T h e addressing of the instance of the module re(i) which performs the operation Fi is given by ui C Z.
4
Determination
of A l l o c a t i o n F u n c t i o n s
The allocation function m a p s each index vector i C Z to a processor p of the processor space 7) = {p I P = Si A i C Z}. Our aim is the determination of a set of proper linear allocation functions that lead to processor spaces with a small number of processors. In our approach we approximate the number of processors of the processor space 7) by the number of processors of the enclosing constant bounded polytop (cbpolytop) Q = {p I Pmin <_ P _< Pmax} of 7), where 7) C_ Q, and each face of Q intersects 7) at least in one point. The consideration of the cb-polytop Q allows the formulation of the search for allocation functions as a linear optimization problem. Program 1 (Determination of allocation functions) forj= 1ton-1
minimize v~1 - v ~2 + 1 , subject to v~ _< SjJrW l _< .v1j , sjbk+(1--rk)R>l,
1 2 E~, Vj,Vj 1 < l < [~/V[,sj C ~ n ,
l
(4.1)
(4)
n-j§
E
rk = 1,
rk C {0,1},
k----1
end for
where }4/ is the set of vertices wt of Z, R is a sufficiently large constant and the vectors bk, 1 < k < n - j + 1, are spanning the right null space of the matrix ( s l , . . . ,sj-a) T. Constraint (4.1) is replaced by sl ~ 0 for j = 1. Constraint (4.1) ensures that the vectors sj, 1 < j ~ n - 1, are linearly independent, i.e. that rank(S) = n - 1 . Constant R has to fulfill R _> max {IsTbk ]}. l
The number of processors of the cb-polytop Q is H (vj - vj2 + 1). j=l
A motivation of our approach is given in the following theorem. 1 ( N u m b e r o f p r o c e s s o r s o f a n e n c l o s i n g c b - p o l y t o p ) . If N ( N I) is the number of processors of the enclosing cb-polytop of the processor space resulting after application of the allocation function defined in program 1 (of another arbitrary linear allocation function), then N < N I.
Theorem
Since we are interested in several allocation functions we replace constraint (4.1) in program 1 for j = 1 by s T u t r 0, 1 _< l < i, in order to determine the i-th allocation function, where ut are the projection vectors of the previous determined allocation functions.
1022
Example 1. We consider a part of the GSM speech codec system as example. The considered algorithm consists of two equations.
I. II.
yl[i,k] = y l [ i - 1, k]+r[i]y2[i- 1 , k - 1], y2[i,k]=y2[i-l,k-1]+r[i]yl[i-l,k],
/[={(i,k)T
(i,k) T 6 Z , (i,k) T E / : ,
I1
The index space with the data dependencies as well as the reduced dependence graph are depicted in Fig. 1.
120
I
I 1
I 2
I 3
I 4
I 5
I 6
I~, 7
8
i
Fig. 1. Index spaces with data dependencies and reduced dependence graph
The dependence vectors are: I -+ I : dl = (1,0) T, II -~ I : d~ = (1, 1) T, I - + I I : d 3 = ( 1 , 0 ) T, I I - + I I : d 4 = ( 1 , 1 ) T . The set of vertices W of the index space Z is IW = {(1, 1), (S, 1), (1,120), (S, 120)]. Application of program 1 leads to the matrices $1 = (1, 0) and S~ = (0, 1), and hence to the projection vectors Ul = (0, 1) T and u2 = (1, 0) T respectively. The next section covers the determination of a scheduling function and the processor functionality with respect to each projection vector ui computed in program 1. 5 5.1
Determination
of a Scheduling
Function
General Optimization Problem
In this section we propose an approach to determine a scheduling function as well as a module selection and a conflict free assignment of the modules to the operations of the SURE. Our objective is the minimization of the costs for a hardware implementation of the processor array subject to the condition that a given latency Lg is satisfied. A scheduling function is determined for each linear allocation function resulting after application of program 1. First, we present a general description of the optimization problem and go into more detail in the next paragraphs.
1023 Program
2 (Determination
of a scheduling function)
minimize chip area C of the processor array subject to latency L of the processor array: L < Lg causal scheduling function selection of modules conflict free access to the modules 5.2
Objective Function
The number of processors Np is given exactly since the allocation function is fixed at this step. In each processor of the fullsize array we have to implement nz instances of module ml. The chip area needed to implement the modules of all processor of the fullsize array is therefore C / = Np ~ ntcz. /=1
Additional chip area is needed to implement (1) registers to store intermediate results, (2) combinatorial logic to control the behaviour of the processors and (3) interconnections between the processors. The costs for the control logic are assumed to be negligible. The number of registers depends on the n u m b e r of processors as well as the number of d a t a produced in each processor. We a p p r o x i m a t e the chip area Cr needed to implement the registers in silicon by Cr = Np m cr, where cr is the chip area of one register and m is the number of equations of the SURE. Since the number of processors is independent of the scheduling function and the module selection, C~ can be treated as a constant. An assessment of the effort for the implementation of the interconnections is difficult. Instead of measuring the chip area of interconnections we propose a minimization of the length or a limitation of the radius of the interconnections respectively while determining the allocation function. As a consequence of the above discussion we conclude that it is sufficient to consider the chip area C] needed to implement the modules as measure for the hardware costs. Now, we assume that the admissible latency Lg enables a slow down of the fullsize array. Hence, we can decrease the hardware costs by partitioning the fullsize array. We apply the LSGP-partitioning [4] (tiling) by a factor of K , i.e. each partition contains K processors of the fullsize array. Each partition represents a processor of the resulting processor array. We assume t h a t the partitioning matches exactly the fullsize array and we neglect the additional hardware needed to control the partitions. Hence, we get the hardware costs of the resulting processor array Cp = C / / K . Since the number of processors Np is fixed it is sufficient to consider the chip area C~ of only one processor of the fullsize array. Our objective is to minimize C~ = C~/K, where Cp = NvC ~ and C/ = NvC ~.
5.3
Admissible Latency Lg
The latency L/ of the fullsize array is given by L/ = t . . . .
t mi~ ~ v T w q- tl ~ t max -- dm(i),i,
Vw E W,
t rni~, where
1 K i K m,
1024
where t'~in,t "~a~ E Z and W is the set of vertices w of the index space Z. The consideration of only the vertices of Z is justified by the fact t h a t linear functions have their extreme values at extreme points of a convex space [10]. The relaxation to integer values has only small effort for sufficiently large index spaces Z. The assumption that Lg > > Lf enables a partitioning of the fullsize array by a factor K , where K L f < Lg. This inequality is a sufficient condition since we can presume that a partitioning exists where the latency L of the resulting processor array satisfies L < K L f . An intuitive justification yields the opposite ease where the distribution of a sequential program to K processors leads to a speed up less or equal to K. Next, we linearize the constraints C} = KCp and K L f < Lg. The minimal latency Lmi~ of a fullsize array is easy to determine by assumption of unlimited resources and assignment of the fastest possible module to each operation of the SURE. Using Lmi~ we can limit K by 1 < K < K,~a~ = [Lg/L,~i~J. T h e lower bound of K can be increased by consideration of the minimal hardware costs for one processor Groin ' = min{~
nzcz} satisfying 3ml E A/Ii.nt > 1, 1 < i < m.
l----1
The application of a resource constraint scheduling [13] with the modules leading to C'mi~ yields Lma~ and Kmi~ = m a x { l , [ng/L,~a~]}. The introduction of Kmaz - Kmin + 1 binary variables 7j E {0, 1} enables an equivalent formulation of the constraints C} = KC~ and K L ] < Lg as follows:
C ~f < j C ~ + ( 1 - T j ) C ~ a x , jLg > jKmi,~ Lf + 7j (J - K,~i,~)Lg,
Kmin < j < h". . . . Kmi,~ < j < K.~a~, -
j=Kmi~
-
(5)
~j=l,
I~1 where C,~a~' = ~ nma~ctt, and n~ a~ is the m a x i m a l number of instances of l=l
module mt which can be implemented in one processor of the fullsize array. A reduction of the number of variables ~/i from K , ~ - K,,in + 1 to [(K,~a~ Kmi~ + 1)/z], z E Z, is possible by solving the optimization problem iteratively. In the first iteration only such j, Kmi n <_j < K , ~ , are considered t h a t satisfy j rood z = a, where a = K,~i~ m o d z. The second iteration of the optimization problem is solved for j , ~ - z < j < j , ~ z + z, where jrnax is the solution of the first iteration.
5.4
Causality Constraint
In order to ensure a valid partial order or the equations preserving the d a t a dependencies the scheduling function has to satisfy the causality constraint.
Definition 5 (Causality constraint). A scheduling function ri(i) = r T i + tl has to satisfy the following constraint: r T d ( e ) + t~(e) -- t~(e) _> dm(,7(c)),,7(e),
Ve e ~.
(6)
1025 5.5
Module Selection and Prevention of Access Conflicts
The assignment of a module to the operation Fj is described by variables r} e {0,1}, where ~ r} = 1 a n d r k = 1 +~m(j) = m k . totEM
I.h~jl
binary
3
The following resource constraint prevents access conflicts to the module.
D e f i n i t i o n 6 ( R e s o u r c e c o n s t r a i n t ) . For a given projection vector u a uniform aJfine scheduling function ri(i) = ~-Ti + ti has to satisfy the fdllowing constraint:
(tj ;~ -- (tj (tk )~ - (tk
mod mod mod mod
/~) ;~) + )~) )~) +
(tk mod A) >__Om(j),k, (tk mod ~) >_ om(j),j, (tj mod )~) > Om(j),j, (tj mod )~) >_ om(j),k,
if (~ rood ~) > (t~ rood ~),
(7) if (tj rood ~) <_ (tk rood ;~),
for all j r k, l <_ j , k < m, with re(j) = re(k) and uj = uk, where A = I~-Tu[. We refer to [2] for an explanation and a linearization of (7). 5.6
Result of the Optimization Problem
The result of the optimization problem in program 2 is a set of modules have to be implemented in each processor and the number of processors of the fullsize array building one partition and hence one processor of the resulting processor array. Program 2 is solved for each projection vector computed in program 1. Then we select the projection vector ui leading to minimal hardware costs
c = c~ + c r = N ~ ( ~ E ~te~ + met). i=l
Example 2. (Continuation of example 1) The admissible latency La is measured in clock cycles and supposed to be Lg = 1200. The considered set of modules is listed in table 1. We solve the Table 1. Set of modules operation evaluation time di time offset oi chip area ci in clock cycles in clock cycles normalized rnl add/mult 3 3 297 m~ add/muir, 4 4 169 add/mult 8 8 44 mr reg 1
optimization problem of program 2 separately for the projection vectors u l and u2 determined in program 1. In order to justify our approach we present some interesting solutions in table 2 instead of only the best one given by the solver of the optimization problem.
1026 Table 2. Results of the optimization problem projection vector Np C~ = Np m c,. module selection K Cp = 9
IMI ~ nlcl C = Cr + Cp i=1
Ul
8
L6
1 x ma
1
8 8
16 ]6
1 x m2
1
120 120 120
240 240 240
= (0, 1) "r "
" u2 = (1,0) 'r " "
2
x ml
3
L x m,a
9
2 x m2
37 25
1 x ml
704 1352 1584 586 1096 1425
720 1368 1600 826 1336 1665
Minimal hardware costs occur by using projection vector u l and implementing one instance of module m3 in each processor of the resulting processor array. Finally, we want to give some short notes to the computational effort of our example. The worst case is program 2 with respect to the projection vector ul. The program consists of 168 constraints with 51 binary, 14 integer and one rational variable. The solution takes 11.3 seconds of CPU time on a SUN SPARC Station 10. 5.7
Selection of Projection Vectors
An alternative approach to the separate computation of a scheduling function for each projection vector consists in consideration of the projection vectors ui as parameters in program 2. We assume that the iteration interval A < Am~x. A linearization of the constraint A = [TTu[ is achievable using four inequalities and one binary variable v. A - - 2VAmax < "rTll < )% A-2(1-v)Ama~ <--TTu<_A,
A E Z, r E { O , 1}.
(8)
Suppose, that we have to select one of P projection vectors ui, 1 < i < P. Using P binary variables ai, 1 < i < P, we replace (8) by: A - 2vAma~: - 2 ( 1 -
2(1 -
-
ai)Amaz <
2(1 -
rTui
<
<_
+ (1 -
l < i < P, .
{0,1},
P
cq=l,
aiC{0,1},
I
i----.1
The projection vector ui is selected if ai = 1. Furthermore, we have to include the hardware costs for the registers. The objective function is changed to minimize the hardware costs C of the resulting processor array. New constraints have to be introduced to take the different number of processors N~ of the fullsize arrays with respect to the projection vectors ui into account: C >_
i
I
+
-
- ~ i ) N ~ ( C m a'x
+ mc~),
1 < i < P.
1027 We do not recommend this approach since the condition of the integer linear program deteriorates strongly.
6
Conclusion
The presented approach is suitable to derive cost minimal processor arrays for algorithms with the requirement of an admissible latency. The arising optimization problems are given in a linearized form which permits the use of standard packages to solve the problems. Possible extensions to our approach are the inclusion of the power consumption and the inclusion of a rough approximation of the effort needed to implement the interconnections.
References 1. A. Darte, Y. Robert: "Constructive Methods for Scheduling Uniform Loop Nests", IEEE Trans. on Parallel and Distributed Systems, Vol. 5, No. 8, pp. 814-822, 1994 2. D. Fimmel, R. Merker: "Determination of the Processor Functionality in the Design of Processor Arrays", Proc. Int. Conf. on Application-Specific Systems, Architectures and Processors, pp. 199-208, Zfirich, 1997 3. D. Fimmel, R. Merker: "Determination of an Optimal Processor Allocation in the Design of Massively Parallel Processor Arrays", Proc. Int. Conf. on Algorithms and Parallel Processing, pp. 309-322, Melbourne, 1997 4. K. Jainandunsing: "Optimal Partitioning Scheme for Wavefront/Systolic Array Processors", IEEE Proc. Syrup. on Circuits and Systems, 1986 5. R.M. Karp, R.E. Miller, S. Winograd: "The organization of computations for uniform recurrence equations", J. of the ACM, vol.14, pp. 563-590, 1967 6. D.I. Moldovan: "On the Design of Algorithms for VLSI Systolic Arrays", Proceedings of the IEEE, pp. 113-I20, January 1983 7. P. Quinton: "Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations", IEEE 11-th Int. Symp. on Computer Architecture, Ann Arbor, pp. 208-214, 1984 8. S.K. Rao: "Regular lterative Algorithms and their Implementations on Processor Arrays", PhD thesis, Stanford University, 1985 9. J. Rossel, F. Catthoor, H. De Man: "Extension to Linear Mapping for Regular Arrays with Complex Processing Elements", Proc. Int. Con]. on Application-Specific Systems, Architectures and Processors, pp. 156-167, Princeton, 1990 10. A. Schrijver: Theory of Linear and Integer Programming, John Wiley s Sons, New York, 1986 11. A. Schubert, R. Merker: "Systolization of Recursive Algorithms with DESA", in Proc. 5th Int. Workshop Parcella '90, Mathematical Research, G. Wolf, T. Legendi, U. Schendel (eds.), vol. 2, Akademie-Verlag Berlin, 1990, pp. 267-276, 1994 12. J. Teich: "A Compiler for Application-Specific Processor Arrays", PhD thesis, Univ. of Saarland, Verlag Shaker, Aachen, 1993 13. L. Thiele: "Resource Constraint Scheduling of Uniform Algorithms", Int. Journal on VLSI and Signal Processing, Vol. 10, pp. 295-310, 1995 14. Y. Wong, J.M. Delosme: "Optimal Systolic Implementation of n-dimensional Recurrences", Proc. ICCD, pp. 618-621, 1985
1028 15. Y. Wong, J.M. Delosme: "Optimization of Processor Count for Systolic Arrays", Research Report YALEU/DCS/RR-697, Yale Univ., 1989 16. X. Zhong, S. Rajopadhye, I. Wong: "Systematic Generation of Linear Allocation Functions in Systolic Array Design", J. of VLSI Signal Processing, Vol. 4, pp. 279-293, 1992
Interval Routing & Layered Cross Product: Compact Routing Schemes for Butterflies, Mesh of Trees and Fat Trees Tiziana Calamoneri 1 and Miriam Di Ianni 2 1 Dip. di Scienze dell'Informazione, Universit~t di Roma "la Sapienza" via Salaria 113, 1-00198 Roma, Italy. calamo@dsi, uniromal, it
2 Istituto di Elettronica, Universitk di Perugia via G.Duranti I/A, 1-06123 Perugia, Italy. diianni~ istel, ing. unipg, it
A b s t r a c t . In this paper we propose compact routing schemes having space and time complexities comparable to a 2-Interval Routing Scheme for the class of networks decomposable as Layered Cross Product (LCP) of rooted trees. As a consequence, we are able to design a 2-IntervM Routing Scheme for butterflies, meshes of trees and fat trees using a fast local routing algorithm. Finally, we show that a compact routing scheme for networks which are LCP of general graphs cannot be found by any only using shortest paths information on the factors.
1
Introduction
The information needed to route messages in parallel and distributed systems must be somehow stored in each node of the network. The simplest solution consists of a complete routing table - stored in each node u - that specifies for each destination v at least one link incident to u and lying on a path from u to v. The required space of such a solution is O(nlog~), where ~ is the node degree and n the number of nodes in the network. Efficiency considerations lead to store shortest paths information. For better and fair use of network resources, storing, for each entry of the routing table, as many outgoing links as necessary to describe all shortest paths in the network should be aimed. Due to limited storage space at each processor, a linear increase of the routing table size in n is not acceptable. Research has then focused on identifying classes of network topologies whose shortest paths information can be succinctly stored, assuming that some "short" labels can be assigned to nodes and links at preprocessing time. In the Interval Routing Scheme (in short, IRS) [7, 12], node-labels belong to the set { 1 , . . . , n}, while link-labels are pairs of node-labels representing cyclic intervals of [ 1 , . . . , n]. A message with destination v arriving at a node u is sent by u onto an incident link whose label [vl, v2] is such that v 6 [vl, v2]. Such an approach allows one to achieve an efficient memory occupation. An IRS is said
1030
optimum if the route traversed by each message is a shortest path fl'om its source to its destination. It is said overall optimum if a message can be routed along any shortest path. In [6, 12] optimum IRSs have been designed for particular network topologies. In [3, 11, 13] it has been proved the existence of networks that do not admit any optimum IRS. Multi-label Interval Routing Schemes were introduced [7] to extend the model in order to allow more than one interval to be associated to each link: a k-IRS is a scheme associating at most k intervals to each link. A message whose destination is node v is sent onto a link labeled ( I 1 , . . . , Ik) if v C Ii for some 1 < i < k. In [2] a technique for proving lower bounds on the minimum k allowed was developed and in [4] it has been used to construct n-node networks for which any optimal k-IRS requires k = O(n). It was proved that for some well known interconnection networks, such as shuffle exchange, cube connected cycle, butterfly and star graph, each optimal k-IRS requires k = / 2 ( n 1 / 2 - e ) - for proper values of c - to store one shortest path for each pair [5]. Of course, this lower bound still holds to store any shortest path. In this paper, after providing the necessary preliminary definitions (Section 2), we propose overall optimum compact routing schemes (Section 3 ) based on the same leading idea as the Multi-label Interval Routing for all networks which are Layered Cross Product (LCP) [1] of rooted trees (in short, T-networks). For many commonly used interconnection networks falling in this definition no overall optimum compact routing scheme was known. Among them we recall three widely studied topologies: butterflies, mesh of trees and fat trees. Our compact routing scheme requires as much space and time as those required by a 2-IRS. The achievement is particularly meaningful for butterflies because of the result in [5]. Finally, in Section 4, we give a negative result by proving that the knowledge of shortest paths on the factors of a network could be not enough to compute shortest paths on it.
2
Definitions and P r e l i m i n a r y R e s u l t s
Point to point communication networks are usually represented by graphs, whose nodes stand for processors and edges for communication links. We always represent each edge {u, v} by the pair of (oriented) arcs, (u, v) and (v, u). An l-layered graph, G = (V 1, V 2 , . . . , V t, E) consists of l layers of nodes; V i is the (non-empty) set of nodes in layer i, 1 < i < l; every edge in E connects vertices of two adjacent layers. In particular a rooted tree T of height h is a h-layered graph, layer i defined either as the set of nodes having distance i - 1 from the root or as the set of nodes having distance h - i from the root. From now on, we call T a root-tree or a leaf-tree according to whether the first or the second way of defining layers is chosen. In Fig. 1, T1 is a root-tree, while T2 is a leaf-tree. Let G1 = (1/11,V~,...,V/,E1) and G2 = (V~,Vff,...,V~,E2) be two llayered graphs. Their Layered Cross Product (LCP for short) [1] G 1 x G 2 is an/-layered graph G = (V 1, V 2 , . . . , V l, E) where V i is the cartesian product
1031 of V~ and V~, 1 < i < l, and a link ((a, a), (b,/3)) belongs to E if and only if (a, b) 9 E1 and (a,/3) 9 E2. Many common networks are LCP of trees [1]. Among themh the butterfly with N inputs and N outputs is the LCP of two N-leaves complete binary trees (Fig. 1.a), the mesh of trees of size 2N is the LCP of two N-leaves complete binary trees with paths of length log N attached to their leaves (Fig. 1.b), the fat tree of height h [10] is the LCP of a complete binary tree and a complete quaternary tree, both of height h (Fig. 1.c). 23 2345
123
,9 N
r~ 3 2 -
t)
1
3 2 .
.
.
72
.
1
c
Fig. 1. Butterfly, Mesh of Trees and Fat-tree as LCP of rooted trees.
F a c t 21 Let (a, ~), (b, fl) E V(G1 • G2). Any shortest path from (a, ~) to (b, fl) is never shorter than a shortest path from a to b in G t and a shortest path from to fl in G 2. F a c t 22 If G is the LCP of either two root-trees or two leaf-trees then G is a tree. Observe that the LCP of two trees could be also not connected. This is not a restriction to our discussion, since we deal with connected networks that are the LCP of trees and not with any LCP of trees.
3
Designing Compact Routing Schemes for T-networks
Let G = T1 • T2, where both T1 and T~ are trees. In [12] an IRS for trees has been shown. Thus, consider the two IRSs for T 1 and T 2 and let s and I1, E2 and Z2 be the node- and link-labelings of an IRS for T 1 and T 2, respectively. A node (ul, u2) E V(G) is labeled with a triple (a, a, l) if El(U1) = a, Z2(u2) = a and l is the layer of both ut in Tt and u2 in T2 (and of (ut, u2) in G). Similarly, a link ((ul, u2), (Vl, v2)) E E is labeled with a triple (/1, I~, 1), where/1 = Zl((ut, vt)), I2 = Z~((u~, v2)) and I is the layer of (vl, v2). It is possible to rename all nodes
1032 according to their node-labeling s therefore, in the following we speak a b o u t labeling to m e a n link-labeling and we refer to nodes themselves to m e a n their node-labels. To complete the definition of the c o m p a c t r o u t i n g scheme for G, we m u s t describe algorithm ,4 stored in each node (a, a, la) used to route a packet onto a shortest p a t h connecting the current node (a, c~, la) itself to the destination (t, r, lt). Informally, at each step ,4 tries to take the greedy choice: if b o t h shortest p a t h factors Pl(a, t) and P2(a, r) move towards the same level then A moves in t h a t direction, t h a t is, it chooses link ((a, a, l~), (b,/3, lb)). Otherwise, if a = t (or a = r), then P1 (or Pu) is null, so the other p a t h m u s t be followed 1. Finally, if Pl(a, t) and P2(o:, r) go towards opposite levels, ,4 follows the p a t h going away f r o m level lt. We will prove t h a t in this way a shortest p a t h on G is always used. Now, we are ready to describe algorithm ,4 formally.
Algorithm M ( i n p u t : t, r, lt; o u t p u t : one outgoing link labeled (I1,/2,1)) { The labeling of the outgoing links, a, o~ and la are known constants in each node.} if a = t a n d a = r t h e n extract the packet else if there exists a link labeled (I1,/2, l) s.t. t 6 I1 a n d r 6 / 2 t h e n choose it else if there exists a link having label (I1, Is, l) s.t. t 9 11 a n d (a = r or [la - lt[ < I1 - lt[) t h e n choose it else if there exists a link having label (11,/2, l) s.t. r 6 [2 a n d (a = t or [l~ - lr] < I1 - l~[) t h e n choose it; Notice that algorithm`4, stored in each node, does not increase the a s y m p t h o t i c space complexity with respect to 2-1RS and runs in O(3) time, 6 being the m a x i m u m node degree. As a consequence, if .4 is able to route packets to their destinations we have designed a c o m p a c t routing scheme. In the following, we first show the optimality (Thin. 1) and then its overall o p t i m a l i t y ( T h m . 2). 1. If (a, a, la) transmits a packet, whose destination is (t, r, It), to an adjacent node (b,/3, lb), then (b,/3, lb) belongs to a shortest path from (a, a, la) to
Theorem
(t, ~, l~). Proof. If G is the L C P of either two root-trees or two leaf-trees the s t a t e m e n t is trivially true because, from Fact 22, there exists a unique p a t h between any pair of nodes. It m u s t necessarily be the Layered Cross P r o d u c t of the corresponding paths in the factors. Thus, links labeled (It, h , 1 ) such t h a t t E I1 and v 6 /2 can always be used. Therefore, f r o m now on we shall always suppose t h a t T1 is a root-tree and T2 is a leaf-tree. T h e p r o o f considers the t r u t h of the if-conditions in A. 1. There exists a link (I1, h , l) such that r 6 12 and a = t. Let (~, fl, ill,. 9 ilk, r) be the shortest p a t h in T2. Then, by definition of LCP, there exist b, b l , . . . , bk in T1 such t h a t (a, b, b l , . . . , bk, a) is a p a t h (crossing the same links m o r e t h a n 1 When we say that M follows a path Pi (either i = 1 or i = 2), we mean that r follows an edge on G whose i-th factor belongs to Pi.
1033
<(a,o~,la), (b,fl, lb), (bl,fll,lb,),..., (bk,flk,lb~), (a, 7-,/a)) is a p a t h in G s t a r t i n g at (a, a, la) and ending at (a, 7-, la). Fact 21 ensures t h a t this is a shortest p a t h in G. T h e case with t E / 1 and a = 7- is s y m m e t r i c . 2. T h e r e exists a link labeled (Ii,I2,lb) such t h a t t E I1 and 7- C I~. W.l.o.g., suppose la < lt: b is a child of a and fl is the father of a (Fig. 2). Moreover, since b belongs to a shortest p a t h from a to t, t is a descendant of b. T w o cases are possible: once) such that
- 7" is an ancestor of c~ (Fig. 2.a). Then, the shortest p a t h s f r o m a to t in T1 and f r o m a to ~- in T2 have the s a m e length It - la. It is easy to see t h a t in G there exists a unique shortest p a t h from (a, c~, l~) to (t, 7-, 4). F u r t h e r m o r e , all links in such p a t h are labeled (/1,/2, l) such t h a t t E I1 and 7- C /2. Thus, (b, fl, lb) belongs to a p a t h of length It - la and for Fact 21 it is the shortest path. - a and 7- have a c o m m o n ancestor (Fig. 2.b). Let 7 be the nearest c o m m o n ancestor and let l~ be its layer. By the definition of layers, and since Tz is a root-tree and T2 is a leaf-tree, it m u s t be It < 4.
I~ It
Ic It Ib la
~(b,(t"~)"r
%,"
I,,
}" ]"l (a;a) "
(a, a ) ,,, C(
Ib
'
la
-,/ b
F i g . 2. la < lt; a. T is an acestor of c~; b. a and T have a c o m m o n ancestor.
Let c be one of the descendants of t at layer Ic in T1 and 6 be the node at l a y e r l t belonging to the shortest p a t h f r o m c~ to 7 in T2. T h e r e exists a p a t h f r o m (a, ~, la) to (t, ~i, 4) in G whose links are always labeled (I1,/2, l) such t h a t t E I1 and 7- E / 2 and having length equal to the length of the shortest p a t h from c~ to 6 in T2 (this is easily proved by induction on the length). T h e p a t h in G from (t, 6, 4) to (t, v, 4) chosen by .4 is a shortest one because of ease 1. Since the p a t h from (a, c~, 4 ) to (t, r, 4) passing t h r o u g h (b, fl, 4) has the s a m e length as the p a t h from c~ to r in T~, it is a shortest one. 3. No outgoing link from (a, c~,4) is labeled (I1,/2,1) such t h a t t E I1 and r E / 2 ; f u r t h e r m o r e , neither a = t nor c~ = r. T h e n , one shortest p a t h factor moves towards increasing layers while the other shortest p a t h factor moves towards decreasing layers. Suppose first It > la, t h a t is, t is closer t h a n a to the leaves in T1 while 7- is closer t h a n a to the root in T2.
1034 Notice t h a t it is not possible to have b o t h shortest p a t h s factors m o v i n g towards the leaves. Thus, b o t h shortest p a t h s factors move towards the roots. Then, t and a have a nearest c o m m o n ancestor c at layer lc. On 3?2, either 7is an ancestor of a (see Fig. 3) or r and a have a nearest c o m m o n ancestor (~ at layer ld (Fig. 4). Consider the case of Fig. 3 first. Let b the father of a in T1 and let I1 be the label of (a, b) in T1. Since It - lb > It -- l~ and t E /1, A chooses one link labeled (I1, I2, lb) and ending in (b,/~, lb). Let d be the node belonging to the shortest p a t h from a to t at layer l~ in T1. Node (b,/?, lb) lies on the p a t h
,~
(a,a) ./-a \:. ~/'~" ,,
~ It
Fig. 3. The two shortest paths factors move towards the roots and r is an ancestor of a while t and a have a common ancestor c.
f r o m ( a , a , l~) to (d, a, la) - t h r o u g h a node (c, 7, lc) - of the s a m e length as the shortest p a t h from a to d in T1. From this last node, there is a single shortest p a t h to (t, v, lt) constituted by links always labeled ( I 1 , / 2 , l ) such t h a t t E /1 and r E /2. T h e length of such p a t h from (a, a, la) to (t, r, lt) t h r o u g h (b, fl, lb) is equal to the length of the shortest p a t h f r o m a to t in
T1. Consider now the situation in Fig. 4: both shortest p a t h s factors m o v e towards the roots, t and a have a nearest c o m m o n ancestor c at layer lc, and r and a have a nearest c o m m o n ancestor 5 at layer ld. To go f r o m (a, c~,la) to (t, v, lt) through a shortest p a t h it is necessary to use either first a shortest p a t h in T1 till c or first a shortest p a t h in 372 till 5. Indeed, consider a p a t h t h a t alternates links whose first factor is f r o m a shortest p a t h in T1 with links whose second factor is f r o m a shortest p a t h in ~): let (a, a, l~), (al, a l , / 1 ) , 9 9 (ak, ak, lk) be the first f r a g m e n t of such a p a t h having the first factor as the shortest p a t h in 371, with ai ~ c, i = 1 , . . . , k. If the next step is a link whose second factor belongs to a shortest p a t h towards r in T2, such link ends necessarily in (u, ak-1, lk-1), where u is a child of ak in T1. For the sake of s y m m e t r y , since u and ak-1 are at the s a m e layer in the s a m e subtree rooted at ak (with lk > lc), the distance of (u, ak-1, lk-1) f r o m (t, r, lt) is equal to the distance of (ak-1, a k - 1 , lk-1) f r o m (t, r, lt). Thus, in order to reach (t, r, lt) from (a, a , l a ) at least two steps have been wasted. Hence, necessarily one of the following is a shortest p a t h in G:
1035 - (Fig. 4.a) a p a t h from (a, ~, la) to (b, fl, lb) to (c, 3',1~) to (h, c~,l~) to (t, ~1, lt) to (f, ~, ld) to (t, 7, lt), where 7 is a descendant of a at layer l~ in T2, h and f are descendants of e at layer la and let, respectively, in T1 and 7/is an ancestor of c~ at layer It in T2. T h e length of such a p a t h is 2(la - it) + (it - la) q- 2(ld -- it) = 2(ld -- lc) -~ la -- It -- (Fig. 4.b) a p a t h from (a, a,l~) to (d, cf, la) to (k, v, lt) to (a, O,l~) to (c, ~, l~) to (h, O, la) to (t, v, lt), where d and k are descendant of a at layers Id and It, respectively, in T1 and 0 and cr are descendants of 7 at layers l~ and l~, respectively, in T2. T h e length of such a p a t h is (ld -- lo) + (ld -- lt) + 2(lt -- l~) = 2(ld -- l~) + It -- lo
Thus, the shortest p a t h depends on the sign of It - l a . Since in our hypothesis It > la, the shortest p a t h is the first one, t h a t is the p a t h whose first step goes away f r o m lt. Since node (b, fl, Ib) is such t h a t It - lb > It -- la, then ,4 chooses the first path. W h e n It < l~, the s a m e reasoning applies. Finally, the s a m e discussion holds also if It = la. However, notice t h a t in this case any of the two choices is possible.
Id It/a /b /c d k
/d It la It, Ic ~Oi~'~(C,
: N.,,(t,.c) (t,nT,r
IrXI,'T I ~J,'l I I
~)
,,,._.,e)
(t.~j
/a
~I
_/- [
a
Fig. 4. The two shortest paths factors move towards the roots, t and a have a common ancestor c and 7 and c~ have a common ancestor &
T h e o r e m 2. If a packet must be transmitted from node (a, ~,l~) to node (t, r, lt) and ((a,c~,l~), (al,c~l,lal), ( a 2 , a ~ , l ~ ) , . . . , (t,'r, lt)> is any shortest path from
(a, o~, l~) to (t, r, lt), then algorithm ,4 can possibly use it. Proof. Again, t h a n k s to Fact 22, we shall always suppose t h a t T1 is a roottree and T2 is a leaf-tree. T h e p r o o f is divided according to the t r u t h of the if-conditions in .4 and most of considerations clone in the p r o o f of T h e o r e m 1 are used here. 1. There exists a link ( h , I 2 , 1 ) such t h a t r E /2 and a = t: then (cf. p r o o f of T h i n . l ) G contains a p a t h f r o m (a, a , I,) to (t, w, lt) of length equal to the p a t h f r o m a to 7- in T2. Let k + 1 be such length. Suppose t h a t a p a t h ((a, a, la), (al, a l , 1 , 1 ) , . . . , (ak, ak, l,k), (t, 7", lt)} exists in G t h a t is not found
1036 by .4. Thus, a l does not belong to the shortest p a t h from a to r in T2, t h a t is, if link ((a, a,/~), (a~, a l , l a l ) ) is labeled (/1, I2,lal), ~- ~ I2. Hence, since by definition of L C P p = (a, a l , . . . , a k , 7 " ) m u s t be a p a t h in T2, some ai m u s t be the s a m e as some aj, with i ~ j. But the distance between a and ~in T2 is k + 1, then p must be longer t h a n k + 1, an absurd. T h e case t G / 1 and a = r is s y m m e t r i c . 2. T h e r e exists a link labeled (/1, I s , l ) such t h a t t E / 1 and T E /2. W.l.o.g., suppose la < It and therefore b a child of a and fl the father of a. Since b belongs to a shortest p a t h from a to t, t is a descendant of b. T w o cases are possible: - ~- is an ancestor of a. This case is trivially proved, since there is in G a unique shortest p a t h from (a, a, la) to (t, T, lt) and A finds t h a t p a t h . a and r have a nearest c o m m o n ancestor 7 at layer lc. Recall t h a t It < lc and a shortest p a t h in G m u s t be as long as the p a t h f r o m a to ~- in T2 (cf. p r o o f of T h m . l ) . T h e p r o o f of this case is similar to the one of case 1. of this t h e o r e m and thus o m i t t e d . 3. No outgoing link from (a, a, la) is labeled (/1,12,l) such t h a t t C /1 and 7- E /2. Suppose It > la. We have already proved t h a t one of the following cases m u s t occur: - b o t h shortest p a t h s factors m o v e towards the roots, 7- is an ancestor of while t and a have a nearest c o m m o n ancestor c at layer 1r a l g o r i t h m A chooses a link labeled (/1,12, l) such t h a t t G / 1 and such link belongs to a p a t h from (a, a, la) to (t, T, lt) having the s a m e length as the p a t h from a to t in T1. Again, a reasoning very similar to t h a t one of case 1. of this t h e o r e m applies. - b o t h shortest p a t h s factors move towards the roots, t and a have a nearest c o m m o n ancestor c at layer lc, r and a have a nearest c o m m o n ancestor 5 at layer ld. We have already proved t h a t a shortest p a t h f r o m (a, a, l~) to (t, r, lt) necessarily crosses either first a shortest p a t h in T1 to c or first a shortest p a t h in T2 to J. This implies (cf. p r o o f of T h i n . l ) t h a t a shortest p a t h in G must necessarily be a p a t h f r o m (a, a, l~) to (c, % lc) to (h, a, la) to (t, ~, lt) to (d, 5, ld) to (t, r, lt), where 7 is a descendant of a at layer l~ in T2, h is a descendant of c at layer l~ in T1 and r/ is an ancestor of a at layer It in T2 (Fig. 4.a). Since .4 can choose any link ((a, a, l~), (b,/3, lb ) ) labeled ( / 1 , 5 , lb ) such t h a t It - lb > It -- la and t E / 1 , then it is able to choose any p a t h of this sort, and this proves the assertion. T h e s a m e reasoning applies when 1t < la and when l t : la. -
4
LCP
of General
Graphs:
Driving
some
Conclusions
In the previous section we have proposed a m e t h o d to c o m p u t e a c o m p a c t routing scheme for all T-networks. In the proofs of correctness, we have strongly used the properties of the factors and having a unique (shortest) p a t h between any couple of nodes. It is i m m e d i a t e to wonder whether our technique can be
1037 extended to networks which are the LCP of more general graphs. In this section we show a negative result in this direction. Namely, we prove a property of the LCP allowing one to deduce that the knowledge of node- and link-labels on factors (and therefore the knowledge of their shortest paths) gives not enough information to find shortest paths in their LCP. T h e o r e m 3. There exist a layered graph G = (V, E) L C P of G1 = (V1, E l ) and G2 = (V2, E2), a source node (s, (r, ls) e V, a destination node (t, T, It) E V and an edge e E E having the following properties: i. e is the layered cross product of two edges el E E1 and e2 E E2; ii. e belongs to a shortest path from (s, ~, l~) to (t, T, lt); iii. neither el belongs to a shortest path from s to t in G1, nor e2 belongs to a shortest path from c~ to v in G2. Proof. The graph in the assertion is shown in Fig.5 in which the following convention is used: each edge in the drawing of G1 and G2 (and therefore of G) represents a simple chain whose length is determined by the difference of layers where its extremes lie. Suppose the shortest path from s to t in G1 passes through m and the shortest path from a to v passes through ~,. Then, the following relations must hold: I l s - l t [ + 2 1 l t - l m [ < 2[ls--ld[+[ls-lt[ and 2 [ l s - l n [ + [ l s - l t [ < [ l s - l t [ + 2 [ l t - l r [ that is It~ - ltl < {l~ - ldl and II~ - l~l < II, - l~[.
Whenever one of the following inequalities hold II~ - lt[ + 2llt - l~l < 2[l~ - l~[ + [l~ - l~l + 211, - l~l 215 - ldl + 15 - ltl < 2lls - l,~l + II~ - Ill + 21lt - l~[ either the path through (m,/~l,lm) and (r,p,l~) (first inequality) or the path through (n2, u, ln) and (d, 5, ld) (second inequality) are shorter than the path through (n2, u, l~) and (m, #2, l~). T h a t is, the shortest path passes through an edge - either ((m, #1, lm), (r, p, l~)) or ((n2, u, l~), (d, 5, ld)) - whose factors are not on a shortest path. As a consequence of the previous theorem, we can state the following fact: F a c t 41 Let G be a network that is the L C P of any two graphs G1 and G2. The only knowledge of the compact routing schemes (i.e. a node- and link-labeling scheme) on G1 and G2 may not be sufficient to deduce a compact routing scheme for G. Anyway, the previous claim does not forbid one to find some special cases in which the particular structure either of the network itself or of its factors helps in defining a compact routing scheme.
1038 /
77, I
_//
-
(r~v)
\ =_(~,o) .~
m
._i
"'".(r,p)
(~~
/d
P
Fig. 5. Edges on a shortest path in G whose factors do not lie on any shortest path in G1 and in G~.
A c k n o w l e d g m e n t s : We thank Richard B. Tan for a careful reading of earlier drafts and for his helpful suggestions on improving the paper.
References 1. S. Even, A. Litman, "Layered Cross Product - A technique to construct interconnection networks", 4 th ACM SPAA, 60-69, 1992. 2. M. Flammini, J. van Leeuwen, A. Marchetti-Spaccamela, "The complexity of interval routing in random graphs", MFCS 1995, LNCS 969, 37-49, 1995. 3. P. Fraigniaud, C. GavoiUe, "Optimal interval routing", CONPAR, LNCS 854, 785796, 1994. 4. C. Gavoille, S. Perennes, "Lower bounds for shortest path interval routing", SIROCCO'96, 1996. 5. R. Krs P. Ru~i~ka, D. Stefankovi~, "The complexity of shortest path and dilation bounded interval routing", Euro-Par'97, LNCS 1300, 258-265, 1997. 6. J. van Leenwen, R.B. Tan, "Computer networks with compact routing tables", in: G. Rozemberg and A. Salomaa (Eds.) The Book of L, Springer-Verlag, Berlin, 298-307, 1986. 7. J. van Leeuwen, R.B. Tan, "Interval routing", Computer Journal, 30, 298-307~ 1987. 8. J. van Leeuwen, R.B. Tan, "Compact routing methods: a survey", Colloquium on Structural Information and Communication Complexity (SICC'9~), 1994. 9. F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmaan Publishers, Inc.,1992. 10. C.E. Leiserson, "Fat-Trees: universal networks for hardware-efficient supercomputing', IEEE Trans. on Comp., 34, 892-901, 1985. 11. P. Ru~i~ka, "On efficiency of Interval Routing algorithms", in: M.P. Chytil, L. Janiga, V. Koubeck (Eds.), MFCS 1988, LNCS 324, 492-500, 1988. 12. N. Santoro, R. Khatib, "Routing without routing tables", Tech. Rep. SCS-TR-6, School of Computer Science, Carleton University, 1982. Also as: "Labelling and implicit routing in networks", Computer Journal, 28, 5-8, 1985.
1039 3. S.S.H. Tse, F.C.M. Lau, "A lower bound for interval routing in general networks", Tech. Rep. 94-04, Dept. of Computer Science, The University of Hong Kong, 1994 (to appear in Networks).
Gossiping Large Packets on Full-Port Tori Ulrich Meyer and Jop F. Sibeyn Max-Planck-Institutffir Informatik Im Stadtwald, 66123 Saarbrficken, Germany. umeyer, [email protected] http://www.mpi-sb.mpg.de/~umeyer, ~jopsi/
A b s t r a c t . Near-optimal gossiping algorithms are given for two- and higher dimensional tori. It is assumed that the amount of data each PU is contributing is so large, that start-up time may be neglected. For twodimensional tori, an earlier algorithm achieved optimality in an intricate way, with a time-dependent routing pattern. In our algorithms, in all steps, the PUs forward the received packets in the same way.
1
Introduction
M e s h e s a n d T o r i . One of the most thoroughly investigated interconnection schemes for parallel computation is the n x n mesh, in which n 2 processing units, PUs, are connected by a two-dimensional grid of communication links. Its immediate generalizations are d-dimensional n x ... x n meshes. Despite their large diameter, meshes are of great importance due to their simple structure and efficient layout. Tori are the variant of meshes in which the PUs on the outside are connected with "wrap-around links" to the corresponding PUs at the other end of the mesh, thus tori are node symmetric. Furthermore, for tori the bisection width is twice as large as for meshes, and the diameter is halved. Numerous parallel machines with two- and three-dimensional mesh and torus topologies have been built.
Gossiping. Gossiping is a fundamental communication problem used as a subroutine in parallel sorting with splitters or when solving ordinary differential equations: Initially each of the P PUs holds one packet, which must be routed such that finally all PUs have received all packets (this problem is also called all-to-all broadcast). Earlier Work. Recently Soch and Tvrdfk [5] have analyzed the gossiping problem under the following conditions: Packets of unit size can be transferred in one step between adjacent PUs, store-and-forward model. In each step a PU can exchange packets with all its neighbors, full-port model. The store-and-forward model gives a good approximation of the routing practice, if the packets are so large that the start-up times m a y be neglected. If an algorithm requires k > 1 packets per PU, then this effectively means nothing more than that the packets have to be divided in k chunks each.
1041 In [5], it was shown that on a two-dimensional nl • n2 torus, the given problem can be solved in [(nl " n 2 - 1 ) / 4 ] steps if nl, n2 >__3. This is optimal. The algorithm is based on time-arc-disjoint broadcast trees: The action of each PU is time dependent, and thus, for every routing decision, a PU has to perform some non-trivial computation (alternatively these actions can be precomputed, but for a d-dimensional torus with P PUs this requires O(P) storage per processor). N e w R e s u l t s . In this paper, we analyze the same problem as Soch and Tvrdfk. Clearly, we cannot improve their optimality. Instead, we try to determine the minimal concessions towards simple time-independent algorithms: In our gossiping algorithms, after some O(d) precomputation on a d-dimensional toms, a PU knows once and for all, that packets coming h'om direction xi have to be forwarded in direction xj, 1 < i, j <_d. Time-independence ensures that the routing can be performed with minimal delay, for a fixed size network the pattern might even be built into the hardware. Also on a system on which the connections must be somehow switched, this is advantageous. By routing the packets along d edge disjoint Hamiltonian paths, it is not hard to achieve the optimal number of steps if each PU initially stores k = d packets. Here optimal means a routing time of k.P/(2.d) steps on a torus with P PUs. For d = 2, the schedule presented in Section 2 is particularly simple, and might be implemented immediately. Our scheme remains applicable for higher dimensions: the proof that edge-disjoint Hamiltonian cycles exist [3, 2, 1] is constructive, but these constructions are fairly complicted. Unfortunately, it is not possible to achieve the optimal number of steps using d fixed edge disjoint Hamiltonian paths and only one packet per PU which then circulates concurrently along all d paths. In [4] we show for the two-dimensional case that at least P/40 extra steps are needed due to multiple receives. In Section 3, we investigate a second approach that allows an additional o(P) routing steps. On a d-dimensional torus, it runs in P/(2. d) + o(P) steps, with only one packet per PU. For d = 2, this is ahnost optimal and time independent. Therefore, this algorithm might be preferable over the one from [5]. More importantly, for higher dimensions, particularly for the practically relevant case d = 3, this gives the first simple and explicit construction that achieves close to optimally. In these algorithms, we construct partial Hamiltonian cycles: on a d-dimensional torus, we construct d cycles, each of which covers P/d + o(P) PUs. These are such that for every cycle, every PU is adjacent to a PU through which this cycle runs.
2
Optimal-Time
Algorithms
In this section we describe a simple optimal gossiping algorithm for twodimensional nl • n~ tori assuming that initially each PU holds k = 2 packets. B a s i c C a s e , The PU with index (i,j) lies in row i and column j, 0 <__i < nl, 0 _< j < n2. PU (0,0) is located in the upper-left corner. First PU (i, j) determines whether j is odd or even and sets its routing rules as follows:
1042
j
:Te+R;B++L. :T++L;Be+R.
Here T, B, L, R designate the directions "top", "bottom", "left" and "right", respectively. By T ~ R, we mean that the packets coming from above should be routed on to the right, and vice-versa. The other ++ symbols are to be interpreted analogously. Only in the special case j = n 2 - 1 we apply the rule T e+ R; B ++ L. The resulting routing scheme is illustrated in the left part of Figure 1.
Fig. 1. Left: A Hamiltonian cycle on a 4 • 4 torus, whose complement (drawn with thin lines) also gives a Hamiltonian cycle. Right: Partial Hamiltonian cycles on a 4 • 5 torus. The PUs in column 3 lie on only one cycle. Such a PU passes the packets on this cycle that are running forwards to the PU below it, and those that are running backwards to the PU above it.
O d d ni. Now we consider the case nl even and n2 odd, the remaining cases can be treated similarly. Here we do not construct complete Hamiltonian cycles, but cycles that visit most PUs, and pass within distance one from the remaining PUs. Except for Column n2 - 2, the rules how to pass on the packets are the same as in the basic case. In Column n2 - 2 we perform L ++ R. PUs which do not lie on a given cycle, out-of-cycle-PUs, abbreviated OOCPUs, are provided with the packets transferred along this cycle by their neighboring on-cycle-PUs, OC-PUs. With respect to different cycles, a PU can be both out-of-cycle and on-cycle. The packets received by an O O C - P U are not forwarded. This can be achieved in such a way, that a connection has to transfer only one packet in every step. The resulting routing scheme is illustrated in the right part of Figure 1. C h a n g i n g D i r e c t i o n . Each O O C - P U C receives packets from two different OC-PUs, A and t3. Let A transfer the packets that are walking forwards, let B provide the packets in backward direction. If m denotes the cycle length and A and B are not adjacent on the cycle, then m/2 circulation steps are not enough: some packets are received from both directions, while others pass by without
1043 notice. We modify the algorithm as follows. Let l be the number of OC-PUs between A and B, then during the first l steps, A transfers to C the packets t h a t are running forward, and during the last m / 2 - l steps those t h a t run backward. B operates oppositely. In our case 1 = 2 9n2 - 1 for all PUs in Column n2 - 2. Thus, each PU can still compute its routing decisions in constant time. T h e o r e m 1 If every P U of an nl • n2 torus holds 2 packets, then gossiping can be performed in [nl 9nu/2] steps. The considered schemes have the advantage that their p r e c o m p u t a t i o n can be performed in constant time. Alternatively, one might use the scheme in [3] for the construction of two edge-disjoint Hamiltonian cycles on any two-dimensional torus.
3
One-Packet Algorithms
In this section we give another practical alternative to the approach of [5]: we present algorithms which require only one packet per PU and are o p t i m a l to within o(P) steps. The idea is simple: we construct d edge-disjoint cycles of length P / d + o ( P ) . Each of the cycles must have the special property, that if a PU does not lie on such a cycle, this PU must be adjacent to two other PUs belonging to this cycle. Each of these two PUs will transmit the packets from one direction of the cycle to the out-of-cycle-PU. T w o - D i m e n s i o n a l T o r i . The construction can be viewed as an extension of the approach described in Section 2. There we interrupted the regular zigzag p a t t e r n for one column in order to cope with an odd value for n2. Now we discard the zigzag in n2 - 2 consecutive columns, the only two remaining zigzag columns connect the long stretched row parts to form two partial cycles: j=O:T j=
e+ R ; B e+ L.
I : T e+ L ; B e+ R.
2<j
Binding the out-of-cycle-PUs is done exactly as in the algorithm of Section 2; the case of odd nl and n2 can be treated by inserting one special row. Each of the partial cycles consists of nl 9 n 2 / 2 + nl PUs, so using b o t h directions in parallel one needs nl 9 n 2 / 4 + n l / 2 steps to spread all the packets within the cycle. For every O O C - P U the two supplying OC-PUs are separated by n2 + 1 other PUs on their cycle, thus we obtain a fully time-independent n l "n 2 / 4 + 0 (nl + n2) step algorithm. Applying the f o r w a r d / b a c k w a r d switching as presented in Section 2 we get T h e o r e m 2 If every P U of an nl • n2 torus holds 1 packet, then gossiping can be performed in nl 9n 2 / 4 + n l / 2 + I steps.
1044
,.+t_+++'~176 L++'~ +t+'~ + +
++!
l!
Fig. 2. Three edge-disjoint cycles used in the one-packet algorithm on a threedimensional torus.
T h r e e - D i m e n s l o n a l Torl. For three-dimensional tori, we generalize the previous routing strategy, showing more abstractly the underlying approach. We assume that n2 and n3 are divisible by three. The constructed patterns are similar to the bundle of rods in a nuclear power-plant. We construct three cycles. These are identical, except that they start in different positions: Cycle j, 0 < j _< 2, starts in PU (0,j, 0). The cycles for twodimensional tori were composed of nl/2 laps: sections of a cycle starting with a zigzag and ending with the traversal of a wrap-around connection. The zigzags were needed to bring us two positions further in order to connect to the next lap. Here, we have three zigzags, bringing us three positions further. The sequence of directions in the zigzag pattern is given by (2, 1, 2, 1, 2, 1). After these zigzags, moves in direction 1 are repeated until a wrap-around connection is traversed, the next lap can begin. In this way we can fill up one plane (using n2/3 laps), but in order to get to the next plane, there must be a second zigzag type consisting of the following direction pattern: (2, 1, 2, 1, 3, 1). This type is applied every n2/3 laps. The total schedule is illustrated in Figure 2. If PU P = (x,y,z), x _> 3 belongs to cycle i, then PUs Pr = ( x , ( y + 1) mod 3, z) and Pd= (x, y, (z + 1) rood 3) belong to cycle (i + 1) rood 3, Pl =
1045 (x, ( y - 1) mod 3, z) and Pu = (x, y, (z - 1) mod 3) belong to cycle (i - 1) mod 3. In this way, P on cycle i can be supplied with packets which are not lying on its own cycle: it receives packets running forward on cycle (i + 1) mod 3 from Dr, backward-packets from Pd. Pt provides packets running backward on cycle (i - 1) rood 3, forward-packets are sent by Pu. Each pair of supporting on-cyclePUs is separated by ( n l / 3 + 1) .n2 - 1 other PUs. In the upper part of our reactor, exactly two cycles pass through each PU P, and the shifts are so, that the connections that are not used by cycle traffic lead to PUs that lie on the other cycle. At most 2 9 ( n l / 3 + 1). n~ - 1 PUs lie between two supporting on-cycle-PUs. Multiple receives in the OOC-PUs can again be eliminated by the forward/backward switching of Section 2. Otherwise O ( n l 9n2) extra steps are necessary. T h e o r e m 3 If every P U of an nl x n2 x n3 torus (n2, na integer multiples of three) holds 1 packet, then gossiping can be performed in nl . n 2 . n a / 6 + n , .n2/2+1 steps. H i g h e r D i m e n s i o n a l Tori. The idea from the previous section can be generalized without problem for d-dimensional tori. Now, we construct d edge-disjoint cycles, each of them covering P / d + o(P) PUs. The cycles are numbered 0 through d - 1. Cycle j starts in position (0, j, 0 , . . . , 0). As before a cycle is composed of laps, starting with a zigzag and ending with moves in direction 1. Now there are d - 1 types of zigzags, which are used in increasingly exceptional cases. The general pattern telling along which axis a positive) move is made is as follows: zigzag(2, 1) = (2,1,2,1). zigzag(3, 1) = (2, 1,2, 1,2, 1), zigzag(3, 2) = (2,1,2,1,3,1). zigzag(i, 1) _ (zigzag(i - 1, 1), 2, 1), zigzag(i, i - 2) - (zigzag(i - 1, i - 2), 2, 1), zigzag(i, i - 1) = (2, 1,2, 1,3,1,4, 1,5, 1 , . . . , i -
1, 1,i, 1).
Here zzgzag(d, j) indicates the j-th zigzag pattern for d-dimensional tori. During lap i, 1 < i < n 2 / d . na . . . . . nd, the highest numbered zigzag is applied for which the condition i rood ( n 2 / d 9 n3 . . . . . nj) = 0 is true. Namely, after precisely so many laps, the cycle has filled up the hyperspace spanned by the first j coordinate vectors. Using induction over the number of dimension we can prove T h e o r e m 4 If every P U of a d-dimensional nl x . . . x nd torus (n2, . . . , nd integer multiples of d) with P P U s holds I packet, then gossiping can be p e r f o r m e d in Pl( . d) + P/(2. nl) + 1 steps.
1046
4
Conclusion
We have completed the analysis of the gossiping problem on full-port store-andforward tori. In [5] only one interesting aspect of this problem was considered. We have shown that almost equally good performance can be achieved by simpler time-independent algorithms, and given explicit schemes for higher-dimensional tori as well.
References 1. Alspach, B., J~C. Bermond, D. Sotteau, 'Decomposition into Cycles h Hamilton Decompositions,' Proc. Workshop Cycles and Rays, Montreal, 1990. 2. Aubert, J., B. Schneider, 'Decomposition de la Somme Cartesienne d'un Cycle et de l'Union de Deux Cycles Hamiltoniens en Cycles Hamiltonien,' Discrete Mathematics, 38, pp. 7-16, 1982. 3. Foregger, M.F., 'Hamiltonian Decomposition of Products of Cycles,' Discrete Mathematics, 24, pp. 251-260, 1978. 4. Meyer, U., J. F. Sibeyn, 'Time-Independent Gossiping on Full-Port Tori,' Techn. Rep. MPI-98-1-01~, Max-Planck Inst. f/ir Informatik, Saarbr/icken, Germany, 1998. 5. Soch, M., P. Tvrdfk, 'Optimal Gossip in Store-and-Forward Noncombining 2-D Tori,' Proc. 3rd International Euro-Par Conference, LNCS 1300, pp. 234-241, Springer-Verlag, 1997.
Time-Optimal Gossip in Noncombining 2-D Tori with Constant Buffers* Michal Soch and Pavel Tvrdlk Department of Computer Science and Engineering Czech Technical University, Karlovo nAm. 13 121 35 Prague, Czech Republic {soch, tvrdik}@sun, felk. cvut. cz http ://cs. felk. cvut. cz/pcg/
A b s t r a c t . This paper describes a time-, transmission-, and memoryoptimal algorithm for gossiping in general 2-D tori in the noncombining full-duplex and all-port communication model.
1
Introduction
Gossiping, also cMled all-to-all broadcasting, is a fundamental collective c o m m u nication problem: each node of a network holds a packet that must be delivered to every other node and all nodes start simultaneously. In this paper, we consider gossiping in the noncombining store-and-forward all-port full-duplex model. In such a model, a gossip protocol consists of a sequence of rounds and broadcast trees can be viewed as levels of arc sets, one level for one round. Given a broadcast tree BT(u) rooted in vertex u of a graph G, the arcs crossed by packets in round t are denoted by At(BT(u)). The height of a broadcast tree BT(u), denoted by h(BT(u)), is the number of levels of BT, i.e., the number of rounds of a broadcast using this tree. Given BT(u) and i < h(BT(u)), the i-th level subtree of BT(u), denoted by BT [~](n), is the subtree of BT(u) consisting of the first i levels. Given a graph G, every node of G has to receive lY(G)] - I packets and in one round, it can receive 5(G) packets in the worst case, where [~(G)] is the size of G and 5(G) is the m i n i m u m degree of a node in G. The lower bound on the number of rounds of a gossip in G is therefore T g ( G ) = / ~ [ . If G is a vertex-transitive network, the gossip problem can be solved by constructing a generic broadcast tree, denoted by BT(,), whose root can be placed into any vertex of G. All broadcast trees are therefore isomorphic. 2-D torus T(m, n) is a cross product of 2 cycles of size m and n. Vertex set o f T ( r e , n) is the cartesian product { 0 , . . , m - 1} x { 0 , . . , n - 1}. Tori are vertex transitive and the isomorphic copies of the generic broadcast trees are made by translation. * This research was supported by GACR Agency under Grant Agency under Grant 1251/98, and CTU Grant 3098102336.
102/97/1055, FRVS
1048 D e f i n i t i o n 1. Given nodes u = [ux, uu] and v = [vx, v~] of T(m, n), a translation from u to v, denoted by r is induced by node mapping ([wx, Wy] ~-~
[w~ Orn vx Om ux, Wy On vy On uy], where the addition and subtraction is taken modulo m and n, respectively. A gossip in T(m, n) is time- and transmission-optimal iff h (BT(,)) = rg (T(m, n)) _ [ mn-1 ] and all isomorphic copies of BT(*) are pairwise time-arc-disjoint, i.e., two arcs at the same time-level of any two broadcast tree are never m a p p e d on the same arc of T(m, n) and the communication in every round is therefore contention-free. It is known that in tori, time-arc-disjointedness is equivalent to the distinctness of directions of arcs at every level of the generic tree. The set of directions of arcs At(BT(*)) is denoted by dir(At(BT(*))). In a 2-D torus there exist four directions, usually denoted by N,E,W,S. N-S direction is vertical, W - E direction is horizontal. Hence, a sufficient condition for B T ( , ) to guarantee a time-optimal gossip is that for any t < 7g(T(m, n)), every At(BT(*)) is a set of 4 arcs of 4 distinct directions N, E, W, S. In general, a gossip algorithm in T(m, n) m a y require additional buffers in routers for packets which must wait at least one round before they can be sent out and the routers can get rid of them. Let/3(G) denote the m a x i m u m size of auxiliary buffers per router during gossiping in network G. In [2], we have presented a time-optimal gossip protocol for general T(m, n). However, this algorithm is not memory-optimal, it requires auxiliary buffers for O(max(n, m - n)) packets per router. The generic broadcast tree of T ( m , n) in [2] is built in two steps: filling up the m a x i m a l square submesh of odd side + informing the rest of nodes. Additional buffers for p~ckets are required on the sides of the square submesh. In this paper, we present a time- and m e m o r y - o p t i m a l gossip algorithm for T(m, n) which requires ~(T(m, n)) = 3. To keep the size of auxiliary buffers constant, the generic broadcast tree is built from vertical stripes of width 2. Only constant number of packets must be stored for dissemination in directions W and E, which allows to concatenate vertical stripes horizontally.
2
Generic Broadcast T r e e s
for
Optimal Gossip on
T(m,n). 1. For any m > n > 2, there exists a generic BT(*) ofT(m, n) such h(BT(*)) = I m p - l ] and IAt(BT(,)) I = Idir(At(BT(*)))l for all 1 < t <
Theorem that
h(BT(,))
and Z ( T ( m , n)) < 3. If n =
and m is even, one extra round is
needed. Proof. For m = n, the algorithm is trivial a n d / 3 ( T ( m , m)) = 0 if m is odd and /3(T(m, m)) = 1 otherwise. Assume without losing generality m > n. The construction of time-arc-disjoint trees for time- and m e m o r y - o p t i m a l gossip depends slightly on the values of m and n, the number of different cases is 7, the basic idea is, however, the same in all of them, see [3] for details. The m e m o r y requirements of our algorithm are stated in the following table.
1049
,3(T(m,n))
no2 n=3,0m<5n_ 2dd
n>4. even, m odd n > 4 even, m even
2
3
In this paper, we describe only one particular case when m > n > 5 are odd numbers. To make the construction of BT(*) in T(m, n) as simple as possible, the same patterns of arc sets At(BT(*)) are used repeatedly in various rounds t. The whole generic tree BT(,) is then built using several arc patterns. Arc pattern i is depicted on Figures 1 and 2 as a quadruple of ares labeled i. We associate with every pattern i a so called expansion operator, denoted by Fi. The broadcast tree is then specified by a regular expression over expansion operators. If At(BT(*)) has pattern i, an attachment of arcs at level t to (t - 1)-level generic subtree is described as an application of the corresponding operator B:C['] (,) = BT[~-*] (,) &.
F i g . 1. The first phase of building n--1
BT(*) in T(m, n), if
m > 11, m > n > 5 are odd.
rL~3
(a) Y~ = F [ ~ - / 7 2 . (b) Y2 = Y, [ ~ - ~ - F , .
(c) 1/22.
The generic tree is built in two phases. Figure l(a) depicts the first phase of constructing BT(*) if m >_ 1l. If m < 9, the first phase is void. The black circle is the root of BT(,). The square symbols [] denoted the nodes which have to store packets needed in later steps. Repeated application of F1 in Figure l(a) is n--1
followed by F2 which produces ~ 2 level subtree 171 = BT[%-~~ = F~-F2. Note that the expansion by F1 proceeds in N-S direction. Further a~k levels are attached in direction N-S by repeated application of F3 and after applying/14, r~--3
we get n-level subtree Y2 = B T M ( , ) = YIF3-~-F4. If m > 15, the whole process can be repeated by expanding ]/2 in W-E direction, see Figure 1 (c). The second phase depends on m m o d 4 . Let us describe the case of m - l m o d 4 (the other m-9
case is similar). Let Y3 = 1/2 4 . If m = 9, then Ya shrinks to the root. Figure 2 depicts the solution. We interpret the use of expansion operators similarly as in the first phase. In a~:_____klrounds, we expand ]/3 to I/4 =
YaY1F~-FhF~--F4 (see Figure 2(a)). ]/4 is F7 (see Figure 2(5)). Pattern
using FG and
then expanded in 3 [a_~_kJ rounds /~6~F7 diffuses vertically until the
1050
Fig. 2. The second phase of constructing BT(*), if m > n > 5 are odd, m ~_ l m o d 4 . (a) Construction of ]/4. (b) Application of F6 and F7 patterns.
boundary is reached. Finally, if n - 1 rood4 (as shown on Figure 2(b)), one final round is needed, and if n - 3 rood4, after two more applications of Fe, only two uninformed nodes surrounded by informed ones remain. The m e m o r y requirements for this case of rn and n follow easily. In the first phase, 2 packets must be stored to apply/"2 (see the square symbols in Figure l(a)). Then, these buffers can be reused to store packets for application of F4 (see Figure l(b)). The second phase is similar. In every round of the gossip, no more than two packets must be stored at a time. []
3
Conclusions
The minimal-height time-arc-disjoint trees in the proof of T h e o r e m 1 provide time-, transmission-, and m e m o r y - o p t i m a l gossip algorithm in noncombining full-duplex Ml-port 2-D tori. The algorithm requires routers with buffers for at most 3 packets. An interesting problem is to find the exact lower bound on the size of additional buffers. Another open problem is to find a time- and m e m o r y - o p t i m a l gossip algorithm for 2-D meshes.
References 1. J.-C. Bermond, T. Kodate, and S. Perennes. Gossiping in Cayley graphs by packets. In M. Deza, et. al., editors, Combinatorics and Computer Science, LNCS 1120, pages 301-315. Springer, 1995. 2. M. Soch and P. Tvrdl'k. Optimal gossip in store-and-forward noncombining 2-D tori. In C. Lengauer, et. al., editors, Euro-Par'97 Parallel Processing, LNCS 1300, pages 234-241. Springer, 1997. Research report at h t t p : / / c s . f e l k . cvut. cz/pcg/. 3. M. Soch and P. Tvrdl'k. Time-optimal gossip in noncombining 2-D tori with constant buffers. Manuscript at h t t p : / / c s . f e l k . cvut. cz/pcg/.
Divide-and-Conquer Algorithms on Two-Dimensional Meshes* Miguel Valero-Garcfa, Antonio Gonz~lez, Luis Dfaz de Cerio and Dolors Royo Dept. d'Arquiteetnra de Computadors - Universitat Politcnica de Cataiunya c/Jordi Girona 1-3, Campus Nord - D6, E-08034 Barcelona (Spain)
{miguel, antonio, idiaz, dolors}@ac.upc.es
A b s t r a c t . The Reflecting and Growing mappings have been proposed to map parallel divide-and-conquer algorithms onto two - dimensional meshes. The performance of these mappings has been previously analyzed under the assumption that the parallel algorithm is initiated always at the same fixed node of the mesh. In such scenario, the Reflecting mapping is optimal for meshes with wormhole routing and the Growing mapping is very close to the optimal for meshes with store-and-forward routing. In this paper we consider a more general scenario in which the parallel divide-and-conquer algorithm can be started at an arbitrary node of the mesh. We propose and approach that is simpler than both the Refleeting and Growing mappings, is optimal for wormhole meshes and better than the Growing mapping for store-and-forward meshes.
1
Introduction
The problem of m a p p i n g divide-and-conquer algorithms onto two - dimensional meshes was addressed in [2]. First, a binomial tree was proposed to represent divide-and-conquer algorithms. Then, two different mappings (called the Reflecting and Growing mappings) were proposed to embed binomial trees onto two-dimensional meshes. It was shown that the Reflecting m a p p i n g is o p t i m a l for wormhole routing since the required communication can be carried out in the m i n i m u m number of steps (there are not conflicts in the use of links). On the other hand, the Growing m a p p i n g was shown to be very close to the o p t i m a l for the case of store-and-forward routing. The communication performance of the Reflecting and the Growing m a p p i n g s was analyzed in [2] under the assumption that the divide-and-conquer algorithm is always started at a fixed node of the mesh. In the following, we use the t e r m fixed-root to refer to this particular scenario. In this paper, we consider a more general scenario in which a divide-and-conquer algorithm can be started at any arbitrary node of the mesh. The t e r m arbitrary-root will be used to refer to this scenario. It will be shown that the Reflecting m a p p i n g is still optimal for wormhole routing in the arbitrary-root scenario but the performance of the Growing m a p p i n g for store-and-forward routing can be very poor in some c o m m o n cases. * This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-429/95).
1052 An alternative solution for the arbitrary-root scenario is proposed in this paper, which is inspired in a previous work on embedding hypercubes onto meshes and tori [1]. The proposed scheme, which will be called DC-cube embedding, has the following properties: a) it can be applied to meshes with either wormhole or store-and-forward routing, b) it is significantly simpler than the Reflecting and Growing mappings, c) it is optimal for wormhole routing, and d) it is significantly faster than the Growing mapping in some common cases, for store-and-forward routing. A more detailed explanation of the results presented in this paper can be found in [4]. The rest of this paper is organized as follows. Section 2 reviews the Reflecting and Growing mappings [2], which are the basis for our proposal. In section 3, these mappings are extended to the arbitrary-root scenario. Section 4 presents our proposal. Finally, section 5 presents a performance comparison of the different approaches and draws the main conclusions. 2
Background
Rajopadhye and Telle [2] propose the use of a binomial tree to represent a divideand-conquer algorithm. Every node of the binomial tree represents a process that performs the following computations: 1. Receive a problem (of size x) from the parent (the host, if the node is the root of the tree). 2. Solve the problem locally (if 'small enough"), or divide the problem into two subproblems, each of size ax), and spawn a child process to solve one of these parts. In parallel, start solving the other part, by repeating step 2. 3. Get the results from the children and combine them. Repeat until all children's results have been combined. 4. Send the results to the parent (the host, if the node is the root). Step 2 is referred to as the division stage. It has a number of phases corresponding to the different repetitions of step 2 (the number of levels of the binomial tree). The division stage is followed by the combining stage in which the results of the different subprobems are combined to produce the final result. For the sake of simplicity, as in [2], only the division stage will be considered from now on. Two different values for parameter c~ (see step 2) will be considered: o~ = 1 (the problem is replicated) and a = 1/2 (the problem is halved). The execution of a binomial tree on a two-dimensional mesh can be specified in terms of the embedding of the tree onto the mesh. Two different embeddings were proposed in [2]: the Reflecting mapping and the Growing mapping. It was shown that the Reflecting mapping is optimal for wormhole routing since the communications are carried out without conflicts in the use of the mesh links. On the other hand, the Growing mapping is close to the optimal under storeand-forward routing. These performance properties were derived assuming a fixed-root scenario, that is, the root of the tree is always assigned to the same mesh node.
1053
3
Extending the Reflecting and Growing Mappings to the Arbitrary-Root Scenario
In the following, we consider the case in which the divide-and-conquer algorithm can be started at an arbitrary node (a, b) of the mesh (this is called the arbitraryroot scenario). We have first considered two straightforward approaches to extend the Reflecting and Growing mappings to the arbitrary-root scenario. In approach A we build a binomial tree which is an isomorphism of the binomial tree used in [2] for the fixed-root scenario (every label of the new tree is obtained by a fixed permutation of the bits in the old label). The isomorphism is defined so that the root of the binomial tree is mapped (by the corresponding embedding Reflecting or Growing) onto node (a, b). In approach B, the whole problem is first moved from node (a, b) to the starting node according to the original proposal (for the fixed-root scenario). These approaches will be compared with our proposal, that is described in the next section.
4
A New Approach for the Arbitrary-Root Scenario
Our approach to perform a divide-and-conquer computation on a mesh, under the arbitrary-root scenario is inspired in the technique proposed in [1] to execute a certain type of parallel algorithms (which will be referred to as CC-cube algorithms) onto multidimensional meshes and tori. A d-dimensional CC-cube algorithm consists in 2 d processes which cooperate to perform a certain computation and communicate using a d-dimensional hypercube topology (the dimensions of the hypercube will be numbered from 0 to d - 1). The operations performed by every process in the CC-cube can be expressed as follows: do i=O, d-1 compute
exchange information with neighbour in dimension i enddo The above case corresponds to a CC-cube which uses the dimensions of the hypercube in increasing order. However, any other ordering in the use of the dimensions is also allowed in CC-cubes. A CC-cube can executed in a mesh multicomputer by using an appropriate embedding. In particular, the standard and xor embeddings were proposed to map the CC-cube onto a multidimensional meshes and tori respectively. The properties of both embeddings were extensively analyzed in [1, 3]. A divide-and-conquer algorithm can be regarded as a particular case of CCcube algorithm. This particular case will be referred as DC-cube (from Divideand-Conquer) and has the following peculiarities with regard to CC-cubes: - a) In every iteration, only a subset of the processes are active (all the processes are active in every iteration of a CC-cube).
1054 8
9
Fig. 1. The four iterations of a 4-dimensional DC-cube starting at process 0 and using the hypercube dimensions in ascending order.
- b) Communication between neighbour nodes is always unidirectional (instead of the bidirectional exchange used by CC-cubes). A particular DC-cube is characterized by: (a) a process which is responsible for starting the divide-and-conquer algorithm, and (b) a certain ordering of the hypercube dimensions, which determine the order in which the processes of the DC-cube are activated. Figure 1 shows an example of a 4-dimensional DC-cube starting at process 0, and using the dimensions in ascending order. In this figure, the arrows represent the communications that are carried out in every iteration of the DC-cube. Finally, it can be shown t h a t a binomial tree with the root labelled as l is equivalent to a d-dimensional DC-cube initiated at process l and using the hypercube dimensions in descending order. The standard embedding has been proposed to m a p a 2k-dimensional hypercube onto a 2 k • 2 k mesh. The function S which m a p s a node i (i C [0, 22k - 1]) of the hypercube onto a node (a, b) (a, b E [0, 2 k - 1]) of the mesh is defined as1:
S(i) = ([~'~] , i m ~
2k)
(1)
The properties of the standard embedding that are more relevant to this paper are: a) It is an embedding with constant distances (this property means t h a t the neighbors in dimension i of the hypercube are found at a constant distance in the mesh, for any pair of neighbor nodes, and b) It has a m i n i m a l average distance. Property (a) is very attractive for the purpose of using the embedding on the arbitrary-root scenario. Property (b) is attractive from the performance point of view. To start a divide-and-conquer algorithm from an arbitrary node (a, b) of the wormhole mesh, we use a DC-cubeinitiated at process S - l ( ( a , b)) = a . 2 k + b, and using the dimensions of the hypercube in descending order. It is easy to see t h a t the standard embedding of such a DC-cube is conflict free, and therefore the approach is optimal. The formal proof of this property is very similar to the 1 The standard embedding can be easily extended to the general case of C-dimensional meshes. This extension is however out of the scope of this paper.
1055 Table 1. Comparision of approaches [
t~
GA 2 T M - 2
I
tr (general c~)
[tr (a = 1) I 2k-l-1
(~ + ~ ) 2 k - 1 + (~3 ,_ ~ /,, 2c~, 2 --1 _
4",(20t2~k--l--1
2 k-l-1 -- 2
te (c~ = 0.5)
~3 (2k -t- 1 -- 2-~_1 )
3 - 2 k - 1 -- 1 2k-1 -t- g1 +
Ge 3" 2 k-1 - 1 2a-1 - 1 -t- c~ -I- c~2 -t- (o!3 -t- c~4) (2~ S
2
(0~--"i- O~2"21r 2-"-ff~l
2k+l -- 2
3
2-3( 1 -- -~-) 2
proof given in [2] for the case of the Reflecting m a p p i n g under the fixed-root scenario. To start a divide-and-conquer algorithm from an arbitrary node (a, b) of the store-and-forward mesh, we use a D e - c u b e initiated at process S - l ( ( a , b)) = a 9 2 k -l- b, and using the dimensions of the hypercube in ascending order. In this way, the distances corresponding to the first iterations of the c o m p u t a t i o n (involving larger messages) are smaller. 5
Comparison
of Approaches
and
Conclusions
As a general consideration, it can be said that the standard embedding is significantly simpler than both Reflecting and Growing mapping. When considering the wormhole arbitrary-root scenario, both the approach A for Reflecting m a p p i n g and the standard embedding of DC-cubes are optimal. The comparison of approaches for the store-and-forward arbitrary-root scenario are summarized in table 1, in terms of the average communication cost (GA for approach A, GB for approach B and S for the standard embedding of DC-cube). The average cost GA is defined as: ]
2k--1 2k--1
aA =
c ~
,
(2)
a----0 b=0
where G a'b is the cost of the division stage when the c o m p u t a t i o n is initiated at node (a, b). This cost is defined in terms of the startup cost incurred in every communication between neighbor nodes (denoted by ts) and the transmission time per message size unit (denoted by te). The average costs GB and S are defined in a similar way. Two particular cases of the term affecting te are distinguished, corresponding to a = 1 and c~ =- 0.5. The expressions in table 1 can be compared in two different cases. In the "small volume" case, the communication cost is assumed to be dominated by the t e r m affecting ts (this will happen when ts/te is large a n d / o r the problem is small). In the "large volume" case, the cost is assumed to be dominated by the t e r m affecting te (this will happen when ~ / t ~ is small a n d / o r the problem is large). The conclusions drawn from table 1 are: - a) In the "small volume" case, approach B is the best, since: GA = S = (4/3)GB.
1056 - b) In the "large volume" case and c~ = 1, the conclusion is exactly the same as case ( a )
- c) In the "large volume" case and a = 0.5, approach S is significantly better t h a n the rest, since:
Note that cases (a) and (c) are expected to be the most frequent since a parallel computer is targeted to solve large problems. Besides, they are also the most relevant since they m a y be very time consuming. Note that in ease (c) the improvement of the standard embedding of DC-cube over approaches A and B is proportional to the number of nodes and therefore it is very high for large systems.
References 1. Gonz~lez, A., VMero-Garcfa, M., Dfaz de Cerio L.: Executing Algorithms with Hypercube Topology on Torus Multicomputers. IEEE Transactions on Parallel and Distributed Systems 8 (1995) 803-814 2. Lo, V., Rajopadhye, S., Telle, J.A.: Parallel Divide and Conquer on Meshes. IEEE Transactions on Parallel and Distributed Systems 10 (1996) 1049-1057 3. Matic, S.: Emulation of Hypercube Architecture on Nearest-Neighbor MeshConnected Processing Elements. IEEE Transactions on Computers 5 (1990) 698-700 4. Valero-Garcfa, M., Gonzglez, A., Dfaz de Cerio, L., Royo, D.: Divide-and-Conquer Algorithms on Two-Dimensional Meshes. Research Report UPC-DAC-1997-30, http ://www. ac .upc. es/recerca/report s/INDEX 1997DAC. html
All-to-all S c a t t e r in K a u t z N e t w o r k s * Petr Salinger and Pavel Tvrdfk Department of Computer Science and Engineering Czech Technical University, Karlovo nhm. 13 121 35 Prague, Czech Republic {sal inger, tvrdik}~sun, felk. cvut. c z
A b s t r a c t . We give lower and upper bounds on the average distance of Kautz digraph. Using this result, we give a memory-optimal and asymptotically time-optimal algorithm for All-to-All-Scatter in noncombiIfing all-port store-&-forward Kautz networks.
1
Introduction
Shuffle-based interconnection networks, which include shuffle-exchange, de Bruijm and Kautz networks, are alternatives to orthogonal (meshes, tori, hypercubes) and tree-based networks. Binary shuffle-based networks have been shown to have similar properties as fixed-degree hypercubic networks (butterflies, CCCs), but general de Bruijn and Kautz networks form a special family of intereonnection topologies whose properties are still poorly understood. Recently, the interest in these networks has come from designers of optical interconnects [2, 3]. In this paper, we focus on d-ary Kautz digraphs. We give lower and upper bounds on the average distance of Kautz digraph. Using this result, we give a memory-optimal and asymptotically time-optimal algorithm for All-to-All-Scatter in noncombining all-port Kautz networks. Vertex (arc) set of graph G is denoted by V(G) (A(G), respectively). The alphabet of d letters, {0, 1 , . . . , d - 1}, is denoted by Zd. For d > 2 and D > 2, I(autz digraph of degree d and diameter D, K ( d, D), has vertex set V ( K ( d, D) ) = {Xl...XD;Xi 9 Zd+l A Xi r Xi+l Yi 9 { 1 , . . . , D - 1}} and there is an arc from vertex x = x l . . . X D to vertex y = Y l . . . Y D , x , y 9 V ( K ( d , D ) ) , iff xi+l = Yi Vi 9 { 1 , . . . , D - 1}. The arc from vertex xl . . . x n to X~...XDXD+I is denoted by @ 1 . . . XDXD+I>.
2
T h e Average D i s t a n c e of K a u t z D i g r a p h s
Let N = IV(G)]. The distance between vertices x, y e V(G) is denoted by d(x, y). The average distance of x 9 V(G), d(x), is the average of distances between x and all the vertices of G (including x): d(x) = 1 ~ y e y ( a ) d(x, y). The average * This research has been supported by GA(~R grant No. 102/97/1055 and FRVS grant No. 1251/98.
1058
distance of G is defined as the average of average distances of all vertices of G: 3(c) = 1
In K(d, D), given vertices x = xl ...XD and y = Yl ...YZ), there is a unique shortest dipath from x to y. To calculate d(x, y) and to construct the shortest p a t h from x to y, we have to find the largest overlap of x and y, i.e., the smallest i such that X i + l . . . XD = Y l . . . YD-i. Then d(x, y) = i and the shortest p a t h is obtained by performing left shifts introducing successively Y D + I - i , . . . , YD. This fact allows to compute easily d(x) for any x and consequently d_(K(d, D)). In this section, we give tight upper and lower bounds on d(K(d, D)). Our analysis is based on shortest-path spanning trees, similarly to de Bruijn networks [1]. The upper (lower) bound on the average distance is reached when all vertices are as far (close) as possible from the root. L e m m a 1. In K(d, D), for any vertex x, 1)-
1
( d - 1)(1 - ~ )
+
D d + dD-2(d 2 - 1)
< 3(x) < D -
1
2
- - +
d-
1
(d 2 - 1)d D - l "
There are (d + 1)d vertices matching the upper bound. These are all vertices formed by alternating a and b, a r b C Zd+l. There are (d + 1)d(d - 1) D-2 vertices matching the lower bound. These are all vertices x = x l . . . X D where xiCxD, for all i, l < i < D - 1 .
3
Memory-Optimal and Asymptotically Time-Optimal AAS in Kautz Digraphs
All-to-all scatter (AAS), also called complete exchange, is a fundamental collective communication pattern. Every vertex has a special packet for every other vertex and all vertices start scattering simultaneously. Most of the published AAS algorithms assume that in one communication step, vertices can combine individual packets arbitrarily into longer messages and can recombine incoming compound messages into new compounds. This assumption is not very realistic, since it implies substantial overhead and compound messages can become very large. Hence, we assume the all-port noncombining lock-step model. Every packet travels in the network as a separate entity. All vertices start the scattering simultaneously and transmit packets with the same speed in steps. A router receives packets from all input links in one step, those destined for the local processor stores into the local processor memory, and the remaining ones together with those ejected by the local processor sends out using all output links in the next step. We assume store-and-forward hardware: every input and output buffer of routers can store one packet. If a packet cannot be retransmitted in the next step, it must be stored in an auxiliary buffer of the router. An AAS algorithm is said to be memory-optimal if the routers require auxiliary buffers for 0 ( 1 ) packets. The AAS algorithm is based on lexicographic labeling of input and o u t p u t arcs of vertices and on greedy routing. To describe these notions, we need three auxiliary functions.
1059 Definitionl.
F o r any a E Zd+l
'
b E Zd: C~(a,b) = ~ b + l [ b { bb
F o r a • b E Zd+z: ~(b, a) = # ( a , b ) =
i f a < b, otherwise 9
ifa>b, i f a < b.
- 1
K a u t z vertex x z . . . X D is adjacent fl'om d vertices ~( x z , i ) x l . . . x D - 1 , i = 0 , . . . , d1. T h e arc <~(x], i ) x l . . , x p } , denoted by x$i, has i n p u t label i. Similarly, vertex x l . . . X D is adjacent to d vertices x 2 . . . x D a ( x D , j ) , j = 0,...,d1. T h e arc @1...XDr denoted by z S j , has o u t p u t label j. Hence arc @ 1 . . . X D + l } o f K ( d , D) has input label ~(xl, x2) and o u t p u t label # ( X D , XD+I) and therefore <Xl'''XD+I>
= (:g2"''XD+I)4~(Xl,X2)
:
(Xl...XD)~/-t(XD,XD+I)-
Every K a u t z vertex x = X l . . . X D is the root of a "virtual" complete d-ary tree of height D, whose level l, 1 < 1 < D - 1, consists of all K a u t z vertices x t + l . . . x D y z . . . Y t , constructed from X l . . . X D by l left shift operations, and whose level-D vertices are all K a u t z vertices Y l . . . YD, YZ # XD. This tree is called the greedy routing virtual tree rooted in x, denoted by G R T ( x ) . Obviously, some K a u t z vertices appear more t h a n once in G f t T ( x ) , this depends on the periodicity of x. All G R T ( x ) have the same o u t p u t arc labeling 9 T h e A A S algorithm consists of two similar phases. T h e first phase has d D - 1 rounds. In each round r, 1 < r < d D - l , of the first phase, every root x = X l . . . X D inserts a d-tuple of packets into G R T ( x ) , one packet per o u t p u t arc, and the packets move down the tree using D-step greedy routing 9 Let y = Yl 9 9 9YD be a destination vertex, Yl ~ xD. T h e D - s t e p greedy r o u t i n g in G R T ( x ) sends packets from x to y via vertices x 2 . . . x D y l , x3...XDYlY2, XDyZ 9 9 9Y D - 1 . Transmission of packets f r o m vertices xl 9 9 9x D y ] 9 9 9Yi-1 to vertices X i + l . . . X D Y l . . . Y i , 1 < i < D, is said to be the i-th step of r o u n d r. (We put xD+l -- Yl). The core of the A A S a l g o r i t h m is functions ~1 and ~2. 9
9
9
D e f i n i t i o n 2. L e t x = x z . . . X D
E V(K(d,D)), r fi T h e n O l ( X , r , i ) = y~l'i . . . y D , where r,i
Yz
= c~(xD, i),
r,i : Yk+l
qrk
0 < r < dD-~--l, 0 < i < d-1.
O" r,i (Yk ,ark),
= (~(xk,zk+l) + r div d k - l )
LetO < r < dD-2
__
Z ; 'i
~
r,i Zk+l
= y~,i, k
m o d d,
k = 1,..., D-
1,
k = 1,...,D-
1.
r,i, where 1. T h e n q ~ 2 ( x , r , i ) =zrl ' i . . . z D
X D
k = l,
"" "
D-
R o u t e r in K a u t z vertex x = X l . . . X D protocol the following algorithm:
1. executes in the first phase of the A A S
f o r r = 0 to d D - 1 - i d o _ s e q u e n t i a l l y { f o r i = 0 to d - 1 d o A n _ p a r a l l e l { take the packet destined for vertex ~z(x, r, i) f r o m the local m e m o r y ; apply the first step of the D-step greedy routing}; f o r j = 2 to D d o _ s e q u e n t i a l l y f o r i = 0 to d - ] do_.in_parallel{ receive a packet f r o m arc x$i;
1060 apply the j - t h step of the D-step greedy routing}; f o r i = 0 to d - 1 do__in_parallel{ receive a packet from arc x$i; store packet into the local processor memory}}; The second phase has d D - 2 rounds and uses (D - 1)-step greedy routing from x = x a . . . X D to x D y l . . . Y D - 1 , which is defined similarly, using function ~2 instead of r T h e o r e m 1. The A A S algorithm is memory-optimal and contention-free. Every
arc of K(d, D) in every step of every round is used for transmitting one packet. The algorithm is asymptotically time- and transmission-optimal. Proof. It follows from Definition 2 that for all i and for any two rounds r l r r2, ~ l ( x , rl,i) # ~)l(x, r2, i). Also q~l(x, ri, il) 7~ ~l(x, rl,i2) for any il # i2. We need just to show that the communication is contention-free. Consider round r r,i of the first phase and root x. The packet destined for q>l(x, r, i) = Ylr,i ".. YD is issued via the i-th output arc of x. The first step is contention-free and every vertex receives one packet per input arc. Vertex x 2 . . . xDylr,~ , where Ylr,i = ~(XD, i) gets the packet from its input arc with input label ~(xl, x2) and greedy routing r,i r,i r,~ forwards this packet to vertex x 3 . . . X D Y l Y2 , where Y2 = ~r(Y~'*, (~(Xl, x 2 ) + r div d o) mod d) = ~(yl~,i , (r x~) + r) mod d). Hence, vertex z ~ . . . ~ . y l ~,i will use output arc #(yrj, a(y~,, , ~(xl, x 2 ) + r ) mod d) = (~(xl, x2) § mod d. This is clearly a permutation of input arcs to output arcs. The same argument applies to any step j, j = 3 , . . . , D, of round r. The same argument can be used to prove that the second-phase communication is also contention-free. The total number of steps of the AAS algorithm is clearly v ( d, D) = Dd p - 1 + ( D - l) d D- 2. The lower bound on the number of steps of AAS in digraph G is
~AAS(G)=/
IA(G)I
/'
Therefore T(d, D)
Dd D-1 + (D - 1)d D-2
2d- 1 TAAS(K(d, D)) = -d(K(d, D))(d + 1)d D-2 <- 1 + D ( d + l ) ( d - 1 ) 2 - d
~"
Let Dmin(d) be the smallest diameter such that the slowdown of the AAS algorithm in K(d, D) is at most 5 % for any D >__ Omin(d ). Then Dmi~(2) = 22, Omin (3) ----7, Dmi n (4) = 4, and for d _> 5, Drnin (d) -- 2. 4
Conclusions
We have given tight lower and upper bounds on the average distance in K(d, D). They are very close to the diameter, they can be approximated with D 1 We have also designed a memory-optimal and asymptotically time- and 7/-1"
1061 transmission-optimal all-to-all scatter algorithm in noncombining all-port K (d, D). It is based on greedy routing which treats the Kautz networks as if they were vertex-transitive. Every vertex becomes the root of a virtual complete d-ary tree. For d > 5, the slowdown of the algorithm is within 5% for any diameter D.
References 1. J.-C. Bermond, Z. Liu, and M. Syska. Mean eccentricities of de Bruijn networks. Networks, 30:187-203, 1997. 2. G. Liu, K. Y. Lee, and H. F. Jordan. TDM and TWDM de Bruijn networks and ShuffieNets for optical communications. IEEE Transactions on Computers, 46:695-701, 1997. 3. K. Sivarajan and R. Ramaswami. Lightwave networks based on de Bruijn graphs. IEEE/A CM Trans. Networking, 2(1):70-79, 1994.
Reactive Proxies: A Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention Sarah A. M. Talbot and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine 180 Queen's Gate, London SW7 2BZ, United Kingdom {saint, phjk}~doc, ic. ac .u.k
A b s t r a c t . Serialisation can occur when many simultaneous accesses are made to a single node in a distributed shared-memory multiprocessor. In this paper we investigate routing read requests via an intermediate proxy node (where combining is used to reduce contention) in the presence of finite message buffers. We present a reactive approach, which invokes proxying only when contention occurs, and does not require the programmer or compiler to mark widely-shared data. Simulation results show that the hot-spot contention which occurs in pathological examples can be dramatically reduced, while performance on well-behaved applications is unaffected.
1
Introduction
Unpredictable performance anomalies have h a m p e r e d the acceptance of cachecoherent non-uniform m e m o r y access (ccNUMA) architectures. Our aim is to improve performance in certain pathological cases, without reducing performance on well-behaved applications, by reducing the bottlenecks associated with widely-shared data. This paper moves on from our initial work on proxy protocols [1], eliminating the need for application p r o g r a m m e r s to identify widelyshared data. Each processor's m e m o r y and cache is m a n a g e d by a node controller. In addition to local m e m o r y references, the controller must handle requests arriving via the network from other nodes. These requests concern cache lines currently owned by this node, cache line copies, and lines whose home is this node (i.e. the page holding the line was allocated to this node, by the operating system, when it was first accessed). In large configurations, unfortunate ownership migration or home allocations can lead to concentrations of requests at particular nodes. This leads to performance being limited by the service rate (occupancy) of an individual node controller, as demonstrated by Holt et al. [6]. Our proxy protocol, a technique for alleviating read contention, associates one or more proxies with each d a t a block, i.e. nodes which act as intermediaries for reads [1]. In the basic scheme, when a processor suffers a read miss, instead
1063 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0
0
Proxy 0
o~ 0 0 0 0
0 0 0 0 0 0 ! 0 0 0 0 0 0 0 0 0 0 0 0
Home
~176
o
o
o
o
o
0
0 ~ ~ro~
0
0 0 0 0 0 0 0
o o
(a) Without proxies
o
o
U
O
O
(b) With two proxy clusters (read Line l)
(c) Read next data block (i.e. read Line l -t- 1)
Fig. 1. Contention is reduced by routing reads via a proxy of directing its read request directly to the location's home node, it sends it to one of the location's proxies. If the proxy has the value, it replies. If not, it forwards the request to the home: when the reply arrives it can be forwarded to all the pending proxy readers and can be retained in the proxy's cache. The m a i n contribution of this paper is to present a reactive version, which uses proxies only when contention occurs, and does not require the application p r o g r a m m e r (or compiler) to identify widely-shared data. The rest of the paper is structured as follows: reactive proxies are introduced in Section 2. Our simulated architecture and experimental design are outlined in Section 3. In Section 4, we present the results of simulations of a set of standard b e n c h m a r k programs. Related work is discussed in Section 5, and in Section 6 we summarise our conclusions and give pointers to further work.
2
Reactive Proxies
The severity of node controller contention is both application and architecture dependent [6]. Controllers can be designed so that there is multi-threading of requests (e.g. the Sun S3.mp is able to handle two simultaneous transactions [12]) which slightly alleviates the occupancy problem but does not eliminate it. Some contention is inevitable, and will increase the latency of transactions. T h e key problem is that queue lengths at controllers, and hence contention, are nonuniformly distributed around the machine. One way of reducing the queues is to distribute the workload to other node controllers, using t h e m as proxies for read requests, as illustrated in Fig. 1. When a processor makes a read request, instead of going directly to the cache line's home, it is routed first to another node. If the proxy node has the line, it replies directly. If not, it requests the value from the home itself, allocates it in its own cache, and replies. Any requests for a particular block which arrive at a proxy before it has obtained a copy from the h o m e node, are added to a distributed chain of pending requests for that block, and the reply is forwarded down the pending chain, as illustrated in Fig. 2. It should be noted that write requests are
1064 (a) First request to proxy has to be forwarded to the home node:
~ \ x l .
clientmakesreadrequest (proxy_read_request)
J2. requeS/rJSa SdenJ:ntot)the home (b) Second client request, before data is returned, forms pending chain:
3. clientmakesreadrequest
.
I .... Y I
(take_hole)
(c) Data is passed to each client on the pending chain:
7. datasuppliedto Cfient2 (takeshared) 6. datasuppliedto Client 1 (takeshared) ~ ~ o x y ~ ~
/~5.
data suppliedto the proxy (take_shared)
Fig. 2. Combining of proxy requests not affected by the use of proxies, except for the additional invalidations that may be needed to remove proxy copies (which will be handled as a m a t t e r of course by the underlying protocol). The choice of proxy node can be at random, or (as shown in Fig. 1) on the basis of locality. To describe how a client node decides which node to use as a proxy for a read request, we begin with some definitions: 7): the number of processing nodes. - 7/(/): the home node of location l. This is determined by the operating system's memory management policy. - .MPC: the number of proxy clusters, i.e. the number of clusters into which the nodes are partitioned for proxying (e.g. in Fig. 1, H7)C=2). The choice of XT)C depends on the balance between degree of combining and the length of the proxy pending chain. A/'7)C=I will give the highest combining rate, because all proxy read requests for a particular data block will be directed to the same proxy node. As A/'PC increases, combining will reduce, but the number of clients for each proxy will also be reduced, which will lead to shorter proxy pending chains. -
1065 o
o
o
o
o
0
o
o
o
o
O
0
o
0 0
~
0
0
0
0
0
0
0
0
o
o
0 0
0
0 0
0
0
Home
0 pr(O
o
0 0
,
0
o
0
o o
0
o o
o o
o
bounce
(a) Input buffer full, some read requests bounce
(b) Reactive proxy reads
Fig. 3. Bounced read requests are retried via proxies
-
- 7)C$(C): the set of nodes which are in the cluster containing client node C. In this paper, 7)C8(C) is one of AfPC disjoint clusters each containing P/AfT)C nodes, with the grouping based on node number. 7)H(l, C) the proxy node chosen for a given client node (C) when reading location 1. We use a simple hash function to choose the actual proxy from the proxy cluster 7)C$(C). If)Off(l, C) = C, or PAl(l, C) = 7/(/), then client C will send a read request directly to 7/(l)
The choice of proxy node is, therefore, a two stage process. When the system is configured, the nodes are partitioned into AfT)C clusters. Then, whenever a client wants to issue a proxy read, it will use the hashing function 7)Af(l,C) to select one proxy node from PCS(C). This mapping ensures that requests for a given location are routed via a proxy (so that combining occurs), and that reads for successive data blocks go to different proxies (as illustrated in Fig. l(c)). This will reduce network contention [15] and balance the load more evenly across all the node controllers. In the basic form of proxies, the application programmer uses program directives to mark data structures: all other shared data will be exempt from proxying [1]. If the application programmer makes a poor choice, then the overheads incurred by proxies may outweigh any benefits and degrade performance. These overheads include the extra work done by the proxy nodes handling the messages, proxy node cache pollution, and longer sharing lists. In addition, the programmer may fail to mark data structures that would benefit from proxying. Reactive proxies overcome these problems by taking advantage of the finite buffering of real machines. When a remote read request reaches a full buffer, it will immediately be sent back across the network. With the reactive proxies protocol, the arrival of a buffer-bounced read request will trigger a proxy read (see Fig. 3). This is quite different to the basic proxies protocol, where the user has to decide whether all or selected parts of the shared data are proxied, and proxy reads are always used for data marked for proxying. Instead, proxies are only used when congestion occurs. As soon as the queue length at the destination node has reduced to below the limit, read requests will no longer be bounced and proxy reads will not be used.
1066 The repeated bouncing of read requests which can occur with finite buffers leads to the possibility of deadlock: the underlying protocol has to detect the continuous re-sending of a remote read request, and eventually send a higher priority read request which is guaranteed service. Read requests from proxy nodes to home nodes will still be subject to buffer bouncing, but the combining and re-routing achieved by proxying reduce the chances of a full input buffer at the home node. The reactive proxy scheme has the twin virtues of simplicity and low overheads. No information needs to be held about past events, and no decision is involved in using a proxy: the protocol state machine is just set up to trigger a proxy read request in response to the receipt of a buffer-bounced read request. 3
Simulated
Architecture
and
Experimental
Design
In our execution-driven simulations, each node contains a processor with an integral first-level cache (FLC), 3~large second-level cache (SLC), m e m o r y (DRAM), and a node controller (see Fig. 4). The node controller receives messages from, and sends messages to, both the network and the processor. The SLC, DRAM, and the node controller are connected using two decoupled buses. This decoupled bus arrangement allows the processor to access the SLC at the same time as the node controller accesses the DRAM. Table 1 summarises the architecture. We simulate a simplified interconnection network, which follows the the LogP model [3]. We have parameterised the network and node controller as follows: - L: the latency experienced in each communication event, 10 cycles for long messages (which include 64 bytes of data, i.e. one cache line), and 5 cycles for all other messages. This represents a fast network, comparable to the point-to-point latency used in [11]. - o: the occupancy of the node controller. Like Holt et al. [6], we have adapted the LogP model to reeognise the importance of the occupancy of a node controller, rather than just the overhead of sending and receiving messages. The processes which cause occupancy are simulated in more detail (see Table 2). g: the gap between successive sends or receives by a processor, 5 cycles. P: the number of processor nodes, 64 processing nodes. -
-
We limit our message buffers to eight for read requests. There can be more messages in an input buffer, but once the queue length has risen above eight, all read requests will be bounced back to the sender until the queue length has fallen below the limit. This is done because we are interested in the effect of finite buffering on read requests rather than all messages, and we wished to be certain that all transactions would complete in our protocol. The queue length of V ~ is an arbitrary but reasonable limit. Each cache line has a home node (at page level) which: either holds a valid copy of the line (in SLC and/or DRAM), or knows the identity of a node which does have a valid copy (i.e. the owner); has guaranteed space in DRAM for the line; and holds directory information for the line (head and state of the sharing
1067
Network Controller + , , * . . . . . Network buffers Node Controller MEM bus ~---- SLC bus FLC CPU
Fig. 4. Table
T h e architecture of a n o d e
1. Details of the simulated architecture
CPI 1.0 Instruction set based on DEC Alpha Instruction cache All instruction accesses assumed primary cache hits First level d a t a cache Capacity 8 Kbytes Line size 64 bytes Direct mapped, write-through Second-level cache Capacity 4 Mbytes Line size 64 bytes Direct mapped, write-back DRAM Capacity Infinite Page size 8 Kbytes Node controller Non-pipelined Service time and occupancy See Table 2 Cycle time lOns Interconnection network Topology full crossbar Incoming message queues 8 read requests Cache coherence protocol Invalidation-based, sequentially-consistent ccNUMA, home nodes assigned to first node to reference each page (i.e. "first-touch-after-initialisation"). Distributed directory, using singly-linked sharing list Based on the Stanford Distributed-Directory Protocol, described by T h a p a r and Delagi [14] CPU
Table
2. L a t e n c i e s of t h e m o s t i m p o r t a n t
operation Acquire SLC bus Release SLC bus SLC lookup SLC line access Acquire MEM bus Release MEM bus DRAM lookup DRAM line access Initiate message send
time
node actions
(cycles) 2 i 6 18 3 2 20 24 5
1068 Table 3. Benchmark applications shared data marked for basic proxyln all 64 x 64 grid all 64K points all 8K particles f_array ( p a r t of G _ M e m o r y ) 512 x 512 m a t r i x entire matrix 258 x 258 o c e a n q_multi and rhs_multi 258 x 258 o c e a n fields, fields2, wrk, a n d f r c n g 512 molecules V A R and P F O R C E S problem
B arnes CFD FFT FMM GE Ocean-Contig Ocean-Non-Conti Water-Nsq
size
list). The distributed directory holds the identities of nodes which have cached a particular line in a sharing chain, currently implemented as a singly-linked list. The directory entry for each data block provides the basis for maintaining the consistency of the shared data. Only one node at a time can remove entries from the sharing chain (achieved by locking the head of the sharing chain at the home node), and messages which prompt changes to the sharing chain are ordered by their arrival at the home node. This mechanism is not affected by the protocol additions needed to support proxies. Proxy nodes require a small amount of extra store to be added to the node controller. Specifically we need to be able to identify which data lines have outstanding transactions (and the tags they refer to), and be able to record the identity of the head of the pending proxy chain. In addition, the node controller has to handle the new proxy messages and state changes. We envisage implementing these in software on a programmable node controller, e.g. the MAGIC node controller in Stanford's FLASH [9], or the SCLIC in the Sequent NUMA-Q [10]. The benchmarks and their parameters are summarised in Table 3. GE is a simple Gaussian elimination program, similar to that used by Bianchini and LeBlanc in their study of eager combining [2]. We chose this benchmark because it is an example of widely-shared data. CFD is a computational fluid dynamics application, modelling laminar flow in a square cavity with a lid causing friction [13]. We selected six applications from the SPLASH-2 suite, to give a cross-section of scientific shared memory applications [16]. We used both Ocean benchmark applications, in order to study the effect of proxies on the "tuned for data locality" and "easy to understand" variants. Other work which refers to Ocean can be assumed to be using Ocean-Contig.
4
Experimental
Results
In this work, we concentrate on reactive proxies, but compare the results with basic proxies (which have already been examined in [1]). The performance results for each application are presented in Table 4 in terms of relative speedup with no proxying (i.e. the ratio of the execution time for 64 processing nodes to the execution time running on 1 processor), and percentage changes in execution time when proxies are used. The problem size is kept constant. The relative changes results in Fig. 5 show three different metrics:
1069 Table 4. Benchmark performance for 64 processing nodes relative I I speedup proxy noprox*es type
applications
43.2
Barnes
CFD
30.6
FFT
47.4
basic
~
% c h a n g e in e x e c u t i o n t i m e (--[- is b e t t e r , - " = % f ~ ---- 1 6 t ~
0.0
reactive + 0 . 3
-0.1 +0.2
0.0 +0.3
0.0 +0.2
-0.2 +0.1
+0.3 -0.1
-0.2 0.0
-0.1 +0.3
basic +6.6 reactive +5.6
+7.7 +5.4
+6.4 +4.7
+8.3 +4.5
+6.7 +5.6
+5.8 +4.1
+6.0 +4.1
+10.2 +4.8
+9.0 +11
+9.8 +9.4 +9.2 +8.8 +8.9 +8.9 +10.8 +10.8 +II.i +ii.6:+11.0 +10.6
+0.2 +0.3
+0.2 +0.3
basic
+9.3
reactive +ii,5
36.1
FMM
8
basic
+0.2
reactive + 0 . 4
+0.2 +0.4
+0.2 +0.3
+0,2 +0.4
+0.2 +0.3
+0.2 +0.3
I
22.0
GE
basic
I+ 2 8 . 7
+ 2 8 . 7 + 2 8 . 7 ~+ 2 8 . 7 + 2 8 . 7 + 2 8 . 8 [ + 2 8 . 8 +22.9 +22.3 +21.4 +21.5 +21.4 +21.7
+28.7 +21.5
-3.2 -0.2
+2.6 0.0
+0.1 -0.i
-0.2 -0.3
-0.8 +0.2
-2.0 +2
-2.1 +1.3
-1.8 +2.1
reactive
-0.3 +3.1
+ i .6 +1.4
-i .7 +1.4
+4.1 +0.8
+i.0 +5.1
-0.4 +1.8
+0.2 +1.2
+4.2 +5
basic
-0.7
-0.6 +0.2
-0.6 +0.2
-0.5 +0.2
-0.5 +0.2
-0.5 +0.2
-0,7 +0.1
-0.5 +0.2
reactive +23.3 Ocean-Contig
48.9
basic
reactive Ocean-Non-Contig Water-Nsq
50.5
55.5
basic
reactive +0.2
-
messages:
the ratio of the total number of messages to the total without
proxies, -
e x e c u t i o n t i m e : the ratio of the execution time (excluding startup) to the execution time (also excluding startup) without proxies. q u e u e i n g d e l a y : the ratio of the total time that messages spend waiting for service to the total without proxies, and
The message ratios shown in Fig. 6 are: -
-
-
h i t rate: the ratio of the number of proxy read requests which are serviced directly by the proxy node, to the total number of proxy read requests (in contrast, a proxy miss would require the proxy to request the data from the home node), r e m o t e r e a d d e l a y : the ratio of the delay between issuing a read request and receiving the data, to the same delay when proxies are not used. b u f f e r b o u n c e r a t i o : the ratio of the total number of buffer bounce messages to read requests. This gives a measure of how much bouncing there is for an application. This ratio can go above one, since only the initial read request is counted in that total, i . e . the retries are excluded. p r o x y r e a d r a t i o : the ratio of the proxy read messages to read requests - this gives a measure of how much proxying is used in an application. proxy
The first point to note from Table 4 is that there is no overall "winner" between basic and reactive proxies, in that neither policy improves the performance of all the applications for all values of proxy clusters. Looking at the
1070 180
b a r n e s - reactive p r o x i e s
t 69 140 120 100
~ ~
80 '
~
~
~
....
16o
messages
140
execution time queueing delay
12o
~,
lOO ~...,L~ go
60
6o
40
40
20
2o
O,
o, 1
2
3 4 5 proxy clusters
| t - reactive prc*xies
1 SO
6
7
8
1
20
messages
140
execution time queueing delay
120
100
80
60
60
40
40
20
20
1
2
3
4
5
6
7
8
00
160
o c e a n - c o n t i g - reactive p r o x i e s
180
140
1
2
3
4
5
6
7
8
1
ocean-non-contig
- reactive proxies
140
160
messages
140
execution time que~eing delay
loo
80
60 40
60
execution time queueing delay 7
8
4
5
6
7
8
w a t e r - n s q - reactive p r o x i e s
~
-T.
.
.
.
.
.
messages
~ E~E]
execution time queueing delay
40 20
20 0
6
3
120
8O
3 4 5 proxy clusters
2
proxy clusters
120
0 - - ' - ' 0 messages
queueing delay
~
60 40 20 O
100
2
g e, - reactive p r o x i e s
~
tOO !
1go
I
8
140
120
60 40 20 0
7
proxy clusters
160
go
6
180 !
messages execution time queueing delay
~
proxy clusters 180
3 4 5 proxy dusters
1O0
80
0
2
( m m - reactive p r o x i e s
leg 160
160 140
C f d - reactive p r o x i e s
16o
messages execution time queueing delay
~ ~ ~
1
2
3 4 5 proxy clusters
6
7
8
O0
1
2
3
4
5
6
7
8
proxy dusters
Fig. 5. Relative changes for 64 processing nodes with reactive proxies results for different values of X ~ C , for basic proxies there is no value which has a positive effect on the performance of M1 the benchmarks. However, for reactive proxies, there are two proxy cluster values that improve the performance of all the benchmarks, i.e. A;PC=5 and 8 achieve a balance between combining, queue distribution, and length of the proxy pending chains. Reactive proxies may not always deliver the best performance improvement, but by providing stable points for A/'7~C they are of more use to system designers. It should also be noted that, in general, using reactive proxies reduces the number of messages, because they break the cycle of re-sending read messages in response to a finite buffer bounce (see Fig. 5). Looking at the individual benchmarks: B a r n e s . In general, this application benefits from the use of reactive proxies. However, changing the balance of processing by routing read requests via proxy nodes can have more impact than the direct effects of reducing home node congestion. Two examples illustrate this: when .MPC=6 for basic proxies, load miss delay is the same as with no proxies, store miss delay has increased slightly from the no proxy case, yet a reduction of 0.4% in lock and barrier delays results in an overall performance improvement of 0.3%. Conversely, for reactive proxies when A/'?)C-:6, the load and store miss delays are the same as when proxies are
1071 b a r n e s - reactive proxies
1.f 1.0
0.~ ~ 0.8 0,7 0.6 0.5 0.4 0.3 0.2 0.1 O.O
1.1 10 0.9 0.8 0.7 0.6 0.5 0,4 0.3 0.2 0.1 O.O
~
o.~
~ 3---.--I~ ~
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
I proxy hit rate remote read delay buffer bounce ratio proxy read ratio
. . . . 1
~ 3 4 S proxy clusters
f f t - reactive proxies
c f d - reactive proxies
1.1 1.o
6
7
I remote read delay
I
~
buffer bounce ratio
O.O
8
f
2
3
4
5
6
7
8
proxy clusters 9 e - reactive proxies
3,5
f m m - reactive proxies
t.t ~
30
0.9 ~
proxyhit rate
~:
r:2,:~;r22;'2o
2
0.1
3 4 5 proxy clusters
6
7
ocean-contig - reactive proxies 1.2 1.1 1.0 t I I I I I 0.9 ~ ' 1 0.8 t2 ~ 0.7 0.6 / ~ proxy,,t rate 0.5 remote read delay 0.4 buffer bounce ratio 0.3 proxy read ratio 0,2 0.1 O.O . . . . . 1 2 3 4 5 6 7 proxy clusters
Fig.
6.
8
O
1.t
~_~
t I I : ~ [ ~ E 3 proxy hit rate ~ ~ remote read delay ~ buffer bounce ratio
0.6 0.5 0.4 0.3
2Z: 1
.~
;
I-
X o .~ X
X
,
,-
)(
1
2
3 4 5 proxy clusters
6
7
0.9
~ 0.7~
0.7 0.6
pr~
read rati~
~
--
1~
[3
g]
E3 ~EI
1
2
3 4 S proxy clusters
6
7
8
~
0
'
9
','
~
~-- m
proxy hit rate
0.,
~ 8
remote read delay buffer bounce ratio
w a t e r - n s q - reactive proxies
t.1 1.0
0.8
proxy hit rate
0.0
8
ocean-non-centig - reactive proxies
C~E3 4-----+ X ~ ~-~
/
1.0
1.0
I
25 2.0 1,5
X ~
0.4 O.3 0.2 0.1 O,O
remote read delay buffer bounce ratio
0.6 0.5 0.4 0.3 0.2 0.1
proxy read ratio
1
2
3 4 5 proxy clusters
6
7
8
1
2
~
proxy hit rate te road delay
~ ~
buffer bounce ratio proxy read ratio
3 4 5 proxy clusters
6
7
8
Message ratios for 64 processing nodes with reactive proxies
not used, but a slight 0.1% increase in lock delay gives an overall performance degradation of -0.1%. C F D . This application benefits from the use of reactive proxies, with performance improvements in the range 4.1% to 5.6%. However, the improvements are not as great as those obtained with basic proxies. The difference is attributable to the delay in triggering each reactive proxy read: for this application it is better to use proxy reads straight away, rather than waiting for read requests to be bounced. It should also be noted that the proxy hit rate oscillates, with peaks at Af7~C=2,4,8 (see Fig. 6). This is due to a correspondence between the chosen proxy node and ownership of the cache line. F F T . This shows a marked speedup when reactive proxies are used, of between 10.6% and 11.6%. The number of messages decreases with proxies because the buffer bounce ratio is cut from a severe 0.7 with no proxies. The mean queueing delay drops down as the number of proxy clusters increases, reflecting the benefit of spreading the proxy read requests. However, this is balanced by a slow increase in the buffer bounce ratio, because as more nodes act as proxy there will be more read requests to the home node, and these read requests will start to bounce as the number of messages in the home node's input queue rises to v ~ and above.
1072 F M M . There is a marginal speedup compared with no proxies (between 0.3% and 0.4%). This is as expected given only the f_array (part of G_Memory) is known to be widely-shared, which was why it was marked for basic proxies. However, the performance improvement is slightly better than that achieved using basic proxies, so the reactive method dynamically detects opportunities for read combining which were not found by code inspection and profiling tools. G E . This application, which is known to exhibit a high level of sharing of the current pivot row, shows a large speedup in the range 21.4% to 23.2%. However, the improvement is not as good as that obtained using basic proxies. This was to be expected, because proxying is no longer targeted by marking widely-shared data structures. Instead proxying is triggered when a read is rejected because a buffer is full, and so there will be two messages (the read and buffer bounce) before a proxy read request is sent by the client. It should also be noted that the execution time increases as the number of proxy clusters increases. As the number of nodes acting as proxy goes up, there will be more read requests (from proxies) being sent to the home node, and the read requests are more likely to be bounced, as shown by the buffer bounce ratio for GE in Fig. 6. Finally, the queueing delay is much higher when proxies are in use. This is because without proxies there is a very high level of read messages being bounced (and thus not making it into the input queues). With proxies, the proxy read requests are allowed into the input queues, which increases the mean queue length. O c e a n - C o n t i g . Reactive proxies can degrade the performance of this application (by up to -0.3% at X~oC=4), but they achieve performance improvements for more values of HT)C than the basic proxy scheme. Unlike basic proxies, reactive proxies reduce the remote read delay by targeting remote read requests that are bounced because of home node congestion. The performance degradation when H;oC=1,3,4 is attributable to increased barrier delays caused by the redistribution of messages. O c e a n - N o n - C o n t i g . This has a high level of remote read requests. These remote read requests result in a high level of buffer bounces, which in turn invoke the reactive proxies protocol. Unfortunately the data is seldom widely-shared, so there is little combining at the proxy nodes, as is illustrated by the low proxy hit rates. With N'7)C=4, this results in a concentration of messages at a few nodes, overall latency increases, and the execution time suffers. For X7)C=5, the queueing delay is reduced in comparison to the no proxy case, and this has the best execution time. Given these results, we are carrying out further investigations into the hashing schemes suitable for the "PN'(I,C) function, and the partitioning strategy used to determine PCS(C), to obtain more reliable performance for applications such as Ocean-Non-Contig. W a t e r - N s q . Using reactive proxies gives a small speedup compared to no proxies (around 0.2%). However, this is better than with basic proxies, where performance is always worse (in the range -0.5% to -0.7%, see Table 4). The extremely low proxy read ratios shows that there is very little proxying, but the high proxy hit rates indicate that when proxy reads are invoked there is a high level of combining. It is encouraging to see that the proxy read ratio is kept low:
1073 this shows that the overheads of proxying (extra messages, cache pollution) are only incurred when they are needed by an application. To summarise, the results show that for reactive proxies, when the number of proxy clusters (N"PC) is set to five or eight, the performance of all the benchmarks improves, i.e. they achieve the best balance between combining, queue length distribution, and the length of the proxy pending chains in our simulated system. This is a very encouraging result, because without marking widely-shared data we have obtained sizeable performance improvements for three benchmarks (cE, FFT, and CSD), and had no detrimental effect on the other well-behaved applications. By selecting a suitable N'7)C for an architecture, the system designers can provide a c c N U M A system with more stable performance. This is in contrast to basic proxies, where although better performance can be obtained for some benchmarks, the strategy relies on judicious marking of widely-shared data for each application. 5
Related
Work
A number of measures are available to alleviate the effects of contention for a node, such as improving the node controller service rate [11], and combining in the interconnection network for fetch-and-update operations [4]. Architectures based on clusters of bus-based multiprocessor nodes provide an element of read combining since caches in the same cluster snoop their shared bus. Caching extra copies of data to speed-up retrieval time for remote reads has been explored for hierarchical architectures, including [5]. The proxies approach is different because it does not use a fixed hierarchy: instead it allows requests for copies of successive data lines to be serviced by different proxies. Attempts have been made to identify widely-shared data for combining, including the GLOWextensions to the scI protocol [8, 7]. GLOWintercepts requests for widely-shared data by providing agents at selected network switch nodes. In their dynamic detection schemes, which avoid the need for programmers to identify widely-shared data, agent detection achieves better results than the combining of [4] by using a sliding window history of recent read requests, but does not improve on the static marking of data. Their best results are with program-counter based prediction (which identifies load instructions that suffer very large miss latency) although this approach has the drawback of requiring customisation of the local node cpus. In Bianchini and LeBlanc's "eager combining", the programmer identifies specific memory regions for which a small set of server caches are pre-emptively updated [2]. Eager combining uses intermediate nodes which act like proxies for marked pages, i.e. their choice of server node is based on the page address rather than data block address, so their scheme does not spread the load of messages around the system in the fine-grained way of proxies. In addition, their scheme eagerly updates all proxies whenever a newly-updated value is read, unlike our protocol, where data is allocated in proxies on demand. Our less aggressive scheme reduces cache pollution at the proxies.
1074
6
Conclusions
This paper has presented the reactive proxy technique, discussed the design and implementation of proxying cache coherence protocols, and examined the results of simulating eight benchmark applications. We have shown that proxies benefit some applications immensely, as expected, while other benchmarks with no obvious read contention still showed performance gains under the reactive proxies protocol. There is a tradeoff between the flexibility of reactive proxies and the precision (when used correctly) of basic proxies. However, reactive proxies have the further advantage that a stable value of 3/7)g (number of proxy clusters) can be established for a given system configuration. This gives us the desired result of improving the performance of some applications, without affecting the performance of well-behaved applications. In addition, with reactive proxies, the application programmer does not have to worry about the architectural implementation of the shared-memory programming model. This is in the spirit of the shared-memory programming paradigm, as opposed to forcing the programmer to restructure algorithms to cater for performance bottlenecks, or marking data structures that are believed to be widely-shared. We are currently doing work based on the Ocean-Non-Contig application to refine our proxy node selection function (P2((l, C)). In addition, we are continuing our simulation work with different network latency (L) and finite buffer size values. We are also evaluating further variants of the proxy scheme: adaptive proxies, non-caching proxies, and using a separate proxy cache.
Acknowledgements This work was funded by the U.K. Engineering and Physical Sciences Research Council through the C R A M P project G R / J 99117, and a Research Studentship. We would also like to thank Andrew Bennett and Ashley Saulsbury for their work on the A L I T E simulator and for porting some of the benchmark programs.
References 1. Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, and Sarah A. M. Talbot. Using proxies to reduce cache controller contention in large shared-memory multiprocessors. In Luc Boug@ et al, editor, Euro-Par 96 European Conference on Parallel Architectures, Lyon, volume 1124 of Lecture Notes in Computer Science, pages 445-452. Springer-Verlag, August 1996. 2. Ricardo Bianchini and Thomas J. LeBlanc. Eager combining: a coherency protocol for increasing effective network and memory bandwidth in shared-memory multiprocessors. In 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, pages 204-213, October 1994. 3. David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten yon Eieken. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78-85, November 1996.
1075 4. Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU Ultracomputer - designing a MIMD shared memory parallel computer. IEEE Transactions on Computers, C-32(2):175-189, February 1983. 5. Self Haridi and Erik Hagersten. The cache coherence protocol of the Data Diffusion Machine. In E. Odijk, M. Rein, and J.-C Syre, editors, PARLE 89 Parallel Architectures and Languages Europe, Eindhoven, volume 365 of Lecture Notes in Computer Science, pages 1-18. Springer-Verlag, June 1989. 6. Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The effects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995. 7. David V. James, Anthony T. Laundrie, Stein Gjessing, and Gurindar S. Sohi. Scalable Coherent Interface. IEEE Computer, 23(6):74 77, June 1990. 8. Stefanos Kaxiras, Stein Gjessing, and James R. Goodman. A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors. In (to appear) 12th ACM International Conference on Supereomputing, Melbourne, July 1998. 9. Jeffrey Kuskin. The FLASH Multiprocessor: designing a flexible and scalable system. PhD thesis, Computer Systems Laboratory, Stanford University, November 1997. Also available as a technical report, CSL-TR-97-744. 10. Tom Lovett and Russell Clapp. STING: a CC-NUMA computer system for the commercial marketplace. 23rd Annual International Symposium on Computer Architecture, Philadelphia, in Computer Architecture News, 24(2):308-317, May 1996. 11. Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lira, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. 24th Annual International Symposium on Computer Architecture, Denver, in Computer Architecture News, 25(2):219-228, June 1997. 12. Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Saujay Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the International Conference on Parallel Processing Vol. 1, pages 1-10, August 1995. 13. B. A. Tanyi. [terative Solution of the Incompressible Navier-Stokes Equations on a Distributed Memory Parallel Computer. PhD thesis, University of Manchester Institute of Science and Technology, 1993. 14. Manu Thapar and Bruce Delagi. Stanford distributed-directory protocol. IEEE Computer, 23(6):78-80, June 1990. 15. Leslie G. Valiant. Optimality of a two-phase strategy for routing in interconnection networks. IEEE Transactions on Computers, C-32(8):861-863, August 1983. 16. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. Proceedings of the 22nd Annual International Symposium on Computer Architecture, in Computer Architecture News, 23(2):24-36, June 1995.
Handling Multiple Faults in W o r m h o l e M e s h Networks * Tor Skeie Department of Informatics, University of Oslo Box 1080, Blindern, N-0316 OSLO, Norway. mail: t o r s k 0 i f i . u i o . n o
A b s t r a c t . We present a fault tolerant method tailored for n-dimensional mesh networks that is able to handle multiple faults, even for two dimensional meshes. The method does not require existence of virtual channels. The traditional way of achieving fault tolerance based on adaptivity and adding virtual channels as the main mechanisms, has not shown the ability to handle multiple faults in wormhole mesh networks. In this paper we propose another strategy to provide high degree of fault-tolerance, we describe a technique which alters the routing function on the fly. The alteration action is always taken locally and distributed to a limited number of non-neighbor nodes.
1
Introduction
In recent years we have seen an increasing interest in multicomputers b o t h in industry and academia. As the number of components in such computers grow the probability of failing components increase, and therefore the ability to a d a p t to errors becomes a significant issue in this field. In this paper we study wormhole routed interconnect networks with respect to fault tolerance. The wormhole switching technique that was described by Dally and Seitz [5] is an improvement of the virtual cut through technique of K e r m a n i and Kleinrock [15]. The concept of wormhole routing has been taken up in m a n y interconnection networks, in particular in the domain of multiprocessor interconnection, see e.g. [18,20,21]. A survey of wormhole routing techniques can be found in [19]. Fault-tolerance in a communication network is defined as the ability of the network to effectively utilize its redundancy in the presence of faulty links. A failing link m a y destroy one path in use between two communicating nodes, but given that the failing link has not split the network into two unconnected parts, there will still be other paths that the two ends m a y c o m m u n i c a t e over. Much work on fault-tolerance in wormhole routed networks has been reported. Linder and Harden described an adaptive routing algorithm for k-cry n-cubes which requires up to 2 ~ virtual channels per physical channel [16]. B o p p a n a and Chalasani present an algorithm to handle block faults in meshes where faulty links and nodes form rectangular regions [1]. The method requires 2 virtual * This work is supported by Esprit under the OMI-ARCHES
1077 channels per physical channel, and is able to handle any number of faults as long as the fault regions do not overlap. The technique requires that a fault free node knows the status of its own links and its neighbors' links. In [2] they generalize this method to also handle non-convex fault regions requiring 4 virtual channels per physical channel in the non-overlapping case. The method requires fault region shape-finding messages, and may in many cases mark more links as faulty than those that have actually failed. Several other methods rely on the virtual channel concept as well [3, 4, 6,8, 9, 11]. However, other approaches have been proposed. Glass and Ni have described a fault tolerant routing algorithm for meshes which does not require virtual channels [13]. Their method is based on the turn-model for generating adaptive routing algorithms [10, 12] and needs extra control lines to agree upon allowed turns. Lysne, Skeie and Waadeland [17] also describe fault tolerant routing without using virtual channels. Basically, their methods rely on alteration of the routing function. Unfortunately, both these methods tolerate very few faults for low dimension meshes. In this paper we present a method for fault tolerance which is different from the above mentioned work in terms of that it both can handle a significant number of faults and does not need virtual channels. The technique relies on a limited amount of non local status information, the action is taken locally and distributed to a limited number of nodes. The remainder of the paper is organized as follows. Section 2 contains some basic notations. In section 3 we describe the fault tolerant routing m e t h o d and finally in section 4 we conclude.
2
Preliminaries
The definitions used in this paper adhere to standard notation and definitions in wormhole routing [17]. The following theorem is due to Dally and Seitz [5] and Duato [7]. It will be used extensively within this paper. T h e o r e m 1. A wormhole network is free f r o m deadlocks if its channel dependency graph is acyclic. We shall in this paper consider bidirectional n-dimensional mesh networks, also called (k, n)-mesh networks, where k is the radix, n is the number of dimensions and N = k ~ is the number of nodes. A (k, n)-mesh is a (k, n)-torus without connecting wraparound links. To simplify presentation of the fault tolerant method, we will describe it in detail for the two-dimensional case and afterwards present generalization to three and higher dimensions. The four directions of a two-dimensional mesh are labeled north, south, east and west. Furthermore the channels of a horizontal (vertical) link 1 are denoted 4a~t/l~r (4o~th/l,~orth), respectively.
1078
2.1
The Fault Model
In this paper we will concentrate on link faults, and assume that if a channel fails then the entire link fails in both directions. The link failures are considered to be "hard" in the sense that faulty links will be down for quite a while (hours or even days). Furthermore we assume that each node in addition to knowing the status of its own links, also has status information regarding east and north links of some of the nodes in the row (column) that it is positioned in. For a more thorough definition we refer to section 3.2. To provide nodes with the link status information stated above we assume that the network has a control line structure where routers can send/receive specific control messages. The I E E E 1355 [14] STC104 [23] from SGS-Thomson is a router with such a control line structure.
3
A Distributed Fault Tolerant System for Positive-First Routing
In this section we present an algorithm on that handles multiple faults for positive-first routing (PF) without using virtual channels. PF is a variant of the turn model [12] which is a method for designing adaptive routing algorithms. The idea in the turn model is to start with a system where every packet can take all possible paths in every router. Then one prohibits just enough turns in order to break cycles in the channel dependency graph. Positive first routing is defined by the turns showed in figure 1. We make the additional restriction that packets should be routed minimally in a fault free network, even if there are detours that do not introduce illegal turns. From the
1 I
[~
Fig. 1. The six allowed turns (solid arrows) for positive first routing in the fault free case.
above we observe that there is adaptivity in the south-west and north-east directions because in each of these directions one is allowed to make use of two turns. Routing in the south-east and north-west directions is, however, deterministic. The n a m e positive-first comes from the hindrance of making turns from negative to positive directions. The fault tolerant system is based on altering the routing function in some nodes, so that new paths are defined that avoid the faulty links. For most of the failure situations it will be nodes in two rows a n d / o r columns that need a
1079 reprogrammed routing function, however, in some specific situations a cascading update might be necessary (see section 3.3). It will always be the node, where one or both of the east and north links recently failed, that initiates the necessary updating step to be taken. This means that the updating action to be provided is locally determined. Such a node will be called nma~t~ throughout the remainder of the paper. Alteration messages are then distributed via the control line structure to nodes that need their routing function updated, henceforth the distributed view of the fault tolerant system. On the other hand the node where its west- and/or south links fail, stays passive concerning alteration of the routing function until reception of updating message. Such a node we denote for nstave. We assume the nodes to have some amount of programmable logic so they can implement the alteration algorithm to respond to the failure situation, and furthermore that nodes on the fly can reprogram their routing function upon reception of updating messages. Below we shall first present the necessary alteration of the routing function for different link failure situations in order to give all packets affected by a faulty link new paths. Next we show that the updated routing function will be deadlock free. Depending on the failure situation we divide the description in two parts: l. The situation when only one of the east and north links of a node are faulty. 2. The more complex situation where both these links are failing.
In the first situation our methods do not use non-local link status information in order to perform the routing function alteration, while for the second case this becomes necessary. Recall that we denote the node with its east- a n d / o r north links failing for nmaster , and the nodes that are receiving a updating message from n,~,st~, are called n ~ t ~ . Furthermore the east and north links of nmaster are designated l~ and l~, respectively. For clarity, we shall in the methods proposed below assume that rerouting channels (links) are not faulty at the moment the updating action is initialized, in section 3.3 we also handle faulty rerouting channels. 3.1
T h e E a s t o r t h e N o r t h L i n k o f a N o d e Fails
For this case we show that we can recover without introducing the two prohibited turns (figure 1). T h e E a s t L i n k o f a N o d e Fails. The following method is used to alter the routing function: A l t e r a t i o n 1: The packets that have paths through 1..... are rerouted northwards at nrnaster (Alter 1 in figure 2(a)). A l t e r a t i o n 2: The packets that have paths through 1~o8~ that arrive (or originate) at the nodes in the row east of lc~r are rerouted northwards as well ( A l t e r 2 in figure 2(a)). A l t e r a t i o n 3: The packets that have paths through l~o~, that arrive at the nodes in the row above alteration 2 are rerouted westwards ( A l t e r 3 in figure 2(a)).
1080
The Alterations 2 and 3 are initialized by nrnaster distributing a update message to the two corresponding set of nodes. By altering the routing function to the nodes east of nslave in the same row, we ensure that packets with original paths through le~o., are not sent towards n~la,~ (Alter 2). If they were sent towards n~la,~ they would be forced to take the prohibited west-north turn. Notice also t h a t packets going in the adaptive south-west direction, that can possible use l . . . . . must for the same reason be stopped from going further south at the row right above the failing link (Alter 3).
:i:tAer3 ... i (a)
(b)
Fig. 2. Identifying the nodes (visualized as darkened boxes) that will get their routing function altered according to our method. The thick arrows indicate the redirections of the affected packets through the failing link. (a) The east link of n,~a~t~ fails. (b) The north link of nmast~r fails.
T h e N o r t h L i n k o f a N o d e Fails. Handling the north link failure situation of a node is symmetric to the east link one. Due to this s y m m e t r y we here only present the alteration of the routing function in a figurative manner (figure 2(b)). L e m m a 1. If alteration of the routing function is done according to our methods
when the east or the north link of a node fails, the new routing function is connected and deadlock free. Proof. From the described alteration methods it is obvious t h a t the new routing function is connected, since nodes in the vicinity of the failing link are u p d a t e d with respect to the affected packets in both channel directions. It also follows trivially from the turn model that the new routing function is deadlock free since none of the prohibited turns are used.
1081 3.2
B o t h t h e E a s t a n d N o r t h L i n k s of a N o d e Fail
In this more complex failure situation the link status information has to be consulted in order for nmaster to decide the proper alteration action. In addition it becomes necessary to introduce the two illegal turns, the south-east and westnorth turn, respectively.
<
I I
I|
'
I
c
I
....
Lie
i
n~P~N~ ~
)
I (a)
(b)
Fig. 3. (a) A failure situation where the two node sets Na and Nb cannot communicate without introducing the prohibited double turn, the south-east and west-north dependencies, respectively. (b) Alteration of the routing function when both east and north links of nma,t,r fail. The thick arrows indicate redirections of affected packets through the failing links (channels).
Let us first give some motivation for the algorithm proposed below; consider the failure situation showed in figure 3(a). The east and north connecting links (marked le and l~, respectively) of node n,~st~r have recently failed, in addition we assume that before this situation arises the links 11 and 12 have failed. Notice that since both le and l~ now are failing, packets originating in the node set Na destined for the node set Nb and vice versa are not able to reach their destinations without introducing the prohibited double turn, south-east and west-north dependencies, respectively. Thus the key issue is to introduce the prohibited double turn in such a way that the two node sets can communicate with each other, and furthermore such that no cycles in the channel dependency graph are created as a consequence of the introduction of prohibited turns. We shall in the following introduce the prohibited double turn associated with operating east and north links of a node ni, either positioned west of nma,t~r in the same
1082
row or positioned south of nmaster in the same column. We shall show that this reestablishes connectedness and preserves freedom from deadlocks. If there are several candidate nodes ni in the same row (column) we pick the east (north) most of these, furthermore we denote this specific node for nipt (Introduction of Prohibited Turns). Henceforth, for nmaster to be able to decide which node ni is able to serve as nipt, it must have knowledge about the status of east and north links of all nodes in the same row (column) and that in addition are positioned west (south) of it. This defines the non local information required for our method. Below we define the necessary alteration for this specific situation. We present it for the case when nipt is located west of nmaster. The case when a node south o f nmaster is selected a s nipt is symmetric. A l t e r a t i o n 1: T h e packets that have paths through le,,~s~ that arrive at the nodes in the row east of l~ are rerouted northwards. Furthermore the packets that have paths through l~ . . . . that arrive at the nodes in the row above the previously specified one are rerouted eastwards (Alter 1 in figure 3(b)). A l t e r a t i o n 2" The rerouting proposed here consists of new paths for those packets via 1. .... h that can reach their destination without introducing the prohibited south-east turn. They could have been rerouted through nipt as well, however, we prefer eastwards rerouting in order to spread the traffic. If there are any nodes south of nmaster in the same column that have their east link operating, let n8 be the up most node of these candidates. Furthermore let N~pt (reachable Without Prohibited Turn) be the node set defined by the nodes located at the same vertical position as n~ or south of ns, at the same horizontal position as n~ or west of n~, and in addition positioned east of nipt (figure 3(b)). The packets having paths through l,~.ou~h that arrive at the nodes in the column north of nmaster and are destined for the node set Nwpt, a r e rerouted eastwards. Furthermore the packets having paths through 1,~o~ h that arrive at the nodes in the column east of the previously specified one and are destined for the node set Nwpt, are rerouted southwards (Alter 2 in figure 3(b)). A l t e r a t i o n 3: This alteration redirects those packets that have their detour through nipt. The packets having paths through le .... and l,~o~ h t h a t arrive at nodes in the part of the row between nipt and nrn~ste~ (inclusive), are rerouted westwards. In addition the packets having paths through l ~ . ~ and l~ou,~ that arrive at nodes in the row above the previously defined one, are rerouted westwards as well. Furthermore the packets with paths through leo.~ and l . . . . . h are rerouted northwards at nipt, and the packets with paths through l~ ..... and l,~o.,, are rerouted southwards in the node north of nipt (Alter 3 in figure 3(b)). L e m m a 2. If alteration of the routing function is done according to the method
above for the situation when both east and north links of a node fail, the new routing function is connected and deadlock free. Proof. The proof is divided in two: 1. C o n n e c t e d : It follows trivially from the description of the methods t h a t the new routing function is connected. 2. D e a d l o c k f r e e : Assume that a prohibited double-turn DT/ is introduced as
1083 a consequence of east and north link failures at n,~a~t~,-i. We first prove that DT/ alone cannot form a cycle in the updated channel dependency graph. Secondly we prove t h a t DTi cannot be involved in a cycle together with other prohibited double-turns either. We assume that D T / i s introduced west of nmaster,, the case when DT/ is south of nma~t~, is similar. (i) If DT/ is involved in a cycle the cyclic dependencies must go through the channels connected to n , ~ t ~ . Since both channel directions of nmasteri 'S east and north links are failing, a potential cycle must come via the south link of nma,t~r,. This means that another prohibited south-east or west-north turn must be involved to form a cycle in one of the directions. Therefore, DT/ cannot form a cycle if is the only prohibited turn introduced. (ii) Now assume that there exists another prohibited double-turn DTj south of nmaster,. We know that DTj is introduced as a consequence of the failure of b o t h the east and north links of nrnasterj and consider DTj to be located west of nm~,t~rj. From point (i) above the cyclic dependencies must come through the south link of nm~,t~rj, thus we need yet another prohibited double-turn to form a cycle, and so forth. The case when DTj is introduced south of nm~,t~j is treated analogously. This means that no cycle can be formed even by two or more prohibited double-turns when they are introduced according to our method. Our system is able to handle a significant number of link failure situations. The situations that the proposed method cannot handle are those where both east and north links of nmaster fail simultaneously, and in addition one of the following two conditions apply: 1. There does not exist any node neither west of n,~ast~ in the same row, nor south of n,~a,t~r in the same column that have both its east and north links operating. 2. Any candidate node ni west (south) of node n m a s t e r in the same row (column) with both its east and north links operating, does not have a horizontal (vertical) p a t h to n m a s t e r .
3.3
R e r o u t i n g Channel is Faulty
In the above presentation of alteration algorithms we assumed that the specified rerouting channels (links) were operating. In this section we identify what impact faulty rerouting channels will have on alteration of the routing function. When rerouting is requested along earlier failed channels we know there already exist rerouting paths for these faults, thus we only need to reupdate the routing function along these redirections to also include affected packets associated with the fault currently handled. Moreover, the additional nodes to be visited are exactly defined by the alteration methods specified in sections 3.1 and 3.2. To demonstrate this we give the three possible situations for requesting rerouting along a faulty channel, and refer them to the previous defined alterations: (i) R e q u e s t i n g r e r o u t i n g a l o n g a f a u l t y e a s t o r n o r t h c h a n n e l w h e n o n l y o n e o f t h e s e h a s f a i l e d : According to Alteration I specified in section
1084 3.1, reupdate the routing function in this node (which must be a rtmastero,n) b y letting affected packets from the original fault keep their rerouting northwards if it is the east channel that is failing. With respect to rerouting requested for the north channel direction, let affected packets keep their rerouting eastwards (refer to Alter 1 in figure 2(b)). (ii) R e q u e s t i n g r e r o u t i n g a l o n g a e a s t o r n o r t h c h a n n e l w h e n b o t h o f t h e s e h a v e failed: From Alteration 3 in section 3.2, if niptotd is located west of nmasterotd, reupdate the routing function in nodes between niptotd and nmastCrot, (inclusive) to also include affected packets from the current fault (figure 3(b)). There will be a symmetric action for the case when niptotd is located south of nrnasterotd
9
(iii) R e q u e s t i n g r e r o u t i n g a l o n g a f a u l t y w e s t o r s o u t h c h a n n e l : Recall from Alteration 2, 3 (in section 3.1) and figure 2 (a), that if rerouting is requested for a failing west channel direction the nodes in the row west of this channel and nodes in the row above the previous one need to be updated with respect to the affected packets from current fault. Symmetric updating is necessary when rerouting is requested for a failing south channel (refer to figure 2(b)). The three situations defined above require some additional updating to be initiated. Now, perhaps we in this reupdating see yet another faulty channel. Therefore, the natural question to ask is: will this repeated alteration process terminate successfully? L e m m a 3. The alteration process initialized as a consequence of a broken reroutin 9 channel will terminate successfully if it is done as specified above. Proof. Since there are a limited number of earlier faults to visit, it means that the only possible way to achieve a non terminating alteration process must be as a consequence of a "loop". Let us denote the failing link starting the updating for lcurr~,~t. The rest of the proof is divided in three: (i) First, when rerouting is requested along a faulty east or north channel, the additional updating is only done locally at the source node of this channel. It is therefore obvious that this situation cannot contribute to a "loop". (ii) Second, the only possibility to construct a reupdating "loop" (revisiting lcurre~t) is when the following apply: 1. The node nma,terom and lcurrent a r e located in the same row (column) and 2. lcurrent is between nmastero,~ and niptota. However, the last requirement could not entail correctness since it will mean that the path between nmasterol d and nipto~d does not exist, which contradicts the assumption of the existence of such a path between these two nodes. (iii) Third, if rerouting is requested for a broken west channel additional rerouting is only needed for channels located east and north of this one, which is even further east and north of Icurrent. Hence, Icu~re~t cannot be revisited. T h e same reasoning can be used for a broken south channel. Therefore, we can conclude that a repeated alteration process will converge.
1085 3.4
T h r e e - and H i g h e r - D i m e n s i o n a l M e s h e s
This section illustrates how the positive-first fault tolerant m e t h o d can be extended to three- and higher-dimensional meshes. The modification follows naturally from the method presented above. In figure 4 we define the prohibited turns
I I
I I
L--.-E Directions
/ /
/
V" LC--,~E
Prohibited double-turns
Fig. 4. The prohibited double-turns for fault free routing in three-dimensional meshes. The referred directions denote the following: U - Up, D - Down, N - North, S - South, E - East and W - West, respectively.
(shown as double-turns) for fault free routing in three-dimensional meshes. We have prohibited just enough turns to avoid cycles in the channel dependency graphs, thus it follows from [12] that the routing function is deadlock free. In presence of faults we have two main routing function alteration actions depending of the failure situation in the three dimensional case as well:
1. The situations where at least one of the east, north and up links of a node are operating. Affected packets of a newly failing link are redirected through one that still is present. This means t h a t no prohibited turns need to be introduced. This is analogous to the situations handled in section 3.1. 2. The more complex situation when the east, north and up links are faulty simultaneously. This means that affected packets have to be redirected via a node (nipt) either down, south or west of nma~t~r. In this case it becomes necessary to introduce the prohibited double-turns, which corresponds to the situation defined in section 3.2. Note that in the three-dimensional case each link participates in two planes, for example an east link is visible in both the east-north and east-up planes, respectively. Henceforth, nodes in the two planes that a failing link is located in, m u s t have altered routing function. For each plane nodes in two rows/columns require to be updated, as in the two dimensional case. Regarding point 2 above, the additional nodes along the detour through nip t (the node where the prohibited turns are introduced) need to be updated, refer Alteration 3 in section 3.2. L e m m a 4. If alteration of the routing function is performed as stated above in presence of faulty links in three-dimensional meshes, the new routing function is connected and deadlock free.
Proof. The proof can be fully elaborated as the proofs in sections 3.1 and 3.2. Regarding the deadlock issue for the situation when all the east, north and up
1086 links of a node are faulty simultaneously, creation of any cycle requires several prohibited turns. Since we for each of these situations only associate the prohibited turns with one single node (nipt), this cannot form cyclic dependencies. Generalization to n-dimensions follows the same pattern. For each new dimension one prohibits just enough turns to avoid cycles in the channel dependency graph in the fault free case. Then two main routing function alteration processes will be defined along the same lines as described above.
3.5
Simulation Experiments
In order to evaluate the performance of fault tolerant system we have done a series of experiments. We have simulated a 16 x 16 mesh in presence of 1%, 3% and 5% faulty links. The results show graceful degradation in performance, especially for non-uniform traffic and low number of faults where the degradation is hardly noticeable. The results are not presented here due to space constraints, but can be found in [22].
4
Conclusion
We have presented a method for achieving fault tolerance in wormhole routed (k, n)-mesh networks. Unlike other techniques we do not require existence of virtual channels and tolerate multiple faulty links, even for two-dimensional meshes. The proposed concept relies on that the network is able to alter the routing function on the fly. The initiation for alteration is always taken locally and distributed to small number of non-neighbor nodes via control lines. The I E E E 1355 STC104 [23] router is an example of such a router with a control line structure. For most of the failure situations, the needed alteration of the routing function is fixed and simple. In order to handle more complex situations, such as when both the east and north links of a node (n) fail, we assume knowledge of link status information from some of the other nodes in the same r o w / c o l u m n as node n. This non-local information is used by the local master node responding to a failure. The link status information is gathered via the control lines. The implementation cost of our method is basically that it requires each node to have an amount of p r o g r a m m a b l e logic, making it possible to implement the proposed alteration Mgorithms. Secondly we assume the network to have a control line structure where nodes can send/receive specific control messages. E x t r a logic in the nodes and some form of control lines are, however, usual mechanisms for achieving fault tolerance [2, 13]. Our investigations indicate that it is relatively easy to incorporate node faults into our method, where such faults will be modeled as a failure of all its links. An other interesting path of further work is to investigate if our technique also can be extended to handle (k, n)-tori networks.
1087
A c k n o w l e d g m e n t s I would like to thank Associate Professor Olav Lysne and Associate Professor 0ystein Gran Larsen for their valuable comments and assistance in preparing this paper.
References 1. R. V. Boppana and S. Chalasani. Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Transactions on Computers, 44(7):848-864, 1995. 2. S. Chalasani and R. V. Boppana Communication in Multicomputers with Nonconvex Faults. IEEE Transactions on Computers, 46(5):616-622, 1997. 3. A. A. Chien and J. H. Kim. Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. Journal of the Association for Computing Machinery, 42(1):91-123, 1995. 4. W. J. Dally and H. Aoki. Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Transactions on Parallel and Distributed Systems, 4(4):466-475, 1993. 5. W. J. Dally and C. L. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computers, C-36(5):547-553, 1987. 6. B.V. Dao, J. Duato, and S. Yalamanchili. Configurable flow control mechanisms for fault-tolerant routing. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 220-229. ACM Press, 1995. 7. J. Duato. A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. Int. Conf. on Parallel Processing, I:142-149, Aug. 1994. 8. 3. Duato. A theory to increase the effective redundancy in wormhole networks. Parallel Processing Letters, 4:125-138, 1994. 9. P. T. Gaughan and S. Yalamanchili. A family of fault-tolerant routing protocols for direct multiprocessor networks. IEEE Transactions on Parallel and Distributed Systems, 6(5):482-497, 1995. 10. C. J. Glass and L. M. Ni. The turn model for adaptive routing. In Proceedings of the 19th International Simposium on Computer Architechture, pages 278-287. IEEE CS Press, California, 1992. 11. C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes. In TwentyThird Annual Int. Symp. on Fault-Tolerant Computing, pages 240-249, 1993. 12. C. J. Glass and L. M. Ni. The turn model for adaptive routing. Journal of the Association for Computing Machinery, 41(5):874-902, 1994. 13. C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Transactions on Parallel and Distributed Systems, 7(6):620-636, June 1996. 14. IEEE 1355-1995. IEEE standard for Heterogeneous InterConnect (HIC) (Low cost, low latency scalable serial interconnect for parallel system construction), 1995. 15. P. Kermani and L. Kleinrock. Virtual cut-through: A new computer communication switching technique. Computer Networks, 3:267-286, 1979. 16. D. H. Linder and J. C. Harden. An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. [EEE Transactions on Computers, 40(1):2-12, 1991. 17. O. Lysne, T. Skeie and T. Waadeland. One-Fault Tolerance and Beyond in Wormhole Routed Meshes. Microprocessors and Microsystems, Elsevier, 21(7-8):471-481, 1998.
1088 18. M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, touters and transputers: function performance and application. IOS Press, 1993. 19. L. M. Ni and P.K. McKinley. A survey of wormhole routing techniques in direct networks. Computer, 26:62-76, 1993. 20. Paragon XP/S product overview. Intel Corp., Supercomputer Systems Div, 1991. 21. C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su. The architecture and programming of the Ametek series 2010 multicomputer. In Proceedings of the Third Conference Hypercube Concurrent Computers and Applications, Pasadena (California), volume I, pages 33-36, 1988. 22. T. Skeie. Topics in Interconnect Networking. PhD Thesis, ISBN 82-7368-190-4, Dept. of Informatics, University of Oslo, 1998. 23. P. W. Thompson and J. D. Lewis. The STC104 asynchronous packet switch. VLSI Design, 2(4):305-314, 1995.
Shared Control - Supporting Control Parallelism Using a SIMD-like Architecture Nael B. Abu-Ghazaleh I and Philip A. Wilsey 2 1 Computer Science Dept., State University of New York Binghamton, NY 13902-6000 2 Department of ECECS, PO Box 210030 University of Cincinnati, Cincinnati, Ohio 45221-0030
A b s t r a c t . SIMD machines are considered special purpose architectures chiefly because of their inability to support control parallelism. This restriction exists because there is a single control unit that is shared at the thread level; concurrent control threads must time-share the control unit. We present an alternative model for building centralized control architectures that better supports control parallelism. This model, called shared control, shares the control unit(s) at the instruction level - - in each cycle the control signals for the supported instructions are broadcast to the PEs. In turn, a PE receive its control by synchronizing with the control unit responsible for its current instruction. There are a number of architectural issues that must be resolved. This paper identifies some of these issues and suggests solutions to them. An integrated shared-eontrol/SIMD architecture design (SharC) is presented and used to demonstrate the performance relative to a SIMD architecture.
1
Introduction
Parallel architectures are classified according to their control organization as Multiple Instruction streams Multiple Data streams (MIMD), or Single Instruction stream Multiple Data streams (SIMD) machines. MIMD machines have a distributed control organization: each Processing Element (PE) has a control unit and is able to sequence a control thread (program segment) locally. Conversely, SIMD machines have a centralized control organization: the PEs share one control unit. A single thread executes on the control unit, broadcasting instructions to the PEs for execution. Because the control is shared and the operation is synchronous, SIMD PEs are small and inexpensive. In a centralized control organization (e.g., SIMD [11], [14] and MSIMD [5], [18] machines), an arbitrary number of PEs share a fixed number of control units. Traditionally, sharing of control has been implemented at the thread level; the PEs following the same thread concurrently share a control unit. The presence of application-level control-parallelism causes the performance of this model to drop (proportionately to the degree of control parallelism). This drop occurs
1090 because the control units are time-shared among the threads, with only the set of PEs requiring the currently executing control thread actively engaged in computation. General parallel applications contain control parallelism [3], [7] and, therefore, perform poorly on SIMD machines. Accordingly, SIMD machines have fallen out of favor as a platform for general-purpose parallel processing [10], [15]. This paper presents Shared Control: a model for constructing centralized control architectures that better supports control-parallelism. Under shared control, the control units are shared at the instruction (or atomic function) level. Each PE is assigned a local program and PEs executing the same instruction, but not necessarily the same thread, receive their control from the same control unit. A control unit is assigned to each instruction, or group of similar instructions, in the instruction set and, thus, broadcasts the microinstruction sequences to implement that instruction repeatedly to the PEs. Each PE receives its control by synchronizing with the control unit corresponding to its current instruction. Thus, all the PEs are able to advance their computation concurrently, regardless of the degree of control parallelism present in the application. The similarity of the hardware to the SIMD model allows the SIMD mode to be supported at little additional cost. With the ability to support control-parallelism efficiently, the major drawback of the SIMD model is overcome. Shared control is a unique architectural paradigm; the classic association between control units and threads, present in all Von-Neumann based architectures, does not exist in this model. Therefore, it introduces several architectural issues that are unique to it. This identifies some of these issues and discusses solutions to them. The feasibility of the solutions is demonstrated using a SIMD/sharedcontrol architecture design, S h a r C . Using a detailed simulator of S h a r C , the performance of the model is studied for some irregular problems. The remainder of this paper is organized as follows. Section 2 introduces the shared control model. Section 3 presents some architectural issues relating to a general shared control implementation. Section 4 presents a case study of a shared-control architecture. In Section 5, the performance of the architecture is studied. Finally, Section 6 presents some concluding remarks.
2
Shared
Control
A shared control architecture is a centralized control architecture where the control units are shared at the operation level. PEs executing the same operation, but not necessarily the same thread, may share the use of the same control unit. An overview of a shared control machine is shown in Figure 1. The control program (microprogram) implementing the instruction set for the shared control mode is partitioned across a number of tightly coupled control units. This partitioning is static; it is carried out at architecture design time. Each control unit repeatedly broadcasts the microprogram sequence assigned to it to the PEs. A P E receives its control from the control unit associated with its current instruction. The PE synchronizes with the control unit by selecting the set
1091
ntrcl)~ \ Iunit t/ {Control
--~'~/ rA-~- ~ q Instruction
--~pE~.,, Aregister
I
'!
I
I"l . . . . . .
(Controq.~l
register @~/- - ~mEo~~ Instruction
Fig. 1. Overview of a Shared Control Organization
(!etch '~ J
~__/ex-ec.~ ~~,~ ~inst. J
.~
~--\i
--- -~ back to fetch /exec.~
nst"J
represents setting active sets Fig. 2. Implementation of Several Instructions on a Single Control Unit
of control signals broadcast by that control unit. PEs are able to advance their computation concurrently; thus, MIMD execution is supported. We first consider the problem of implementing and optimizing shared control using a single control unit; this is a special case that is needed for the general solution. Figure 2 shows the single control unit implementation. The control unit must supply control for all the instructions in every cycle. More precisely, the control unit sequentially issues all the paths through the microprogram, and the PEs conditionally participate in the path corresponding to their current instruction. In the remainder of this section, some of the architectural issues involved in constructing a single-control unit shared control multiprocessor are discussed. Managing the Activity Status: Before every instruction execution stage, the activity bits for the PEs that are interested in this stage must be set (represented by the shaded circles in Figure 2). On SIMD machines setting the activity status before a conditional region requires the following operations on the PEs: (i) save current active set, (ii) evaluate the condition, and (iii) set active bit if condition is true. In addition, at the end of the conditional region the active set saved in
1092 InstructionRegister I Opcode I
I ~ GlobalActivitybit k-bit immediateactivityvector
ooio
...
oooJ
Fig. 3. Activity Set Management
step (i) is restored. The active bit can be tracked by saving the bit directly to an activity bit stack [13], or by using an activity counter that is incremented to indicate a deeper context [12]. Both schemes have a significant overhead (register space, as well as several execution steps). Implementing activity management using the SIMD model adds unacceptable overhead to the shared control cycle time since it is required several times per cycle in shared control. The condition for the activity of a PE for instruction stage (called the immediate activity) is contained in the opcode field in the instruction register, allowing the following optimization to be made. The opcode field is decoded into a k-bit vector (as shown in Figure 3). At the beginning of every instruction segment, the bit vector is shifted into the immediate activity bit. Thus, instruction i is decoded into a bit vector consisting of 1 in the i th position and 0 elsewhere, mirroring the order of the execution regions. Only PEs with a high immediate activity bit participate in the current instruction segment. The register shift can be performed concurrently with the execution of each region at no additional cost; activity management cost is eliminated. A compositional instruction set: Examining Figure 2), it can be observed that each P E receives the control for all k instructions, but uses only one. Compositional instruction sets relax this restriction by allowing the PEs to use any number of the k instruction segments in each cycle [4]. The output of each execution region is deposited in a temporary register that serves as an input to the next one. The output of the last stage is stored back to the final destination. Thus, an instruction is represented by the subset of the instruction segments that compose it. Composition can be easily incorporated into the activity management scheme in Figure 3. 3
General
Shared
Control
A general implementation of the shared control model uses multiple control units to implement the microprogram for the instruction set. There are a number of architecture issues that are introduced by this model (in addition to the ones
1093 present in the single control unit implementation). The discussion in this section will cover the range of possible solutions to each issue, rather than consider specific solutions in detail.
Control I/O pins: At first glance, this is the primary concern regarding the feasibility of shared control; because the model uses multiple control units, the width of the broadcast bus and the number of required pins at the PEs m a y become prohibitive. Fortunately, the increase in the number of pins is not linear with the number of control units because: (i) each control unit is responsible only for an instruction or a group of instructions; (ii) literal fields and register number fields are not broadcast; they are part of the instruction (they have to be broadcast in SIMD and MSIMD architectures); and (iii) pins that carry the same values on different control units throughout their lifetime are routed as a single physical pin. The Control Broadcast Network: The control broadcast network is responsible for delivering the control signals from the control units to the PEs. Traditionally, control broadcast has been a bottleneck on centralized control machines; solutions to this problem include pipelining of the broadcast [3], and caching the control units closer to the PEs [16]. With advances in fabrication technology, there is a trend to move processing to memory [8]. For such systems, the control units may be replicated per chip, simplifying the control broadcast problem. Control Unit Synchronization and Balance: The control stream required by each P E is not supplied on demand as per traditionM computers. Rather, the control units have to be synchronized such that the control streams are available when a P E needs them (with minimal waiting time). A possible model to synchronize the control units is to force them to share a common cycle length, called fundamental instruction cycle, determined by the slowest control unit. However, the instruction cycle time may vary widely for the instructions in the instruction set, forcing long idle times for PEs with short instructions. Fortunately, there are a number of synchronization models that reduce PE idle time, including: (i) issuing the long instructions infrequently; (ii) allowing the short instructions to execute multiple times while the long instruction is executing; and (iii) breaking long instructions in a series of shorter instructions [2]. Support for Communication and I/O: Support of communication, I / O and other system functions poses the following problem: the system must provide support for both SIMD and MIMD operation at a cost that can be justified against the simple PEs. We focus this discussion on the communication subsystem. There are two options for the support communication. The first option restricts the support to the SIMD mode; the more difficult problem of supporting MIMD-mode communication is side stepped. Restricting communication to SIMD mode is inefficient because: (i) all the PEs have to synchronize and switch back to SIMD if any of them needs to communicate, (ii) because of SIMD semantics, all the PEs must wait for the P E with the worst path communication; a bottleneck of synchronous operation when irregular communication patterns are required [6]. Another alternative is to support MIMD operation using an inexpensive network [1].
1094 4
A Case Study:
SharC
In this section we present an overview of the architecture of SharC, a proofof-concept shared control architecture [1]. We preface this discussion by noting that the SharC architecture was designed for possible fabrication within a very small budget (10,000 US dollars for a 64 PE prototype). The PE chip fits within the MosIs small chip (a very small chip size with 100 I/O pads); much higher integration is possible with more aggressive technology. While SharC does not represent a realistic high-performance design using today's technology, it can still be used as an impartial model to investigate the shared-control performance relative to SIMD performance.
Fig. 4. System Overview
Figure 4 presents an overview of the system architecture. The shared control subsystem consists of 9 control units. The fundamental cycle for all the control units is 11 cycles, with the exception of the load/store control units which require 22 cycles. There are 4 PEs per chip sharing a single memory port. Communication occurs using the same memory port as well: 4 cycles of the ll-cyele fundamental cycle are reserved for the exchange of two message with the network processor (one each way). Most of the integer operations are mapped to the same control unit and implemented using composition. The fundamental cycle is 11 cycles long, constrained by the access time of the shared memory/communication port. If dedicated (non-shared) memory ports are supplied, this length can be reduced to 4 cycles for most instructions. As was expected, the number of control pins for the shared control mode exceeded the number required by SIMD mode. However, the number of pins only increased from 112 pins for the SIMD mode to 118 pins for the shared control mode; the increase is small because of reasons discussed in Section 3. The average number of Clocks Per Instruction (CPI) for the shared control mode is higher than the CPI for SIMD mode; most SIMD instructions require less than
1095 11 cycles to implement. The reason for the higher CPI is the time required for the additional fetch required for each instruction, as well as the idle cycles required to synchronize the control units. Thus, for a pure data-parallel applications, a SIMD implementation yields better performance than shared control. However, as the degree of control parallelism increases, the performance of shared control remains constant, while SIMD performance degrades. The SHARC communication network is a packet-switched toroidal mesh network, with a chip (4 PEs) at each node sharing a network processor. The network processor is made simple by using an adaptive deflection routing strategy (no buffering is necessary) [9]. In many cases, deflection routing results in better performance than oblivious routing strategies (path taken by packet independent of other packets), because excess traffic is deflected away from congested areas; creating better network load bMance. 5
Performance
Analysis
A detailed simulator of a scalable configuration of the S h a r C design is used to study its performance. The model incorporates a structural description of the control subsystem at the microcode level; it also serves to verify the correctness of the microcode. The test programs were written using assembly language, and assembled using a macro assembler. The assembler supports SIMD/SPMD programming model; an algorithm is written as a SIMD/SPMD application, and any region can be executed in either mode (allowing mixed-mode programming). A transition from SIMD to MIMD operation is initiated explicitly by the SIMD program (using a special c_raimd instruction). A transition from MIMD back to SIMD occurs when all the PEs have halted (using a h a l t instruction); implementing a full barrier synchronization. ~eedup Ior 1he8 e s l
C~
7-C~ee. --~-----8 - ~Jee. al,~a I . .
~j.JJ~
.jJ
4
le Number
e4 ol p . ~ c o ~ o m
2S6
1024
Fig. 5. Speedup for the N-Queen Problem
Our first example is the N-Queen problem: a classic combinatorial problem that is easy to express but difficult to solve. The N-Queen problem has recently
1096 SIMD Mode SPMD Mode
1
2
4
Nomber8of pEs
16
33
-~-
64
Fig. 6. Search Competition Algorithm Performance
found applications in optical computing and VLSI routing [17]. We considered the exhaustive case (find all solutions, rather than find a solution). The N-Queen problem is highly irregular and not suited to SIMD implementation. Figure 5 shows the performance of the shared control implementation normalized to that of the SIMD implementation. The SIMD solution was worse than a sequential solution because it failed to extract any parallelism, but incurred the activity management overhead. The speedup leveled because the of parallelism present at the small scales of the problem that were studied is limited. The second application is a database search competition algorithm; an algorithm characteristic of massively parallel database operations. Each P E is initialized with a segment of a database (sorted by keys), and the PEs search for keys in their database segment. As the number of PEs is scaled, the size of the database is scaled (the size of the segment at each PE is kept the same). For a small number of PEs, there is little control parallelism, and SIMD performs better than shared control. As the number of PEs is increased, the paths through the database followed by each PE diverge according to their respective data. The performance of the SlMD implementation drops with the increased number of PEs while shared control performance remains constant. Finally, we consider a parallel case statement with balanced cases (Figure 7). By varying the number of cases in the case statement, the degree of control parallelism is directly controlled. The instruction mix within the S1MD blocks generated randomly, with a typical RISC profile comprised of 30% load/store instructions with branches forced to the next address (allowing for the pipelined instruction). The blocks are chosen to of balanced lengths (10% variation). The program was simulated in three different ways: a SIMD implementation, a shared control (SPMD) implementation, and a mixed mode implementation. The mixed mode implementation executes the switch statement in the shared control mode, but executes block A and block B in SIMD. Figure 8 shows execution times as a function of the number of cases. Not surprisingly, the performance of the SIMD mode degrades (linearly) with the increased control parallelism injected by increasing the number of cases. Both
1097 plural p = p_random(1, n); [SIMD BLOCK A] switch(p) { case I: [SIMD BLOCK I]; break ;
case n: [SIMD BLOCK n]; } [SIMD BLOCKB] Fig. 7. Parameterized Parallel Branching Region E x s c u ~ limes with Incr~sod Control Parallelism SIMD Mode i SPMD Mode Mixed Mode
r
250O r 2OOO r
D
~
D
15o0 ~
Fig. 8. Simulation results for the architecture on a parameterized branching region
the mixed mode and MIMD execution times did not change significantly with the added control parallelism. The mixed mode implementation is more efficient than the full shared control implementation because of its superior performance on the leading and trailing SIMD blocks.
6
Conclusions
SIMD machines offer an excellent price to performance alternative for applications that fit their restricted control organization. Commercial SIMD machines with superior price to performance ratio continue to be built for specific applications. Unfortunately, centralized control architectures (like the SIMD model) cannot support control parallelism in applications; the control unit has to sequentially broadcast the different control sequences required by each of the control threads. In this paper, we present a model for building centralized control architectures that is capable of supporting control parallelism efficiently. The shared control model is less efficient than the SIMD model on regular code sequences (it requires an additional instruction fetch, and memory space to hold the task
1098 instructions). When used in conjunction with the SIMD model, irregular regions of applications can be executed in shared control, extending the utility of the SIMD model to a wider range of applications. The shared control model is fundamentally different from the SIMD model. Therefore, there are a set of architectural and implementation issues t h a t must be addressed before an efficient shared control implementation can be realized. The performance of S h a r C was studied using a detailed RTL simulator. Applications were implemented in the SIMD mode, and in the shared control mode. The performance of the shared control implementation was compared to a pure SIMD implementation for several applications (the highly irregular N-queen, and massively parallel database search algorithms were used). Even at the small scales of the problems considered, shared control resulted in significant improvement in performance over the pure SIMD implementation. Unfortunately, larger applications (or problem sizes of the studied applications) could not be studied because: (i) simulating a massively parallel machine on a uniprocessor is inefficient: the simulation run time for the 1024 PE 8-queen problem was in excess of 20 hours on a SunSparc 100 workstation, and (ii) we do not have a compiler for the machine; all the examples were written and coordinated in assembly language. Finally, we used a parameterized conditional region b e n c h m a r k to demonstrated t h a t shared control is beneficial for applications where even a small degree of control parallelism is present.
References 1. Abu-Ghazaleh, N. B. Shared Control: A Paradigm for Supporting Control Parallelism on SIMD-like Architectures. PhD thesis, University of Cincinnati, July 1997. (in press). 2. Abu-Ghazaleh, N. B., and Wilsey, P. A. Models for the synchronization of control units on shared control architectures. Journal of Parallel and Distributed Computing (1998). (in press). 3. Allen, J. D., and Schimmel, D. E. The impact of pipelining on SIMD architectures. In Proc. of the 9th International Parallel Processing Syrup. (April 1995), IEEE Computer Society Press, pp. 380-387. 4. Bagley, R. A., Wilsey, P. A., and Abu-Ghazaleh, N. B. Composing functional unit blocks for efficient interpretation of MIMD code sequences on SIMD processors. In Parallel Processing: CONPAR 94 - VAPP VI (September 1994), B. Buchberger and J. Volkert, Eds., vol. 854 of Lecture Notes in Computer Science, SpringerVerlag, pp. 616-627. 5. Bridges, T. The GPA machine: A generally partitionable MSIMD architecture. In Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Architeeutres (1990), pp. 196--203. 6. Felderman, R., and Kleinrock, L. An upper bound on the improvement of asynchronous versus synchronous distributed processing. In Proceedings of the SCS Multiconference on Distributed Simulation (January 1990), vol. 22, pp. 131-136. 7. Fox, G. What have we learnt from using real parallel machines to solve real problems? Tech. Rep. C3P-522, Caltech Concurrent Computation Program, California Institute of Technology, Pasadena, CA 91125, March 1988.
1099 8. Gokhale, M., Holmes, B., and Iobst, I(. Processing in memory: The Terasys massively parallel PIM array. IEEE Computer (April 1995), 23 31. 9. Greenberg, A., and Goodman, J. Sharp approximate models of deflection routing in mesh networks. IEEE Transactions on Communications 41 (3ml. 1993). 10. Hennesy, J. L., and Patterson, D. A. Computer Architecture a Quantitavc Approach, Second Edition. Morgan Kaufman Publishers Inc., San Mateo, CA, 1995. 11. Hillis, W. D. The Connection Machine. The MIT Press, Cambridge, MA, 1985. 12. Keryell, R., and Paris, N. Activity counter: New optimization for the dynamic scheduling of SIMD control flow. In 1993 International Conference on Parallel Processing (Aug. 1993), vol. 2, pp. 184-187. 13. MasPar Computer Corporation. MasPar Assembly Language (MPAS) Reference Manual. Sunnyvale CA, July 1991. 14. Nickolls, J. The design of the MasPar MP-1. In Proceedings of the 35th IEEE Computer Society International Conference (1990), pp. 25-28. 15. Parhami, B. SIMD machines: Do they have a significant future? Computer Architecture News (September 1995), 19-22. 16. Rockoff, T. SIMD instruction caches. In Proceedings of the Symposium on Parallel Architectures and Algorithms '94 (May 1994), pp. 67-75. 17. Sosic, R., and Gu, J. Fast search algorithms for the N-queens problem. IEEE Transactions on Systems, Man, and Cybernatics (Nov. 1991), 1572-1576. 18. Weems, C., Riseman, E., and Hanson, A. Image understanding architecture: Exploiting potential parallelism in machine vision. Computer (Feb 1992), 65-68.
Generating Parallel Applications of Spatial Interaction Models John Davy and Wissal Essah School of Computer Studies, University of Leeds, Leeds. West Yorkshire, UK, LS2 9JT
Abstract. We describe a tool enabling portable parallel applications of spatial interaction modelling to be produced automatically from highlevel specifications of their parameters. The application generator can define a whole family of models, and produces programs which execute only slightly more slowly than corresponding hand-coded versions.
1
Introduction
This paper describes a prototype Application Generator (AG) for rapid development of parallel programs based on Spatial Interaction Models (SIMs). The work was a collaboration between the School of Computer Studies and the Centre for Computational Geography at the University of Leeds, with GMAP Ltd, a University company, as industrial partner. It was supported by the Engineering and Physical Sciences Research Council within its Portable Software Tools for Parallel Architectures programme. An application generator can be characterised as a tool “with which applications can be built by specifying their parameters, usually in a domain-specific language” [1]. Our AG follows precisely this definition: – both the SIM itself and the various computational tasks which use it are specified by defining parameters in a textual Interface File; – parallel programs for the required tasks are generated from the Interface File, using templates which implement generic forms of these tasks; – automatically generated programs deliver identical results to hand-coded versions of the same tasks. Portability is achieved on two counts. Firstly, the level of abstraction of the Interface File is high enough to be machine-independent. Secondly, the AG generates source-portable programs in Fortran-77 with MPI. Currently the AG has been tested on a 4-processor SGI Power Challenge and a 512-processor Cray T3D. A serial version runs on a range of Sun and SGI workstations. A common criticism of such high-level approaches to program generation is that excessive and unacceptable overheads are introduced in comparison with equivalent hand-coded programs. For this reason, every effort was made to ensure that needless inefficencies arising from genericity were avoided; also the AG determines an efficient form of the parallel program based on the dimensions D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 136–145, 1998. c Springer-Verlag Berlin Heidelberg 1998
Generating Parallel Applications of Spatial Interaction Models
137
of the model and the number of processors available. With execution times for generated programs no more than 2% greater than the corresponding hand-coded programs, we have demonstrated that acceptable performance from high-level abstractions is indeed possible, at least within a limited application domain. The work is closely related to algorithmic skeletons [3], in which common patterns of parallel algorithms are specified as higher-order functions, each of which has a template for parallel implementation. Our current prototype is only first order, but extensions to cover a wider range of industrial requirements are likely to lead to higher-order versions.
2
Spatial Interaction Modelling
Spatial interaction modelling was developed in the context of the social sciences, notably quantitative geography. A SIM is a set of non-linear equations which defines flows (people, commodities etc) between spatial zones. Such models are of importance to the business sector, academic researchers, and policy makers; they have been used in relation to a wide range of spatial interaction phenomena, such as movements of goods, money, services and persons, and spread of innovation. Many realistic spatial interaction problems involve large data flows, together with computation which grows rapidly with the number of zones the model uses to represent geographical areas. Computer constraints have therefore limited the level of geographical detail that can practically be used for these models. Greater computing power, available by the exploitation of parallel processing, allows models to be used on a finer level with more realistic levels of detail and may enable better quality results. Despite these benefits, the exploitation of parallelism has received limited attention. Recent research [2,5,6] has confirmed the reality of these benefits and the considerable potential of parallelism in this area. In particular, [5] shows the effectiveness of parallel genetic algorithms for solving the network optimisation problem, in comparison with previous heuristic methods. To date, however, such work has proceeded on an ad hoc basis, and there is still a need to provide a more uniform and consistent approach, enabling these technologies to be more readily exploited outside the specialised community of parallel processing. The research reported here aims to rectify this deficiency by developing an application generator to: – cover a wide range of SIM applications in the social sciences; – enable efficient parallel implementations over a range of machines; – be scalable to enable the use of large data sets on appropriate machines. 2.1
A Family of Models
There is a family of closely-related SIMs derived from so-called entropymaximising methods [7]. A simple origin-constrained model is specified by Tij = Oi Dj Ai f (cij )
(1)
138
John Davy and Wissal Essah
Ai = 1/
Dj f (cij )
(2)
j
where Tij predicts the flow from origin zone i to destination zone j, Oi , Dj represent the ‘sizes’ of i and j, and cij is a measure of the travel cost from i to j (often simply the distance). The deterrence function f (cij ) decreases as cij increases and is commonly modelled as either the exponential function e−βcij or the exponentiation function cij −β where β is an unknown impedance factor to be estimated. Equation (2) ensures that the predicted flows satisfy the constraint: Tij = Oi (3) j
thus equating the ‘size’ of a zone to the number of flows starting there. A single model evaluation involves computing the set of values Tij defined by (1) and (2). Destination-constrained models are an obvious variant of origin-constrained models, in which (1) and (2) are replaced by Tij = Oi Dj Aj f (cij ) Aj = 1/ Oi f (cij )
(4) (5)
i
thus satisfying the constraint
Tij = Dj
(6)
i
Doubly-constrained models are rather more complex: Tij = Oi Dj Ai Bj f (cij )
(7)
Dj Bj f (cij )
(8)
Ai = 1/
j
Bj = 1/
Oi Ai f (cij )
(9)
i
In this case the predicted flows satisfies constraints similar to both (3) and (6). 2.2
Applications Involving SIMs
SIMs are commonly used to represent human trip behaviour (such as to retail outlets) in network optimisation problems, where some ‘profit’ or other performance indicator of a global network is maximised. A typical problem is to locate some number NF of facilities (e.g. shops, dealerships, hospitals) in a set of NZ distinct zones (where NF < NZ ) in such a way as to maximise the ‘profit’ from the facilities, which is computed from an underlying SIM by evaluating (a variant of) equation (1) for the zones in which facilities are located.
Generating Parallel Applications of Spatial Interaction Models
139
A necessary preliminary to solving this non-linear optimisation problem is the calibration of the model, in which the value of β is estimated from observed values Tijobs of trips (flows) between zones. This is a further non-linear optimisation problem which minimises an appropriate error norm f (β) between Tijobs and the flows Tij predicted by the model. Choice of appropriate error norms is a complex issue [7], which is beyond the scope of this paper; for simplicity we here use only maximum likelihood (10). f (β) =
i
−
obs j Tij cij i
i
j Tij
j Tij cij
(10)
The reliability of solutions to network optimisation problems depends on the reliability of the calibration process, which can be addressed computationally. The robustness of the computed value of β in relation to minor changes in Tijobs is assessed by bootstrapping, involving multiple recalibrations of the model, with slightly different sets of observed trip values obtained by systematic sampling and replacement from the original Tijobs . Thus the mean and variance of β can be derived. The number of calibrations of carried out, B is the bootstrap size. Clearly, greater values of B lead to more accurate estimates of the mean and variance of β; in practice, the heavy computation places practical limits on B, emphasising the potential benefits of parallel processing.
3
Parallel Implementation
We have implemented all three application components (calibration, bootstrapping, optimisation) on all three kinds of model (origin-, destination- and doublyconstrained), using Fortran77 with MPI. The aim of this part of the work was to experiment with alternative implementation techniques and derive templates for use in the AG. Details have already been reported in [4]; here there is space only to outline the principles in the origin-constrained case. For all three components parallelism can be obtained within both the model evaluation and the application which evaluates the model: model evaluation is highly data parallel (as implied by (1) and (2)) and the applications can be parallelised by executing independent SIM evaluations in parallel. For calibration we used a ‘Golden Section’ non-linear optimiser, which generates few independent model evaluations. Hence we expected parallelism to be exploited most effectively within the SIM evaluation. On the other hand, bootstrapping involves multiple independent calibrations and thus a high level of more coarsely-grained application-level parallelism was expected. Following [5], a genetic algorithm was used to solve the non-linear network optimisation problem. Here the SIM is used to evaluate the fitness of each member of a population of possible solutions, hence multiple independent SIM evaluations are again involved, leading to plentiful application-level parallelism. To explore the optimal combination of application- and model-based parallelism the P processors are viewed as a logical grid of dimensions papp × pmod .
140
John Davy and Wissal Essah
The application is parallelised between the papp application processors of one row, each of which acts as a master processor for model evaluation, distributing its copy of the model between the pmod model processors in its column. 3.1
Parallelism in the Model
One of the main computational challenges for model evaluation is the large volume of stored data; space is required for the cost matrix cij and (for calibration and bootstrapping) the trip matrix Tijobs . A data parallel implementation distributes them across the processors using cyclic partitioning by rows: processor 1 receives rows 1, P +1, 2P +1,. . . , processor 2 rows 2, P +2, 2P +2,. . . , and so on. The matrices are treated slightly differently, as follows: – each processor reads an (x, y) pair for each zone centroid, to compute its own rows of cij , stored in dense format. – each processor stores the corresponding rows of Tijobs (also read from file), in sparse format since most trips are between physically close zones. This distribution ensures that model evaluation is scalable from the perspective of memory usage, as long as the number of processors increases proportionately to the model size. Partitioning cyclically leads to a better load balance than partitioning contiguously because of the removal of systematic matrix patterns. Evaluation of (1) and (2) may then proceed entirely in parallel with no further communication. It is only necessary to store one row of the Tij matrix, since the error norms for calibration or bootstrapping are cumulatively evaluated in parallel from equation (10). The ‘profit’ in network optimisation is similarly accumulated. 3.2
Performance Results
Performance results were obtained on a 512-processor Cray T3D machine, using ‘journey-to-work’ data derived from the 1991 UK census (see [6]). This records the numbers of journeys (Tijobs ) out of and into all 10764 electoral wards. Costs cij are computed from distances between the centroids (xi , yi ) of wards; in the case of intra-ward journeys the costs cii are fixed at some notional distance di depending on ward size. A second, smaller data set contains equivalent information aggregated into 459 electoral districts. These data are representative of a range of other origin-constrained models, such as journeys to shopping centres or car dealerships, and can therefore be used to simulate corresponding location optimisation problems. As most journeys are between relatively close zones the trips matrix has the typical highly sparse pattern. Also, since the data sets satisfy both origin and destination constraints they may be used to assess doubly-constrained calibration and bootstrapping. In all cases an exponential deterrence function was used. For space reasons, only a selection of the results in [4] are given. As expected, calibration obtained best results with all parallelism in the model (papp = 1), whereas bootstrapping and network optimisation enabled
Generating Parallel Applications of Spatial Interaction Models
141
Table 1. Optimisation times (sec) for locating 15 facilities in 10764 zones P 32 64 128 256 512
1 3779 3317 3300 3459 3736
papp 2 4 1843 1657 928 1650 833 1724 824
8 471 418
16 241
Table 2. Calibration times (sec) for 10764 zones P Time at papp = 1 Speedup 32 274.3 64 137.6 1.99 128 69.2 3.96 256 35.1 7.8 512 17.9 15.3
maximum parallelism in the application. Table 1 shows optimisation times for allocating 15 car dealerships within 10764 zones, for varying values of P and papp . Increasing papp always decreases execution time, so the best time is obtained when the maximum value of papp is used (in this case P/32, since the model requires 32 processors to evaluate). Similar results were obtained for the 459-zone model, which can be evaluated on a single processor (ie Papp = P ). Encouragingly, the diagonal entries in Table 1 show a speedup close to linear. Bootstrapping results show a similar pattern to network optimisation. Calibration was predictably less scalable, since only the finer-grain model parallelism was available. However, even here the results were encouraging. Table 2 shows a near-linear improvement in execution time as the number of processors increases 16-fold – the baseline of 32 processors was again the minimum necessary to evaluate the model. Even with the smaller 459-zone model useful performance improvements were obtained by parallelism up to 64 processors (see [4] for details) but performance degraded thereafter. By contrast, bootstrapping and optimisation continued to achieve near-linear speedup up to 512 processors even with the smaller model, because parallelism was exploited at the application level.
4
Specifying a Spatial Interaction Model
A SIM application can be specified by defining its parameters in a textual Interface File. We illustrate this for a simple case in Fig. 1, which defines a locational problem to optimise placement of 15 facilities within 10764 zones. (The need for MAXTRP in this file arises from using static arrays in the Fortran-77 implementation and is not inherent in the model specification.)
142
John Davy and Wissal Essah
SIM MODTYP=O DETFUN=exp CALIBRATE ERROR=maxlike BOOTSTRAP BSIZE=256 OPTIMISE LOC=15 DATASIZES NZ=10764 MZ=10764 MAXTRP=586645
% Model parameters % origin constrained model % exponential deterrence function % calibration parameters % maximum likelihood error norm % bootstrapping parameters % size of bootstrap sample % optimisation parameters % number of locations to be optimised % data set details for % number of origin zones % number of destination zones % total number of non-zero trips % in calibration data
Fig. 1. Interface File for simple origin-constrained model
This Interface File format could be used as the input to an AG to generate parallel programs using templates based on the codes described in section 3. In effect we would have a generator based on SIMs defined by equations (1) to (9). While interesting as a demonstration of principle, the simplified form of the models is very restrictive. To develop a more powerful AG, we studied the requirements of a wide range of spatial location problems, including examples from retailing, agricultural production, industrial location, and urban spatial structure. From these a more generic SIM formulation was derived. Here we outline the origin-constrained version, without seeking to justify the modelling process. 4.1
A Generic Origin-Constrained Model
We assume that each origin zone produces ‘activities’ or ‘goods’ of several types and that each destination zone has several facilities’ (consumers of activities/goods). This rather general terminology describes a range of different phenomena; for instance (and rather counter-intuitively) ‘goods’ may be m different categories of potential buyers travelling to one of r different car dealerships. The earlier flow value, Tij generalises to Tijmr , the flow of good type m from zone i to facility r in zone j. Origin and destination sizes, Oi and Dj become Oim and Djr and we allow for the latter to be exponentiated. Thus (1), (2) and (3) generalise to Tijmr = Oim (Djr )α Ai f (cij ) (11) Ai = 1/ (Djr )α f (cij ) (12) j
j
r
r
Tijmr = Oim
(13)
Generating Parallel Applications of Spatial Interaction Models
143
The network optimisation problem is defined as maximising the difference between revenues (Djr ) and costs (Cjr ) for each facility, ie for each r maximise
(Djr − Cjr )
(14)
j∈J
over a subset J of destination zones, to (13). subject mr m Revenues are computed as i m pm T i ij where pi is the revenue generated from a good of type m from zone i. In the most general case, costs at facility of type r at j have three components: maintaining the facility, transport, and the costs associated with transactions. This is modelled as Cjr = vjr (Djr )α + ( cij Tijmr ) + (qjr Tijmr ) (15) i
m
i
m
where vjr , qjr are the unit costs of running facility r at j and making transactions there. The second and third terms are not always needed. Finally, unit transport costs, cij , can be modelled by a term proportional to distance together with ‘terminal costs’, such as parking. Thus, in the most general case, (16) cij = tdij + ηi + ρj where t is travel cost per unit distance, dij is a distance measure (we use Euclidean distance), ηi and ρj are the terminal costs at origin and destination. In some cases the t value may be irrelevant (ie it is only necessary to have costs proportional to distance) and the terminal costs are not always required. Specifying this more general model requires the values of m, r and α, as well as stating whether the optional parts of the model are included. The values of ρj , ηi , vjr , qjr and pm i are data to be read from files at runtime if required. Thus, the previous Interface File (Fig. 1) would be replaced by Fig. 2.
5
The Application Generator
The prototype AG implements the generic model above. It accepts an Interface File as input and generates the required parallel programs using templates based on the codes described in section 3. The user supplies an Interface File as above, or alternatively is prompted for parameters at a simple command-line interface. Each of the CALIBRATE, BOOTSTRAP, OPTIMISE sections is optional. Using the results of section 3, calibration only exploits parallelism in the SIM evaluation, whereas bootstrapping and optimisation use maximum parallelism in the application, subject to limits imposed by the model size, derived from the NZ and MZ parameters. Thus, within the constraints of the program structure chosen, the values of papp and pmod give optimal execution time. The use of a generic model can lead to superfluous and time-consuming computations, such as adding 0 when optional parts of the computation are not present, and needless exponentiation when ALPHA has its (common) value 1. The AG removes all such redundant computation, bringing execution times of
144
John Davy and Wissal Essah
SIM MODTYP=O DETFUN=exp O_ACTIVITY=1 D_FACILITY=1 CFLAG=0 TFLAG=0 PFLAG=0 QFLAG=0 ALPHA=1.0 CALIBRATE ERROR=maxlike BOOTSTRAP BSIZE=256 OPTIMISE LOC=7 DATASIZES NZ=10764 MZ=10764 MAXTRP=586645
% % % % % % %
number of activities at origins number of facilities at destinations travel costs NOT included in optimisation terminal and unit travel costs NOT included revenue costs NOT included in optimisation transaction costs NOT included in optimisation value of alpha
Fig. 2. Interface file with generic SIM generated code very close to hand-coded versions. Table 3 shows overheads of less than 2% on the Cray T3D for a variety of model types, application components and values of P (with 10764-zone model). Source code for the AG is a mix of Fortran, Perl and Unix scripts. The templates consist of 2700 lines of code totalling some 97 kbytes, and the rest of the AG consists of 2500 lines (77 kbytes). It has been implemented and tested on an SGI Power Challenge, a Cray T3D, and various Sun and SGI workstations.
6
Evaluation and Conclusions
The prototype AG shows that it is possible to generate non-trivial parallel applications from very high-level specifications, with small overheads compared with direct coding. This is made possible by a restricted application domain, which nevertheless is useful in practice and benefits from parallelism. The current tool is not yet of industrial-strength, on several counts. It has not been adequately evaluated by real users, does not cover the full range of spatial interaction modelling variants, and is limited in the outputs produced. Further generalisation will require more detailed inputs from domain experts, including a standardisation of data formats in a flexible way to permit greater generality. More control over the optimisation process (such as specifying the number of generations for the genetic algorithm) is desirable. Lastly, the user interface betrays the system’s Fortran origins and would benefit from a graphical presentation.
Generating Parallel Applications of Spatial Interaction Models
145
Table 3. Overheads of application generator application type model constraint P overheads (%) calibration origin 1 1 2 1.26 8 1.4 calibration double 1 1.8 2 1.7 8 1.6 optimisation origin 1 1.6 4 1.85 16 1.9
Despite its limitations, we believe this work is valuable in showing that rapid parallel application development through a skeleton-like approach can be made to work effectively in an appropriate, well-defined application domain.
References 1. C. Barnes and C. Wadsworth. Portable software tools for parallel architectures. In Proceedings of PPECC’95, 1995. 136 2. M. Birkin, M. Clarke, and F. George. The use of parallel computers to solve nonlinear spatial optimisation problems. Environment and Planning A, 17:1049–1068, 1994. 137 3. M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. Pitman/MIT Press, 1989. 137 4. W. Essah, J. R. Davy, and S. Openshaw. Systematic exploitation of parallelism in spatial interaction models. In Proceedings of PDPTA’97, July 1997. 139, 140, 141 5. F. George. Hybrid genetic algorithms with immunisation to optimise networks of car dealerships. Edinburgh Parallel Computing Centre, EPCC–PAR–GMAP, 1994. 137, 139 6. I. Turton and S. Openshaw. Parallel spatial interaction models. Geographical and Environmental Modelling, 1:179–197, 1997. 137, 140 7. A. G. Wilson. A family of spatial interaction models, and associated developments. Environment and Planning, 3:1–32, 1971. 137, 139
Workshop 23 ESPRIT Projects Ron Perrott and Colin Upstill Co-chairmen The E S P R I T Programme has now been in existence for over a decade. One of the areas to benefit substantially from the injection of E S P R I T research and development funding has been the area of parallel computing or high performance computing. It is therefore appropriate that at this conference there should be two sessions dealing with HPC projects. A wide range of papers were submitted, and those selected represent state-of-the-art projects in various areas of interest. The projects break down into applications which can benefit from the use of HPC and those which deal with the software required for making high performance computers more efficient. In the former category are papers like Parallel Crew Scheduling in Pharos. This is an application which considers the problem of scheduling airline crews and the use of a network of workstations. It is based on the widely known Carmen system which is used by most European airlines to schedule their crews among aeroplanes. The objective is to produce a parallel version of the Carmen system and execute it on a network of workstations and demonstrate the improvements and benefits which can be obtained. Essentially there are two critical parts in the application dealing with pairing generation and optimisation. The pairing generator distributes the numeration of pairings over the processors and the optimiser is based on an iterative Lagrangian heuristic. Initial results indicate the benefits the speed-up which can be obtained with this application on a network of workstations. The second paper deals with CORBA (Common Object Request Broker Architecture) and develops a compliant programming environment for HPC. This is part of an E S P R I T Project known as PACHA. The main objective is to help the design of HPC applications using independent software components through the use of distributed objects. It proposes to harness the benefits of distributive and parallel programming using a combination of two standards, namely CORBA and MPI. CORBA provides transparent remote method invocations which are handled by an object request broker that provides a communication infrastructure independent of the underlying network. This project introduces a new kind of object which is referred to as a CORBA parallel object. It then allows the aggregation of computing resources to speed up the execution of a software component, The interface is described using an extended IDL to manage data distribution among the objects of the collection. CORBA is being implemented using Orbix from Iona Technologies and has already been tested using a signal processing application based on a client server approach. The OCEANS project deals with optimising compilers for embedded applications. The objective of OCEANS is to design and implement an optimising
1102 compiler that utilises aggressive analysis techniques and integrates source level restructuring transformations with low level machine-dependent optimisations. This will then provide a prototype framework for iterative compilation where feedback from the low level is used to guide the selection of a suitable sequence of source level transformations and vice versa. The OCEANS compiler is centred around two major components, a high level restructuring system known as MT1 and a low level system for supporting language transformations and optimisations known as Salto. In order to validate the compiler, four public domain multimedia codes have been selected. Initial results indicate that the results using the initial prototype are satisfactory in comparison with the production compiler but their implementation still needs to be refined at both the high and low levels. The fourth paper deals with industrial stochastic simulations on large-scale meta-computers, and reports the acheivements of the P R O M E N V I R (Probabilistic Mechanical Design Environment) project. This project culminated in a pan-European meta-computer demonstration of the use of P R O M E N V I R running a multibody simulation application. P R O M E N V I R is an advanced computing tool for performing stochastic analysis of generic physical systems. The tool provides the user with a framework for running a stochastic Monte Carlo simulation using a preferred deterministic solver and then provides sophisticated means to analyse the results. The European meta-computer consisted of a total of 102 processors geographically distributed among the members of the consortium. The paper details the problems that were encountered in establishing the meta-computer and carrying out the execution of the chosen application. It makes useful contributions on the feasibility and reliability of such an approach. HIPEC stands for High Performance Computing Visualisation System Supporting Network Electronic Commerce applications. This is an E S P R I T Project whose main objective is to integrate advanced high performance technologies to form a generic electronic commerce application. The application is aimed at giving a large number of SMEs a configuration and visualisation tool to support the selling process within the show room and to enlarge their business using the same tool over the Interact. Computer graphics technology plays a primary role in the project as realistic images are required in such commerce. The particular experiment was based on the generic technology and applied to bathroom furniture application. The project was shown to be of benefit in this particular application area. The final paper is the porting of the SEMC3D electromagnetics code to HPF. This was a project within the E S P R I T PHAROS Project, in which four industrial simulation codes are ported to HPF. The electromagnetic simulation code was ported onto two machines, namely the Meiko CS2 and the IBM SP2. The paper describes the application and the numerical computational features which are important and then analyses the porting of the code to a parallel machine. A series of phases such as code cleaning, translation to Fortran 90, and inserting H P F directives are described. This is followed by a series of measurements. The first stage in the port was the conversion of the Fortran 77 code to Fortran
1103 90, for which converted code single processor times on the IBM SP2 where obtained. This was then followed by conversion to H P F code and performance times over a range of processors documented. Finally, a comparison with message passing code was carried out. The speed-ups obtained in the benchmarking of this particular code are described as encouraging by the users, and ~ive evidence of the benefits of using HPF.
Parallel Crew Scheduling in PAROS* Panayiotis Alefragis 1, Christos Goumopoulos 1, Efthymios Housos 1 , Peter Sanders 2, Tuomo Takkula 3, Dag Wedelin 3 1 University of Patras, Patras, Greece Max-Planck-Institut f/ir Informatik, Saarbr/icken, Germany s Chalmers University of Technology, GSteborg, Sweden
A b s t r a c t . We give an overview of the parallelization work done in PAROS. The specific parallelization objective has been to improve the speed of airline crew scheduling, on a network of workstations. The work is based on the Carmen System, which is used by most European airlines for this task. We give a brief background to the problem. The two most time critical parts of this system are the pairing generator and the optimizer. We present a pairing generator which distributes the enumeration of pairings over the processors. This works efficiently on a large number of loosely coupled workstations. The optimizer can be described as an iterative Lagrangian heuristic, and allows only for rather fine-grained parallelization. On low-latency machines, parallelizing the two innermost loops at once works well. A new "active-set" strategy makes more coarsegrained communication possible and even improves the sequential algorithm.
1
The P A R O S Project
The PAROS (Parallel large scale a u t o m a t i c scheduling) project is a joint effort between Carmen Systems AB of GSteborg, Lufthansa, the University of Patras and Chalmers University of Technology, and is supported under the European Union E S P R I T H P C N (High Performance C o m p u t i n g and Networking) research program. The aim of the project is to generally improve and extend the use and performance of automatic scheduling methods, stretching the limits of present technology. The starting point of PAROS is the already existing Carmen ( C o m p u t e r Aided Resource Management) System, used for crew scheduling by Lufthansa as well as by most other m a j o r European airlines. The crew costs are one of the main operating costs for any large airline, and the Carmen System has already significantly improved the economic efficiency of the schedules for all its users. For a detailed description of the Carmen system and airline crew scheduling in general, see [1]. See also [3, 6, 7] for other approaches to this problem. At present, a typical problem involving the scheduling of a m e d i u m size fleet at Lufthansa requires 10-15 hours of computing, for large problems as much as 150 hours. The closer to the actual day of operation the scheduling can take * This work has been supported by the ESPRIT HPCN program.
1105 place, the more efficient it will be with respect to market needs and late changes. Increased speed can also be used to solve larger problems and to increase the solution quality. The speed and quality of the crew scheduling algorithms are therefore important for the overall efficiency of the airline. An important objective of PAROS is to improve the performance of the scheduling process through parallel processing. There exist attempts to parallelize similar systems on high end parallel hardware (see [7]), but our focus is primarily to better utilize a network of workstations a n d / o r multiprocessor workstations, allowing airlines such as Lufthansa to use their already existing hardware more efficiently. In section 2 we give a general description of the crew scheduling process in the Carmen system. The following sections report the progress in parallelizing the main components of the scheduling process: the pairing generator in section 3, and the optimizer in section 4. Section 5 gives some conclusions and pointers for future work.
2
Solution
Methodology
and
the
Carmen
System
For a given timetable of flight legs (non stop flights), the task of crew scheduling is to determine the sequence of legs that every crew should follow during one or several workdays, beginning and ending at a home base. Such routes are known as pairings. The complexity of the problem is due to the fact the crews cannot simply follow the individual aircraft, since the crews on duty have to rest in accordance with very complicated regulations, and there must always be new crews at appropriate locations which are prepared to take over. At Lufthansa, problems are usually solved as weekly problems with up to the order of 104 legs. For problems of this kind there is usually some top level heuristic which breaks down the problem into smaller subproblems, often to different kinds of daily problems. Due to the complicated rules, the subproblems are still difficult to handle directly, and are solved using some strategy involving the components of pairing generation and optimization. The main loop of the Carmen solution process is shown in Figure 1. Based on an existing initial or current best solution a subproblem is created by opening up a part of the problem. The subproblem is defined by a connection matrix containing a number of legal leg connections, to be used for finding a better solution. The connection matrix is input to the pairing generator which enumeratively generates a huge number of pairings using these connections, Each pairing must be legal according to the contractual rules and airline regulations, and for each pairing a cost is calculated. The generated pairings are sent to the optimizer, which selects a small subset of the pairings in order to minimize the total crew costs, under the constraints that every flight leg must receive a crew. In both steps the main algorithmic difficulty is the combinatorial explosion of possibilities, typical for problems of this kind. Rules and costs are handled by a special rule language, and is translated into runnable code by a rule compiler, which is then called by the pairing generator,
1106
_
+
Curren! Pairings
oub0rob,eo Selection
Solution Update
Connection
Matrix
YES
Generation ~et
Covering Problem
oo, o . . . . ,oo Optimization
l
Fig. 1. Main loop in Carmen solution process. A typical run of the Carmen system consists of 50-100 daily iterations in which 104 to 106 pairings are generated in each iteration. Most of the time is spent in the pairing generator and the optimizer. For some typical Lufthansa problems, profiling reveals that 5-10% of the time is consumed in the overall analysis and the subproblem selection, about 70-85% is spent in the pairing generator, and 10-20% in the optimizer. These figures can however vary considerably for different kinds of problems. The optimizer can sometimes take a much larger proportion of the time, primarily depending on the size and other characteristics of the problem, the complexity of the rules that are called in the inner loop of the pairing generator, and various parameter settings. As a general conclusion however, it is clear that the pairing generator and the optimizer are the main bottlenecks of the sequential system, and they have therefore been selected as the primary targets for parallelization.
3
Generator Parallelization
The pairing generation algorithm is a quite straightforward depth first enumeration that starts from a crew base, and builds a chain of legs by following possible connections as defined by the matrix of possible leg connections. The search is heuristically pruned by limiting the number of branches in each node, typically between 5-8. The search is also interrupted if the chain created so far is not legal. The parallelization of the pairing generator is based on a manager/worker scheme, where each worker is given a start leg from which it enumerates a part of the tree. The manager dynamically distributes the start legs and the necessary additional problem information to the workers in a demand driven manner. Each worker generates all legal pairings of the subtree and returns them to the manager. The implementation has been made using PVM, and several efforts have been made to minimize the idle and communication times. Large messages are sent whenever possible. Load balancing is achieved by implementing a dynamic
1107 workload distribution scheme that implicitly takes into account the speed and the current load of each machine. The number of start legs that are sent to each worker is also changing dynamically with a fading algorithm. In the beginning, a large number of start legs is given and in the end only one start leg per worker is assigned. This scheme reconciles the two goals of minimizing the number of messages and of good load balancing. Efficiency is also improved by pre-fetching. A worker requests the next set of start legs before they are needed. It can then perform computation while its request is being serviced by the manager. The parallel machine can also be extended dynamically. It is possible to add a new host at any time to the virtual parallel machine and this will cause a new worker to be started automatically. Another feature of the implementation is the suspend/resume feature which makes sure that only idle workstations are used for generation. Periodically the m anager requests the CPU load of the worker, suspends its operation if the CPU load for other tasks is over a specified limit, and resumes its operation if the CPU load is below a specified limit. The performance overhead depends on the rate of the load checking operation which in any case is small enough (< 1%). 3.1
Scalability
For the problems tested, the overhead for worker initialization depends on the size of the problem and consists mainly of the initialization of the legality system and the distribution of the connection matrix. This overhead is very smM1 compared to the total running time. The data rate for sending generated pairings from a worker to the manager is typically about 6 KB/s, assuming a typical workstation and Lufthansa rules. This is to be compared with the standard Ethernet speed of 1 MB/s, so after its start-up phase the generator is CPU bound for as many as 200 workstations. Given the high granularity and the asynchronous execution pattern we therefore expect the generator to scale well up to the order of 100 workstations connected with standard Ethernet, of course provided that the network is not too loaded with other communication. 3.2
Fault Tolerance
Several fault tolerance features have been implemented. To be reliable when many workstations are involved, the parallel generator can recover from task and host failures. The notification mechanism of PVM [4] is used to provide application level fault tolerance to the generator. A worker failure is detected by the manager which also keeps the current computing state of the worker. In case of a failure the state is used for reassigning the unfinished part of the work to another worker. The failure of the manager can be detected by the coordinating process, and a new manager will be started. The recovery is achieved since the manager periodically saves its state to the filesystem (usually NFS) and thus can restart from the last checkpoint. The responsibility of the new manager is to reset the workers and to request only the
1108
generation of the pairings that have not been generated, or have been generated partially. This behavior can save a lot of computing time, especially if the fault appears near the end of the generation work. 3.3
Experimental
Results
We have measured the performance of the parallel generator experimentally. The experiments have been performed on a network of HP 715/100 workstations of roughly equivalent performance, connected by Ethernet at the computing center of the University of Patras, during times when almost exclusive usage of the network and the workstations was possible. The results are shown in Table 3.3. W h a t we see is the elapsed time for a single iteration of the generator, and the running time in seconds as a function of the number of CPUs. The speedup in all cases is almost linear to the number of CPUs. The generation time decreases in all cases almost linearly to the number of the CPUs used. Table 1. Parallel generator times for typical pairing problems. problem name legs lh_dl_gg 946 lh_dl_splimp 946 lh_wk_ggi6196 ~_dl_kopt 1087
4
Optimizer
pairings 396908 318938 594560 159073
1CPU 2 26460 20760 31380 10860
CPUs~4 CPUs 6 1 3 7 7 1 7061 1 0 7 9 7 5448 1 6 8 3 4 8436 5563 2804
CPUs 8 CPUs 10 CPUs 4536 3466 2818 3686 2797 2181 5338 4312 3288 1892 1385 1112
Parallelization
The problem solved by the optimizer is known as the set covering problem with additional base capacity constraints, and can be expressed as min{cx : A x > 1, C x <_ d, x binary} Here A is a 0-1 m a t r i x and C is arbitrary non-negative. In this formulation every variable corresponds to a generated pairing and every constraint corresponds to a leg. The size of these problems usually range from 50 x 104 to 104 • 106 (constraints x variables), usually with 5-10 non-zero A-elements per variable. In most cases only a few capacity (_<) constraints are present. This problem in its general form is NP-hard, and in the C a r m e n System it is solved by an algorithm [9], that can be interpreted as a b a g r a n g i a n based heuristic. Very briefly, the algorithm a t t e m p t s to perturbate the cost function as little as possible to give an integer solution to the LP-relaxation. From the point of view of parallel processing it is worth pointing out that the character of the algorithm is very different, and for the problems we consider also much faster, compared to the c o m m o n branch and bound approaches to integer p r o g r a m m i n g ,
1109
and is rather more similar to an iterative equation solver. This also means that the approach to parallelization is completely different, and it is a challenge to achieve this on a network of workstations. The sequential algorithm can be described in the following simplified way that highlights some overall aspects which are relevant for parallelization. On the top level, the sequential algorithm iteratively modifies a Lagrangian cost vector ~ which is first initialized to c. The goal of the iteration is to modify the costs so that the sign pattern of the elements of ~ corresponds to a feasible solution, where ~j < 0 means that ;gj ~--- 1, and where Cj > 0 means that Xj ~-~ 0. In the sequential algorihm the iteration proceeds iteratively by considering o n e constraint a t a t i m e , where the computation for this constraint modifies the values for the variables of that constraint. Note that this implies that values corresponding to variables not in the constraint are not changed by this update. The overall structure is summarized as
reset all s ~ to 0 repeat for every constraint i r i = ~i _ s i
s i = function of r ~, bi and some parameters # = r ~+ si increase g according to some rule until no sign changes in
In this pseudocode, ~i is the sub-vector of ~ corresponding to non-zero elements of constraint i. The vector s i represents the contribution of constraint i to ~. This contribution is cancelled out before every iteration of constraint i where a new contribution is computed. The vectors r i are temporaries. A property relevant to parallelization is that when an iteration of a constraint has updated the reduced costs of its variables, then the following constraints have to use these new updated values. Another property is that the vector s i is usually a function of the current ~-values for at most two variables in the constraint, and these variables are referred to as the c r i t i c a l v a r i a b l e s of the constraint. In the following subsections we summarize the different approaches to parallelization that have been investigated. In addition to this work, an aggressive optimization of the inner loops was performed, leading to an additional speedup of roughly a factor of 3.
4.1
Constraint Parallel Approach
A first observation is that if constraints have no common variables they m a y be computed independently of each other, and a graph coloring of the constraints would reveal the constraint groups that could be independent. Unfortunately,
1110 the structure of typical set covering problems (with many variables and few long constraints) is such that this approach is not possible to apply in a straightforward way. The approach can however be modifed to let every processor maintain a local copy of 6, and a subset of the constraints. Each processor then iterates its own constraints and updates its local copy of 6. The different copies of must then be kept consistent through communication. Although the nature of the algorithm is such that the ~ communication can be considerably relaxed, the result for our type of problems did not allow a significant speedup on networks of workstations. 4.2
Variable Parallel Approach
Another way of parallelization is to parallelize over the variables and the inner loops of the constraint calculation. The main parts of this calculation that concern the parallelization are 1) collect the ~ for the variables in the constraint. 2) find the two least values of r i. 3) copy the result back to 6. When the variables are distributed over the processors, each processor is then responsible for only one piece of 6 and the corresponding part of the A-matrix. Some of the operations needed for the constraint update can be conveniently done locally, but the double minimum computation requires communication. This operation is associative, and it is therefore possible to get the global minimal elements through a reduction operation on the local critical minimum elements. To minimize communication it is also possible to group the constraints and perform the reduction operation for several independent constraints at a time. This strategy has been successfully implemented on an SGI Origin2000, with a speedup of 7 on 8 processors. However, it cannot be used directly on a network of workstations due to the high latency of the network. We have therefore investigated a relaxation to this algorithm, that does a lazy update of 6. The approach is based on updating 6 only for the variables required by the constraint group that will be iterated next. The reduced cost vector is then fully updated based on the contributions of the previous constraint group, during the reduction operation of the current constraint group, and so overlaps computation with both communication and idle time. The computational part of the algorithm is slightly enlarged and the relaxation requires that constraint groups with a reasonable number of common variables can be created. The graph colouring heuristics that have been used are based on [8], which gives 10-20% fewer groups than a greedy first fit algorithm, although the computation times are slightly higher. For numerical convergence reasons random noise is added in the solution process and even with minor changes in the parameters the running times can vary considerably, and also the solution quality can be slightly different. Tests are shown in Table 4.2 for pure set covering problems from the Swedish Railways, and the settings have been adjusted to ensure that a similar solution quality
IIII
is preserved on the average (usually well within 1% of the o p t i m u m ) , and the problems shown here are considered as reasonably representative for the average case. Running times are given in seconds for the lazy variable parallel implementation for 1 to 4 processors. The hardware in this test has been 4 • HP 715/100 connected with 10 Mbps switched Ethernet. It can be seen t h a t reasonably good speedup results are obtained even for m e d i u m sized problem instances. For networks of workstations a possibility could here also be to use faster networks such as SCI [5] or Myrinet [2] and streamlined software interfaces.
Table 2. Results for the lazy variable parallel code. problem name rows columns 1 C P U 2 CPUs 3 CPUs 4 CPUs sj_daily_17sc 58 1915 0.60 2.95 3.53 4.31 sj_daily_14sc 94 7388 17.30 9.45 11.78 11.82 sj-daily-O4sc 429 38148 288.73 124.76 82.94 72.00 sj-daily_34sc 419 156197 951.53 634.54 365.90 259.32
4.3
Parallel A c t i v e Set A p p r o a c h
As previously mentioned, the constraint update is usually a function of at most two variables in the constraint. If the set of these critical variables was known in advance, it would in principle be possible to ignore all other variables and receive the same result much faster. This is not possible, but it gives intuitive support for a variation of the sequential algorithm, where a smaller number of variables that are likely to be critical are selected in an active set. The idea is to make the original algorithm work only on the active set, and then sometimes do a scan over all variables to see if more variables should become active. For our problems this is especially appealing considering the close relation to the so called column generation approach to the crew scheduling problem, see [3]. The approach was first implemented sequentially and turns out to work well on real problems. An additional and i m p o r t a n t performance benefit is t h a t the scan can be implemented by a columnwise traversM of the constraint m a t r i x which is much more cache friendly and gives an additional sequential speedup of a b o u t 3 for large instances. This modified algorithm also gives a new possiblity for parallelization. If the active set is small enough, then the scan will dominate the total execution time, even if the active set is iterated m a n y times between every scan. In constrast to the original algorithm, all communication necessary for a global scan of all variables can be concentrated into a single vector valued broadcast and reduction operation. Therefore, the parallel active set approach is successful even on a network of workstations connected over Ethernet. Some results for pure set covering problems from Lufthansa are shown in Table 4.3. The settings have been adjusted to ensure that a similar solution quality is preserved on the average.
1112 Running times are given in seconds for the original production code P r o b l , the new code with the active set strategy disabled, and with active set strategy for 1, 2 and 4 processors. The hardware in this test has been 4 • Sun Ultra 1/140 connected with a shared 100 Mbps Ethernet. From these results, and other more Table 3. Results of parallel active set code. problem name rows lh_d126_02 682 lh_d126_04 154 lh_dtl_ll 5287 lh_dt58_02 5339
columns Probl no active set 1 C P U 2 CPUs 4 CPUs 642613 9962 4071 843 412 294 121714 1256 373 216 105 60 266966 1560 765 298 78 66 409350 2655 2305 924 406 207
extensive tests, it can be concluded t h a t in addition to the sequential speedup, we can obtain a parallel speedup of about a factor of 3 using a network of four workstations. It does not make sense however to increase the n u m b e r of workstations significantly, since it is then no longer true that the scan dominates the total running time.
5
Conclusions
and
Future
Work
We have demonstrated a fully successful parallelization of the two most time critical components in the Carmen crew scheduling process, on a network of workstations. For the generator, where a coarse grained parallelism is possible, a good speedup can be obtained also for a large number of workstations, when however issues of fault tolerance become very important. Further work includes the m a n a g e m e n t of multiple A P C jobs, a refined task scheduling scheme, preprocessing inside the workers (e.g. elimination of identical columns) and enhanced m a n a g e m e n t of the virtual machine (active time windows of machines). For the optimizer, which requires a much more fine grained parallelism, Figure 2 shows a theoretical model of the expected parallel speedup of the different parallelizations, for a typical large problem. Although the variable parallel approach scales better, the total speedup is higher with the active set since the sequential algorithm is then faster. On the other hand, the active set can be responsible for a lower solution quality for some instances. Further tuning of the parallel active set strategy and other parameters is therefore required to improve the stability of the code both with respect to quality and speedup. It m a y be possible to combine the two approaches but it is not clear if this would give a significant benefit. Also, the general capacity constraints are not yet handled in any of the parallel implementations. The parallel components have been integrated in a first PAROS p r o t o t y p e system that runs together with existing Carmen components. Given that the
1113
," / 9
i
! !
ts S
:.. . . . . . . . . . . . . . . . . . . . . . . 2
3
4
5
VBD
]
. . . .
"~"
- -
I:~ralel Active Set]
j
8
7
Fig. 2. Expected parallel speedup. generator and the optimizer are now much faster, we consider parallelization Mso of some other components such as the generation of the connection matrix, which should be comparatively simple. Other integration issues are the communicaton between the generator and the optimizer, which is now done through files, and related issues such as optimizer preprocessing.
References 1. E. Andersson, E. Housos, N. Kohl, and D. Wedelin. OR in the Airline Industry, chapter Crew Pairing Optimization. Kluwer Academic Publishers, Boston, London, Dordrecht, 1997. 2. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, 3. N. Seizovic, and W.-K. Su. Myrinet: A gigabit-per-second Local Area Network. IEEE Micro, 15(1):29-36, Feb. 1995. 3. J. Desrosiers, Y. Dumas, M. Solomon, and F. Soumis. Handbooks in Operations Research and Management Science, chapter Time Constrained Routing and Scheduling. North-Holland, 1995. 4. A. Geist. Advanced programming in PVM. Proceedings of EuroPVM 96, pages 1-6, 1996. 5. IEEE. Standard for the scalable coherent interface (sci), 1993. IEEE Std 1596-1992. 6. S. Lavoie, M. Minoux, and E. Odier. A new approach for crew pairing problems by column generation with an application to air transportation. European Journal of Operational Research, 35:45-58, 1988. 7. R. Maxsten. RALPH: Crew Planning at Delta Air Lines. Technical Report. Cutting Edge Optimization, 1997. 8. A. Mehrota and M. Trick. A clique generation approach to graph coloring. INFORMS Journal o] Computing, 8, 1996. 9. D. Wedelin. An algorithm for large scale 0-1 integer programming with application to airline crew scheduling. Annals o] Operations Research, 57, 1995.
Cobra: A CORBA-compliant Programming Environment for High-Performance Computing Thierry Priol and Christophe Ren~ IRISA -Campus de Beaulieu - 35042 Rennes, Prance
A b s t r a c t . In this paper, we introduce a new concept that we call a parallel COtlBA object. It is the basis of the Cobra runtime system that is compliant to the CORBA specification. Cobra is being developed within the PACHA Esprit project. It aims at helping the design of high-performance applications using independent software components through the use of distributed objects. It provides the benefits of distributed and parallel programming using a combination of two standards: CORBA and MPI. To support CORBA parallel objects, we propose to extend the IDL language to support object and data distribution. In this paper, we discuss the concept of CORBA parallel object,
1
Introduction
Thanks to the rapid increase of performance of nowadays computers, it can be now envisaged to couple several high-intensive numerical codes to simulate more accurately complex physical phenomena. Due to both the increased complexity of these numerical codes and their future developments, a tight coupling of these codes cannot be envisaged. A loosely coupling approach based on the use of several components offers a much more attractive solution. With such approach, each of these components implements a particular processing (pre-processing of data, mathematical solver, post-processing of data). Moreover, several solvers are required to increase the accuracy of simulation. For example, fluid-structure or thermal-structure interactions occur in m a n y field of engineering. Other components can be devoted to pre-processing (data format conversion) or post-processing of d a t a (visualisation). Each of these components requires specific resources (computing power, graphic interface, specific I / O devices). A component, which requires a huge amount of computing power, can be parallelised so t h a t it will be seen as a collection of processes to be ran on a set of network nodes. Processes within a component have to exchange d a t a and have to synchronise. Therefore, communication has to be performed at different levels: between components and within a component. However, requirements for communication between components or within a component are not the same. Within a component, since performance is critical, low level message-passing is required whereas between components, although performance is still required, modularity/interoperability and reusability are necessary to develop cost effective applications using generic components.
1115 However, till now, low level message-passing libraries, such as MPI or PVM, are used to couple codes. It is obvious to say that this approach does not contribute to the design of applications using independent software components. Such communication libraries were developed for parallel programming so that they do not offer the necessary support for designing components which can be reused by other applications. Solutions already exist to decrease the design complexity of applications. Distributed object-oriented technology is one of these solutions. A complex application can be seen as a collection of objects, which represent the components, running on different machines and interacting together using remote object invocations. Existing standard such as CORBA (Common Object Request Broker Architecture) aims at helping the design of applications using independent software components through the use of CORBA objects 1 CORBA is a distributed software platform which supports distributed object computing. However, exploitation of parallelism within such object is restricted in a sense that it is limited to a single node within a network. Therefore, both parallel and distributed programming environments have their own limitations which do not allow, alone, the design of high performance applications using a set of reusable software components. This paper aims at introducing a new approach that takes advantage of both parallel and distributed programming systems. It aims at helping programmers to design high performance applications based on the assembling of generic software components. This environment relies on CORBA with extensions to support parallelism across several network nodes within a distributed system. Our contribution concerns extensions to support a new kind of object we called a parallel CORBA object (or parallel object) as well as the integration of message-passing paradigms, mainly MPI, within a parallel object. These extensions exploit as much as possible the functionality offered by CORBA and requires few modifications to existing CORBA implementations. The paper is organised as follows. Section 2 gives a short introduction to CORBA. Section 3 describes our extensions to the CORBA specification to support parallelism within an object. Section 4 introduces briefly the Cobra runtime system for the execution of parallel objects. Section 5 describes some related works that share some similarities with our own work. Finally, section 6 draws some conclusions and perspectives.
2
A n o v e r v i e w of C O R B A
CORBA is a specification from the OMG (Object Management Group) [5] to support distributed object oriented applications. Such applications can be seen as a collection of independent software components or CORBA objects. Objects have an interface that is used to describe operations that can be remotely invoked. Object interface is specified using the Interface Definition Language (IDL). The following example shows a simple IDL interface: 1 For the remaining of the paper, we will use simply object to name a CORBA object
1116 interface myservice { void put(in double a); double myop(inout long i, out long j);
}; An interface contains a list of operations. Operations may have parameters whose types are similar to C + + ones. A keyword added just before the type specifies whether the parameter is an input or an output parameter or both. IDL provides an interface inheritance mechanism so that services can be extended easily. Figure 1 provides a simplified view of the CORBA architecture.
CORBA Object ' '", ................................................. ,,( ~:1
I
Object
9
I"
9
i'
!l Implementation l I: ~,..........L................................2J" Server node Client node IDLStubs
[
I
Object
I
adaptor
I
" . . . . - ~ - " O b i e c t RequesiBrok-er~ . . . . . . . . ]"
Fig. 1. CORBA system architecture
In this figure, an object located at the client side is bound to an implementation of an object located at the server side. When a client invokes an operation, communication between the client and the server is performed through the Object Request Broker (ORB) thanks to the IDL stub (client side) and the 1DL skeleton (server side). Stub and skeleton are generated by an IDL compiler taking as input the IDL specification of the object. A CORBA compliant system offers several services for the execution of distributed object-oriented applications. For instance, it provides object registration and activation. 3
Parallel
CORBA
object
CORBA was not originally intended to support parallelism within an object. However, some CORBA implementations provide a multi-threading support for the implementation of objects. Such support is able to exploit simultaneously several processors sharing a physical memory within a single computer. Such level of parallelism does not require modification of the CORBA specification since it concerns only the object implementation at the server side. Instead of having one thread assigned to an operation, it can be implemented using several threads. However, the sharing of a single physical memory does not allow a large
1117 number of processors since it could create memory contention. One objective of our work was to exploit several dozen of nodes available on a network to carry out a parallel execution of an object. To reach this objective, we introduce the concept of parallel CORBA object.
3.1
Execution model
Server 1
Parallel server node
Server n
, g - ......................................
.......................................
' , i t Process I '! qr
"i I Object !i[ Implementation
Client node
i.l
-
'
-
I_.~?L I ISkeletonl
', ',
*
' I
' I
Object , Implementation Ill I I',
I_.IDL I I~e,etonl
--.
'
Object Adapter I
,"
Object Adapter ,
!
"i ', i
I
,i
b,
', ', I
'
l
', CORBA object collection 1
i
I h
'
', ,
j
[Process J
parallel CORBAobject ,
',l ',' u
J"
Fig. 2. Parallel CORBA object service execution model.
The concept of parallel object relies on a SPMD (Single Program Multiple Data) execution model which is now widely used for programming distributed memory parallel computers. A parallel object is a collection of identical objects having their own data so that it complies with the SPMD execution model. Figure 2 illustrates the concept of parallel object. From the client side, there is no difference when calling a parallel object comparing to a standard object. Parallelism is thus hidden to the user. When a call to an operation is performed by a client, such operation is executed by all CORBA objects belonging to the collection. Such parallel execution is handled by the stub that is generated by an Extended-IDL compiler, which is a modified version of the standard IDL compiler.
3.2
Extended-IDL
As for a standard object, a parallel object is associated with an interface that specifies which operations are available. However, this interface is described using an IDL we extended to support parallelism. Extensions to the standard IDL aim at both specifying that an interface corresponds to a parallel object and at distributing parameter values among the collection of objects. Extended-IDL is the name of these extensions.
1118 S p e c i f y i n g t h e d e g r e e o f p a r a l l e l i s m The first IDL extension corresponds to the specification of the number of objects of the collection that will implement the parallel object. Modifications to the IDL language consist in adding two brackets to the IDL interface keyword. A parameter can be added within the two brackets to specify the number of objects belonging to the collection. Such parameter can be a "*", that means that the number of objects belonging to the collection is not specified in the interface. The following example illustrates the proposed extension. interface [*] ComputeFEM { typedef double dmat [I00] [I00] ; void initFEM(in dmat mat, in double p) ; void doFEM(in long niter, out double err) ;
}; In this example, the number of objects will be fixed at runtime depending on the available resources (i.e. the number of network nodes if we assume that each object of the collection is assigned to only one node). The implementation of a parallel object may require a given number of objects in the collection to be able to run correctly. Such number may be inserted within the two brackets of the proposed extension. The following example gives an example of a parallel object service which is made of 4 objects. interface[4] ComputeFgM {
}; Instead of giving a fixed number of objects in the collection, a function may be added to specify a valid number of objects in the collection. The following example illustrates such possibility. In that case, the number of objects in the collection may be only a power of 2. interface [n^2] ComputeFEM {
}; It is the responsibility of the runtime system, in our case Cobra, to check whether the number of network nodes has been allocated according to the specification of the parallel object. IDL allows a new interface to inherit from an existing one. Parallel interface can do the same but with some restrictions. A parallel interface can inherit only from an existing parallel interface. Inheritance from standard interface is forbidden. Moreover, inheritance is allowed only for parallel interfaces that could be implemented by a collection of objects for which the number of objects coincides. The following example illustrates this restriction. interface [*] MatrixComponent
{
}; interface In^2] ComputeFEM
: MatrixComponent
{
1119 In this example, interface ComputeFEM derives from interface MatrixComportent, The new interface has to be implemented using a collection having a power of 2 objects. In the following example, the Extended-IDL compiler will generate an error when compiling because inheritance is not valid : interface[3] MatrixComponent
{
}; interface In'2] ComputeFEM
: MatrixComponent
{
}; S p e c i f y i n g d a t a d i s t r i b u t i o n Our second extension to the IDL language concerns d a t a distribution. The execution of a method on a client side will provoke the execution of the method on every objects of the collection. Since, each object of the collection has it own separate address space, we m u s t envisage how to distribute p a r a m e t e r values for each operation. Attributes and types of operation parameters act on the data distribution. Proposed extension of IDL for d a t a distribution is allowed only for parameters of operations defined in a parallel interface. When a standard IDL type is associated with a p a r a m e t e r with an in mode, each object of the collection will receive the same value. When a p a r a m e t e r of an operation has either an o u t or a i n o u t mode, as a result of the execution of the operation, stub generated by the Extended-IDL compiler will get a value from one of the objects of the collection. T h e IDL language provides multidimensional fixed-size arrays which contains elements of the same type. The size along each dimension has to be specified in the definition. We provide some extensions to allow the distribution of arrays among the objects of a collection. D a t a distribution specifications apply for b o t h in, o u t and i n o u t mode. They are similar to the ones already defined by H P F (High Performance Fortran). The following example gives a brief overview of the proposed extension. interface[*] MatrixComponent { typedef double dmat [iO0] [I00] ; typedef double dvec[iO0]; void matrix_vector_mult (in dist [BLOCK] [*] dmat, in dvec v, out dist[CYCLIC] dvec u);
}; This extension consists in the adding of a new keyword (dist) which specifies how an array is distributed among the objects of the collection. For example, the 2D array m a t is distributed by block of rows. Stubs generated by the Extended IDL compiler do not perform the same work when the p a r a m e t e r is an input or an output parameter. With an input parameter, the stub must scatter the distributed array so that each object of the collection received a subset of the whole array. With an output parameter, the stub must do the reverse operation. Indeed, each object of a collection contains a subset of the array. Therefore, the stub is in charge of gathering d a t a from objects belonging to the collection.
1120
Such gathering may include a redistribution of data if the client is itself a parallel object. In the previous example, the number of objects in the collection is not specified in the interface. Therefore, the number of elements assigned to a particular object can be known only at runtime. It is why a distributed array of a given IDL type is mapped to an unbounded sequence of this IDL type. Unbounded sequence offers the advantage that its length is set up at runtime. We propose to extend the sequence structure to store information related to the distribution.
4
A runtime s y s t e m to support parallel C O R B A objects
The Cobra runtime system [2] aims at providing resource allocation for the execution of parallel objects. It is targeted to a network of PCs connected together using SCI [6]. Resource allocation consists in providing network nodes and shared virtual memory regions for the execution of parallel objects. Resource allocation services are performed by the resource management service (RmProcess) of Cobra. It is used when a parallel service must be bind to a client. We propose to extend the _bind method provided by most of the CORBA implementations. Binding to a parallel object differs from the standard binding method. Indeed, a reference to a virtual parallel machine ( v p m ) is given as an argument of the bind method instead of a single machine. The v p m reference is obtained through the Cobra resource allocator. The following example illustrates how to use a parallel object service within the Cobra runtime: // Obtain a reference from the RmProcess service cobra = RmProcess::_bind("cobra.irisa.fr"); // Create a VPM cobra->mkvpm(vpmname, NumberOfNodes, NORES); // Get a reference to the allocated vpm pap_get_info_vpm(~vpm, vpmname); // Obtain a reference from the parallel object service: MatrixComponent cs = MatrixComponent::_bind ( &vpm ); // Invoke an operation provided by MatrixComponent service cs->matrix_vector_mult( a, b, ~c);
The bind method may be called either by a single object, or by all objects belonging to a collection if the client is itself a parallel object.
5
Related works
Several projects deal with environments for high-performance computing combining the benefits of distributed and parallel programming. The RCS [1], NetSolve [3] and Ninf [8] projects provide an easy way to access linear algebra method libraries which run on remote supercomputer. Each method is described
1121 by a specific interface description language. Clients invoke methods thanks to specific functions. Arguments of these functions specify method name and m e t h o d arguments. These projects propose some mechanisms to manage load balancing on different supercomputers. One drawback of these environments is the difficulty for the user to add new functions in the libraries. Moreover, they are not compliant to relevant standard such as CORBA. The Legion [4] project aims at creating a world wide environment for high-performance computing. A lot of principles of CORBA (such as heterogeneity management and object location) are provided by the Legion run-time, although Legion is not CORBA-eompliant. It manipulates parallel objects to obtain high-performance. All these features are in common with our Cobra run-time. However, Legion provides others services such as load balancing on different hosts, fault-tolerance and security which are not present in Cobra. The PARDIS [7] project proposes a solution very close to our approach because it extends the CORBA object model to a parallel object model. A new IDL type is added: c/sequence (for distributed sequence). It is a generalisation of the COI~BA sequence. This new sequence describes data type, data size, and how data must be distributed among objects. In PARDIS, distribution of objects is let to the programmers. It is the main difference with Cobra for which a resource allocator is provided. Moreover in Cobra, extended-IDL allows to describe object parallel services in more details. 6
Conclusion
and
perspectives
This paper introduced the parallel CORBA object concept. It is a collection of standard CORBA objects. Its interface is described using an Extended-IDL to manage data distribution among the objects of the collection. Cobra is being implemented using Orbix from Iona Tech. It has already been tested for building a signal processing application using a client/server approach [2]. For such application, the most computing part of the application is encapsulated within a parallel CORBA object while the graphical interface is a Java applet. This applet, acting as a client, is connected to the server through the CORBA ORB. Current works are now focusing on the experiment of the coupling of numerical codes. Particular attention will be paid on the performance of the ORB which seems to be the most critical part of the software environment to get the requested performance. It is planned, within the PACHA project, to implement an ORB that fully exploits the performance of the SCI clustering technology while ensuring compatibility with existing ORB through standard protocols such as T C P / I P .
References 1. P. Arbenz, W. Gander, and M. Oettli. The Remote Computation System. In HPCN Europe '96, volume 1067 of LNCS, pages 662-667, 1996. 2. P. Beaugendre, T. Priol, G. Alleon, and D. Delavaux. A client/server approach for hpc applications within a networking environment. In HPCN'98, pages 518-525, April 1998.
1122 3. H. Casanova and J. Dongara. NetSolve: A Network Server for Solving ComputationaJ Science Problems. The International Journal of Supercomputer Applications and High Performance Computing, 11(3):212-223, 1997. 4. A. S. Grimshaw, W. A. Wulf, and the Legion team. The Legion Vision of a Worldwide Virtual Computer. Communications of the ACM, 1(40):39 45, January 1997. 5. Object Management Group. The common object request broker: Architecture and specification 2.1, August 1997. 6. Dolphin Interconnect. Clustar interconnect technology. White Paper, 1998. 7. K. Keahey and D. Gannon. PARDIS: CORBA-based Architecture for Applicationlevel Parallel Distributed Computation. In Proceedings of Supercomputing '97, November 1997. 8. M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U. Nagashima, and H. Takagi. Ninf: A Network Based Information Library for Global World-Wide Computing Infrastructure. In HPCN Europe '97, volume 1225 of LNCS, pages 491 502, 1997.
OCEANS: Optimising Compilers for Embedded ApplicatioNS* Michel Barreteau 1 , Francois Bodin ~, Peter Brinkhaus 3, Zbigniew Chamski 4, Henri-Pierre Charles 1, Christine Eisenbeis ~, John Gurd 6, Jan Hoogerbrugge 4, Ping Hu 5, William Jalby 1, Peter M. W. Knijnenburg 3, Michael O'Boyle ~, Erven Rohou 2, Rizos Sakellariou 6, Andr~ Seznec 2 , Elena A. StShr 6, Menno Treffers4, and Harry A. G. Wijshoff 3 1 Laboratoire PRISM, Universitd de Versailles, 78035 Versailles, France. 2 IRISA, Campus Universitaire de Beaulieu, 35042 Rennes, France. 3 Department of Computer Science, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands. 4 Philips Research, Information and Software Technology, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands. 5 INRIA, Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France. 6 Department of Computer Science, The University, Manchester M13 9PL, U.K. Department of Computer Science, The University, Edinburgh EH9 3JZ, U.K. A b s t r a c t . This paper presents an overview of the activities carried out within the ESPRIT project OCEANS whose objective is to investigate and develop advanced compiler infrastructure for embedded VLIW processors. This combines high and low-level optimisation approaches within an iterative framework for compilation.
1
Introduction
Embedded applications have become increasingly complex during the last few years. Although sophisticated hardware solutions, such as those exploiting instruction level parallelism, aim to provide improved performance, they also create a burden for application developers. The traditional task of optimising assembly code by hand becomes unrealistic due to the high complexity of hardware/software. Thus the need for sophisticated compiler technology is evident. Within the OCEANS project, the consortium intends to design and implement an optimising compiler that utilises aggressive analysis techniques and integrates source-level restructuring transformations with low-level, machine dependent, optimisations [1, 14, 16]. A major objective of the project is to provide a prototype framework for iterative compilation where feedback from the low-level is used to guide the selection of a suitable sequence of source-level transformations and v i c e v e r s a . Currently, the Philips TriMedia (TM1000) VLIW processor [8] is used for the validation of the system. In this paper, we present the work that has been carried out during the first 15 months since the project started (September 1996). This has largely * This research is supported by the ESPRIT IV reactive LTR project OCEANS, under contract No. 22729.
1124 I File.f ~
1
MTI 1 L Code ] generator
LORA ~
I
SEA SALTO
I
I e ort. !
Fig. 1. The Compilation Process. concentrated on the development of the necessary compiler infrastructure. An overall description of the system is given in Section 2. Sections 3 and 4 present the high-level and the low-level subsystems respectively, while the steps that have been taken towards their integration are highlighted in Section 5. Finally, some results from the initial validation of the system are shown in Section 6, and the paper is concluded with Section 7. 2
An
Overview
of the
OCEANS
Compiler
System
The OCEANS compiler is centred around two major components: a high-level restructuring system, MT1, and a low-level system for supporting assembly language transformations and optimisations, SALTO,which is coupled with SEA, a set of classes that provides an abstract view of the assembly code, and tools for software pipelining (PILo) and register allocation (LoRA). Their interaction is illustrated in Figure 1 which shows the overall organisation of the OCEANS compilation process. In particular, a program is compiled in three main steps: - First, MT1 performs lexical, syntactical and semantic analysis of a source FoRTaAN program ( F i l e . f). Also, a sequence of source program transformations can be applied. - The restructured source program is then fed into the code generator which generates sequential assembly code that is annotated with instruction identifiers used to identify common objects in MT1 and SALTO,and a file written in an Interface Language (File.IL) that provides information on d a t a dependences and control flow graphs.
1125 - Finally, SALTO (coupled with SEA) performs code scheduling and register allocation. At this step guarded instructions are created and resource constraints are taken into account. The above process is repeated iteratively until a certain level of performance is reached. Thus, different optimisations, both at the source-level and the lowlevel, are checked and evMuated. An important feature of the system is the existence of a client-server protocol that has been implemented in order to provide easy access to the compiler over the Internet, for all members of the consortium. MT1 and the code generator are located at Leiden, and SALTO,SEA, PILO and LoRA are located at Rennes. 3
High-Level
Transformations
Optimizing and restructuring compilers incorporate a number of program transformations that replace program fragments by semantically equivalent fragments to obtain more efficient code for a given target architecture. The problem of finding an optimum order for applying them is commonly known as the phase ordering problem. Within the MT1 compilation system [5] this problem is solved by providing a Transformation Definition Language (TDL) [3] and a Strategy Specification Language (SSL) [2]. Transformations and strategies specified in these languages can be loaded dynamically into the compiler.
3.1
Transformation Definition Language
The TDL is based on pattern matching. The user can specify an input pattern, a transformed output pattern and a condition when the transformation can be legally a n d / o r beneficially applied. Patterns may contain expression and statement variables. When a pattern is matched against the code these variables are bound to actual expressions and code fragments, respectively. The expression and statement variables can be used in turn in the specification of the output pattern and the condition. This mechanism ~llows one to specify a large number of transformations, such as loop interchange, loop distribution or loop fusion. However, it is not powerful enough to express other important transformations, such as loop unrolling. Therefore, the TDL also allows for user-definedfunctions in the output pattern. User-defined functions are the interface to the internal data structures of the compiler. In this way, any algorithm for transforming and testing code can be implemented and made accessible to the TDL.
3.2
Strategy Specification Language
The order in which transformations have to be applied is specified using a Strategy Specification Language (SSL). It allows the specification of an optimising strategy at a more abstract level than the source code level. This language contains sequential composition of transformations, a choice construct and two repetitive constructs.
1126
Fig. 2. SEA class hierarchy.
Fig. 3. Typical usage of SEA classes.
An IF statement consists of a transformation that acts as a condition, a THEN part and an optional ELSE part. The transformation in the condition can be applied successfully or not. If it is successful, the transformations in the THEN part are to be executed. Optionally, in the ELSE part a list of transformations can be given which should be executed in case the transformation matched but was not applied successfully due to failing conditions. The two repetitive constructs consist of a transformation to be checked and a statement list to be executed if the condition is true or false, respectively. T h e y consist of a WHILE-ENDWHILE and a REPEAT-UNTIL construct. Examples of how to specify strategies in SSL can be found in [2].
4
Low-Level Optimisations
Low-level optimisations are built on the top of SALTO, a retargetable system for assembly language transformation and optimisation [15]. To facilitate the implementation of optimisations, a set of classes has been designed, SEA (SALTO Enhanced Abstraction), that provides an abstract view of the assembly code which is more pertinent to the code scheduling and register allocation problems. The most important features of SEA are that it allows the evaluation of various code transformations before producing the final code, and that it separates the implementation of the global low-level optimisation strategy from the implementation of individual optimisation sequences. The SEA model contains two kinds of objects: c o d e f r a g m e n t s The following objects can be used. seaINST: an instruction object; seaCF, an unstructured set of code fragments; seaSCF, a structured subset of control flow graph with a unique entry point; seaSBk, a superblock; seaBBk, a basic block; and finally, seaLoop, a structured piece of code that has loop properties. Figure 2 illustrates the corresponding class hierarchy. t r a n s f o r m a t i o n s to be applied to subgraphs. All transformations are characterized by the following main methods: preCond() returns the set of control flow subgraphs that qualifies for the transformation; a p p l y ( ) applies the transformation to a given subgraph, and finally, g e t S t a t u s () that returns
1127 the status of the transformation after application (success or failure) and the reason for the failure. The usage of the SEA objects is shown in Figure 3. A transformation is tried on a cloned piece of code, then according to performance or size criteria one of the solutions found is chosen and propagated to the low-level program representation using the r e b u • () method. The optimisations currently available within SEA are: register renaming, superblock construction [12], guard insertion [11], loop unrolling (also available at the high-level), local/superblock scheduling [12], software pipeline, and register allocation. The implementation of software pipeline is based on the tools PILO [17] and LORA [9] which generate a modulo scheduling of the loop body. 5 5.1
Integration The Interface Language
In order to transmit information between the various components of the compiler, an Interface Language (IL) was designed. This allows the propagation of information, such as data dependences and loop control data, from MT1 to SALTO, as well as feedback information from the scheduled code back to MT1. An IL description consists of three sections: a list of keywords that specifies the list of attributes that can apply to an object; a default level setting that indicates the type of code the objects belong to; and a list of object references which specify the nature, contents and attributes of an object. More details on the IL can be found in [7]. 5.2
Information Forwarded and Feedback
Data dependence information is propagated from MT1 to SALTO and is used for memory disambiguation. The feedback from SEA to MT1 (file R e p o r t . IL in Figure 1) contains information on the code structure, the basic blocks, as well as a record of the transformations that were applied. Data related to each basic block include the total number of assembly instructions, the critical path for scheduling the code, the number of cycles of the scheduled code, and a grouping depending on the nature of the instructions. Examples can be found in [7]. MT1 uses the feedback from SALTO in order to build an internal data structure that can be accessed by the T D L and the SSL by means of user-defined functions in the condition that can check for the identity of a code fragment and suggestions made by SALTO. When such a transformation is used as the condition for an iF construct in the SSL, we are able to select the transformations we want to apply to this fragment. Initially, MT1 compiles the program without performing any restructuring and the compiled program is scheduled by SALTO. SALTO identifies the code fragments that can be improved. It reports its diagnostics to a cost model that makes a decision on what kind of restructuring could be performed next. Then, MT1
1128 reads the suggestions for restructuring and performs these. It is intended that a transformation sequence for a given program fragment is selected by following a systematic approach for searching through a domain of possible transformations. First, each different transformation is applied once and then the same follows for each branch of the tree. The search space is minimised by using a threshold condition for terminating branches that are not likely to yield an optimum result in their descendants. Some preliminary experiments using this strategy can be found in [10]; further work is in progress. 6
Validation
of the
Initial
System
In order to validate the compiler, four public domain multimedia codes have been selected [4]. These are a low bit-rate encoder/deeoder for H.263 bitstreams, an MPEG2 eneoder/decoder, an implementation of the C C I T T G.711, G.721 and G.723 voice compression standards, and the Persistence of Vision Ray-Tracer for creating 3D graphics. At the high-level, initial experimentation aimed at identifying those transformations that appear to be the most crucial in optimising code scheduling. Inspection of the benchmarks revealed that they contain many imperfectly nested ~louble or triple loops with much overhead due to branch delays. In order to deal 'ith such loops, a transformation that converts the imperfectly nested loop into single loop has been suggested [13]. At the low-level, the initial validation of the system has been carried out by applying four different optimisation sequences: - So is the simplest sequence. First, the code is scheduled locally and then register allocation is performed. - Sl(U) is based on unrolling the loop body u times. The unrolled body is transformed into a superblock by guarding instructions. As in S0, register allocation is performed after local scheduling. - $2 (u) is similar to $1 (u) except that register allocation is performed before scheduling. This usually requires less registers, allowing this sequence to succeed when $1 (u) fails due to a lack of registers. $3 consists in applying a software pipelining algorithm. -
The above optimisation sequences were validated and indicative results, using the most-time consuming loops of H263, are illustrated in Figure 4. Every optimisation sequence has been applied to each of the six selected loops and the size of the resulting VLIW code and the speed of the loop, i.e. the number of cycles per iteration, were computed. From the table, a well-known result is observed: the more we unroll a loop, the faster it runs - - cf. columns $1 (2), $1 (3), $1 (4) - - but at the expense of a larger code size. As expected $2 (2) yields too poor performance and large code because of the presence of false dependences. Finally, software pipelining (Sa) gives the best performance but at the expense of a very large increase in code size. Note that this transformation failed with the last loop, due to a lack of registers.
1129 Optimisation sequences So S1(2) S1(3) $1(4) S2(2) S3 speed 8 6 5 5 7 3 size 8 12 16 20 13 75 speed size
9 9
7 13
6 18
6 22
10 19
5 55
speed 12 size 12
8 16
8 24
7 28
12 24
6 121
speed size
15 15
10 20
9 28
9 34
16 31
6 172
speed I 15 size 15
l0 19
10 29
8 33
17 33
7 179
C code
for (i=xa;i<xb; i++) { d[i]=s[i]*om[i]; } for (i=xa; i<xb; i++) { d[i]+=s[i]*om[i]; } for (i=xa; i<xb; i++) { dp[i]+=(((unsigned int) (sp[i] +sp2[i]+l)) >>l)*om[i] ; } for (i=xa; i<xb; i++) { dp [i] += ( ( (unsigned int) (sp [i] + sp[i+l]+l))>>l)*OM[r [j] [i] ; } for (i=xa; i<xb; i++) { d p [i] += ( ((uint) (sp [i] +sp2 [i] + sp [ i + l ] +sp2 [ i + l ] +2) ) >>2) *ore [ i ] ;
} speed size
19 19
13 25
12 36
11 44
30 59
-
f o r (k=0; k<5; k++) { xint[k]=nx[k]>>l; x h [ k ] = n x [ k ] ~ 1; yint[k]=ny[k]>>l; y h [ k ] = n y [ k ] ge 1; s [k] = s r c + l x 2 * ( y + y i n t [k] ) + x + x i n t [k] ;
} F i g . 4. T i m e consuming loops e x t r a c t e d from H263.
In most embedded applications, it is necessary to answer globally questions such as: "Given a maximum code size, what is the highest performance that can be achieved?", or "Given a performance goal, what is the smallest code size that can be achieved?". Within the OCEANS compiler this trade-off is evaluated quantitatively by applying a novel compiler strategy. Thus, the choice of the most suitable optimisation is made a posteriori, when the impact of each possible transformation is known. More details can be found in [6].
7
Conclusion and Future Work
The previous sections outlined the current status of the OCEANS compiler. Although the results obtained so far, using the initial prototype, are satisfactory (comparing with a production compiler), the implementation work still continues on both the high and low levels. A major part of the work during the next months and until the end of the project is devoted to the integration of the two levels, the development of a prototype framework for iterative compilation, and its experimentM evaluation and tuning. Finally, it is intended that the system be
1130 made publically available in due time (at the moment SALTO is available on request).
References 1. B. Aarts, et al. OCEANS: Optimizing Compilers for Embedded Appfications. In C. Lengauer, M. Griebl, S. Gorlatch (Eds.), Proceedings of Euro-Par'97, Lecture Notes in Computer Science 1300, Springer-Verlag, 1997, pp. 1351-1356. 2. R. A. M. Bakker, F. Bregt, P. M. W. Knijnenburg, P. Touber, and H. A. G. Wijshoff. Strategy Specification Language. OCEANS Deliverable D1.2a, 1997. 3. A. J. C. Bik, P. J. Brinkhaus, P. M. W. Knijnenburg, P. Touber, and H. A. G. Wijshoff. Transformation Definition Language. OCEANS Deliverable D1.1, 1997. 4. A. J. C. Bik, P. J. Brinkhaus, P. M. W. Knijnenburg, P. Touber, H. A. G. Wijshoff, W. Jalby, H.-P. Charles, M. Barreteau. Identification of Code Kernels and Validation of Initial System. OCEANS Deliverable D3.1c, 1997. 5. A. J. C. Bik and H. A. G. Wijshoff. MTI: A Prototype Restructuring Compiler. Technical Report 93-32, Department of Computer Science, Leiden University, 1993. 6. F. Bodin, Z. Chamski, C. Eisenbeis, E. Rohou, and A. Seznec. GCDS: A Compiler Strategy for Trading Code Size Against Performance in Embedded Applications. Research Report 1153, IRISA, 1997. 7. F. Bodin and E. Rohou. High-level Low-level Interface Language. OCEANS Deliverable D2.3a, 1997. 8. B. Case. Philips Hope to Displace DSPs with VLIW. Microprocessor Report, 8(16), 5 Dec. 1994, pp. 12-15. See also h t t p : / / w w w . t r i m e d i a - p h i l i p s , com/ 9. C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: a new model for loop cyclic register allocation. Proceedings of PACT'95 (Cyprus, June 1995). 10. J. Gurd, A. Laffitte, R. Sakellariou, E. A. StShr, Y. T. Chu, and M. F. P. O'Boyle. On Compile-Time Cost Models. OCEANS Deliverable 1.2b, 1997. 11. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings o/the IEEE, 83(12), Dec. 1995, pp. 1625-1639. 12. W. Hwu, et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. The Journal o/Supercomputing, 7(1), May 1993, pp. 229-248. 13. P.M.W. Knijnenburg. Flattening: VLIW Code Generation for Imperfectly Nested Loops. Proceedings CPC'98, 1998. To appear. 14. OCEANS Web Site at h t t p ://ww~. wi. l e i d e n u n i v . n l / 0 c e a n s / 15. E. Rohou, F. Bodin, A. Seznec, G. Le Fol, F. Charot, F. Raimbault. SALTO: System for Assembly-Language Transformation and Optimization. Technical Report 1032, IRISA, June 1996. See also http://www, i r i s a . f r / c a p s / S a l t o / 16. R. Sakellariou, E. A. StShr, and M. F. P. O'Boyle. Compiling Multimedia Applications on a VLIW Architecture. Proceedings of the 13th International Conference on Digital Signal Processing (DSP97) (Santorini, July 1997), vol. 2, IEEE Press, 1997, pp. 1007-1010. 17. J. Wang, C. Eisenbeis, M. dourdan, and B. Su. Decomposed Software Pipelining: a New Perspective and a New Approach. International Journal on Parallel Processing, 22(3), 1994, pp. 357-379.
Industrial S t o c h a s t i c S i m u l a t i o n s on a E u r o p e a n Meta-Computer Ken Meacham, Nick Floros, and Mike Surridge Parallel Appfications Centre, 2Venture Road, Chilworth, Southampton, SO17 7NP, UK.
{kem,nf,ms}@pac.soton.ac.uk http://www.pac.soton.ac.uk/
A b s t r a c t . This paper outlines the experiences of running a large stochastic multi-body simulation across a pan-European meta-computer, to demonstrate the use of the PROMENVIR tool within such a large-scale WAN environment. We describe the meta-application management approach developed by PAC and discuss the technical issues raised by this experiment.
1
Introduction
P R O M E N V I R (PRObabilistic MEchanical desigN enVIRonment) is an advanced meta-computing tool for performing stochastic analysis of generic physical systerns, which has been developed within the PP~OMENVIR E S P R I T project (No. 20189) [1, 2]. The tool provides a user with the framework for running a stochastic Monte Carlo (MC) simulation, using any preferred deterministic solver (e.g. NASTRAN), then provides sophisticated statistical analysis tools to analyse the results. P R O M E N V I R facilitates investigations into the reliability of components, by analysing how uncertainties in their manufacture, deployment or environment affect key mechanical properties such as stresses. P R O M E N V I R does this by generating many hundreds of analysis "shots", in which the uncertain parameters are generated from statistical distributions selected by the user. Key o u t p u t values are then extracted from each analysis shot, and can be analysed to determine the sensitivity of the component with respect to uncertain parameters, correlations between physical properties and behaviour, and clustering characteristics (e.g. failure set distributions). P R O M E N V I R is an open environment, and can generate analyses for essentially any solver. Since solver runs are independent, the overall simulation is intrinsically parallel and is therefore an ideal candidate for exploiting parallel HPC resources, which may include heterogeneous clusters of workstations, MPP or SMP platforms. These resources may reside within a company Local Area Network (LAN), or may be accessible over a geographically-distributed Wide Area Network (WAN). The only restriction on the use of resources within a P R O M E N V I R simulation is the availability of hosts and solver licenses. The implication of this is that,
1132 by co-operating with partner sites across a WAN, a PROMENVIR stochastic simulation of many hundreds of solver runs may be executed within a few hours, rather than many days, hence providing a rapid turnaround time for analysis of new designs. One of the main challenges within the P R O M E N V I R project was therefore to demonstrate the effectiveness of the P R O M E N V I R package within a pan-European meta-computing (WAN) environment, to solve a large stochastic problem of industrial significance. The Parallel Applications Centre (PAC) has successfully set up and run a series of "WAN experiments", involving partner sites within a European consortium. This paper outlines the experiments which took place, the technical issues which arose and summarizes the results.
2
Testcase for W A N E x p e r i m e n t
The most widely available solver within the P R O M E N V I R consortium was the Multi-Body Simulation code SIMAID, developed by C E I T in San Sebastian [3], so it was decided that a suitable demonstration would be set up, using this code as the basis for a large stochastic simulation. The testcase chosen was a simulation of a satellite antenna deployment, duriug which the antenna unfolds to give a planar rim (see Figs 1 - 3). The antenna was modelled using a set of beams, with springs and dampers at the joints. A single deterministic simulation had already been carried out, which predicted that the outer rim of the antenna would be essentially flat. However, the antenna simulation model was constructed using components with identical properties, and did not take into account the manufacturing tolerances inherent within the components and their assembly.
Fig. I. SIMAID antenna model (folded)
1133
Fig. 2. SIMAID antenna model (partially deployed)
Fig. 3. SIMAID antenna model (fully deployed)
1134 The P R O M E N V I R Monte Carlo simulation consisted of 500-1000 shots of the SIMAID solver, for the antenna testcase (at least 500 shots were required for convergence of results). For each shot, values for the critical component parameters (e.g. spring stiffness) were chosen from a normal distribution, which modelled the known tolerances. The results of the stochastic simulation could then be analysed to see the effect of manufacturing tolerances on the planarity of the deployed antenna, and hence the effectiveness of the antenna.
3
The
European
Meta-Computer
A total of 102 CPUs were provided for the experiment, from within the consortium (see Fig. 4). These were restricted to SGI architectures, since SIMAID was only available for IRIX or IRIX64 operating systems. However, many different types of SGI machines were made available, ranging from older Indigo R3000 workstations up to a 64-cpu Origin (at UPC, Barcelona). The topology of this hardware configuration is shown geographically in Fig 4. The master host was running at PAC, while slave hosts were used in UK, Spain, Germany and Italy, at various partner sites. We refer to such a meta-computer as a Parallel Virtual Computer (PVC). SIMAID was installed and licensed on each machine, and access permissions set up to allow remote access from the master SGI Indy at PAC, on which the P R O M E N V I R front end and resource manager were running.
4 4.1
Technical
Issues
Meta-Application Manager
A fundamental component of the package is the Advanced Parallel Scheduler (APS),_ which has been developed at the Parallel Applications Centre (PAC). This has been designed as a recta-application manager, which orchestrates the use of the PVC by an application such as PROMENVIR. The APS is capable of making intelligent resourcing decisions, based on performance models (developed at PAC) of the application and resources. These models are used to predict resource usage for a solver in terms of CPU, memory, disk space and I / O traffic, and hence allows capacity planning of the available resources, before submission of any jobs. A hosts (PVC) database is set up, using the System Definition tool (SDM), which automatically obtains characteristics for each required host, within the LAN or WAN. Performance characteristics for hosts may also be entered here, to be used by the APS. The APS creates daemons on remote hosts, which are responsible for submitting jobs, copying files and communicating load information back to the master workstation. It is capable of initiating UNIX processes directly, but can also be used to control and submit meta-applications via conventional load-sharing software, such as LSF from Platform Computing [4].
1135 PROMENVIR Multi-National Meta-Computer (UK, D, E, I) NCPU = 102
Fig. 4. Meta-computer (PVC) topology for WAN experiment
4.2
R e m o t e Host Access a n d S e c u r i t y
A major part of the work involved in setting up the WAN experiment was ensuring remote host accessibility. For example, some partners were able to set up, quite quickly, remote access to their hosts via r s h (remote shell); these sites generally consisted of academic institutions, where access is not normally too restricted. However, several partners (including PAC) had security features including firewall machines. This was a major obstacle, since r s h access is normally restricted and, even where available, a firewall machine generally Mlows access in one direction only. This makes it impossible to set up two-way r s h access between two secure sites, through two firewalls (i.e. one at each end). A further requirement on the meta-computer was to make any file transfers between remote sites both robust and secure. For these reasons, we decided to explore the use of SSH/SCP protocols, which turned out to solve most of our problems and requirements. SSH (Secure Shell) and SCP (Secure Remote Copy), developed by Data Fellows [5, 6], provide secure T C P / I P connections between trusted sites. Features include: - strong authentication (via RSA keys) - automatic data encryption and compression - access though firewalls (via SSH officially registered port 22)
1136 We were also able to exploit an advanced SSH facility called port forwarding, which enables T C P traffic to be forwarded to a remote host via the secure connection. Various partner sites already had an SSH facility, while others needed to install it. By using SSH and SCP protocols for all T C P connections, and by configuring partner firewalls to allow SSH access, we were then able to submit tasks transparently to the PVC, via the APS, and receive back results from solver runs, for collection by the P R O M E N V I R application. This demonstrated sharing of resources between sites for a single meta-application, even between sites which both have firewall security. 5
Running
the
Experiment
Several smaller-scale tests of P R O M E N V I R were carried out, firstly using machines that were available and easily accessible (i.e. without firewalls). It was soon found that the unreliability of the network was the major factor in any lost or failed solver runs (shots). Due to the statistical nature of the PROMENVIR environment, it is not critical if some shots are lost, since results will still converge, as long as enough shots are completed in total. However, we were finding that up to 10-20% of the remote tasks could fail, due to slow and unreliable connections to certain sites (particularly to Italy). This figure was unacceptable, particularly if the WAN was intended for use with other (non-stochastic) types of application. We therefore made several improvements to the APS, to enhance robustness and optimise data transfer. For example, rather than attempting single file transfers, we set up the APS to perform multiple retries. If a host could not be contacted after this, it was assumed to be down, and no further tasks were sent to it. As more and more machines were made available, we were able to increase the number of CPUs used for the experiments. Our target was to reach 100 CPUs across the meta-computer. 6 6.1
WAN
Experiment
Findings
Results
A summary of the most successful experiment is shown in Table 1. This shows a list of the partner sites, and their contributions to the recta-computer in terms of CPUs provided (as SGI workstations or SMP platforms). Of the 8 sites represented in this table, 5 had firewall security.
6.2
Availability and Reliability
Of course, the fact that remote hosts are set up and installed does not necessarily imply that they will be available at the time of running the experiment. Of the
1137 Table 1. PVC host usage statistics during WAN experiment
Partner
PAC So'ton University UPC RUS CASA CElT ItalDesign Blue Grand Totals Total Total Total Total
cpus cpus cpus cpus
installed defined in PVC available used in WAN
CPUs Availability Shot Statistics Nproc In PVC Access Used Failed Successful Total 15 15 15 14 1 150 151 10 10 9 6 0 40 40 16 16 16 16 1 275 276 12 12 11 9 0 104 104 15 15 15 14 15 184 199 12 12 12 8 0 98 98 11 11 6 5 2 63 65 11 11 11 7 6 61 67 102 102 95 79 25 975 1000 102 Elapsed Execution Time: 102 95 Approx Single CPU Time: 79
4:39:16 250 hrs
102 CPUs defined in the PVC, only 95 could be contacted at the time of running the experiment. We found that this number fluctuated during the day; this was mainly due to the network load, but was sometimes due to machines being down, or turned off. Of the 95 CPUs available to P R O M E N V I R via the APS (i.e. with daemons running), 79 were actually used during the simulation to run solver tasks. The reason for this lies in how the APS decides to allocate tasks to hosts. The C P U load is measured on each remote host (via the UNIX upt• command), and sent back to the APS. If this is below a certain threshold and the performance models indicate that it is advantageous to do so, a task will be submitted to that host; otherwise the task will be scheduled elsewhere. We found that jobs were already running on certain hosts (generally those which had not been provided exclusively for the WAN experiment), so these were not used by the APS to run P R O M E N V I R tasks. However, we also found that some machines, which were apparently not running any jobs, were still not used by the APS; these hosts still appeared overloaded. We eventually traced this problem to certain SGI workstations which were running screen lock programs, which fully loaded the machines CPU resources, when no other (higher priority) jobs were running. A total of 1000 SIMAID solver shots were submitted to the meta-computer, of which 975 completed successfully, and the total elapsed time of the P R O M E N V I R simulation was only 4h:39m:16s. This compares with an approximate time of 250 hours on a typical SGI workstation (a single SIMAID run taking around 15 mins). The 25 shot failures were still mainly due to network problems, which will always be present to some extent, within a large meta-computer, unless very fast and reliable links are used (e.g. ATM). A handful of shots were lost,
1138 due to solver problems. In certain extreme cases, combinations of random input parameters can cause the solver to crash, though this is rare.
6.3
Analysis o f Results
The detailed analysis of the results is beyond the scope of this paper, and will be published elsewhere, in conjunction with results from further simulations which are being carried out using differing ranges of manufacturing tolerances. However, preliminary results show that, when a stochastic approach is used, there is a broader variation in the deployed geometry, and hence a poorer planarity in the dish than predicted with the deterministic simulation. By using the stochastic approach, further insights may be obtained than were previously available. For example, by examining the planarity variation according to the standard deviation of manufacturing tolerances, it is possible to work back and ask questions such as "what tolerances must be ensured in the manufacturing process, to produce an antenna of certain planarity'. These types of questions are highly relevant to industry.
7
Conclusions
PROMENVIR has been demonstrated as a highly useful tool for running industrial stochastic simulations across a European meta-computer, showing that corporate-scale meta-computing resources really can be used to solve large problems for industry. The APS, developed at PAC, has facilitated management of meta-applications and meta-computing resources (including firewall-protected systems), and is also being used with other meta-computing applications [7,8]. Results of the WAN experiment have shown that, once a meta-computer has been set up, further experiments can be carried out routinely. However, the overheads of setting up remote sites can be quite large. Furthermore, it has been found to be difficult to obtain full (exclusive) access to machines, during a large-scale experiment, without the full cooperation from contributing partners. Our experience suggests that, due to a combination of background loads, network performance and human factors (e.g. failure to log out of a machine), around 80% resource availability is the maximum which can realistically be expected for large-scale meta-applications. Finally, there will always be some inherent unreliability within a large-scale meta-eomputing resource, mainly due to network problems. These must be eliminated as much as possible and, where uncontrollable problems arise, the system and application must be made as robust as possible, to ensure that any simulation does not suffer fatally from individual task failures. Acknowledgements The work reported here was part of the ESPRIT PROMENVIR project. We are greatful to our partners (CASA, UPC, CEIT, RUS, Atos, ItalDesign and Blue Engineering) for a most fruitful collaboration.
1139
References 1. Marczyk, J.: Meta-Computing and Computational Stochastic Mechanics. Computational Stochastic Mechanics in a Meta-Computing Perspective (1997) 1-18 Ed. J. Marczyk 2. PROMENVIR product web page: h t t p ://www. atos-group, de/cae/index-promenvir, htm 3. CEIT home page: http://www, c e l t . es/ 4. Platform Computing (LSF) home page: http://www.platform.com 5. Data Fellows home page: http://www.datafellows, corn/ 6. Further information on SSH may be found at: h t t p : / / w , w , c s . h u t , f i / s s h / 7. Upstill C.: Multinational Corporate Metacomputing. Proceedings ofthe ACM/IEEE SC97 Conference, San Jose, CA, November 15-21, 1997 8. Hey, A.J.G., Scott, C.J., Surridge, M. and Upstill, C.: Integrating Computation and Information Resources - An MPP Perspective. 3rd Working Conference on Massively Parallel Programming Models (MPPM-97), London, November 12-14, 1997
Porting the S E M C 3 D Electromagnetics Code to HPF Henri Luzet 1 and L.M. Delves 2 1 SEMCAP, 20 rue Sarinen, Silic 270 RUNGIS 94578, France 2 N.A Software Ltd, 62 Roscoe St, Liverpool L1 9DW, UK d e l v e s @ n a s o f tware, co. uk
A b s t r a c t . Within the Esprit Pharos project, four industrial simulation codes were ported to HPFI.1 to provide a test of the suitability of this language for handling realistic Fortran77 applications, and to estimate the effort involved. We describe here the port of one of these: the SEMC3D electromagnetic simulation code. An outline of the porting process, and results obtained on the Meiko CS2 are given. 1
The
Pharos
Pharos was
Project
an Esprit-funded project aimed at:
Bringing together an HPF-oriented toolset from three vendors; - Using this toolset to develop HPF versions of four industrial-strength codes owned by project partners to estimate the effectiveness of H P F I . 1 and current compilers for handling such codes, and the effort involved in producing effective H P F ports. -
The project ran from January 1996 to December 1997; partners included: Tools V e n d o r s : N . A . S o f t w a r e : H P F Compiler and source level debugger (HPFPlus); S i m u l o g : Fortran77 to Fortran90 conversion tool (FORESYS); P a l l a s : Performance and communications monitor (VAMPIR). Code Owners : S E M C A P : Electromagnetic Simulation code (SEMC3D); C I S E : Finite Element Structural code (CANT-SD); M a t r a B A e D y n a m i c s : Compressible Flow code (Aerolog); d e b l s : Linear Elastostatics code (DBETSY3D). HPF Experts : G M D : Developers of A D A P T O R , an early and highly efficient H P F research compiler U n i v e r s i t y o f V i e n n a : Developers of Vienna Fortran, a precursor to HPF; and of VFCS, an interactive research compilation system implementing Vienna Fortran and later modified to accept standard HPF. We describe here the H P F port of obtained with this code.
SEMC3D, and
give benchmark results
1141
2 2.1
The
SEMC3D C o d e
T h e Application
The SEMC3D package is an Electromagnetic code developed to simulate EMC (ElectroMagnetic Compatibility) phenomena. It has a wide range of applications. For example, in the automotive industry electric and electronic components and systems must comply with increasingly tougher regulations. They must resist hostile electromagnetic environments and show reduced emissions. The SEMC3D package computes the electromagnetic environment of complex structures made up of metal, dielectric, ferrite and multilayers or human tissues. These structures could be surrounded with ionised gas, laid on a realistic or perfect ground, or with antennas or cables. The electromagnetic environment should be taken as the knowledge of electric and magnetic fields inside and outside the analysed structures. Many source models can be used: NEMP (Nuclear ElectroMagnetic Pulse), plane wave, current injection, lightning and all analytical and digital signals. The connection of the SEMC3D software with specialised Cable Networks codes (as C R I P T E ) allows many applications in the field of EMC problems ( emission a n d / o r susceptibility) as well as test cases where EMI (ElectroMagnetic Interferences) phenomena occur.
2.2
Numerical and Computational Features
The original Fortran77 source code comprises approximately 3500 lines. The kernel of SEMC3D code is built around the Leap-Frog scheme in time and space. The Leap-Frog scheme uses only the components located within semi spacetime away with regard to the updated component. This characteristic gives a strong locality in the solution of Maxwell's equations, that is to say the updating of variables is independent of the order of their appearance and each variable depends only on its closest neighbour through another variable. For each cell 6 variables (Ex, Ey, Ez, Hx, Hy, Hz) are necessary to solve Maxwell's equations. Of course, the high locality of the finite-difference method leads to a domain decomposition of the work space.
2.3
Message Passing Version
The usage of electronics in vehicles has increased at a phenomenal rate in the last decade, and continues to do so, for example to enhance automatization, driver comfort and vehicle safety. As a result, the computer power needs (CPU time and memory size) steadily increase as more complex and more accurate simulations are required to meet customer needs. The use of MPP systems is one way to satisfy these needs, and a message passing version of the code was developed in the early 90s, and tested on many platforms including Paragon, CS2, CM5 and Workstation and with NX, MPI, PVM and PARMACS libraries.
1142
The design of parallelisation is founded on a standard domain decomposition approach based on nodal variables, namely there is a correspondence between one calculation node and one 3D element array (IJK representation). This correspondence is used both in the H P F version and in the message passing version, but the implementations differ. The message passing version used a decomposition with overlap. At each step of the computation, the calculations are made in each domain and then the quantities on the boundaries of domains are updated. This is a fine grain parallelisation which requires very substantial effort in a message passing implementation. A major advantage of H P F is its intrinsic ability to provide a fine grained parallelisation with modest development effort; and this is what we sought to produce. 3
Outline
Porting
Procedure
The porting procedure had the same steps for each code; these steps are those foreseen as appropriate at project start: C o d e C l e a n i n g : The original Fortran77 code was tidied up to remove language features which were either: - non-standard Fortran77 (some of these remain in many codes either as legacies from earlier Fortran versions or because they have been widely supported by compilers); - or can be replaced even within Fortran77 by better constructs. This stage proved to be well-supported by the FORESYS tool. However, some work was necessary by hand. T r a n s l a t i o n t o F o r t r a n g 0 : This provides a better base for HPF. The FOREST90 tool (part of the FORESYS suite) proved to give good support for this, providing an automatic translation which was then hand polished. I m p r o v i n g t h e F o r t r a n 9 0 v e r s i o n : to make the source better suited to H P F by utilising array syntax where possible, and by removing further features (storage association, sequence association, assumed size and explicit shape array arguments) which cannot directly be parallelised in HPF. Static storage allocation was replaced by dynamic storage allocation, as part of this process. This stage made use of the FORESYS vectorising tool; but much of the work was done manually. I n s e r t i n g H P F D i r e c t i v e s : was the final stage. It involved also a small amount of code restructuring: loops were restructured to be INDEPENDENTwhere possible; Procedures were marked as HPF_SERIAL where appropriate; - Some I / O statements were moved to improve parallelisation. -
-
4 4.1
Porting
the
SEMC3D Code
T r a n s l a t i o n t o F o r t r a n 90.
As with all of the codes, FORESYS was used to generate a Fortrang0 version semi-automatically. The main features of the Fortran 90 version are :
1143 -
New Fortran 90 syntax used. All variables are declared. Interface modules are provided for every procedure. Sequence association is not used. COMMONblocks have been replaced by modules. Introduction of INTENT attribute for d u m m y arguments. - Removal of obsolete features. -
In the Fortran77 version, all arrays are defined in the m a i n p r o g r a m with their exact dimensions specified via an include file. Fortran 90 is able to introduce dynamic arrays, so we have rewritten the main program using this feature. The next stage was the introduction of array syntax or array operations when possible. This work was very easy because the initial version of SEMC3D was written in Fortran 80 for the Alliant platform. In Fortran 77 the actual and the d u m m y arguments of subroutines can differ in rank. In addition, m a n y "Fortran77" codes use vendor extensions which allow d u m m y and a c t u i to differ i s o in type or kind. The original SEMC3D code uses all these facilities. We used the capacity of Fortran 90 to create user-defined generic procedures to keep the same cMls. After the initial translation was completed we carried out portability tests, and final tuning on several platforms: IBM, Cray, HP with vendor-supplied compilers and on the Sun and Meiko CS2 platforms at Vienna with the NASL FortranPlus compiler. Some minor parts were rewritten to bypass various compiler bugs. 4.2
HPF Port
As noted above, the H P F port was performed using fine grain parallelisation. No m a j o r difficulties were encountered; the code was originally designed to vectorise well, and the majority of loops have no d a t a dependency. C h o o s i n g t h e p a r t i t i o n SEMC3D works within a three dimensional rectangular envelope, which m a p s naturally onto a 3D processor grid: !HPF$ PROCESSORS Grid_3d(number_of_processors(dim=l),
!HPF$ !HPF$
number_of_processors(dim=2), number_of_processors(dim=3))
The user can specify the actual size and shape of the grid in the run c o m m a n d . U s i n g TEMPLATE The simulation space is a box which can be defined by the following TEMPLATE. !HPF$ TEMPLATE,
DIMENSl0N(l:nx,l:ny,l:nz)
:: BOX
1144 where nx, ny and nz are the number of discretization points on each axis. Following our message passing experience, we chose to distribute this TEMPLATE with the (BLOCK,BLOCK,BLOCK) attribute; this was extended to BLI:}CK(mx), BL0CK(my), BLOCK(mz) during tuning with rex, my, mz chosen to optimise load balancing. However, since all directives are set via a single INCLUDE file, the actual mapping used can be trivially changed at the cost of a recompilation. A r r a y s : A l i g n m e n t a n d D i s t r i b u t i o n We distinguish two cases: the main program and the subroutines. In the main program, all the arrays are aligned with the BOX template, and in the subroutines they are distributed according to the inherited mapping. For example: In main
: !HPF$ ALIGN ex(I,J,K) WITH BOX(I,J,K)
In the subroutines : !HPF$ DISTRIBUTE
ex
*(BLOCK,BLOCK,BLOCK)
M a j o r P a r a l l e l C o n s t r u c t s U s e d The two major sources of parallelism in the H P F code are: U s e o f A r r a y E x p r e s s i o n s : As noted, SEMC3D is highly vectorisable. Many of the Fortran77 DO loops translate naturally to Fortran90 array assignments (and FORESYS carries out many of these translations automatically). These are very efficiently handled by the NASL HPFPlus compiler. U s e o f H P F _ S E R I A L P r o c e d u r e s : Computationally expensive code sections which could not readily be expressed using array syntax, were wrapped in HPF~.qERIALprocedures. These are typically called within a loop; the NASL HPFPlus compiler places the calls intelligently by analysing the processor residency of the actual arguments, and the resulting loops run in parallel with very low overheads. Note that this approach can be used for both fine grain and coarse grain parallelism; in this application the granularity is low. 4.3
Indirection
Some boundary conditions used in SEMC3D can only be handled using indirection. In the Fortran77 version, these conditions are located with an integer pointer array. In the first translation step we replicated these arrays on all processors, but at a later stage we recoded the one dimensional pointer array to reflect the 3-D nature of the problem, and used a masked WHEREto retain parallelism: integer iptstr(l:nstr) do i=l,nstr e(ipstr(i)) = 0 enddo
1145 became
real, dimension(l:,l:,l:), intent(inout) logical, dimension(l:,l:,l:), intent(in)
:: e
:: iptstr
!HPF$ DISTRIBUTE e*(BLOCK,BLOCK,BLOCK)
!HPF$ ALIGN iptstr(I,J,K) WITH *e(I,J,K) WHERE(iptstr(:,:,:) e(:,:,:) ENDWHERE
5
= 0
Benchmarks
We have benchmarked the code using several datasets: c u b e _ 0 7 : a small (44*44*44) test cube used primarily for single processor testing; primary data requirements 3MBytes c u b e _ 0 8 : a modest (54*54*54) test cube; primary data requirements around 6MB. c u b e _ 0 9 : An 80*80*80 test cube, with primary data requirements around 20MB. This provides a test problem small enough to be run on a single processor of the Meiko CS2, which has only limited storage (48MB useraccessible per processor), but large enough to provide a test of the parallelism. C A R _ I : An industrial model of a complete car, with primary data requirements around 100MB. This represents a typical production problem for the code. 5.1
Single Processor Benchmarks
The first stage in the port was the conversion of the Fortran77 code to Fortran90. This conversion itself has major advantages: the resulting code is cleaner, and written at a much higher level; so that it is easier to maintain and develop. Utilising these advantages in practice means relying on the Fortran90 compiler available on the chosen target system. The NASL HPFPlus compiler also has this reliance, since it targets Fortran90. Fortran90 compilers however still vary in their optimisation level for specific targets. Table 1 shows the single processor performance of the original Fortran77 code; of the converted Fortran90 code; and of the final H P F code running on one processor of the Vienna Meiko CS2. The compilers used were: F o r t r a n 7 7 : F 7 7 SC3.0.1 F o r t r a n 9 0 : NASL FortranPlus Release 2.0beta H P F : NASL HPFPlus Release 2.1 We see that the CS2 FortranPlus compiler has not been comparably optimised to the Sun F77 compiler. Using H P F on one processor introduces a factor of between 2.8 and 3.4 in the single processor times. Since (surprisingly perhaps) the H P F times are betterthan the F90 times, all of this is due to the performance of the backend compiler.
1146 Table 1. Single Processor Times (seconds) on the CS2 !Compiler F77 -02 F77 -05 F90 -fast HPF -O
5.2
Cube_07 26.0 16.6 79.4 56.5
Cube_08 46.2 31 106 88
Cube_09 147 101 348 1281
Parallel Performance of the HPF Code
We have run the converted H P F code on the Meiko CS2 at the University of Vienna; this is one of the two project platforms for Pharos. Table 2 shows the performance achieved. The times recorded are for four simulation time periods for cube_08, cube_09; one time period for CAR_I. Table 2. Performance of the HPF version of SEMC3D on the Meiko CS2. Times are in seconds.
Processors Cube_08 !Cube_09 C A R _ I 1 2 4 8 16 32
89 52 35 13 8.6 5.2
281 154 176 39.7 22.7 10.7
268 152 86 45
The speedups achieved are very good; the code clearly scales well. The mildly anomalous time measured for cube_08 on eight processors was repeatable but not really understood. CAR_I could not be run on less than four processors.
5.3
Comparison with a Message-Passing MPI Port
As noted above, an MPI/Fortran77 version of the code had already been developed prior to the project start. Interest in H P F stemmed from the cost of producing and maintaining this port; and especially the difficulty of keeping it in step with ongoing development of the serial code. We ran the cube_09 benchmark with both MPI and H P F versions, on the Meiko CS2, with results shown in Table 3. The single processor performance advantage of the Fortran77 compiler is maintained roughly constant over the range of processors used; that is, the speedups achieved by H P F and MPI codes are essentially the same. The MPI figures are roughly a factor 3 better than those achieved with HPF; comparison with Table 1 shows that this is attributable wholly to the lack of optimisation of the backend Fortran90 compiler.
1147 Table 3. MPI (F77-05) vs HPF (HPFPlus -O) foe cube_09. Times are in seconds. Procs 1 2 4 8 16 32
6
MPI 101 57.5 28.3 13.5 6.9 5.7
HPF 281 154 76 39.7 22.7 10.7
Ratio 1:2.8 9 1:2.7 : 1;2.Y 1:2.9 ,-~ 1:3.3 1:1.9
Discussion
The speedups obtained in the benchmarking are extremely encouraging; what do the results tell us about the use of H P F at this stage in compiler development?
6.1
Friendliness of the HPF Language
This has certainly been confirmed. T h e H P F version of the code is as easy to read as the Fortran90 version; and certainly much easier to maintain t h a n either the Fortran77 or the message passing version. This last is particularly hard to maintain: the implementation of a fine grain message passing parallelisation obscures the structure of the original serial code, while this structure is cleanly maintained in the H P F version.
6.2
R e q u i r e m e n t s on the H P F C o m p i l e r
One i m p o r t a n t aspect of the initial porting process was to identify those requirements on the H P F compiler needed to produce a final scalable working version of the code. Here is a s u m m a r y of the findings:
Fortran90 C o m p i l e r O p t l m i s a t i o n Improved runtime efficiency in the backend Fortran90 compilers, would help the benchmarks. The NASL H P F P l u s compiler targets Fortran90 directly, and hence can (and necessarily does) rely on the m a n y optimisations available to such a compiler. In the m e d i u m term, Fortran90 compilers will generate better code for well written programs t h a n the equivalent code passed through a Fortran77 compiler. This is already true for some Fortran90 compilers; but not yet on the CS2. H P F P l u s E f f i c i e n c y Release 2.0 of HPFPlus has substantial optimisations which have reduced the H P F overheads in m a n y codes to between zero and 20%; indeed, here they prove to be negative overheads. H P F overheads are certainly not a significant issue with SEMC3D.
1148 L a n g u a g e S u p p o r t The major language features used in this port were: Multidimensional BLOCK(al) distributions;
-
-
ALIGN;
PROCESSOR arrays with runtime determined size and shape;
- Efficient handling of distributed array arguments; - Array expressions using distributed components; - Autoparallelised DO loops with calls to HPF_SERIAL procedures. These features were sufficient to produce an extremely effective parallel code with parallel efficiency as good as the pre-existing MPI version; the H P F source is however very much easier to maintain, and further develop, than the messagepassing MPI code. 7
Acknowledgements
This work was supported in part via Esprit Project No P20162, Pharos. We are indebted to our colleagues in this project for many helpful discussions, and aid in running on a variety of systems. Especial thanks are due to Thomas Brandes, Junxian Liu, Ruth Lovely, John Merlin and Dave Watson.
HiPEC: High Performance Computing Visualization System Supporting Networked Electronic Commerce Applications* Reinhard Liiling and Olaf Schmidt University of Paderborn (Department of Computer Science) Fiirstenalle 11~D-33102 Paderborn, Germany
{rl,merlin}@uni-paderborn,
de
A b s t r a c t . This paper presents the basic ideas and the technical background of the Esprit project HiPEC. The HiPEC Project aims at integrating HPCN technologies into an electronic commerce scenario to ease the interactive configuration of products and their photorealistic visualization by a high performance parallel rendering system.
1
Introduction
It is well accepted that electronic commerce systems will be of large benefit for the European industry if these technologies are actively used by companies and integrated into their business as soon as possible. Especially SMEs can benefit largely as market access and communication to suppliers and larger industries can be eased considerably using Internet technologies. T h e m a i n objective of the HiPEC project is to integrate advanced H P C N technologies to form a generic electronic commerce application. This application is aimed at giving a large number of SMEs a configuration and visualization tool to support the selling process within the showroom and to enlarge their business using the same tool also on the Internet. The enabling technology behind this tool is a high performance computing system as well as advanced networking technology. The H P C system is used to run the configuration and visualization system and networks give the end-users access to a H P C system to run the service from their showroom. It is the special aim of this project to make the service as general usable as possible. This means that open interfaces have to be used t h a t allow a future integration of other components to customize the syst e m to any special needs. The H P C equipment which is used for the electronic commerce application is provided as a service to the end-users in a way t h a t allows the end-users to use these technologies via suitable communication networks. Thus, the end-users can have cost-efficient access to H P C technologies. The potential customers of these technologies are retailers of industrial products. The application range selected here is reselling of b a t h r o o m equipment. In * This work is supported by the EU, ESPRIT project 27232 (Domain 6, High Performance Computing and Networking).
1150 a typical scenario, the customer can configure the bathroom equipment in an interactive configuration session. To ease the decision process of the customer, the configuration of the final product is supported by presenting the customer a photorealistic animation of the composed bathroom. As such a presentation has to be computed in short time with highest quality it is necessary to use a high performance computing system for the realistic image synthesis. In section 2 the basic ideas of HiPEC will be discussed and a more detailed view on the technical background of the project is given.
2
Approach of the HiPEC Project
The HiPEC project aims at integrating HPCN technologies into an electronic commerce scenario to ease the interactive configuration of products. In the project scenario a retailer is equipped with a PC that is connected to the Internet. This PC holds a local database of product information from different manufacturers of bathroom equipment. Product information that was previously only available in books and brochures is digitized and loaded into the local database that can be frequently updated by the manufacturer of the bathroom equipment via the Internet. A customer that wants to configure his bathroom can do this using a CAD program on the PC. During this selling process an employee of the reseller guides the selection of products and shows the different scenarios using the local CAD program on the PC. Using the local database selections can easily be made and different scenarios can be presented to the customer. Configuration of products is only one aspect. The other and even more important aspect is the visualization of a product. Customers want to have a visual impression of their product. This is of special importance in areas where furniture and other consumer goods are configured and offered to customers. As a high quality animation of products can lead to an increased attractivity the HiPEC project integrates a rendering service that generates photorealistic pictures of the selected product. As these animations can only be generated using HPC computing systems, a parallel computing system is used to generate these pictures. To generate this in highest quality, the architecture of the bathroom is loaded to a remote parallel computer system where a database is installed that contains detailed graphical models of all products. The model of the complete bathroom is build here and forwarded to the parallel system that generates a picture of highest quality. The encoded picture is send back to the PC to be printed and handed over to the customer. Thus, the HiPEC electronic commerce system contains the following components (see Figure 1). User Interface and Configuration System The User Interface and configuration system is installed on a PC in the showroom of the seller of the product. Together with the seller the customer configures the final product and has the possibility of using the remote high-end rendering service installed on the parallel computer system for a high-quality visualization.
1151
Fig. 1. Architecture of the HiPEC-System
This visualization might be only initiated at the end of the selling process or also in between to guide the user. The user is enabled to store generated product visualizations and configured products in a database. This database is used to set up a virtual showroom on the World Wide Web, which can be visited by potential customers from their PCs at home. Communication network The configuration and visualization service will be offered in the showroom of the retailers. Thus, communication networks will be used for different purposes. We will use ISDN dial-up lines to connect the PCs installed in the end-users showrooms to the parallel server. Using the database installed at the site of the parallel server, the bandwidth of these lines is large enough to support the application. Service Broker The electronic commerce application is achieved by a service broker which is installed on frontend computers to the H P C system. The service broker is responsible for managing the user accounts and for performing efficient job-scheduling of rendering jobs on the available hardware resources. Database of product components To generate a high quality animation, the complete product that is configured on the local PC has to be delivered to the parallel rendering server. To save communication bandwidth, the most i m p o r t a n t information about structure of the sub-components, structure of the surface of the object and other information are stored in a database t h a t is located on a frontend PC connected to the parallel system. Using this database, the client system only has to c o m m u n i c a t e some
1152 small information about the configuration. The complete scene that has to be animated is generated on the parallel computing system. Parallel rendering system Ray tracing and radiosity algorithms currently implemented in image synthesis systems provide the necessary rendering quality, but these methods are suffering from their extensive computational costs and their enormous m e m o r y requiremeats. Parallel computing and graphics are two fields of computer science that have much to offer each other. Efficient use of parallel computers requires algorithms where the same computation is performed repeatedly or where separate tasks are performed with little or no coordination, and most computer graphics tasks fit these needs well. On the other hand, most graphics applications have an insatiable appetite for raw computational power, and parallel computers are a good way to provide this power. It can be concluded that there is an obvious synergy between computer graphics and parallel computation. Several sophisticated methods for efficient parallel simulation of illmnination effects in complex environments have been developed in the past [1][2] [4]. In the HiPEC approach the parallel rendering system is installed on a parallel computing platform that is located at the site of a service provider that provides the parallel computing service. Over all, this application shows the benefits of networked HPC technologies in an electronic commerce application that goes beyond presenting products in the W W W . This technology is general, i.e. it can be applied to a large number of other applications and is presented here for a complete industrial area allowing to demonstrate the benefits of these technologies.
3
Conclusion
The HiPEC project integrates advanced HPC technology into a special electronic commerce application supporting the selling process within showrooms of retailers of bathroom furniture. So even SMEs can profit by the advantages of powerful HPC systems which are to expensive for a local installation. The design of the HiPEC system is kept as general as possible in order to allow the integration of other components to customize the system to the demands of additional electronic commerce applications.
References 1. Baum, D.R.; Winget, J.M.: Real time radiosity through parallel processing and hardware acceleration, Proc. of SIGGRAPH 90, March 1990, pp. 67-75 2. Chalmers, A.G.; Paddon, D.J.: Parallel processing of progressive refinement radiosity methods, Proc. 2nd Eurographics Workshod on Rendering, Mai 1991 3. Schmidt, O.; Lange, B.: Interaktive photorealistische Bildgenerierung durch effiziente parallele Simulation der Lichtenergieverteilung, Proc. of Informatik97,1997, pp. 476-485, Springer- Verlag 4. Schmidt, O.; Rathert, J.; Reeker, L.: Parallel Simulation of the Global Illumination, accepted for PDPTA '98, Las Vegas, July 1998
Performance Measurement of Interpreted Programs Tia Newhall and Barton P. Miller Computer Sciences Department, University of Wisconsin, Madison, WI 53706-1685
Abstract. In an interpreted execution there is an interdependence between the interpreter’s execution and the interpreted application’s execution; the implementation of the interpreter determines how the application is executed, and the application triggers certain activities in the interpreter. We present a representational model for describing performance data from an interpreted execution that explicitly represents the interaction between the interpreter and the application in terms of both the interpreter and application developer’s view of the execution. We present results of a prototype implementation of a performance tool for interpreted Java programs that is based on our model. Our prototype uses two techniques, dynamic instrumentation and transformational instrumentation, to measure Java programs starting with unmodified Java .class files and an unmodified Java virtual machine. We use performance data from our tool to tune a Java program, and as a result, improve its performance by more than a factor of three.
1
Introduction
An interpreted execution is the execution of one program (the interpreted application) by another (the interpreter); the application code is input to the interpreter, and the interpreter executes the application. Examples include justin-time compiled, interpreted, dynamically compiled, and some simulator executions. Performance measurement of an interpreted execution is complicated because there is an interdependence between the execution of the interpreter and the execution of the application; the implementation of the interpreter determines how the application code is executed and the application code triggers what interpreter code is executed. We present a representational model for describing performance data from an interpreted execution. Our model characterizes this interaction in a way that allows the application developer to look inside the interpreter to understand the fundamental costs associated with the application’s execution, and allows the interpreter developer to characterize the interpreter’s execution in terms of application workloads. This model allows for a concrete description of behaviors in the interpreted execution, and it is a reference point for what is needed to implement a performance tool for measuring interpreted executions. An implementation of our model can answer performance questions about specific interactions between the interpreter and the application. For example, we can represent performance data of Java interpreter activities D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 146–156, 1998. c Springer-Verlag Berlin Heidelberg 1998
Performance Measurement of Interpreted Programs
147
like thread context switches, method table lookups, garbage collection, and bytecode instruction execution associated with different method functions in a Java application. We present results from a prototype implementation of our model for measuring the performance of interpreted Java applications and applets. Our prototype tool uses Paradyn’s dynamic instrumentation [4] to dynamically insert and remove instrumentation from the Java virtual machine and Java method byte-codes as the byte-code is interpreted by the Java virtual machine. Our tool requires no modifications to the Java virtual machine nor to the Java source nor class files prior to execution. The difficulties in measuring the performance of interpreted codes is demonstrated by comparing an interpreted code’s execution to a compiled code’s execution. A compiled code is in a form that can be executed directly on a particular operating system/architecture platform. Performance tools for compiled code provide performance measures in terms of platform-specific costs associated with executing the code; process time, number of page faults, and I/O blocking time are all examples of platform-specific measures. In contrast, an interpreted code is in a form that can be executed by the interpreter virtual machine. The interpreter virtual machine is itself an application program that executes on some operating system/architecture platform. One obvious difference between compiled and interpreted application execution is the extra layer of the interpreter program that, in part, determines the application’s performance (see Figure 1). A performance tool for measuring the performance of an interpreted execution needs to measure the interaction between the Application layer (AP) and the InterpreterVM layer (VM). There are potentially two different program developers that would be interested in performance measurement of the interpreted execution: the virtual machine developer and the application program developer. Both want performance data described in terms of platform-specific costs associated with executing parts of their program. However, each views the platform and the program as different layers of the interpreted execution. The VM developer sees the AP as input to the VM program that is run by the Platform layer (left side of the interpreted execution in Figure 1). The AP developer sees the Application layer as the program that is run on the VM (right side of the interpreted execution in Figure 1). For a VM developer, this means characterizing the virtual machine’s performance in terms of Platform layer costs of the VM program’s execution of
Compiled Execution Application Platform (Os/Arch)
Interpreted Execution VM developer’s view: input application platform
Application
AP developer’s view: application
InterpreterVM Platform (Os/Arch)
platform
Fig. 1. Compiled application’s execution vs. Interpreted application’s execution. VM and AP developers view the interpreted execution differently.
148
Tia Newhall and Barton P. Miller
its input (AP)—for example, the amount of process time executing VM function invokeMethod while interpreting AP method foo. The application program developer wants performance data that measures the same interaction to be defined in terms of VM-specific costs associated with AP’s execution; for example, the amount of method call context switch time in AP method foo. Another characteristic of an interpreted execution is that the application program has multiple execution forms. By multiple execution forms we mean that AP code is transformed into another form or other forms while it is executed. For example, a Java class is read in by the Java VM in .class file format. It is then transformed into an internal form that the Java VM executes. This differs from compiled code where the application binary is not modified as it executes. Our model characterizes AP’s transformations as a measurable event in the code’s execution, and we represent the relationship between different forms of an AP code object so that performance data measured for one form of the code object can be mapped back, to be viewed in previous forms of the object.
2
Performance Measurement Model
We present a representational model for describing performance data from an interpreted execution that explicitly represents the interaction between the application program and the virtual machine. Our model addresses the problems of representing an interpreted execution and describing performance data from the interpreted execution in a language that both the virtual machine developer and the application program developer can understand. 2.1
Representing an Interpreted Execution
Our representation of an interpreted execution is based on Paradyn’s representation of a program execution as a set of resource hierarchies. A resource is a physical or logical component of a program (a semaphore, a function, and a process are all examples of program resources). A resource hierarchy is a collection of hierarchically related program resources. For example, the Process resource hierarchy views the running program as a set of processes. It consists of a root node that represents all processes in the application, and some number of child resources—one for each process in the running program. Other examples of resource hierarchies are a Code hierarchy for the code view of the program, a Machine hierarchy for the set of machines on which the application is running, and a Synchronization hierarchy for the set of synchronization objects in the application. An application’s execution might be represented as the following set of resource hierarchies: fProcess, Machine, Code, SyncObjg. An individual resource is represented by a path from its root node. For example, the function resource main is represented by the path /Code/main.C/main. Its path represents its relationship to other resources objects in the Code hierarchy. Since both the application program and the virtual machine can be viewed as executing programs, we can represent each of their executions as a set of resource hierarchies. For example, AP’s execution might be represented by fCode,
Performance Measurement of Interpreted Programs
Virtual Machine Machine Code Process cham grilled
main.c blah.c
pid 1 pid 2 pid 3
149
Application Program SyncObj
Code
Threads
msg tag 1 2
main foo foo blah
tid 1 tid 2 tid3
SyncObj monitors mon1 mon2
Fig. 2. Example resource hierarchies for the virtual machine and application program. Thread, SyncObjg (right half of Figure 2), and the virtual machine’s execution might be represented as fMachine, Process, Code, SyncObjg (left half of Figure 2). The resource hierarchies in this figure represent AP and VM’s executions separately. However, there is an interaction between the execution of the two that must be represented. The relationship between the virtual machine and its input (AP) is that VM runs AP. We represent the VM runs AP relationship as an interaction between program resources of the two; code, process, and synchronization objects in the virtual machine interact with code, process, and synchronization objects in the application program during the interpreted execution. The interpreted execution is the union of AP and VM resource hierarchies. Figure 3 is an example of an interpreted execution represented by fMachine, Code, Process, SyncObj, APCode, APThreads, APSyncObjg. The application program’s multiple execution forms are represented by a set of sets of resource hierarchies–one set for each of the forms that AP takes during its execution, and a set of mapping functions that map resources in one form to resources in another form of AP. For example, in a dynamically compiled Java application, method functions may be translated from byte-code form to native code form during execution. Initially we create one AP Code hierarchy for the byte-code form of AP. As run-time transformations occur, we create a new AP code hierarchy for the native form of AP objects, and create mapping functions that map resources in one form to resources in the other form.
2.2
Representing Points in an Interpreted Execution
We can think of a program’s execution as a series of events. Executing an instruction, waiting on a synchronization primitive, and accessing a memory page are examples of events. Any event in the program’s execution can be represented
VM runs AP Machine Code cham grilled
main.c blah.c
Process pid 1 pid 2 pid 3
SyncObj APCode APThreads APSyncObj msg tag 1 2
main foo foo blah
tid 1 tid 2 tid3
monitors mon1 mon2
Fig. 3. Resource hierarchies representing the execution of VM runs AP.
150
Tia Newhall and Barton P. Miller
as a subset of its active resources—those resources that are currently involved in the event. For example, if the event is “process 1 is executing code in function foo”, then the resource /Code/main.C/foo and the resource /Process/pid 1 are both active when this activity is occurring. These single resource selections from a resource hierarchy are called constraints. The activity specified by the constraint is active when its constraint function is true. Definition 1. A constraint is a single resource selection from a resource hierarchy. It represents a restriction of the hierarchy to some subset of its resources. Definition 2. A constraint function, constrain(r), is a boolean function of resource r that is true when r is active. For example, constrain(/Process/pid 1) is true when process 1 is active (i.e. running).
Constraint functions are resource-specific; all constraint functions test whether their resource argument is active, but the type of test that is done depends on the type of the resource argument. The constraint function for a process resource will test whether the specified process is running. The constraint function for a function resource will test whether the program counter is in the function. Each resource hierarchy exports the constraint functions for the resources in its hierarchy. For example, Code.constrain(/Code/main.C) is the constraint function applied to the Code hierarchy resource main.C. Constraint functions can be combined with AND and OR to create boolean expressions containing constraints on more than one resource. By combining one constraint from each resource hierarchy with the AND operator, we represent different activities in the running program. This representation is called a focus. Definition 3. A focus is a selection of resources (one from each resource hierarchy). It represents an activity in the running program. A focus is active when all of its resources are active (i.e., the AND of the constraint functions).
If the focus contains resources that are refined on both VM and AP resource hierarchies, then it represents a specific part of virtual machine’s execution of the application program. For example,
is a focus from Figure 3 that represents when VM’s process 1 is executing function invokeMethod, and interpreting AP method foo. This activity is occurring when the corresponding constraint functions are all true. 2.3
Performance Data for Interpreted Execution
To describe performance data that measure the interaction between the application program and the virtual machine, we provide a mechanism to selectively constrain performance information to any part of the execution. There are two complementary (and occasionally overlapping) ways to do this: constraints are either implicitly specified by metric functions or explicitly specified by foci. A metric function is a time-varying function that measures some aspect of a program execution’s performance. Metric functions consist of time or count functions combined with boolean expressions built from constraint functions and constraint operators. For example:
Performance Measurement of Interpreted Programs
151
– CPUtime = [ ]processTime/sec. The amount of process time per second. – methodCallTime = [Code.constrain(/Code/main.C/invokeMethod)] processTime/sec. The time spent in VM function invokeMethod. – io wait = [Code.constrain(/Code/libc.so/read) OR Code.constrain( /Code/libc.so/write)] wallTime/sec. The time spent reading or writing.
The second way to constrain performance data is by specifying foci that represent restricted locations in the execution. Foci with both VM and AP resources represent an interaction between VM and AP. The focus represents the part of the execution when the virtual machine is fetching AP code object foo. This part of the execution is active when [Code.constrain( /Code/main.C/fetchCode) AND APCode.constrain( /APCode/foo.class/foo)] is true. If we combine metric functions with this focus, we can represent performance data for the specified interaction. We represent performance data as metric-focus pairs. The AND operator is used to combine a metric with a focus. The following are some example metric– focus pairs (we only show those components of the focus that are refined beyond a hierarchy root node): 1. CPUtime, : [ ] processTime/sec AND [APCode.constrain(/APCode/foo.class/foo)] 2. CPUtime, : [ ] processTime/sec AND [Code.constrain(/Code/main.C/invokeMethod) AND APCode.constrain(/APCode/foo.class/foo)] 3. methodCallTime, : [Code.constrain(/Code/main.C/invokeMethod)] processTime/sec AND [APCode.constrain(/APCode/foo.class/foo)]
Example 1 measures the amount of process time spent in AP function foo. The performance measurements in examples 2 and 3 are identical—both measure the amount of process time spent in VM function invokeMethod while interpreting AP function foo. However, example 2 is represented in a form that is more useful to a VM developer and example 3 is represented in a form that is more useful to an AP developer. Example 3 uses an VM-specific metric. VM-specific metric functions measure activities that are specific to a particular virtual machine. They are designed to present performance data to an AP developer who may have little or no knowledge of the virtual machine; they encode knowledge of the virtual machine in a representation that is closer to the semantics of the application language. Thus, an AP developer can measure VM costs associated with the execution of their program without having to know the details of the implementation of the VM; the methodCallTime metric encodes information about the VM function invokeMethod that is used to compute its value. A final issue is representing performance data for foci with application program resources. An AP object may currently be in one form, while its performance data should be viewable in any of its forms. To do this, AP mapping functions are used to map performance data that is measured in one form of an AP object to a logical view of the same object in any of its other forms.
152
3
Tia Newhall and Barton P. Miller
Measuring Interpreted Java applications
We present a tool for measuring the performance of interpreted Java applications and applets running on Sun’s version 1.0.2 of the Java VM [6]. The tool is an implementation of our model for representing performance data from an interpreted execution. The Java VM is an abstract stack-based processor architecture. A Java program consists of a set of classes, each compiled into its own .class file. Each method function is compiled into byte-code instructions that the VM executes. To measure the performance of an interpreted Java application or applet, our performance tool (1) discovers Java program resources as they are loaded by the VM, (2) generates and inserts SPARC instrumentation code into Java VM routines, and (3) generates and inserts Java byte-code instrumentation into Java methods and triggers the Java VM to execute the instrumentation code. Since the Java VM performs delayed loading of class files, new classes can be loaded at any point during the execution. We insert instrumentation code in the VM that notifies our tool when a new .class file is loaded. We parse the VM’s internal form of the class to create application program code resources for the class. At this point, instrumentation requests can be made for the class by specifying metric–focus pairs containing the class’s resources. We use dynamic instrumentation [4] to insert and delete instrumentation into Java method code and Java VM code at any point in the execution. Dynamic instrumentation is a technique where instrumentation code is generated in the heap, a branch instruction is inserted from the function’s instrumentation point to the instrumentation code, and the function’s instructions that were replaced by the branch are relocated to the heap and executed before or after the instrumentation code. Because the SPARC instruction set has instructions to save and restore stack frames, the instrumentation code and the relocated instructions can execute in their own stack frames. Thus instrumentation code will not destroy the values in the function’s stack frame. Using this technique to instrument Java methods is complicated by the fact that a method’s byte-code instructions push and pop operands from their own operand stack. Java instrumentation code should use its own operand stack and have its own execution stack frame. The Java instruction set does not contain instructions to explicitly save and restore execution stack frames or to create new operand stacks. Our solution uses a technique called transformational instrumentation. This technique forces the Java VM to create a new operand stack and execution stack frame for our instrumentation code. The following are the transformational instrumentation steps: 1. The first time an instrumentation request is made for a method, relocate the method byte-code to the heap and expand its size by adding nop byte-code instructions around each instrumentation point. The nop instructions will be replaced with branches to instrumentation code. 2. Get the VM to execute the relocated method byte-code by replacing the first bytes in the original method with a goto w byte-code instruction that branches to the relocated method. Since the goto w instruction is inserted after the VM has
Performance Measurement of Interpreted Programs
153
verified that this method byte-code is legal, the VM will execute this instruction even though it branches outside the original method function. 3. Generate instrumentation code in the heap. We generate SPARC instrumentation code in the heap, and use Java’s native methods facility to call our instrumentation code from the Java method byte-code. 4. Insert method call byte-code instructions in the relocated method to call the native method function that will execute the instrumentation code. This will implicitly cause the VM to create a new execution stack frame and value stack for the instrumentation code.
4
Results
We present results from running a Java application with our performance tool. The application is a CPU scheduling simulator that consists of eleven Java classes and approximately 1200 lines of Java code. Currently, we can represent performance data in terms of VM program resources, AP code resources, and the combination of VM and AP program resources using both foci and VM-specific metrics to describe the interaction. Figure 4 shows the resource hierarchies from the interpreted execution, including the separate VM and AP code hierarchies. We began by looking at the overall CPU utilization of the program (about 98%). We next tried to account for the part of the CPU time due to method call context switching and object creation in the Java VM. To measure these, we created two VM-specific metrics—MethodCall CS measures the time for the Java VM to perform a method call context-switch, and obj create measures the time for the Java VM to create a new object. Both are a result of the VM interpreting certain byte-code instructions in the AP. We measured these values for the Whole Program focus (no constraints). As a result, we found that a large portion (∼ 35%) of the total CPU time is spent handling method context switching and object creation (Figure 5)1 . Because of these results, we first tried to reduce the method call context switching time by in-lining method functions. Figure 6 shows the method functions that are accounting for the most CPU time and the largest number of 1
With CPU enabled for every Java class, about 45% is due to executing Java code, 5% to method call context switching, and 40–45% to instrumentation overhead.
Fig. 4. Resource Hierarchies from the interpreted execution.
154
Tia Newhall and Barton P. Miller
Fig. 5. Time Histogram showing CPU utilization, object creation time, and method call context switching time. This graph shows that method call context switching time plus object creation time account for ∼ 35% of the total CPU time. method function calls. We found that the nextInterrupt() and isBusy() methods of the Device class were being called often, and were accounting for a relatively large amount of total CPU time. By examining the code we found that the Sim.now() method was also called frequently. These three methods return the value of a private data member, and thus are good candidates for in-lining. After changing the code to in-line calls to these three method functions, the total number of method calls decreased by 31% and the total execution time decreased by 12% (second row in Table 1).
Fig. 6. AP classes and methods that account for the largest % CPU (left) and that are called the most frequently (right). The first column lists the focus, and the second column lists the metric value for the focus.
Performance Measurement of Interpreted Programs
155
We next tried to reduce the object creation time. We examined the CPU time for the new version of the AP code, and found that the Sim, Device, Job, and StringBuffer classes accounted for most of the CPU time. The time spent in StringBuffer methods is due to a large number of calls to the append and constructor methods made from the Sim and Device classes. We were able to reduce the number of StringBuffer and String objects created by removing strings that were created but never used, and by creating static data members for parts of strings that were recreated multiple times (in Device.stop() and Device.start()). With these changes we are able to reduce the total execution time by 70% (fourth row in Table 1).
Table 1. Performance results from different versions of the application. Optimization
Number of Number of Method Calls Object Creates Original Version 13,373,200 465,140 Method in-lining 9,213,400 (31% less) 465,140 Fewer Obj. Creates 9,727,800 (27% less) 17,350 (96% less) Both Changes 5,568,100 (58% less) 17,350
Total Execution Time (in seconds) 389.29 343.99 (-12% change) 234.13 (-40% change) 113.60 (-70% change)
In this example, our tool provided performance data that is difficult to obtain with other performance tools. Our tool provided performance data that described expensive interactions between the Java VM and the Java application, and accounted for these costs in terms of AP resources. With this data, we were easily able to determine what changes to make to the Java application to improve its performance.
5
Related Work
There are many general purpose program performance measurement tools [3,5,7,4] that can be used to measure the performance of the virtual machine. However, these are unable to present performance data in terms of the application program or in terms of the interaction between VM and AP. There are some performance tools that provide performance data in terms of AP’s execution [1,2]. These tools provide performance data in terms of Java application code. However, they do not represent performance data in terms of the Java VM’s execution or in terms of the interaction between the VM and the Java application. If the time values provided by these tools include Java VM method call, object creation, garbage collection, thread context switching, or class file loading activities in the VM, then performance data that explicitly represents this interaction between the Java VM and the Java application’s execution will help an application developer determine how to tune his or her application.
156
6
Tia Newhall and Barton P. Miller
Conclusion and Future Work
This paper describes a new approach to performance measurement of interpreted executions that explicitly models the interaction between the interpreter program and the interpreted application program so that performance data can be associated with any part of the execution. Performance data is represented in terms that either an application program developer or an interpreter developer can understand. Currently, we are working to expand our prototype to include a larger set of VM-specific metrics, AP thread and synchronization resource hierarchies, and support for mapping performance data between different views of AP code objects. With support for AP thread and synchronization resources our tool can provide performance data for multi-threaded Java applications.
7
Acknowledgments
Thanks to Marvin Solomon for providing the CPU simulator Java application. This work is supported in part by DARPA contract N66001-97-C-8532, NSF grant CDA-9623632, and Department of Energy Grant DE-FG02-93ER25176. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
References 1. Intuitive Systems. Optimize It. http://www.optimizeit.com/. 155 2. KL Group. JProbe. http://www.klg.com/jprobe/. 155 3. A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, and F. Bodin. Performance Analysis of pC++: A Portable Data-Parallel Programming System for Scalable Parallel Computers. Proceedings of the 8th International Parallel Processing Symposium (IPPS), Cancun, Mexico, pages 75–85, April 1994. 155 4. Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. The Paradyn Parallel Performance Measurement Tools. IEEE Computer 28, 11, November 1995. 147, 152, 155 5. Daniel A. Reed, Ruth A. Aydt, Roger J. Noe, Phillip C. Roth, Keith A. Shields, Bradley W. Schwartz, and Luis F. Tavera. Scalable Performance Analysis: The Pablo Performance Anlysis Environment. Proceedings of the Scalable Parallel Libraries Conference, pages 104–113. IEEE Computer Society, 1993. 155 6. Sun Microsystems Computer Corporation. The Java Virtual Machine Specification, August 21 1995. 152 7. J. C. Yan. Performance Tuning with AIMS – An Automated Instrumentation and Monitoring System for Multicomputers. 27th Hawaii International Conference on System Sciences, Wailea, Hawaii, pages 625–633, January 1994. 155
Analysing an SQL Application with a BSPlib Call-graph Profiling Tool Jonathan M.D. Hill, Stephen A. Jarvis, Constantinos Siniolakis, and Vasil P. Vasilev Oxford University Computing Laboratory, UK.
Abstract. This paper illustrates the use of a post-mortem call-graph profiling tool in the analysis of an SQL query processing application written using BSPlib [4]. Unlike other parallel profiling tools, the architecture independent metric of imbalance in size of communicated data is used to guide program optimisation. We show that by using this metric, BSPlib programs can be optimised in a portable and architecture independent manner. Results are presented to support this claim for unoptimised and optimised versions of a program running on networks of workstations, shared memory multiprocessors and tightly coupled distributed memory parallel machines.
1
Introduction
The Bulk Synchronous Parallel model [8,6] views a parallel machine as a set of processor-memory pairs, with a global communication network and a mechanism for synchronising all processors. A BSP program consists of a sequence of supersteps. Each superstep involves all of the processors and consists of three phases: (1) processor-memory pairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other processor’s memories; and (3) all processors barrier synchronise. The BSP cost model [6] asserts that globally balancing computation and communication is the key to optimal parallel design. The rationale for balancing computation is clear as the barrier that marks the end of the superstep ensures that all processors have to wait for the slowest processor before proceeding into the next superstep. It is therefore desirable that all processes enter the barrier at about the same time to minimise idle time. In contrast, the need to balance communication is not so clear-cut. The BSP cost model implicitly asserts that the dominating cost in communicating a message is the cost of crossing the boundary of the network, rather than the cost of internal transit. This is largely true for today’s highly-connected networks, as described in detail in [2]. Therefore, a good design strategy is to ensure that all processors inject the same amount of data into the network and are consequently equally involved in the communication. This paper explores the use of a call-graph profiling tool [3] that exposes imbalance in either computation or communication, and highlights those portions of the program that are amenable to improvement. Our hypothesis is that by D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 157–165, 1998. c Springer-Verlag Berlin Heidelberg 1998
158
Jonathan M.D. Hill et al.
minimising imbalance, significant improvements in the algorithmic complexity of parallel algorithms usually follow. In order to test the hypothesis, the callgraph profiler is used to optimise a database query evaluation program [7]. We show how the profiling tool helps identify problems in the implementation of a randomised sample sort algorithm. It is also shown how the profiling tool guides the improvement of the algorithm. Results are presented which show that significant performance improvements can be achieved without tailoring algorithms to a particular machine or architecture.
2
Profiling an SQL Database Application
A walk-through of the steps involved in optimising an SQL database query evaluation program is used to demonstrate the use of the call-graph profiling tool. A series of relational queries were implemented in BSP. This was done by transcribing the queries into C function calls and then linking them with a BSPlib library of SQL-like primitives. The test program was designed to take as input a sequence of relations (tables), it would then processes the tables and yield as output a sequence of intermediate relations. The input relations were distributed among the processors using a simple block-cyclic distribution. Three input relations ITEM, QNT and TRAN were defined. Six queries were evaluated which created the following intermediary relations: (1) TEMP1, an aggregate sum and a “group-by” rearrangement of the relation TRAN; (2) TEMP2, an equalityjoin of TEMP1 and ITEM; (3) TEMP3, an aggregate sum and group-by of TEMP2; (4) TEMP4, an equality-join of relations TEMP3 and QNT; (5) TEMP5, a less-than-join of relations TEMP4 and ITEM; and (6) a filter (IN “low 1%”) of the relation TEMP5. 2.1
Using the Profiler to Optimise SQL Queries
Figure 2 shows a screen shot of a call-graph profile for the SQL query processing program running on a sixteen processor Cray T3E. The call-graph contains a series of interior and leaf nodes. The interior nodes represent procedures entered during program execution, whereas the leaf nodes represent the textual position of the end of a superstep, that is, the line of code containing a call to the barrier synchronisation function bsp sync. The path from a leaf to the root of the graph identifies the nesting of procedure calls that were active when bsp sync was executed. This path is termed a call-stack and a collection of call-stacks comprise a call-graph. The costs of shared procedures can be accurately apportioned to their parents via a scheme known as inheritance [5]. This is particularly important when determining how costs are allocated to library functions. Instead of simply reporting that all the time was spent in a parallel sorting algorithm for example, a call-graph profile also attributes costs to the procedures which contained calls to the sort routine. In line with superstep semantics, the cost of all communication issued during a superstep is charged to the barrier that marks the end of the superstep. Similarly, all computation is charged to the end of the superstep. Leaf nodes record the costs accrued during a superstep as: (i) the textual position of the
Analysing an SQL Application with a BSPlib Call-graph Profiling Tool
159
Fig. 1. Screen shot of the call-graph profiling tool bsp sync call within the program; (ii) the number of times a particular superstep is executed; and (iii) summaries of computation cost, communication cost, idle time, and the largest amount of data either entering or leaving any process (a hrelation). These costs are given as a maximum, average, and minimum cost over p processors (the average and minimum costs are given as a percentage of the maximum). Interior nodes record similar information to leaf nodes, except that the label of the node is a procedure name and the accumulated cost is inherited from each of the supersteps executed during the lifetime of that procedure. Critical paths are made visible by shading each of the nodes in the graph with a colour ranging from white to red. A red node corresponds to a bottleneck (or ‘hot-spot’ !) in the program. In the graphs that follow, the shading for each node is determined by the largest difference between the maximum and average cost, and also the largest percentage-wise deviation between maximum and average cost [3]. Terms of the form, (12% | 7%), are used to quantify this balance. This states that the average cost of executing a piece of code on p processors is 12% of the cost of the processor that spent the longest time performing the task; whereas the process with the least amount of work took only 7% of the time of the most loaded process. In the following sections we refine the algorithms used in the SQL program until there is virtually no imbalance, which is identified by terms of the form (100% | 100%). Query Evaluation Stage 1 A portion of the call-graph for the original program (version 1) is shown in Figure 2. The three input relations were initially unevenly distributed among the processors. It is to be expected that an irregular
160
Jonathan M.D. Hill et al.
SQL_Process [176 syncs] Max Avg Comp 2.50s 66% Comm 0.23s 92% Wait 1.55s 57% Hrel 1.5e+06 54%
Min 40% 86% 3% 21%
SELECT__FROM_t4_item__WHERE_qnt_LT_lim [35 syncs] Avg Min 73% 55% 98% 96% 64% 18% 54% 20%
Max Comp 0.24s Comm 0.07s Wait 0.11s Hrel 2.0e+05
8 in 8% 4% 3% 0%
ex_query.c line 614 [1 syncs] Max Avg Min Comp 0.00s 100%100% Comm 0.00s 99% 99% Wait 0.00s 8% 1% Hrel 0.0e+00 0% 0%
ex_query.c line 627 [1 syncs] Max Avg Min Comp 0.03s 45% 14% Comm 0.06s 100%100% Wait 0.03s 89% 71% Hrel 7.8e+04 55% 22%
ex_query.c line 659 [1 syncs] Max Avg Min Comp 0.00s 48% 44% Comm 0.00s 98% 95% Wait 0.00s 96% 67% Hrel 0.0e+00 0% 0%
elim_dup0 [25 syncs] Max Avg Comp 1.47s 62% Comm 0.02s 70% Wait 1.01s 56% Hrel 9.5e+05 51%
SELECT__FROM_tran__SUM_qnt__GROUPBY_cd_dp [26 syncs] Avg Min 63% 33% 70% 46% 56% 0% 51% 20%
Max Comp 1.48s Comm 0.02s Wait 1.01s Hrel 9.5e+05
Min 33% 46% 0% 20%
apply0 [7 syncs] Max Avg Comp 0.00s 88% Comm 0.00s 89% Wait 0.00s 48% Hrel 5.5e+02 64%
Min 86% 83% 32% 7%
Fig. 2. Query version 1: SQL query evaluation.
distribution of input relations would produce a considerable amount of imbalance in computation and communication when performing operations using these data structures. For example, Figure 2 shows a (54%| 21%) imbalance in h-relation size. To remedy this imbalance, load balancing functions were introduced into the code which ensured that each processor contained approximately an equal sized partition of the input relation. Query Evaluation Stage 2 Load balancing the input relations reduced the amount of communication and computation imbalance by 26%. Further profiles revealed that the imbalance had not been completely eradicated. It appeared that the SQL primitives had inherent imbalances in communication even for perfectly balanced input data. This can be seen in Figure 3. The critical paths identifying the imbalance in communication were followed from the select function to elim dup0. At this point it was easy to identify that the major cause of the communication imbalance, (69%| 54%), was attributed to the function bspsort at line 175. This is shown in the screen-shot in Figure 1. In the same way, following the imbalance in the computation critical paths, (51%| 17%), it was also clear that the primary cause of computation imbalance was attributed to the same function bspsort, this time at line 188 in Figure 1. Figure 1 also shows a pie-chart that gives a breakdown of the accumulated computation time at the superstep at line 188 for all the processes. To give some idea of the type of computation that may cause problems, the underlying algorithm of the bspsort function is (briefly) described. The function bspsort implements a refined variant of the optimal randomised BSP sorting algorithm of [1]. The algorithm consists of seven stages: (1) each processor locally sorts the elements in its possession; (2) each processor selects a random sample of s × p elements (where s is the oversampling factor) which are gathered onto process zero; (3) the samples are sorted and p regular pivots are picked from the s × p2 samples; (4) the pivots are broadcast to all processors; (5) each processor partitions the elements in its possession into the p blocks as induced by the
Analysing an SQL Application with a BSPlib Call-graph Profiling Tool
SQL_Process [158 syncs] Max Avg Comp 1.97s 85% Comm 0.23s 92% Wait 0.55s 62% Hrel 1.2e+06 67%
Min 74% 86% 12% 47%
SELECT__FROM_t4_item__WHERE_qnt_LT_lim [35 syncs] Avg Min 88% 76% 98% 97% 64% 36% 58% 47%
SELECT__FROM_tran__SUM_qnt__GROUPBY_cd_dp [26 syncs] Avg Min 90% 84% 83% 69% 61% 2% 71% 56%
Max Comp 0.99s Comm 0.02s Wait 0.17s Hrel 5.7e+05
Max Comp 0.19s Comm 0.06s Wait 0.05s Hrel 1.6e+05
8 in 9% 6% 3% 0%
161
ex_query.c line 614 [1 syncs] Max Avg Min Comp 0.00s 100% 99% Comm 0.00s 99% 98% Wait 0.00s 8% 0% Hrel 0.0e+00 0% 0%
ex_query.c line 627 [1 syncs] Max Avg Min Comp 0.03s 52% 14% Comm 0.05s 99% 99% Wait 0.02s 85% 68% Hrel 6.7e+04 59% 52%
ex_query.c line 659 [1 syncs] Max Avg Min Comp 0.00s 47% 43% Comm 0.00s 96% 93% Wait 0.00s 95% 68% Hrel 0.0e+00 0% 0%
elim_dup0 [25 syncs] Max Avg Comp 0.98s 90% Comm 0.02s 83% Wait 0.16s 61% Hrel 5.7e+05 71%
Min 84% 69% 2% 56%
apply0 [7 syncs] Max Avg Comp 0.00s 88% Comm 0.00s 91% Wait 0.00s 47% Hrel 5.2e+02 65%
Min 86% 86% 31% 7%
Fig. 3. Query version 2: SQL query evaluation after load balance. pivots; (6) each processor sends partition i to processor i; and (7) a local multiway merge produces the desired effect of a global sort. If stages (6) and (7) are not balanced, then this can only be attributed to a poor selection of splitters in stage (2). Since the random number generator that selects the sample had been extensively tested prior to usage the only possible cause for the disappointing performance of the algorithm was the choice of the oversampling factor s. The algorithm had however been previously tested and the oversampling procedure had been fine tuned by means of extensive experimental results using simple timing functions. The experimental results had suggested that the oversampling factor established by the theoretical analysis of the algorithm had been a gross overestimate and, therefore, at the implementation level it had been decided to employ a considerably reduced factor. bspsort [10 syncs] Max Avg Comp 1.27s 84% Comm 0.01s 89% Wait 0.34s 59% Hrel 3.1e+05 60%
sort.c line 125 [2 syncs] Max Avg Min Comp 0.82s 100%100% Comm 0.00s 98% 90% Wait 0.00s 56% 5% Hrel 1.2e+02 12% 7%
sort.c line 175 [2 syncs] Max Avg Min Comp 0.01s 92% 86% Comm 0.01s 88% 69% Wait 0.00s 45% 4% Hrel 3.1e+05 60% 48%
Min 73% 72% 0% 48%
sort.c line 188 [2 syncs] Max Avg Min Comp 0.39s 49% 15% Comm 0.00s 98% 96% Wait 0.34s 60% 0% Hrel 0.0e+00 0% 0%
bspreduce_multi [4 syncs] Max Avg Min Comp 0.04s 99% 99% Comm 0.00s 92% 82% Wait 0.00s 48% 29% Hrel 6.0e+01 77% 53%
scan.c line 133 [2 syncs] Max Avg Min Comp 0.04s 100%100% Comm 0.00s 93% 92% Wait 0.00s 8% 1% Hrel 0.0e+00 0% 0%
scan.c line 146 [2 syncs] Max Avg Min Comp 0.00s 58% 41% Comm 0.00s 91% 72% Wait 0.00s 67% 42% Hrel 6.0e+01 77% 53%
Fig. 4. Sorting version 1: experimental oversampling factor. Query Evaluation: Improving Parallel Sorting Experiments continued by varying the oversampling factor for the sorting algorithm. A portion of the callgraphs for the optimal experimental and theoretical parameters are exhibited in figures 4 and 5 respectively. The original experimental results were confirmed by the profile, that is, the algorithm utilising the theoretical oversampling factor (sort version 2) delivered performance that was approximately 50% inferior to that of the algorithm utilising the experimental oversampling factor (sort ver-
162
Jonathan M.D. Hill et al.
bspsort [12 syncs] Max Avg Comp 1.90s 59% Comm 0.01s 88% Wait 0.85s 92% Hrel 3.1e+05 56%
sort.c line 125 [2 syncs] Max Avg Min Comp 0.83s 100%100% Comm 0.00s 80% 19% Wait 0.00s 55% 10% Hrel 1.5e+05 12% 7%
sort.c line 135 [2 syncs] Max Avg Min Comp 0.82s 7% 0% Comm 0.00s 96% 93% Wait 0.82s 94% 0% Hrel 0.0e+00 0% 0%
Min 55% 60% 0% 52%
sort.c line 175 [2 syncs] Max Avg Min Comp 0.01s 91% 85% Comm 0.01s 90% 73% Wait 0.00s 50% 15% Hrel 1.5e+05 98% 97%
sort.c line 188 [2 syncs] Max Avg Min Comp 0.20s 95% 89% Comm 0.00s 97% 96% Wait 0.02s 46% 0% Hrel 0.0e+00 0% 0%
bspreduce_multi [4 syncs] Max Avg Min Comp 0.04s 99% 99% Comm 0.00s 92% 82% Wait 0.00s 49% 29% Hrel 6.0e+01 77% 53%
scan.c line 133 [2 syncs] Max Avg Min Comp 0.04s 100%100% Comm 0.00s 93% 91% Wait 0.00s 8% 1% Hrel 0.0e+00 0% 0%
scan.c line 146 [2 syncs] Max Avg Min Comp 0.00s 58% 41% Comm 0.00s 91% 72% Wait 0.00s 68% 42% Hrel 6.0e+01 77% 53%
Fig. 5. Sorting version 2: theoretical oversampling factor.
sion 1). The computation imbalance of (49%| 15%) in stage (7) of sort version 1, shown at line 188 in Figure 4, shifted to a computation imbalance of (7%| 0%) in stage (2) of sort version 2, shown at line 135 of Figure 5. Similarly, the communication imbalance of (60%| 48%) in stage (6) of sort version 1 shifted to an imbalance of (12%| 7%) in stage (3) at line 125 of sort version 2. It was noted however that the communication and computation requirements of stages (6) and (7) in sort version 2 of Figure 5 are highly balanced, (98%| 97%) and (95%| 89%) respectively. Therefore, the theoretical analysis had accurately predicted the oversampling factor value required to achieve load balance. Unfortunately, the sustained improvement gained by balancing the communication pattern of stage (6) of the underlying sorting algorithm – and consequently, the communication requirements of the algorithm – had been largely overwhelmed by the cost of communicating and sorting a larger sample in stages (2) and (3). To remedy this problem the ideas of [1] were adopted. In particular, the unbalanced communication and computation algorithm of stages (2) and (3), which collected and sorted a sample on a single process, were replaced with a simple and efficient parallel sorting algorithm, which sorted the sample set among all the processes. As noted in [1], an appropriate choice for such a parallel sorting algorithm is presented by an efficient variant of the bitonic-sort network. The introduction of the bitonic sorter caused the data presented in the bspsort node in Figures 4 and 5 to: Computation (99%| 98%), Communication (83%| 70%), and h-relation (99%| 98%). This improved the wall-clock running time of the sorting algorithm by 8.5%. Query Evaluation Stage 3 As the sorting algorithm is central to the implementation of most of the SQL queries, a minor improvement in the sorting algorithm brings about a marked improvement in the performance of the query evaluation as a whole. This can be clearly seen from the lack of any shading of critical paths in Figure 6 (contrast with Figure 2)–all the h-relations in the SQL queries are almost perfectly balanced. To reinforce the conclusion that these improvements are architecture independent, Table 2.1 shows the wall-clock times for the original and optimised pro-
Analysing an SQL Application with a BSPlib Call-graph Profiling Tool
SQL_Process [186 syncs] Max Avg p 1.84s 96% m 0.16s 92% t 0.21s 44% l 1.2e+06 98%
Min 90% 89% 7% 97%
SELECT__FROM_t4_item__WHERE_qnt_LT_lim [42 syncs] Max Avg Min Comp 0.19s 96% 92% Comm 0.05s 97% 95% Wait 0.02s 52% 16% Hrel 2.2e+05 100%100%
e 538 vg 8% 7% 8% 0%
Min 98% 62% 10% 0%
163
ex_query.c line 614 [1 syncs] Max Avg Min Comp 0.00s 100%100% Comm 0.00s 99% 97% Wait 0.00s 8% 0% Hrel 0.0e+00 0% 0%
ex_query.c line 627 [1 syncs] Max Avg Min Comp 0.01s 99% 98% Comm 0.04s 100%100% Wait 0.00s 54% 10% Hrel 3.5e+04 100%100%
ex_query.c line 659 [1 syncs] Max Avg Min Comp 0.00s 48% 44% Comm 0.00s 97% 94% Wait 0.00s 94% 68% Hrel 0.0e+00 0% 0%
SELECT__FROM_tran__SUM_qnt__GROUPBY_cd_dp [33 syncs] Max Avg Min Comp 0.97s 98% 94% Comm 0.02s 85% 78% Wait 0.07s 40% 4% Hrel 5.2e+05 100%100%
elim_dup0 [32 syncs] Max Avg Min Comp 0.96s 98% 94% Comm 0.02s 85% 78% Wait 0.07s 39% 4% Hrel 5.2e+05 100%100%
apply0 [7 syncs] Max Avg Comp 0.00s 87% Comm 0.00s 90% Wait 0.00s 47% Hrel 5.2e+02 51%
ex_ Min 85% 84% 31% 7%
Comp Comm Wait Hrel
Fig. 6. Query version 3: SQL query evaluation, final version. grams running on a variety of parallel machines. The size of the input relations was chosen to be small, so that the computation time in the algorithms would not dominate. This is particularly important if we are to highlight the improvements in communication cost. A small relation size also enabled the Network of Workstations (NOW) implementation of the algorithm to complete in a reasonable amount of time. For all but one of the results, the optimised version of the program provides an approximate 20% improvement over the original program. This evidence supports our hypothesis that the call-graph profiling tool provides a mechanism that guides architecture independent program optimisation. As an aside, the parallel efficiencies in the table are marginal for the T3E and Origin 2000, due to the communication intensive nature of this query processing application, in combination with the small amount of data held on each process (750 records per process at p = 16). In contrast, a combination of the slow processors, fast communication, and poor cache organisation on the T3D gives super-linear speedup even for this small data set. However, the communication intensive nature of this problem makes it unsuitable for running over loosely coupled distributed memory machines. For example, the results for a NOW, exhibit super-linear speedup at p = 2 in the optimised program, yet super-linear slowdown in all other configurations.
3
Conclusions
The performance improvements resulting from the analysis of the call-graph profiles demonstrate that the tool can be used to optimise programs in a portable and architecture independent manner. Unlike other profiling tools, the architecture independent metric – h-relation size – guides the optimisation process. The major benefit of this profiling tool is that the amount of information displayed when visualising a profile for a parallel program is no more complex than that of a sequential program. 1
A 10Mbps Ethernet network of 266MHz Pentium Pro processors.
164
Jonathan M.D. Hill et al.
Table 1. Wall-clock time in seconds for input relations containing 12,000 records p 1 2 4 8 16 Cray T3D 1 2 4 8 16 SGI Origin 2000 1 2 4 8 Intel NOW1 1 2 4 6 Machine Cray T3E
Unoptimised Optimised time speedup time speedup 6.44 1.00 5.37 1.00 4.30 1.50 3.38 1.59 2.48 2.60 1.85 2.90 1.23 5.23 1.04 5.17 0.68 9.43 0.67 8.02 27.18 1.00 22.41 1.00 13.18 2.06 10.88 2.06 6.89 3.94 5.70 3.93 3.29 8.25 3.07 7.30 1.66 16.34 1.89 11.88 2.99 1.00 2.42 1.00 1.65 1.81 1.27 1.91 1.26 2.37 1.11 2.16 0.88 3.39 0.77 3.15 4.96 1.00 4.04 1.00 78.21 0.06 1.65 2.45 174.77 0.03 88.76 0.05 216.42 0.02 101.33 0.04
gain 17% 21% 25% 15% 1% 18% 17% 17% 7% -14% 19% 23% 12% 13% 18% 98% 49% 53%
References 1. A. V. Gerbessiotis and C. J. Siniolakis. Deterministic sorting and randomized median finding on the BSP model. In Proceedings of the 8th ACM Symposium on Parallel Algorithms and Archictures, Padova, Italy, June 1996. ACM Press. 160, 162 2. J. M. D. Hill, S. Donaldson, and D. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Programming Research Group, Oxford University Computing Laboratory, October 1997. 157 3. J. M. D. Hill, S. Jarvis, C. Siniolakis, and V. P. Vasilev. Portable and architecture independent parallel performance tuning using a call-graph profiling tool. In 6th EuroMicro Workshop on Parallel and Distributed Processing (PDP’98). IEEE Computer Society Press, January 1998. 157, 159 4. J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP Programming Library. Parallel Computing, to appear 1998. see www.bsp-worldwide.org for more details. 157 5. S. A. Jarvis. Profiling large-scale lazy functional programs. PhD thesis, Computer Science Department, University of Durham, 1996. 158 6. D. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, Fall 1997. 157 7. K. R. Sujithan and J. M. D. Hill. Collection types for database programming in the BSP model. In 5th EuroMicro Workshop on Parallel and Distributed Processing (PDP’97). IEEE Computer Society Press, Jan. 1997. 158
Analysing an SQL Application with a BSPlib Call-graph Profiling Tool
165
8. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. 157
A Graphical Tool for the Visualization and Animation of Communicating Sequential Processes Ali E. Abdallah Department of Computer Science The University of Reading Reading, RG6 6AY, UK [email protected] http://www.fmse.cs.reading.ac.uk/people/aea
Abstract. This paper describes some aspects of an interactive graphical tool designed to exhibit, through animation, the dynamic behaviour of parallel systems of communicating processes. The tool, called VisualNets, provides functionalities for visually creating graphical representations of processes, connecting them via channels, defining their behaviours in Hoare’s CSP notation and animating the evolution of their visulization with time. The tool is very useful for understanding concurrency, analysing various aspects of distributed message-passing algorithms, detecting deadlocks, identifying computational bottlenecks, and estimating the performance of a class of parallel algorithms on a variety of MIMD parallel machines.
1
Introduction
The process of developing parallel programs is known to be much harder than that of developing sequential programs. It is also not as intuitive. This is especially the case when an explicit message passing model is used. The user usually takes full responsibility for identifying parallelism, decomposing the system into a collection of parallel tasks, arranging appropriate communications between the tasks, and mapping them onto physical processors. Essentially, a parallel system can be viewed as a collection of independent sequential subsystems (processes) which are interconnected through a number of common links (channels) and can communicate and interact by sending and receiving messages on those links. While understanding the behaviour of a sequential process in isolation is a relatively straightforward task, studying the effect of placing a number of such processes in parallel can be very complex indeed. The behaviour of all the processes becomes inter-dependent in ways which are not at first obvious making it very difficult to comprehend, reason about, and analyse the system as a whole. Athough part of this difficulty can be attributed to the introduction of new aspects such as synchronizations and communications, the main problem lies in the inherent complexity of parallel control as opposed to sequential control. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 165–172, 1998. c Springer-Verlag Berlin Heidelberg 1998
166
Ali E. Abdallah
Visualizations are used to assist in absorbing large quantities of information very rapidly. Animations add to their worth by allowing visualizations to evolve with time and, hence, making apparent the dynamic behaviour of complex systems. VisualNets is an interactive graphical tool for facilitating the design and analysis of synchronous networks of communicating systems through visualization and animation.
Fig. 1. A screenshot showing the animation of an insertion sort algorithm
The tool, depicted in Fig. 1, allows the user to interactively create graphical representations of processes, link them via channels to form specific network configurations, describe the behaviour of each process in the network in Hoare’s CSP notation [11], and study an animated simulation of the network in execution. The CSP syntax is checked before the animation starts. The visualization of the network animates communications as they take place by flashing the values being communicated over channel links. The tool builds a timing diagram to accompany the animation and allows a more in-depth analysis to be carried out. Networks can be built either as part of an orchestrated design process, by systematic refinement of functional specifications using techniques described in [1,2], or “on the fly” for rapid prototyping. The benefits of using visualization as an aid for analysing certain aspects of parallel and distributed systems has long been recognised. The main focus
A Graphical Tool for the Visualization and Animation
167
in this area has been on tools for monitoring and visualizing the performance of parallel programs. These tools usually rely on trace data stored during the actual execution of a parallel program in order to create various performance displays. A good overview of relevant work in this area can be found in [10,13]. A different approach aimed at understanding not only the performance of parallel and distributed systems but also their logical behaviours is based on simulation and animation for specific architectures [6,5,12]. The remainder of the paper is organised as follows. Section 2 introduces various features of the tool through a case study and Section 3 concludes the paper and indicates future directions.
2
A Case Study
We will give a brief description of the functionality of the tool through the development of a distributed sorting algorithm based on insertion sort. Given a non-empty list of values, the insertion sort algorithm, isort, starts from the empty list and constructs the final result by successively inserting each element of the input list at the correct position in an accumulated sorted list. Therefore, sorting a list, say [a1 , a2 , ..an ], can be visualized as going through n successive stages. The ith intermediate stage, say insert_i, holds the value ai , takes the sorted list isort [a1 , a2 , ..ai−1 ] as input from the preceeding stage and returns the the longer list isort [a1 , a2 , ..ai ] to the following stage.
Fig. 2. Network configuration for parallel insertion sort
2.1
Creating a Graphical Network
The network configuration for sorting the list [8, 5, 3, 9, 8, 4, 5] is depicted in Fig. 2. Each process in the network is represented graphically as a box with some useful information, such as values of local parameters and process name, inside it. Processes in the network can be linked via channels to form any topographical configuration. The construction of a network involves operations for adding, removing, and editing both processes and channels. The interface allows these operations to be carried out visually and interactively.
168
Ali E. Abdallah
Fig. 3. A CSP definition of IN SERT (a)
2.2
Defining Process Behaviour
Having defined the network configuration, the next stage is to define the behaviour of each process in the network. This is done using a subset of CSP. The screen shot captured in Fig. 2.1 shows the CSP definition for the process IN SERT (a) which is stored in a library of useful processes. The processes IN SERT _i, 1 ≤ i ≤ 7, depicted in the above sorting network are all defined as specific instances of the library process IN SERT (a) by appropriately instantiating the value parameter a and renaming the channels left and right.
Fig. 4. Animation of the network
In CSP, outputing a specific value v on a channel c is denoted by the event c!v, inputing any value v on channel c and storing it in a local variable x is denoted by the event c?x. The arrow → denotes prefixing an event to a process. The notation P < | b > | Q , where b is a boolean expression and P and Q are processes, is just an infix form for the traditional selection construct if b then P else Q. Note that the prefix and conditional operators associate to the right. The special message eot is used to indicate the end of a stream of messages transmitted on a channel. The process COP Y denotes a one place buffer. For
A Graphical Tool for the Visualization and Animation
169
any lists s, the process P rd(s) outputs the values of s in the same order on channel right and followed by the message eot. The process GEN is P rd([]).
Fig. 5. Timing diagram for the network
2.3
Animation of the Network
Having created the network and defined each process, we can now graphically animate the execution of the network by selecting the run button for continuous animation, or the step button to show the new state of each process in the network after one time step. Channels in each process are colour coded to reflect the state of the process as follows: red for unable to communicate, green for ready to communicate, and white when the process has successfully terminated. When the indicators of a channel between two processes are both green, the communication takes place at the next time step. Fig. 2.2 clearly illustrates that on the next time step, communications can only happend on channels with green (light grey) colour on both ends of the links. These are, channels c2 , c4 , c6 , and c8 . Fig. 2.2 illustrates the timing diagram for animating the network. It contains a record of all communications on each channel in the network coupled with the appropriate time stamp. The evolution of each individual process in the network can be dynamically monitored during the animation of the network. An indicator highlights the exact place in the code of the process which will be executed at the next time step. For example, the cursor in Fig. 2.3 indicates that the process IN SERT _2 is willing to output the value 8 at the next time step.
170
Ali E. Abdallah
Fig. 6. Monitoring control within an individual process during network animation
Fig. 7. The process IN S(a, s, t).
Fig. 8. Timing diagram for the execution of the new network.
A Graphical Tool for the Visualization and Animation
2.4
171
Alternative Designs
The process IN SERT (a) is just a valid implementation of the function insert(a) which takes a sorted list of values and inserts the value a at the appropiate position so that the extended list is also sorted. Another process IN SERT (a) which also correctly implements insert(a) can be defined as depicted in Fig. 7. In this definition, the additional variable s (and t resp.) is used to accumulate the list of input values which are stricly less than a (and greater or equal to a respectively). IN SERT (a) = IN S(a, [], []) The timed diagram of the new network is shown in Fig. 2.3. The behaviour of the network is completely sequential. Parallelism is only syntactic; each stage in the pipeline needs to completely terminates before the following stage starts. One of the strengths of VisualNets is that it allows the user to alter the network being investigated very quickly, so that variations can be tested and compared interactively. After each change the user can immediately run the network again, view the animation and the timing diagram as in Fig. 2.3 until a good design is reached.
3
Conclusions and Future Works
In this paper we have presented a graphical tool for the visualization, simulation, and animation of systems of communicating sequential processes. A brief overview of the functionality of the tool is described through the process of developing a distributed solution to a specific problem. Such a tool is of a great educational value in assisting the understanding of concurrency and in illustrating many distributed computing problems and the techniques underlying their solutions. Perhaps the most important aspect of the tool is the ability to visually alter a design, experiment with ideas for overcoming specific problems, and investigate, through animations, the potential consequences of certain design decisions. The tool proved very useful in detecting, through animation, undesirable behaviours such as bottlenecks deadlock [3]. However, currently VisualNets cannot deal with non-deterministic processes our underlying visualisation techniques are not easily scalable. Work is presently in progress on a new version of the tool, rewritten in Sun Microsystems’ Java language. The tool is platform-independant and will operate on UNIX or Windows-based systems. The new tool implements a considerably larger set of CSP operators, and can deal with networks that synchronise on events as well as on channel input or output. Internal parallelism within processes is supported, permitting a smaller network to be visualised as a single process and later zoomed to display the detail of the internal communications. The emphasis of the new project is on developing an advanced visualisation and animation tool and integrating it within an environment which allows higher level of abstractions and capabilities for systematic specification refinement and program transformation.
172
Ali E. Abdallah
Acknowledgements The work reported in this paper has evolved over several years and is the result of working very closely with several final year and postgraduate Computer Science students at the University of Reading, UK. In particular, I would like to thank Aiden Devine, Robin Singleton (for the implementation of the current tool) and Mark Green for developing the new Java implementation.
References 1. A. E. Abdallah, Derivation of Parallel Algorithms from Functional Specifications to CSP Processes, in: Bernhard M¨ oller, ed., Mathematics of Program Construction, LNCS 947, (Springer Verlag, 1995) 67-96. 166 2. A. E. Abdallah, Synthesis of Massively Pipelined Algorithms for List Manipulation, in L. Boug, P. Fraigniaud, A. Mignotte, and Y. Robert, eds, Proceedings of the European Conference on Parallel Processing, EuroPar’96, LNCS 1024, (Springer Verlag, 1996), pp 911-920. 166 3. A. E. Abdallah, Visualization and Animation of Communicating Processes, in N. Namazi and K. Matthews, eds, Proc. of IASTED Int. Conference on Signal and Image Processing, SIP-96, Orlando, Florida, USA. (IASTED/ACTA Press, 1996), 357-362. 171 4. R. S. Bird, and P. Wadler, Introduction to Functional Programming, ( PrenticeHall, 1988). 5. Rajive L. Bagrodia and Chien-Chung Shen, MIDAS: Integrated Design and Simulation of Distributed Systems, IEEE Transactions On Software Engineering, 17 (10), 1993, pp. 1042–1058. 167 6. Daniel Y. Chao and David T. Wang, An Interactive Tool for Design, Simulation, Verification and Synthesis of Protocols, Software and Experience, 24 (8), 1994, pp. 747-783. 167 7. Jim Davies, Specification and Proof in Real-Time CSP, (Cambridge University Press, 1993). 8. H. Diab and H. Tabbara, Performance Factors in Parallel Programs, Submitted for publication, 1996. 9. Michael T. Heath and Jennifer A. Etheridge, Visualizing the Performance of Parallel Programs, IEEE Software, 8 (5), 1991, pp. 29–39. 10. Michael T. Heath, Visualization of Parallel and Distributed Systems, in A. Zomaya (ed), Parallel and Distributed Computing Handbook, (McGraw-Hill, 1996) 167 11. C. A. R. Hoare, Communicating Sequential Processes. (Prentice-Hall, 1985). 166 12. T. Ludwig, M. Oberhuber, and R. Wism¨ uller, An Open Monitoring System for Parallel and Distributed Programs, in L. Boug, P. Fraigniaud, A. Mignotte, and Y. Robert, eds, Proceedings of the European Conference on Parallel Processing, EuroPar’96, LNCS 1123 (Springer, 1996) 78-83. 167 13. Guido Wirtz, A Visual Approach for Developing, Understanding and Analyzing Parallel Programs, in E.P. Glinert, editor, Proc. Int. Symp. on Visual Programming, (IEEE CS Press, 1993) 261-266. 167
A Universal Infrastructure for the Run-time Monitoring of Parallel and Distributed Applications Roland Wism¨ uller, J¨ org Trinitis, and Thomas Ludwig Technische Universit¨ at M¨ unchen (TUM), Informatik Lehrstuhl f¨ ur Rechnertechnik und Rechnerorganisation (LRR-TUM) Arcisstr. 21, D-80333 M¨ unchen {wismuell,trinitis,ludwig}@in.tum.de Abstract. On-line tools for parallel and distributed programs require a facility to observe and possibly manipulate the programs’ run-time behavior, a so called monitoring system. Currently, most tools use proprietary monitoring techniques that are incompatible to each other and usually apply only to specific target platforms. The On-line Monitoring Interface Specification (OMIS) is the first specification of a universal interface between different tools and a monitoring system, thus enabling interoperable, portable and uniform tool environments. The paper gives an introduction into the basic concepts of OMIS and presents the design and implementation of an OMIS compliant monitoring system (OCM).
1
Introduction
The development and maintenance of parallel programs is inherently more expensive than that of sequential programs. This is partly due to the lack of widely available, powerful tools supporting test and production phases of those programs, e.g. performance analyzers, debuggers, or load balancers. What these tools have in common is their need for a module that enables them to observe and possibly manipulate the parallel application, a so called monitoring module. Usually every tool includes a monitoring module that is specifically adapted to its needs. Since these modules typically use exclusive low level interfaces to the operating system and possibly the underlying hardware, these layers tend to be incompatible to each other. As a result, it is impossible to connect two or more tools to a running application at the same time. Even worse, if you want to use two different tools, one after the other, you often have to re-compile, re-link, or otherwise re-process the application. In this paper we will describe the design and implementation of a universally usable on-line monitoring infrastructure that is general enough to support a wide variety of tools and their interoperability. We will first describe existing approaches, analyze the requirements for a general monitoring interface, then introduce our approach and finally describe our implementation.
This work is partly funded by Deutsche Forschungsgemeinschaft, Special Research Grant SFB 342, Subproject A1.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 173–181, 1998. c Springer-Verlag Berlin Heidelberg 1998
174
2
Roland Wism¨ uller et al.
State of the Art
Existing tools can be classified into interactive vs. automatic, observing vs. manipulating, and on-line vs. off-line tools. Taking a look at existing tools for parallel and distributed programming, one can state that: – – – –
there are more off-line tools than on-line tools, most tools are only available on one or very few platforms, different on-line tools can typically not be applied concurrently, there are no generally available integrated tool environments.
There are several reasons for this. First, developing on-line monitoring systems is complex and expensive. Second, these systems are typically difficult to port from one platform to another. And third, they are usually specifically adapted and restricted to the tools that sit on top of them and require incompatible modifications to the application. Nevertheless, very powerful on-line tools have been developed. Two that are especially interesting concerning their monitoring parts will be mentioned here. In the area of debugging tools, p2d2 by Cheng and Hood [6] is an important representative. Different from most earlier approaches, p2d2 clearly separates the platform specific on-line monitoring system from portable parts of the tool itself. To accomplish this, a client-server approach is used [7]. The p2d2 server contains all platform specific parts, whereas the client consists of the portable user interface. Client and server communicate via a specified protocol. Cheng and Hood encourage platform vendors to implement debugger servers that comply with the p2d2 protocol. Users could then use an existing p2d2 debugger client to debug applications on different machines. Taking a look at performance analysis tools, the ambitious Paradyn project by Miller and Hollingsworth [10] clearly deserves a closer look. Paradyn utilizes the exceptional technique of dynamic instrumentation [5], where code snippets performing performance analysis tasks are dynamically inserted into and removed from the running process’s code. The interface between the tool and the dynamic instrumentation library dyninst is clearly defined and published [3]. On the higher level, Paradyn uses the so called W3 -model to automatically determine performance bottlenecks [4,10]. The W3 -model in particular profits from dyninst because intrusion can be kept low by removing unnecessary instrumentation. The above tools are two of the very few that define an explicit inter-face between the monitoring system and the tool itself. By clearly defining this interface it becomes possible to separate the development of the tool from that of the on-line monitoring system. In addition, porting the tool becomes much easier, thus improving the availability of tools. Unfortunately, p2d2 and Paradyn both target only a single class of tools: p2d2 is limited to debugging, whereas Paradyn/dyninst is concentrated on performance analysis. It is impossible to build new tools or interoperable tools with either of these environments. OMIS, on the other hand, is designed with that primary goal in mind.
A Universal Infrastructure for the Run-time Monitoring
3
175
The OMIS Approach
One key idea in OMIS (On-line Monitoring Interface Specification) [9] is to define a standardized interface between the tools and the on-line monitoring system. This is comparable to p2d2 and Paradyn/dyninst. Our approach however is much more general as we do not concentrate on a single class of tools. OMIS supports a wide variety of on-line tools, including centralized, decentralized, automatic, interactive, observing, and manipulating tools (off-line tools can be implemented by a tiny on-line module writing trace files). The tools we initially had in mind were those in The Tool-set [12], developed at LRR-TUM: VISTOP (program flow visualization), PATOP (performance analysis) [2], DETOP (debugging) [11], CoCheck (checkpointing/process migration), LoMan (load balancing), Codex (controlled deterministic execution), and Viper (computational steering). But we do not only want to support classical stand-alone tools. OMIS is intended to also support interoperable tools (two or more tools applied at the same time), e.g. a performance analyzer covering and measuring the dynamic load balancing and process migration as well. Another example is a debugger interoperating with a checkpointing system and a deterministic execution unit, thus enabling the user to start debugging sessions from checkpoints of long lasting program runs. OMIS will also allow to build uniform tool environments. It will thus become possible to develop portable development environments for parallel and distributed programming. In order to achieve all of the above, the monitoring system has to be programmable. OMIS supports this by specifying so called event/action-relations. An event/action-relation consists of a (possibly empty) event definition and a list of actions, where the event specification defines a condition for the actions to be invoked. Every time the event occurs, the actions are executed (in case of an empty event, they are executed immediately). The OMIS core already provides a rich set of events and actions. In addition to this, there are user defined events. The main interface between the tool and the monitoring system consists of only a single function. To this function, requests from the tool are specified as event/action-relations in a machine independent string representation. In order to minimize intrusion, requests can be quickly disabled and re-enabled when e.g. a measurement is unnecessary for a while. When it is no longer needed, the whole request can be deleted from the monitoring system. Depending on how often the corresponding event occurs and what flags were specified, a single request can result in 0...n replies while it is active. Every tool that connects to an OMIS compliant monitor is able to define its own view of the observed system. This is done by explicitly attaching to the nodes and processes of interest. Different tools can thus have different views, and are only informed about events taking place within their observed system. This is especially useful in non-exclusively used target environments. While not being object oriented in the usual sense, objects form an important part of the interface. OMIS represents processes, threads, nodes, requests etc. as objects, which are identified by so called tokens. To improve scalability, all events and actions can operate on object sets. The objects are automatically lo-
176
Roland Wism¨ uller et al.
calized within the observed (distributed) system and the requests are distributed according to the objects’ locations (location transparency). OMIS also automatically converts object tokens where appropriate. For example, a process token can be converted into the token of the node where this process is currently located. Although the above concepts are very powerful if wisely applied, there always remain situations where they do not suffice to support some specific environment. To overcome this, OMIS employs another concept: extendibility. Two types of extensions are defined: Tool extensions (e.g. for distributed tools) introduce new actions into the monitoring system, whereas monitor extensions can introduce new events, actions and objects. One monitor extension is, for example, the PVM extension. It introduces message buffers, barrier objects and services (events and actions) related to the PVM programming model. A detailed description of the current version of OMIS can be found in [9] and on the OMIS home page in the WWW1 .
4
Design of an OMIS Compliant Monitoring System
Starting from the OMIS specification, we have done a first implementation of an OMIS compliant monitoring system, called OCM. In contrast to OMIS itself, which is not oriented towards specific programming libraries or hardware platforms, the OCM currently targets PVM applications on networks of UNIX workstations. However, we also plan to extend this implementation to MPI and NT workstations. Goals of the OCM project. The implementation of the OCM serves three major goals. First, it will be used as an implementation platform for the ongoing tool developments at LRR-TUM, and also at other institutes. E.g. the MTA SZTAKI Research Institute of the Hungarian Academy of Science will use the OCM to monitor programs designed with the GRADE graphical programming environment[8]. 2 The second goal is to validate the concepts and interfaces specified by OMIS and to gain experience guiding further refinements of the specification. Finally, the OCM also serves as a reference for other implementors that want to develop an OMIS compliant monitoring system. Therefore, the OCM will be made available under the terms of the GNU license. Global structure of OCM. Since the current target platform for the OCM is a distributed system where virtually all of the relevant information can only be retrieved locally on each node, the monitoring system must itself be distributed. Thus, we need one or more local monitoring components on each node. We decided to have one additional process per node, which is automatically started when a tool issues the first service request concerning that node. An alternative would be to completely integrate the monitoring system into the programming 1 2
http://wwwbode.informatik.tu-muenchen.de/~omis Funded by the German/Hungarian intergovernmental project GRADE-OMIS, contract UNG-053-96.
A Universal Infrastructure for the Run-time Monitoring
177
libraries used by the monitored application, as it is done with many trace collectors. However, we need an independent process, since an OMIS compliant monitoring system must be able to autonomously react on events, independently of the application’s state. In addition, the monitoring support offered by the UNIX operating system requires the use of a separate process. A principal decision has been to design these monitor processes as independent local servers that do not need any knowledge on global states or the other monitor processes. Thus, each monitor process offers a server interface that is very similar to the OMIS interface, with the exception that it only accepts requests that can be handled completely locally. However, since OMIS allows to issue requests that relate to multiple nodes, we need an additional component in the OCM. This component, called Node Distribution Unit (NDU), has to analyze each request issued by a tool and must split it into pieces that can be processed locally on the affected nodes. E.g. a tool may issue the request :thread stop[p 1,p 2]3 in order to stop all threads in the processes identified by the tokens p 1 and p 2. In this case, the NDU must determine the node executing process p 1 and the node executing p 2. If these nodes are different, the request must be split into two separate requests, i.e. :thread stop[p 1] and :thread stop[p 2], which are sent to the proper nodes. In addition, the NDU must also assemble the partial answers it receives from the local monitor processes into a global reply sent to the tool. In order to keep our first implementation of the OCM manageable, the NDU is currently implemented as a single central process. Thus, a typical scenario of the OCM looks as in Figure 1. The figure also shows that the NDU contains a request parser that transforms the requests specified by the tools into a preprocessed binary data structure that can be handled efficiently by all other parts of the monitoring system. The communication between all of these components is handled by a small communication layer offering non-blocking send and interrupt-driven receive operations. We need interrupt-driven receives (or related techniques like Active Messages) to avoid blocking of the monitor processes. It would result in the inability to timely react on occurrences of other important events. A polling scheme using non-blocking receive operations is not feasible, because it consumes too much CPU time and thus leads to an unacceptable probe effect. In the current version of the OCM, the communication layer is based on PVM. However, we are working on an independent layer based on sockets, which is necessary for the MPI port. As is indicated in Figure 1, tools are linked against a library that implements the procedural interface defined by OMIS and transparently handles the connection to the NDU and the necessary message passing. Structure of the local monitor. The local monitor process operates in an event driven fashion. This means that the process is normally blocked in a system call (either wait or poll, depending on whether we use ptrace or /proc for process 3
The leading colon in the request is the separator between event definition and action list. Since the event definition is empty, the action must be executed immediately.
178
Roland Wism¨ uller et al.
Tool A
Tool B
(e.g. debugger)
(e.g. visualizer)
OMIS tool library
OMIS tool library
request
reply Request parser
Node distribution unit
Reply assembly unit reply
request Local monitor process ptrace /proc
Shared memory segment
Application process 1 Instrumented library node 1
crosstrigger
Local monitor process Shared memory segment
... Application process 2
Application process 3
Instrumented library
Instrumented library
Application process 4
node n
Fig. 1. Coarse Structure of the OCM
control), which returns when an event in one of the monitored processes has been detected by the operating system. In addition, the call is interrupted when a message arrives. After the monitor process has been unblocked, it analyzes the exact kind of event that has occurred and executes the list of actions that are associated with that event. If the event happens to be the arrival of a service request, the associated action causes the request to be received and analyzed. In the case of an unconditional request (i.e. a request with an empty event definition), it is executed and the results are sent back. When a conditional request is received, the monitor process dynamically inserts or activates the necessary instrumentation of the target system and stores the request’s action list, so it can be executed later when the event occurs. Unfortunately, this relatively easy execution scheme is insufficient due to efficiency constraints. For example, assume that a performance analysis tool wants to measure the time process p 1 spends in pvm send calls. The tool could then issue the following requests: thread has started lib call([p 1],"pvm send"): pa timer start(ti 1) thread has ended lib call([p 1],"pvm send"): pa timer stop(ti 1)
A Universal Infrastructure for the Run-time Monitoring
179
Thus, a timer is started when pvm send is called by p 1 4 and is stopped again when the call returns. If event detection and action execution are only performed in the monitor process, we introduce an overhead of at least four context switches for each library call, which renders performance analysis useless. Thus, the OCM must be able both to detect certain events within the context of the triggering application process and to execute simple actions directly in that process, thereby removing the need for context switching. To achieve this, the monitored processes must be linked against instrumented libraries, i.e. extended versions of standard programming libraries that include additional parts of the monitoring system. In order to tell the monitoring parts in these libraries what they shall do, we use a shared memory segment that stores information on which events to monitor and which actions to execute. Note that actions in the application processes should not return results (except in the case of errors), since this would again result in context switches. Rather they update data stored in the shared memory segment (e.g. the timer value in the above example), which are read by actions executed in the monitor process upon request from a tool. With this extension, the final structure as shown in Figure 1 results. Note that the OCM does not require that each monitored application process is linked with an instrumented library. However, if it is not, some events and actions may not be available for this process.
5
Project Status and Future Work
A first implementation of the OCM, as it is described in Section 4, has already been finished. It is currently supporting PVM 3.3.x and PVM 3.4 on networks of workstations running Solaris, Linux, and Irix. However, some of the services defined by OMIS still need to be implemented. In order to test the OCM, the Tool-set debugger has been adapted and is already operational. Our current work includes the specification of an interface allowing extensions of OMIS compliant monitoring systems to be built independently of the individual implementation of these monitoring systems. This interface is supported by a stub generator that eases the programming of new events and actions. Future work will focus on adapting the OCM to the MPI implementation mpich, and on porting it to the operating system Windows NT. In addition, we will use the OCM as the basis for providing interoperable tools within The Tool-set project[12]. Finally, as a more long term goal, we will implement the OCM also for distributed shared memory systems, both based on hardware support (SCI coupled cluster of PCs using a hardware monitoring module [1]) and pure software implementations. 4
More exactly: when pvm send is called by any thread in process p 1. However, since PVM is not multi-threaded, there is only one thread in the process.
180
6
Roland Wism¨ uller et al.
Acknowledgments
We would like to thank Manfred Geischeder, Michael Uemminghaus, and HansG¨ unter Zeller for their participation in the OCM design. Klaus Aehlig, Alexander Fichtner, Carsten Isert, and Andreas Schmidt are still busy with implementing the OCM. We appreciate their support. Special thanks to Michael Oberhuber and Vaidy Sunderam who have been heavily involved in discussing OMIS, and to all other people that provided feedback.
References 1. G. Acher, H. Hellwagner, W. Karl, and M. Leberecht. A PCI-SCI Bridge for Building a PC-Cluster with Distributed Shared Memory. In Proc. 6th Intl. Workshop on SCI-based High-Performance Low-Cost Computing, pages 1–8, Santa Clara, CA, Sept. 1996. 179 2. O. Hansen and J. Krammer. A Scalable Performance Analysis Tool for PowerPC Based MPP Systems. In Mirenkov et al., editors, The First Aizu Intl. Symp. on Parallel Algorithms / Architecture Synthesis, pages 78–84, Aizu, Japan, Mar. 1995. IEEE Comp. Soc. Press. 175 3. J. K. Hollingsworth and B. Buck. DynInstAPI Programmer’s Guide Release 1.1. Computer Science Dept., Univ. of Maryland, College Park, MD20742, May 1998. 174 4. J. K. Hollingsworth and B. P. Miller. Dynamic Control of Performance Monitoring on Large Scale Parallel Systems. In Intl. Conf. on Supercomputing, Tokio, July 1993. 174 5. J. K. Hollingsworth, B. P. Miller, and J. Cargille. Dynamic Program Instrumentation for Scalable Performance Tools. In Scalable High-Performance Computing Conference, pages 841–850, Knoxville, TN, May 1994. 174 6. R. Hood. The p2d2 Project: Building a Portable Distributed Debugger. In Proc. SPDT’96: SIGMETRICS Symposium on Parallel and Distributed Tools, pages 127– 136, Philadelphia, Pennsylvania, USA, May 1996. ACM Press. 174 7. R. Hood and D. Cheng. Accomodating Heterogeneity in a Debugger - A ClientServer Approach. In Proc. 28th Annual Hawaii Intl. Conf. on System Sciences, Volume II, pages 252–253. IEEE, Jan. 1995. 174 8. P. Kacsuk, J. Cunha, G. Dzsa, J. Lourenco, T. Antao, and T. Fadgyas. GRADE: A Graphical Development and Debugging Environment for Parallel Programs. Parallel Computing, 22(13):1747–1770, Feb. 1997. 176 9. T. Ludwig, R. Wism¨ uller, V. Sunderam, and A. Bode. OMIS — On-line Monitoring Interface Specification (Version 2.0), volume 9 of LRR-TUM Research Report Series. Shaker Verlag, Aachen, Germany, 1997. ISBN 3-8265-3035-7. 175, 176 10. B. P. Miller, J. M. Cargille, R. B. Irvin, K. Kunchithap, M. D. Callaghan, J. K. Hollingsworth, K. L. Karavanic, and T. Newhall. The Paradyn parallel performance measurement tools. IEEE Computer, 11(28), Nov. 1995.
A Universal Infrastructure for the Run-time Monitoring
181
11. M. Oberhuber and R. Wism¨ uller. DETOP - An Interactive Debugger for PowerPC Based Multicomputers. In P. Fritzson and L. Finmo, editors, Parallel Programming and Applications, pages 170–183. IOS Press, Amsterdam, May 1995. 174 175 12. R. Wism¨ uller, T. Ludwig, A. Bode, R. Borgeest, S. Lamberts, M. Oberhuber, C. R¨ oder, and G. Stellner. The Tool-set Project: Towards an Integrated Tool Environment for Parallel Programming. In Proc. 2nd Sino-German Workshop on Advanced Parallel Processing Technologies, APPT’97, Koblenz, Germany, Sept. 1997. 175, 179
Net-dbx: A Java Powered Tool for Interactive Debugging of MPI Programs Across the Internet Neophytos Neophytou and Paraskevas Evripidou Department of Computer Science University of Cyprus P.O. Box 537 CY-1678 Nicosia, Cyprus Tel:+357-2-338705, Fax: 339062 [email protected]
Abstract. This paper describes Net-dbx, a tool that utilizes Java and other WWW tools for the debugging of MPI programs from anywhere in the Internet. Net-dbx is a source level interactive debugger with the full power of gdb augmented with the debug functionality of LAM-MPI. The main effort was on a low overhead but yet powerful graphical interface that would be supported by low bandwidth connections. The portability of the tool is of great importance as well because it enables us to use it on heterogeneous nodes that participate in an MPI multicomputer. Both needs are satisfied a great deal by the use of Internet Browsing tools and the Java programming language. The user of our system simply points his browser to the URL of the Net-dbx page, logs in to the destination system, and starts debugging by interacting with the tool just like any GUI environment. The user has the ability to dynamically select which MPI-processes to view/debug. A working prototype has already been developed and tested successfully.
1
Introduction
This paper presents Net-dbx, a tool that uses WWW capabilities in general and Java applets in particular for portable parallel and distributed debugging across the Internet. Net-dbx is now a working prototype of a full fledged development and debugging tool. It has been tested for the debugging of both Fortran and C MPI programs. Over the last 3-4 years we have seen an immense growth of the Internet and a very rapid development of the tools used for browsing it. The most common of form of data traveling in gigabytes all over the net is WWW pages. In addition to information formatted with text, graphics, sound and various gadgets, WWW enables and enhances a new way of accomplishing tasks: Teleworking. Although, teleworking was introduced earlier, it has been well accepted and enhanced through the use of the web. A lot of tasks, mostly transactions, are done in the browser’s screen, enabling us to order and buy things from thousands D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 181–189, 1998. c Springer-Verlag Berlin Heidelberg 1998
182
Neophytos Neophytou and Paraskevas Evripidou
of miles away, edit a paper, move files along to machines in different parts of the world. Until recently, all these tasks were performed using CGI scripts 1 . The CGI scripts are constrained in exchanging information only through text-boxes and radio buttons in a standard Web Form . The Java language [1][2] is now giving teleworking a new enhanced form. Java applets allow the user to manipulate graphics with his mouse and keyboard and send information, in real time, anywhere in the Internet. Graphical web applications are now able to use low bandwidth connections, since the interaction with the server does not demand huge amounts of X-Protocol [3] data. This even makes it possible to perform the task of remote program development and debugging from anywhere in the net. Developing and debugging parallel programs has proven to be one of the most hazardous tasks of program development. When running a parallel program, many things can go wrong in many places (processing nodes). It can lead to deadlock situations when there are bugs in the message distribution (in distributed machines), or when access to the common memory is not well controlled. Some processing nodes may crash under some circumstances in which the programmer may or may not be responsible. To effectively debug parallel programs we should know what went wrong in each of these cases [4]. In order to effectively monitor the programs execution, there are two approaches: Post mortem and runtime debugging. In post mortem debugging, during execution, the compiled program, or the environment, keeps track of everything that happens and writes every event in special files called trace files. These trace files are then parsed using special tools to help the user guess when and what went wrong during execution of a buggy program. Runtime tools, on the other side, have access to the programs memory, on each machine, and using operating system calls they can stop or resume program execution on any execution node. These tools can change variables during the execution and break the program flow under certain conditions. The prototype described in this paper is a runtime source-level debugging tool. It enhances the capabilities of the gdb debugger [5],[6] , and the monitoring facilities of LAM [7][8] into a graphical environment. Using Java, we developed a tool that integrates these two programs in a collection of applets in a single Web Page. Our tool acts as an interface to the two existing tools, which provides the user with a graphical interaction environment, and in addition, it optimizes interaction between LAM and gdb.
2
Architecture of Net-dbx
Net-dbx is a client-server package. The server side is the actual MPI-Network, that is the group of workstations of processors participating in an MPI-Multicomputer. It consists of tools that are to be installed locally on each Node and 1
CGI stands for Common Gateway Interface. It consists of a server program, residing on the HTTP server, which responds to a certain form of input (usually entered in web forms) and produces an output Web page to present results to the user.
Net-dbx: A Java Powered Tool for Interactive Debugging
183
can be used for individually debugging the processes running on that Node. These tools include the MPI runtime environment (we currently use the LAM implementation) and a source level runtime debugger (we currently use gdb). On the client side there should be a tool that integrates debugging of each Node and provides the user with a wider view of the MPI program which runs on the whole Multicomputer. The program residing in the client side is an applet written in Java. It is used to integrate the capabilities of the tools which rely on the server side. We can see Net-dbx as an environment consisting of two layers: the lower layer which is the MPI Network and the higher layer as the applet on the Web browser running on the user’s workstation. To achieve effective debugging of an MPI Parallel Program, one must follow the execution flow on every node of the multicomputer network. The status of messages and status of the MPI processes have to be monitored. Net-dbx is designed to relieve the user from the tedious procedure of manual parallel debugging. This procedure would involve him capturing PIDs as a parallel program is started, connecting to each node he wants to debug via telnet, and attaching the running processes to a debugger. There are also several synchronization issues that make this task even more difficult to achieve. In addition, Net-dbx offers a graphical environment as an integrated parallel debugging tool, and it offers means of controlling/monitoring processes individually or in groups. 2.1
Initialization Scheme
As mentioned above, to attach a process for debugging you need its PID. But that is not known until the process starts to run. In order to stall the processes, until they are all captured by the debugger, a small piece of code has to be added in the source code of the program. An endless loop, depending on a variable and two MPI Barriers that can hold the processes waiting until they are captured. In a Fortran program the synchronization code looks like the following: if ( myid .eq. 0 ) then dummydebug=1 dowhile ( dummydebug .eq. 1 ) enddo endif call MPI_BARRIER( MPI_COMM_WORLD, ierr ) call MPI_BARRIER( MPI_COMM_WORLD, ierr ) The variable myid represents the process’ Rank. As soon as all the processes are attached to the debugger, we can proceed to setting a breakpoint on the second MPI BARRIER line and then setting the dummydebug variable on the root process to 0 so that it will get out of the loop and allow the MPI Barriers to unlock the rest of the processes. After that the processes are ready to be debugged. The dummy code is to be added by the tool using standard searching techniques to find the main file and then compile the code transparently to the user.
184
2.2
Neophytos Neophytou and Paraskevas Evripidou
Telnet Sessions to Processing Nodes
Telnet Sessions are the basic means of communication between the client and the server of the system. For every process to be debugged, the Java applet initiates a telnet connection to the corresponding node. It also has to initiate some extra telnet connections for message and process task monitoring. The I/O communication with this connection is done by exchanging strings, just like a terminal where the user sends input from the keyboard and expects output on the screen. In our case, the Telnet Session must be smart enough to recognize and extract information from the connections response. The fact that standard underlying tools are used guarantees same behavior for all the target platforms. Standard functionality that is expected from the implementation of the Telnet Sessions role is to provide abstractions to: – – – – – –
setting/unsetting breakpoints, starting/stopping/continuing execution of the program, setting up and reporting on variable watches, evaluating/changing value of expressions and providing the source code (to the graphical interface). Capturing the Unix PIDs of every active process on the network (for the telnet session used to start the program to be debugged).
In addition, the telnet Session needs to implement an abstraction for the synchronization procedure in the beginning. That is, if it is process 0 the session should wait for every process to attach and then proceed to release (get out of the loop) the stalled program. If it is a process ranked 1..n, then it should just attach to the debugger, set the breakpoints and wait. This can be achieved using a semaphore that will be kept by the object which will ”own” all the telnet Sessions. In programming telnet session’s role, standard Java procedures were imported and used, as well as third-party downloaded code, incorporated in the telnet part of our program. To the existing functionality we added interaction procedures with Unix shell, GNU Debugger (gdb), and with LAM-MPI so that the above abstractions were implemented. One of the major security constraints posed in Java is the rule that applets can have an Internet connection (of any kind-telnet, FTP, HTTP, etc) only with their HTTP server host [9], [10]. This was overcome by having all telnet connections to the server and then rsh to the corresponding nodes. This approach assumes that the server can handle a large number of telnet connections and that the user will be able to login to the server and then rsh to the MPI network nodes. Telnet Sessions need to run in their own thread of execution and thus need to be attached to another object which will create and provide them with that thread. A Telnet Session can be attached either to a graphical DebugWindow, or to a non-graphical global control wrapper. Both of these wrappers acquire control to the process under debugging using the abstractions offered by the telnet session.
Net-dbx: A Java Powered Tool for Interactive Debugging
2.3
185
Integration/Coordination
As mentioned above, several telnet sessions are needed in order to debug an MPI program in the framework that we use. All these sessions need a means of coordination. The coordinator should collect the starting data (which process runs on which processing node) from the initial connection and redistribute this information to the other sessions so that they can telnet to the right node and attach the right process to the debugger. A large array holds all the telnet sessions. The size is determined by the number of processes that the program will run. Several pointers of the array will be null, as the user might not want to debug all the running processes. The other indices are pointing to the wrappers of the according telnet sessions that will be debugged. The wrapper of a telnet session can be either a graphical DebugWindow, described in the next subsection, or a non-graphical TelnetSessionWrapper, which only provides a thread of execution to the telnet session. Additionally it provides the methods required for the em coordinator to control the process together with a group of other processes. The capability of dynamically initiating a telnet session to a process not chosen from the beginning is a feature under development. As the user chooses, using the user interface, which processes to visualize, and which are in the group controlled by the coordinator, the wrapper of the affected process will be changed from a TelnetSessionWrapper to a DebugWindow and vice-versa. When LAM MPI starts the execution of a parallel program it reports which machines are the processing nodes, where each MPI process is executed and what is its Unix PID. The interpreted data is placed in an AllConnectionData object, which holds all the necessary startup information that every telnet session needs in order to be initialized. After acquiring all the data needed, the user can decide with the help of the graphical environment which of the processes need to be initiated for debugging. After that, the coordinator creates ConnectionData objects for each of the processes to be started and triggers the start of the initialization process. It acts as a semaphore holder for the purposes of the initial synchronization. Another duty of the coordinator is to keep a copy of each source file downloaded so that every source file is downloaded only once. All the source files have to be acquired using the list command on the debugger residing on the server side. These are kept as string arrays on the client side, since access to the user’s local files is not allowed to a Java applet. A future optimization will be the use of FTP connections to acquire the needed files.
3
Graphical Environment
As mentioned in the introduction the user Interface is provided as an applet housed in a Web Page. Third-party packages 2 are specially formed to build a 2
We modified and used the TextView component which was uploaded to the gamelan Java directory [11] by Ted Phelps at DTSC Australia. The release version will use
186
Neophytos Neophytou and Paraskevas Evripidou
Fig. 1. A snapshot of the Net-dbx debugger in action.
Net-dbx: A Java Powered Tool for Interactive Debugging
187
fast and robust user interface. The graphical environment that is shown in the main browser screen (see figure 1) is consisted of: – The Process coordinator window (control panel) which is the global control panel of the system. – The debugging windows, where all the debugging work is done – The interaction console, where the user can interact with the main console of the program to be debugged – The message window, where all the pending messages are displayed. The Process coordinator window provides a user Interface to the coordinator which is responsible for getting the login data from the user, starting the debugging procedure and coordinating the several telnet sessions. It provides means for entering all the necessary data to start a debugging session such as UserID, Password, Program Path which are given by the user. The user is also enabled to select which processes are going to be active, visualized, or controlled within a global process group. The most important visual object used in this environment is the Debug Window in which the code of the program is shown. In addition to the code, as mentioned before, this object has to implement behaviors such as showing the current line of execution and by indicating which lines are set as breakpoints. These behaviors are shown using painted text (red color) for the breakpoints, painted text for the line numbers (blue color), and painted and highlighted text for the current line. The appropriate buttons to set/unset breakpoints, next and continue are also provided. The interaction console is actually a typical telnet application. It is the user interface to a special telnet session used by the coordinator to initialize the debugging procedure. After initialization and capturing of all the needed data, the interaction console is left as a normal telnet window within which the user I/O takes place. The message view is the applet responsible for showing the message status. It displays all the information that is given by the LAM environment when an mpimsg command is issued. All the pending messages are shown in a scrolling List object.
4
Related Work
A survey on all the existing graphical debugging tools would be beyond the scope of this paper. However we present the tools that appear to be most similar to Net-dbx. For an overview of most of the existing parallel tools, the interested reader can visit the Parallel Tools Consortium home page [12]. the IFC Java widget set as it appears to be one of the fastest and most robust 100% Compatibility certified Java toolkits.
188
4.1
Neophytos Neophytou and Paraskevas Evripidou
XMPI
XMPI [7] is a graphical environment for visualization and debugging, created by the makers of the LAM- MPI environment at the Ohio Supercomputer Center. It basically features graphical representation of the program’s execution and message traffic with respect to what each process is doing. XMPI uses a special trace produced by the LAM-MPI environment during the execution. Although XMPI can be used at runtime it is actually a post-mortem tool and uses the LamTraces for every function that it provides. XMPI seems to be more of a visualization tool for displaying the program execution, but not controlling it. It is supported on DEC, HP, SGI, IBM and Sun platforms. 4.2
TotalView
This package is a commercial product of Dolphin Interconnect Solutions [13]. It is a runtime tool and supports source level debugging on MPI programs, targeted to the MPICH implementation by Argonne National Labs. Its capabilities are similar to those of dbx or gdb, which include changing variables during execution and setting conditional breakpoints. It is a multiprocess graphical environment, offering an individual display for each process, showing source code, execution stack etc. TotalView is so far implemented on Sun4 with SunOS 4.1.x platforms, Digital Alpha with Digital Unix system and IBM RS/6000. 4.3
p2d2
Another ongoing project is the p2d2 Debugger [14] being developed at the NASA Ames Research Center. It is a parallel debugger based on the client/server paradigm. It is based on an Object Oriented framework that uses debugging abstractions for use in graphical environments. It relies on the existence of a debugging server on each of the processing nodes. The debugging server provides abstractions, according to a predefined framework and can be attached to the graphical environment at runtime. Following this architecture the tool achieves portability and heterogeneity on the client side, whereas it depends on the implementation of the server on the server side.
5
Concluding Remarks and Future Work
In this paper we have presented Net-dbx, a tool that utilizes standard WWW tools and Java for fully interactive parallel and distributed debugging. The working prototype we developed provides proof of concept that we will soon be able to apply teleworking for the development debugging and testing of parallel programs for very large machines. A task that, up to now, is mostly confined to the very few supercomputer centers worldwide. Net-dbx provides runtime source level debugging on multiple processes and message monitoring capabilities. The prototype provides a graphical user interface at minimal overhead. It can be run from anywhere in the Internet even using
Net-dbx: A Java Powered Tool for Interactive Debugging
189
a low bandwidth dialup connection (33KBps). Most importantly, Net-dbx is superior over similar products in compatibility and heterogeneity issues. Being a Java applet, it can run with no modifications on virtually any console, requiring only the presence of a Java enabled WWW browser on the client side. The tool at it’s present implementation is being utilized in the Parallel Processing class at the University of Cyprus. After this alpha testing, it will be ready for beta testing on other sites as well. We are currently working on extending the prototype presented, to an integrated MPI-aware environment for program development, debugging and execution. For more information of the Net-dbx debugger’s current implementation state, one can visit the Net-dbx home page [15].
References 1. Laura Lemay and Charlse L. Perkins, Teach yourself Java in 21 days, Sams.net Publishing, 1996 182 2. The JavaSoft home page, http://java.sun.com, The main java page at Sun Corporation. This address is widely known as the JAVA HOME 182 3. X Window System, The Open Group, http://www.opengroup.org/tech/desktop/x/, The Open Group’s information page on the X-Window System. 182 4. Charles E. McDowell, David P. Helmbold, Debugging concurrent programs, ACM Computing Surveys Vol. 21, No. 4 (Dec. 1989), Pages 593-622 182 5. Richard M. Stallman and Roland H. Pesch, Debugging with GDB, The GNU Source-Level Debugger, Edition 4.09, for GDB version 4.9 182 6. GNU Debugger information, http://www.physik.fu-berlin.de/edv docu/documentation/gdb/index.html, 182 7. LAM / MPI Parallel Computing, Ohio Supercomputer Center, http://www.osc.edu/lam.html, The info and documentation page for LAM. It also includes all the documentation on XMPI 182, 188 8. Ohio Supercomputer Center, The Ohio State University, MPI Primer Development With LAM 182 9. J. Steven Fritzinger, Marianne Mueller, Java Security white Paper, Sun Microsystems Inc., 1996 184 10. Joseph A. Bank, Java Security, http://www-swiss.ai.mit.edu/~jbank/javapapaer/javapaper.html 184 11. The GameLan home page, http://www.gamelan.com. Here resides the most complete and well known java code collection in the Internet 185 12. The Parallel Tools Consortium home page, http://www.ptools.org/ 187 13. Dolphin Toolworks, Introduction to the TotalView Debugger, http://www.dolphinics.com/TINDEX.html 188 14. Doreen Cheng and Robert Hood, A Portable Debugger for Parallel and Distributed Programs, Supercomputing ’94 188 15. The Net-dbx home page, http://www.cs.ucy.ac.cy/~net-dbx/ 189
Workshop 2+8 Performance Evaluation and Prediction Allen D. Malony and Rajeev Alur Co-chairmen
Measurement and Benchmarking The core of any performance evaluation strategy lies in the methodologies used for measuring performance characteristics and comparing performance across systems. The first paper in this session describes an efficient performance measurement system created for workstation clusers that can report runtime information to application level tools. A hardware monitor approach is argued for in the second paper for measuring the memory access performance of a parallel NUMA system. The third paper demonstrates analysis methods for testing selfsimilarity of workload characterizations of workstations, servers, and the worldwide web. It is hard enough deciding performance metrics for parallel application benchmarking, but making the methodology portable is a real challenge. This is the subject of the fourth paper.
Modeling and Simulation Parallel systems pose unique challenges for constructing analytical models and simulation studies because of the inter-relationship of performance factors. The first paper deals with the problem of building probabilistic models for CPU’s with multilevel memory hierarchies when irregular access patterns are applied. The second paper studies the validity of the h-relation hypothesis of the BSP model and proposes an extension to the model based on an analysis of message passing programs. Simulating large-scale parallel programs for predictive purposes requires close attention to the choice of program factors and the effect of these factors on architectural analysis. This is shown effectively in the third paper for predicting data paralel performance.
Communications Modeling and Evaluation Scalable parallel systems rely on high-performance communication subsystems and it is important to relate their characteristics to communication operations in software. The first paper assess LogP parameters for the IBM SP machine as applied to the MPI library and distinguished between sender and receiver operations. In contrast, pre-evaluating communications performance at the level of a data parallel language, independent of machine architecture, relies instead D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 191–192, 1998. c Springer-Verlag Berlin Heidelberg 1998
192
Allen D. Malony and Rajeev Alur
on the aspects of data distribution and communication volume and pattern. The second paper pre-evaluates communication for HPF. Targeting both a communcations model and its application to program development, the third paper applies genetic programming to the problem of finding efficient and accurate runtime functions for the performance of parallel algorithms on parallel machines.
Real-Time Systems The real-time systems session is dedicated to the design, implementation, and verification techniques for computerized systems that obey real-time constraints. The subject comprises programming languages for real-time systems, associated compiling or synthesis technology, real-time operating systems and schedulers, and verification of real-time constraints. The first paper addresses the issue of language support for design of reliable real-time systems, proposing a methodology based on constraint programming. The second paper presents theoretical results concerning scheduling of periodic processes. In particular, the authors present sufficient conditions, weaker than previously known, for guaranteed schedulability so as to meet given age constraints. The third paper describes how the authors formulated and solved a control problem in the design of an industrial grinding plant to optimize economic criteria using parallel processin.
Configurable Load Measurement in Heterogeneous Workstation Clusters Christian R¨ oder, Thomas Ludwig, and Arndt Bode LRR-TUM — Lehrstuhl f¨ ur Rechnertechnik und Rechnerorganisation Technische Universit¨ at M¨ unchen, Institut f¨ ur Informatik, D–80290 M¨ unchen {roeder,ludwig,bode}@informatik.tu-muenchen.de
Abstract. This paper presents the design and implementation of NSR – the Node Status Reporter. The NSR provides a standard mechanism for measurement and access to status information in clusters of heterogeneous workstations. It can be used by any application that relies on static and dynamic information about this execution environment. A key feature of NSR is its flexibility with respect to the measurement requirements of various applications. Configurability aims at reducing the measurement overhead and the influence on the observed cluster.
1
Motivation
During the last years, clusters of time-shared heterogeneous workstations became a popular execution platform for parallel applications. Parallel programming environments like PVM [5] and MPI [13] have been developed to implement message passing parallel applications. Late development stages and efficient production runs of parallel applications typically require sophisticated on-line tools, e.g. performance analysis tools, program visualization tools, or load management systems. A major drawback of existing tools like e.g. [1,7,9,12] is that all of them implement their own monitoring technique to observe and/or manipulate the execution of a parallel application. Proprietary monitoring solutions have to be redesigned and newly implemented when shifting to a new hardware platform and/or a new programming environment. In 1995, researchers at LRR-TUM and Emory University started the OMIS1 project. It aims at defining a standard interface between tools and monitoring systems for parallel and distributed systems. Using an OMIS compliant monitoring system, tools can efficiently observe and manipulate the execution of a message passing based parallel application. Meanwhile, OMIS has been redesigned and is available in version 2 [8]. At the same time, the LoMan project [11] has been started at LRR-TUM. It aims at developing a cost-sensitive systemintegrated load management mechanism for parallel applications that execute on clusters of heterogeneous workstations. Combining the goals of these two projects, we designed the NSR [6] that is a separate module to observe the status of this execution environment. 1
OMIS: On-line Monitoring Interface Specification.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 193–206, 1998. c Springer-Verlag Berlin Heidelberg 1998
194
Christian R¨ oder et al.
The goal of NSR is to provide a uniform interface for data measurement and data access in clusters of heterogeneous workstations. Heterogeneity covers two aspects: different performance indices of the execution nodes and different data representations due to different architectures. Although primarily influenced by the measurement requirements of on-line tools, the NSR serves as a powerful and flexible measurement component for any type of application that requires static and/or dynamic information about workstations in a cluster. However, the NSR does not post-process and interpret status information because the semantics of measured values is known by measurement applications only. In the following, we will use measurement application to denote any application that requires status information. The document is organized as follows. The next section presents an overview of related work. In § 3 we propose the reference model of NSR and discuss the steps towards its design. Implementation aspects are presented in § 4. The concept of NSR is evaluated in § 5. Finally, we summarize in § 6 and discuss future development steps.
2
Related Work
Available results taken into consideration when designing and implementing the NSR can be grouped into three different categories: – Standard commands and public domain tools (e.g. top2 and sysinfo3) to retrieve status information in UNIX environments. – Load modeling and load measurement by load management systems for parallel applications that execute in clusters of heterogeneous workstations (e.g. load monitoring processes [4,7]; classification of load models [10]). – The definition of standard interfaces to separately implement components of tools for parallel systems (e.g. OMIS [8], UMA4 [14], SNMP5 [2], and PSCHED [3]). Typically, developers implement measurement routines for their own purposes on dedicated architectures or filter status information from standard commands or public domain tools. An overview on commands and existing tools is presented in [6]. The implementation of public domain tools is typically divided into three parts. A textual display presents the front-end of the tools. A controlling component connects the front-end with measurement routines that are connected by a tool-defined interface. Hence, porting a tool to a new architecture is limited to implement a new measurement routine. However, most tools only provide status information from the local workstation. Routines for transferring measured data have to be implemented if information is also needed from remote workstations. 2 3 4 5
top: available at ftp://ftp.groupsys.com/pub/top. sysinfo: available at ftp://ftp.usc.edu/pub/sysinfo. UMA: Universal Measurement Architecture. SNMP: Simple Network Management Protocol.
NSR – A Tool for Load Measurement in Heterogeneous Environments
195
The second category taken into considerations covers load modeling requirements of load management systems. A load management system can be modeled as a control loop that tries to manage the load distribution in a parallel system. Developers of load management techniques often adopt measurement routines to retrieve load values from the above mentioned public domain tools. However, it is still not yet clear which status information is best suited for load modeling in heterogeneous environments. A flexible load measurement component can serve as a basis to implement an adaptive load model [10]. Furthermore, different combinations of status information could be tested to find the values that can adequately represent the load of heterogeneous workstations. Standard interfaces are defined to allow separate development of tool components. Tool components can subsequently cooperate on the basis of well-defined communication protocols. The exchange of one component by another component (even from different developers) is simplified if both are designed taking into account the definitions of such interfaces. We already mentioned the OMIS project. In § 4.2 we will see that OMIS serves as the basis to define the status information provided by NSR. The UMA project addresses data collection and data representation issues for monitoring in heterogeneous environments. Status information is collected independently of vendor defined operating system implementations. The design of UMA includes data collection from various sources like e.g. operating systems, user processes, or even databases with historical profiling data. Although most measurement problems are addressed, the resulting complexity of UMA might currently hamper the development and widespread usage of UMA compliant measurement systems. Furthermore, the concept does not strictly distinguish between data collection and subsequent data processing. The PSCHED project aims at defining standard interfaces between various components of a resource management system. The major goal is to separate resource management functionality, message passing functionality of parallel programming environments, and scheduling techniques. Monitoring techniques are not considered by PSCHED at all. The PSCHED API is subdivided into two parts: 1) a task management interface defines a standard mechanism for any message passing library to communicate with any resource management system; 2) a parallel scheduler interface defines a standard mechanism for a scheduler to manage execution nodes of a computing system regardless of the resource manager actually used. By using a configurable measurement tool, the implementation of a general-purpose scheduler can be simplified.
3
The Design of NSR
The design of a flexible measurement component has to consider the measurement requirements of various applications. A measurement component serves as an auxiliary module in the context of an application’s global task. Hence, the NSR is designed as a reactive system, i.e., it performs activities on behalf of the supported applications. According to the main goal of NSR to provide a uniform interface for data measurement and data access, we consider three questions:
196
Christian R¨ oder et al.
load manage- performance visualization ment system analyzer tool Tool/NSR Interface Node Status Reporter NSR/System Interface Cluster of Heterogeneous Workstations
Fig. 1. The Reference Model of NSR. 1. Which data is needed? Different applications need different types of status information. Status information can be classified into static and dynamic depending on the time-dependent variability of its values. Static and dynamic information is provided for all major resource classes, i.e., operating system, CPU, main memory, secondary storage, and communication interfaces. Interactive usage of workstations in a time-shared environment gains increasing importance when parallel applications should be executed. Hence, the number of users and their workstation utilization are measured as well. 2. When is the data needed? On-line tools like e.g. load management systems, often have a cyclic execution behavior and measurements are periodically initiated. In this case, the period time between two successive measurements depends on the intent of measurement applications. However, measurement applications can have other execution behaviors as well. Now, status information can be measured immediately and exactly once as soon as it is needed. 3. From which set of nodes is the data needed? On-line tools can be classified regarding their implementation as distributed or centralized. In case of a central tool component (e.g. the graphical user interface of a program visualization tool), status information of multiple workstations is typically needed at once. Status information of any workstation is accessible on any other workstation. Different applications might need the same type of status information, although their global tasks are different. For example, a load management system can consider I/O statistics of parallel applications and maps application processes to workstations with high I/O bandwidth. The same status information can be used by a visualization tool to observe the I/O usage statistics in a workstation cluster. The following section presents the reference model of NSR before we list the design goals in detail. 3.1
The Reference Model
The reference model of NSR is illustrated in Fig. 1. The NSR builds a layer between measurement applications and a cluster of heterogeneous workstations. It supports on-line tools for parallel applications like e.g. load management systems, performance analysis tools, or visualization tools. However, any type of
NSR – A Tool for Load Measurement in Heterogeneous Environments
197
measurement application can utilize the NSR. Applications communicate with the NSR via the Tool/NSR Interface. Using this interface, they can manage and control the behavior of NSR. Recall that the NSR is designed as a reactive system on behalf of the measurement applications. The applications are responsible to configure the NSR with respect to their measurement requirements, i.e., the type of status information and the time at which measurements have to be initiated. The interface provides uniform access to information of any workstation that is observed by NSR. If necessary, the NSR distributes measured values between the workstations with respect to the applications’ location. The NSR/System Interface connects the NSR to a cluster of heterogeneous workstations and hides the details of workstation heterogeneity. The main purpose of this interface is to connect the NSR and measurement routines for various architectures. Furthermore, the interface provides a uniform data representation with unity scale. Developers are free to implement their own data measurement and data collection routines. 3.2
Design Goals
The following major objectives have to be pursued during the design of NSR with respect to the above listed measurement requirements: Portability: The target architecture of NSR is a cluster of heterogeneous workstations. The same set of status information has to be accessible in the same way on every architecture for which an implementation exists. New architectures should be easily integrated with existing implementations. Uniformity: For each target architecture, the measured values have to be unified with respect to their units of measure (i.e., bytes, bytes/second, etc.). Hence, status information becomes comparable also for heterogeneous workstations. Scalability: Tools for parallel system typically need status information from a set of workstations. An increasing number of observed workstations must not slow down these applications. Flexibility: The NSR provides a wide variety of status information. Measurement applications can determine which subset of information has to be measured on which workstations. The frequency of measurements has to be freely configurable. Efficiency: The NSR can simultaneously serve as a measurement component for several applications. If possible, concurrent measurements of the same data should be reduced to a single measurement. 3.3 Design Aspects According to the goals and the requirements we consider the following aspects: Data measurement: The measurement of a workstation’s status information issues system calls to the operating system. It is one of the main tasks of the operating systems to update these values and to keep them in internal management tables. System calls are time consuming and slow down the calling application. Additionally, different architectures often require different measurement
198
Christian R¨ oder et al.
routines to access the same values. For portability reasons, data measurement is decoupled from an application’s additional work and performed asynchronously to it. The set of data to be measured is freely configurable by a measurement application. Synchronization: It is likely to happen in time-shared systems that several measurement applications execute concurrently, e.g. several on-line tools for several parallel applications execute on an overlapping set of workstations in the cluster. A severe measurement overhead on multiple observed workstations can result from an increasing frequency of periodically initiated measurements. Multiple measurements of the same status information are converted to a single measurement, thereby reducing the measurement overhead. However, synchronized data measurement demands for a mechanism to control the synchronization. We use interrupts to synchronize data measurements. In case of periodic measurements, the period of time between any two measurements is at least as long as the period of time between two interrupts. However, this limits the frequency of measurement initiations and can delay the measurement application. We implemented two concepts to reduce the delays. Firstly, if a measurement application immediately requires status information then measurements can be performed independently from the interrupt period. Secondly, we use asynchronous operations for data measurement and data access. Asynchronous Operations: Asynchronous operations are used to hide latency and overlap data measurement and computation. An identifier is immediately returned for these operations. A measurement application can wait for the completion of an operation using this identifier. Measurements are carried out by NSR, while applications can continue their execution. Data Distribution: Apart from data measurement, the status information is transparently transferred to the location of a measurement application. A data transfer mechanism is implemented within the NSR. Again, data transfer is executed asynchronously while applications can continue their execution.
4 4.1
Implementation Aspects Client-Server Architecture
The architecture of NSR follows the client-server paradigm. The client functionality is implemented by the NSR library. A process that has linked the NSR library and invokes library functions is called NSR client (or simply client). An NSR client can request service from the NSR server (server) by using these library functions. The NSR server can be seen as a global administration instance that performs data measurement and data distribution on behalf of the clients. It is implemented by a set of replicated NSR server processes that cooperate to serve the client requests. The server processes are intended to execute in the background as other UNIX processes. To clarify our notation we call them NSR daemon processes (or simply daemons). A single NSR daemon process resides on each workstation in the cluster for which an implementation exists.
NSR – A Tool for Load Measurement in Heterogeneous Environments
199
NSR client NSR server node i
NSR daemon
NSR daemon
node j
communication network
Fig. 2. The Client-Server Architecture of NSR. An exemplary scenario of this architecture is illustrated in Fig. 2. One NSR daemon process resides on workstation nodei , while a second daemon resides on workstation nodej . The client on nodei can only communicate with the local NSR daemon process. Applications can not utilize the NSR server if a local NSR daemon process is not available. The Tool/NSR Interface is implemented on top of this interconnection. The local NSR daemon process controls the execution of a client request. From the viewpoint of a client, its local NSR daemon process is called NSR master daemon. If necessary, the NSR master daemon forwards a request to all daemons that participate in serving a client’s request. The latter daemons are also called NSR slave daemon processes. A daemon is identified by the name of the workstation on which it resides. Note, the attributes master and slave are used to distinguish the roles each daemon plays to serve a client request. However, the attributes are not reflected in the implementation. An NSR daemon process can be both master and slave because several clients can request service from the NSR server concurrently. The main difference between master and slave daemons is that an NSR master daemon keeps additional information about a requesting client. Slave daemons do not know anything about the client that performed the request. The NSR/System Interface connects the workstation specific measurement routines with the NSR daemon processes. Daemons exchange messages via the communication network. All information needed by the NSR server to serve the client requests are stored by the NSR daemon processes. Contrarily, a client is stateless with respect to data measurement. The result of a previously initiated library call does not influence the execution of future library calls. Hence, no information has to be stored between any two calls. 4.2
Status Information
The data structure nsr info is the container for the status information of workstations and is communicated between the NSR components. struct nsr info { long static hostinfo dynamic hostinfo
mtag, tick; stthi; dynhi; };
The first entry mtag is a sequence number for the status information. Applications can identify old and new values by comparing their mtag values. The second entry tick is a timestamp for the information. It indicates the interrupt
200
Christian R¨ oder et al.
number at which the measurement was carried out. In case of periodic measurements, a master daemon uses tick to control data assignment if several clients request status information with different periods of time. Information about resources and interactive usage are split into two data structures. Static information is stored in static hostinfo and describes a workstations e.g. in terms of type of operating system, number of CPUs, size of main memory, etc. Benchmark routines are used to determine performance indices of workstation resources. Dynamic information is stored in dynamic hostinfo and represents the utilization values of workstation resources. For detailed information, we refer to the information service node get info in the OMIS document6 [8, pp. 69 ff.]. The definition of this service matches the definition of these two data structures. 4.3
NSR Library Functions
The NSR library functions are split into two major groups. The first group covers administrative functions to manage the execution behavior of the NSR server. The second group includes data access routines to retrieve status information that has been measured by the daemons. Additional functions primarily ease the usage of the NSR. Administrative functions are needed to configure the NSR server. They are available as blocking or non-blocking calls. The following example illustrates the difference between these two types. Initially, a client attaches itself to a set of NSR daemon processes with nsr iattach(), that is a non-blocking call. The local daemon analyzes the parameters of the request, performs authentication checks for the client, and prepares one entry of its internal client management table. From the client request, the names of workstations are extracted from which status information should be retrieved. Subsequently, the master daemon forwards the request to all requested daemons which themselves store the location of the master daemon. As a result, the master daemon returns the identifier waitID for the request. The client can continue with its work. Later on, the client invokes nsr wait() with waitID to finish the non-blocking call. This call checks whether all requested daemons participate to serve the client’s request. The return value to the client is the identifier nsrID, that has to be passed with every subsequent library call. The master daemon uses nsrID to identify the requesting client. In any case, a non-blocking call has to be finished by nsr wait() before a new call can be invoked by a client. A blocking library call is implemented by its respective non-blocking counterpart immediately followed by nsr wait(). Three additional administrative functions are provided by the NSR library. Firstly, a client can program each daemon by sending a measurement configuration message. A measurement flag determines the set of status information that should be measured by each daemon. In case of periodic measurements, a time value determines the period of time between any two measurements. A new configuration message reconfigures the execution behavior of all attached daemons. Secondly, the set of observed workstations can be dynamically changed during 6
http://wwwbode.informatik.tu-muenchen.de/~omis
NSR – A Tool for Load Measurement in Heterogeneous Environments
NSR client request
NsrTick
201
NSR daemon message
dispatch request handling
data measurement
data management
Fig. 3. Functionality of an NSR Daemon the client’s execution time. Finally, a client detaches itself from the NSR server as soon as it does not need service any more. The second group of NSR library functions provides access to the measured data. A request for immediate measurement blocks the caller until the status information of the requested workstation is available. In case of periodic measurements, the local daemon stores the status information of every node into a local memory segment. A library call reads the memory location to access the information. A client can read the status information of all observed workstations at once. Additionally, it can request the status information of one specific workstation. The NSR master daemon stores status information independently from the execution behavior of requesting clients. Again, it is the client’s responsibility to periodically read the data. 4.4
NSR Daemon Functionality
At startup time, an NSR daemon process measures a default set of status data and multicasts the values to all daemons in the cluster. During this phase, it initializes an internal timer that periodically interrupts the execution of the NSR daemon process at fixed time intervals. The period of time between two successive interrupts is called NsrTick. Any time interval used by the NSR implementation is a multiplier of this basic NsrTick. As shown in Fig. 3, three main tasks can be executed by a single daemon: client request handling, data measurement, and data management. A central dispatching routine determines the control flow of the NSR daemon processes. It invokes one of the tasks depending on one of the following asynchronous events: 1) the expiration of NsrTick, 2) the arrival of an NSR client request, or 3) the arrival of another daemon message. In the first case, a default set of status information is updated every n × NsrTick with a predefined number n as long as no client request has to be handled. This concept of alive-messages ensures that all daemons know about all other daemons at any time. Note, a daemon tries to propagate a breakdownmessage in case of a failure. If client requests indicate periodic measurements then NsrTick is used to check for the expiration of a requested period of time. If necessary, the daemon measures exactly that status information requested at
202
Christian R¨ oder et al.
this point of time. The set of data to be measured is calculated as a superset of the status information currently requested by all clients. The arrival of an NSR client request causes the dispatcher to invoke the request handling functionality of the daemon. Depending on the request type, the daemon updates an internal management table that keeps information about its local clients. Request handling also includes the forwarding of a client request to all daemons that participate to serve the client request. A result is sent to the client in response to each request. A daemon can receive different types of messages from other daemons, i.e., alive-messages, breakdown-messages, reply-messages and update-messages. Reply-messages are sent by an NSR slave daemon in response to a forwarded client request. The result of a reply is stored in the client management table. Later on, the results of all slave daemons are checked if the daemon handles the previously mentioned nsr wait() call. Finally, the update-messages indicate that a daemon sends updated status information in case of periodic measurements. New status information of a workstation invalidates its old values. Therefore, the daemon overwrites old workstation entries in the memory segment. Each daemon in the cluster can perform the same functionality. Typically, several daemons cooperate to serve a client’s request. Without going into further details, we state that the NSR server can handle requests from multiple clients. On the one side, applications can be implemented as a parallel program and each process can be a client that observes a subset of workstations. On the other side, several applications from different users can utilize the NSR server concurrently. Additionally, limited error recovery is provided. If necessary, a daemon that recovers from a previous failure is automatically configured to participate in serving the client requests again. 4.5
Basic Communication Mechanisms
Two types of interaction are distinguished to implement the NSR concept: the client-server interaction and the server integrated interaction, i.e., the interaction between the NSR daemons. Their implementations are based on different communication protocols. Client-Daemon Communication:: The communication between a client and its local NSR daemon process is performed directly or indirectly. Direct communication is used for the administrative functions and involves both the NSR client and the NSR master daemon simultaneously. A TCP socket connection is established over which a client sends its requests and receives the server replies. The asynchronous communication uses a two phase protocol based on direct communication (see description above). Indirect interaction between a client and the local daemon is used to implement data exchange in case of periodic measurements. In analogy to a mailbox data exchange scheme, the daemon writes status information to a memory mapped file independently of the clients execution state. Similarly, any client
NSR – A Tool for Load Measurement in Heterogeneous Environments
203
Default NSR daemon cycle 1
time (sec.)
"nsrd" "top" 0.1
0.01
0.001 0
10
20 30 No. of Cycle
40
50
Fig. 4. Measurement duration of top and nsrd. located on the workstation can read this data independently of the daemon’s status. Daemon-Daemon Communication: The NSR server implementation is based on the IP multicast concept. The main advantage of IP multicast over a UDP broadcast mechanism is that the dimension of a cluster is not limited to a LAN environment. A multicast environment consists of senders and receivers. Receiving processes are part of a multicast group. In our context, the NSR server is a multicast group consisting of its daemon processes. Communicated data are converted into XDR to cover the heterogeneity of different architectures. In response to a forwarded client request, an NSR slave daemon sends its reply to the master daemon via an UDP socket connection.
5
Evaluation
We will now evaluate the concept and implementation of NSR. As a point of reference, we compare the public domain software product top with the execution of nsrd, i.e., the NSR daemon process. Two experiments have been performed on a SPARC Station 10 running the SUN Solaris 2.5 operating system. Firstly, we compared the measurement overhead of top and nsrd. Secondly, we compared the average CPU utilization of both tools to determine their impact on the observed workstation. A comparison between the measurement overhead of top and nsrd is shown in Fig. 4. It illustrates the cyclic execution behavior of nsrd. Both tools have been instrumented with timing routines to measure the duration of a single measurement. The NSR daemon process measures its default set of values every n×NsrTick, with NsrTick = 5 (seconds) and n = 12, while top updates its values every m× NsrTick, with m = 3. The set of measured information covers CPU and memory usage statistics. The set of values is approximately the same for both tools. For i mod 12 = 0, the NsrTick just interrupts the execution of nsrd that subsequently performs some internal management operations. The time values of Fig. 4 indicate that the measurement routines of top and nsrd are similar with
204
Christian R¨ oder et al.
respect to their measurement overhead. The reason for this similarity is that the measurement technique is almost identically implemented in both tools. However, the nsrd outperforms top in terms of CPU percentage that is a measure for the CPU utilization of a process. The CPU percentage is defined as the ratio of CPU time used recently to CPU time available in the same period. It can be used to determine the influence on the observed workstation. The following lines show a shortened textual output of top when both tools execute concurrently. PID USERNAME · · · SIZE · · · CPU COMMAND 3317 roeder · · · 1908K · · · 1.10% top 3302 root · · · 2008K · · · 0.59% nsrd
The above snapshot was taken when both tools are executing on their regular level. It combines both the time each tool needs to measure its dynamic values and the time to further process these data. The top utility prints an list of processes that is ordered by decreasing CPU usage of the processes. The nsrd writes the data to the memory mapped file, converts it into the XDR format, and multicasts it to all other daemons. From this global point of view, nsrd outperforms top by a factor of 2. The overhead of top mainly results from its output functionality. In summary, the NSR introduces less load to the observed system than the top program.
6
Summary and Future Work
We presented the design and implementation of NSR that is a general-purpose measurement component to observe the status of a cluster of heterogeneous workstations. The NSR provides transparent access to static and dynamic information for any requesting measurement application. A key feature of NSR is its flexibility with respect to measurement requirements of various applications. If necessary, applications can control and observe different subsets of workstations independently. From the viewpoint of NSR, several applications can be served concurrently. Currently, we implemented a prototype of NSR for SUN Solaris 2.5 and Hewlett Packards HP-UX 10.20 operating systems. In the future we will enhance the implementation of NSR. By that, measurement routines for various architectures like IRIX 6.2, AIX 3.x, and LINUX 2.0.x will be implemented. Additionally, we will optimize the performance of NSR to reduce the measurement overhead. Regarding performance evaluation, we will evaluate the NSR library calls and the data distribution inside the NSR server. In spring 1998, we will release the NSR software package under the GNU general public license. On some architectures, measurements are performed by reading kernel data structures. In this case, read permission to access the kernel data has to be provided for an NSR daemon process. Security aspects are not violated because the NSR does not manipulate any data. Measurement applications can be implemented without special permission flags. In production mode, a NSR daemon process can
NSR – A Tool for Load Measurement in Heterogeneous Environments
205
automatically start up during the boot sequence of a workstation’s operating system. The NSR already serves as the load measurement component within the LoMan project [11]. In the future, the NSR will be integrated into an OMIS compliant monitoring system to observe the execution nodes of a parallel application.
References 1. A. Beguelin. Xab: A Tool for Monitoring PVM Programs. In Workshop on Heterogeneous Processing, pp. 92–97, Los Alamitos, CA, Apr. 1993. 193 2. J.D. Case, M. Fedor, M.L. Schoffstall, and C. Davin. Simple Network Management Protocol (SNMP). Internet Activities Board – RFC 1157, May 1990. 194 3. D.G. Feitelson, L. Rudolph, U. Schweigelshohn, K.C. Sevcik, and P. Wong. Theory and Practice in Parallel Job Scheduling. In IPPS’97 Workshop on Job Scheduling for Parallel Processing, pp. 1–25, Geneva, Switzerland, Apr. 1997. University of Geneva. 194 4. K. Geihs and C. Gebauer. Load Monitor LM – Ein CORBA-basiertes Werkzeug zur Lastbestimmung in heterogenen verteilten Systemen. In 9. ITG/GI Fachtagung MMB’97, pp. 1–17. VDE-Verlag, Sep. 1997. 194 5. G.A. Geist, J.A. Kohl, R.J. Manchek, and P.M. Papadopoulos. New Features of PVM 3.4 and Beyond. In Parallel Virtual Machine — EuroPVM’95: Second European PVM User’s Group Meeting, Lyon, France, pp. 1–10, Paris, Sep. 1995. Hermes. 193 6. M. Hilbig. Design and Implementation of a Node Status Reporter in Networks of Heterogeneous Workstations. Fortgeschrittenenpraktikum at LRR-TUM, Munich, Dec. 1996. 193, 194 7. D.J. Jackson and C.W. Humphres. A simple yet effective load balancing extension to the PVM software system. Parallel Computing, 22:1647–1660, Jun. 1997. 193, 194 8. T. Ludwig, R. Wism¨ uller, V. Sunderam, and A. Bode. OMIS — On-line Monitoring Interface Specification. TR. TUM-I9733, SFB-Bericht Nr. 342/22/97 A, Technische Universit¨ at M¨ unchen, Munich, Germany, Jul. 1997. 193, 194, 200 9. B.P. Miller, J.K. Hollingsworth, and M.D. Callaghan. The Paradyn Parallel Performance Tools and PVM. Tr., Department of Computer Sciences, University of Wisconsin, 1994. 193 10. C. R¨ oder. Classification of Load Models. In T. Schnekenburger and G. Stellner (eds.), Dynamic Load Distribution for Parallel Applications, pp. 34–47. Teubner, Germany, 1997. 194, 195 11. C. R¨ oder and G. Stellner. Design of Load Management for Parallel Applications in Networks of Heterogeneous Workstations. TR. TUM-I9727, SFB 342/18/97 A, SFB 0342, Technische Universit¨ at M¨ unchen, 80290 M¨ unchen, Germany, May 1997. 193, 205 12. C. Scheidler and L. Sch¨ afers. TRAPPER: A Graphical Programming Environment for Industrial High-Performance Applications. In Proc. PARLE’93, Parallel Architectures and Languages Europe, vol. 694 of LNCS, pp. 403–413, Berlin, Jun. 1993. Springer. 193 13. D.W. Walker. The Design of a Standard Message Passing Interface for Distributed Memory Concurrent Computers. Parallel Computing, 20(4):657–673, Apr. 1994. 193
206
Christian R¨ oder et al.
14. X/Open Group. Systems Management: Universal Measurement Architecture Guide, vol. G414. X/Open Company Limited, Reading, UK, Mar. 1995. 194
Exploiting Spatial and Temporal Locality of Accesses: A New Hardware-Based Monitoring Approach for DSM Systems Robert Hockauf, Wolfgang Karl, Markus Leberecht, Michael Oberhuber, and Michael Wagner Lehrstuhl f¨ ur Rechnertechnik und Rechnerorganisation (LRR-TUM) Institut f¨ ur Informatik der Technischen Universit¨ at M¨ unchen Arcisstr. 21, D-80290 M¨ unchen, Germany {karlw,leberech,oberhube}@informatik.tu-muenchen.de
Abstract. The performance of a parallel system with NUMA characteristics depends on the efficient use of local memory accesses. Programming and tool environments for such DSM systems should enable and exploit data locality. In this paper we present an event-driven hybrid monitoring concept for the SMiLE SCI-based PC cluster. The central part of the hardware monitor consists of a content-addressable counter array managing a small working set of the most recently referenced memory regions. We show that this approach allows to provide detailed run-time information which can be exploited by performance evaluation and debugging tools.
1
Introduction
The SMiLE1 project at LRR-TUM investigates high-performance cluster computing based on Scalable Coherent Interface SCI as interconnection technology [5]. Within this project, an SCI-based PC cluster with distributed shared memory (DSM) is being built. In order to be able to set up such a PC cluster we have developed a PCI-SCI adapter targeted to plug into the PCI bus of a PC. Pentium-II PCs equipped with these PCI-SCI adapters are connected to a cluster of computing nodes with NUMA characteristics (non-uniform memory access). The SMiLE PC cluster as well as other SCI-based parallel systems available and accessable at LRR-TUM serve as platforms for the software developments within the SMiLE project, studying how to map parallel programming models efficiently onto the SCI hardware. The performance of such a parallel system with DSM depends on the efficient use of local memory accesses through a parallel program. Although remote memory accesses via hardware-supported DSM deliver high communication performance, they are still an order of magnitude more expensive than local ones. Therefore, programming systems and tools for such parallel systems with NUMA 1
SMiLE: Shared Memory in a LAN-like Environment
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 206–215, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Hardware-Based Monitoring Concept
207
characteristics should enable and exploit data locality. However, these tools require detailed information about the dynamic behaviour of the running system. Monitoring the dynamic behaviour of a compute cluster with hardwaresupported DSM like the SMiLE PC cluster is very exacting because communication might occur implicitely on any read or write. This fact implies that monitoring must be very fine-grained too, making it almost impossible to avoid significiant probe overhead with software instrumentation. In this paper we present a powerful and flexible event-driven hybrid monitoring system for the SMiLE SCI-based PC cluster with DSM. Our PCI-SCI adapter architecture allows the attachment of a hardware monitor which is able to deliver detailed information about the run-time and communication behaviour to tools for performance evaluation and debugging. In order to be able to record all interesting measurable entities with only limited hardware resources, the monitor exploits the spatial and temporal locality of data and instruction accesses in a similar way as cache memories in high-performance computer systems do. Simulations of our monitoring approach helps us to make some initial design decisions. Additionally, we give an overview of the use of the hardware monitor in a hybrid performance evaluation system.
2
The SMiLE PC Cluster Architecture
The Scalable Coherent Interface SCI (IEEE Std 1596-1992) has been chosen as network fabric for the SMiLE PC cluster. The SCI standard [6] specifies the hardware interconnect and protocols allowing to connect up to 64 K SCI nodes (processors, workstations, PCs, bus bridges, switches) in a high-speed network. A 64-Bit global address space across SCI nodes is defined as well as a set of read-, write-, and synchronization transactions enabling SCI-based systems to provide hardware-supported DSM with low-latency remote memory accesses. In addition to communication via DSM, SCI also facilitates fast message-passing. SCI nodes are interconnected via pointto-point links in ring-like arrangements or are attached to switches. The logical layer of the SCI specification defines packet-switched communication protocols. An SCI split transaction requires a request packet to be sent from one SCI node to another node with a response packet in reply to it. This enables every SCI node to overlap several transactions and allows for latencies of accesses to remote memory to be hidden. The lack of commercially available SCI interface hardware for PCs during the initiation period of the SMiLE project led to the development of our custom PCI-SCI hardware. Its primary goal is to serve as the basis of the SMiLE cluster by bridging the PCs I/O bus with the SCI virtual bus. Fig. 1 shows a high-level block diagram of the PCI-SCI adapter, which is described in detail in [1]. The PCI-SCI adapter is divided into three logical parts: the PCI unit, the Dual-Ported RAM (DPR), and the SCI unit. The PCI unit interfaces to the PCI local bus. A 64 MByte address window on the PCI bus allows to intercept processor-memory transactions which are then
208
Robert Hockauf et al.
translated into SCI transactions. For the interface to the PCI bus, the PCI9060 PCI bus master chip form PLX Technology is used. Also, packets arriving from the SCI network, have to be transformed into PCI operations. The packets to be sent via SCI are buffered within the Dual-Ported RAM. It contains frames for outgoing packets allowing one outstanding read transaction, 16 outstanding write transactions, 16 outstanding messaging transactions using a special DMA engine for long data transfers between nodes, and one readmodify-write transaction which can be used for synchronization. Additionally, 64 incoming packets can be buffered. The SCI unit interfaces to the SCI network and performs the SCI protocol processing for packets in both directions. This interface is implemented with the LinkController LC-1 from Dolphin Interconnect Solutions, which realizes the physical layer and part of the logical layer of the SCI specification. The SCI unit is connected to the DPR via the B-Link, the non-SCI link side of the LC1. Control information is passed between the PCI unit and the SCI unit via a handshake bus. Two additional FPGAs, responsible for both the PCI-related and B-Link-related processing complete the design.
3 3.1
The SMiLE Hardware Monitor Extension Card Overview
PCI-SCI adapter (card #1) rest of the host PC
PCI
Handshake Bus
PCIUnit
SCIUnit DPR Bus 32
DPR
SCI out
B-Link 64
SCI in
PCIInterface
associative counter array
main memory event B-Link filter Logic
Hardware Monitor (card #2)
Fig. 1. The SMiLE PCI-SCI adapter and the monitor card installed in a PC
The fine-grain nature of SCI’s remote memory transactions and their level of integration in the hardware makes it hard to trace and measure them in a manner necessary to do performance analysis and debugging. As mentioned in section 2, the B-Link carries all outgoing and incoming packets to the physical SCI interface, Dolphin’s LinkController LC-1. It is therefore a
A Hardware-Based Monitoring Concept
209
central sequentialization spot on which all remote memory traffic can be sensed. The SMiLE SCI hardware monitor thus is an additional PCI card attached to the original SCI adapter as shown in Fig. 1. Data that can be gathered from the B-Link includes the transaction command, the target and source node IDs, the node-internal address offset, and the packet data. However, the design is naturally limited to remote memory traffic: a conventional plug-in PCI card does not enable us to monitor internal memory accesses. Nonetheless, remote memory accesses remain the most severe as mainly read accesses across the SCI network still account for more than one order of magnitude higher latencies than local accesses. 3.2
Working Principle
When memory access latency becomes an increasing problem during execution of machine instruction streams, and speeding up the whole main memory of a computer system appears not to be economically feasible, performance enhancements are usually achieved through the use of cache memories.
associative counter array PC
memory reference hit?
ring buffer stuffed?
tail ptr head ptr
counter #1
tag
counter #2
tag
counter #3
tag
counter #4
tag
counter #5
counter
.
tag
array filled up?
mov %esi, %ecx mov %dx, (%esi) mov %esi, %ebx add %dx, (%esi) pushf . .
main memory ring buffer
Interrupt
Fig. 2. The hardware monitor detects remote accesses. In order to allocate new counters, old counter value/tag pairs are flushed to the main memory ring buffer. Should it overflow, an interrupt is generated
As data accesses and instruction fetches in typical programs exhibit a great amount of temporal and spatial locality, cache memories only attempt to store a small working set of memory pages (cache lines). A prerequisite for this, however, is the use of a content-addressable memory as the covered address space usually dramatically exceeds the cache’s capacity.
210
Robert Hockauf et al.
The same property holds for remote memory references in SCI-supported distributed shared memory execution of parallel programs: It is highly likely that a successive memory reference will access the same data item as its predecessor or some data close by. A hardware monitor the duty of which is to count memory accesses to specified memory regions can profit from this. The most recently addressed memory references and their associated counter can be stored in a register array with limited size. The detection of address proximity allows to combine counter values, as the accesses to neighboring data items may actually represent a single larger data object. E.g. it might be more interesting to count the accesses to an array rather than the references to its elements. Only recently unused memory references have to be flushed into external buffers, by this reducing the amount of data that has to be transported. Drawn from these principles, the SMiLE hardware monitor uses the concept described in Fig. 2: 1. SCI remote memory accesses are sensed via the B-Link. 2. If the memory reference matches a tag in a counter array, the associated counter is incremented. If no reference is found, a new counter-tag pair is allocated and initialized to 1. 3. If no more space is available in the counter array, a counter-tag pair is flushed to a larger ring buffer in main memory. This buffer is supposed to be emptied by system software in a more or less cyclic fashion. 4. If a buffer overflow occurs nevertheless, a signal is sent to the software process utilizing the monitor in order to force the retrieval of the ring buffer’s data. Apart from a partition of counters working according to this principle (termed dynamic counters), the monitor also contains a set of counters that are statically preprogrammable (thus static counters) and are not covered in this work. This addition to the monitor design was chosen in order to allow the user to include pre-known memory areas that are supposed to be monitored anyway, much like in conventional histogram monitors [10]. 3.3
The Dynamic Coverage LRU Method
As one wants to reduce the amount of flushing and re-allocating counter-tag pairs, it makes sense to integrate the strategy of counter coverage adaption into the cache replacement mechanism. Under the prerequisite that counter tags can name not only single addresses but also continuous areas of memory for which the counter is to represent the accesses, a simple least-recently-used replacement algorithm can be adapted to this special task. The maximum address range, however, has to be predefined by the user. The Dynamic Coverage LRU method is shown in Fig. 3. 3.4
The Hardware Design
The resource saving principle of the monitor will hopefully allow us to construct the additional plug-in adapter with a small number of high-density FPGAs. The
A Hardware-Based Monitoring Concept
211
wait for incoming address reference
already present within a tag?
Y
count address event, possibly indicate overflow
N a free tag can be found?
Y
set new tag
N small enough distance to another tag?
Y
adapt tag limits
N determine oldest tag, flush counter-tag pair
Fig. 3. The Dynamic Coverage LRU mechanism
same commercial PCI master circuit already used with the SCI bridge will serve again as its PCI interface. The ring buffer will reside in the PCs main memory while its pointer registers will be located on the monitor hardware. Access to the monitor and its registers will be memory mapped in order to speed up accessing the device from user space. As SCI does not restrict the use of read and write transactions to main memory accesses, synchronization and configuration of a whole monitoring system can also be performed via SCI: remote monitor cards can be mapped into global shared memory in the same way that application memory is made accessible to other nodes.
4
Experimental Evaluation and Design Decisions
For initial performance assessments, a software simulator of the proposed hardware monitor served as the basis of a number of experiments. Traces were gathered by executing a number of the SPLASH-2 benchmark programs with the multiprocessor memory system simulator Limes [9]. For this, a shared memory system was assumed with the following properties: local read/write access latency of 1 cycles, assuming they can be seen as normally cached on a usual system, remote write latency of 20 cycles, corresponding to approx. 100 ns on a 200 MHz system (accounting for host bridge latencies when accessing the PCI bus), remote read latency of 1000 cycles, corresponding to approx. 5 µs on a 200 MHz system [7], and
212
Robert Hockauf et al.
remote lock latency of 1000 cycles, as the processor gets stalled when waiting for the result of the remote atomic read-modify-write operation. The global shared memory in the SPLASH-2 applications was distributed over the nodes in pages of 4 Kbytes in a round-robin fashion. Due to this the physical memory on a single node no longer represented a contiguous but rather interleaved area of the global shared memory. The simulated DSM system consisted of four nodes of x86 processors clocked at 200MHz. Figures 4 and 5 represent the memory access histograms of the FFT and the radix sort programs in the SPLASH-2 suite generated by this configuration.
20
10
8 15
6 10 4
5 2
0 163500
0 164000
164500
165000
165500
166000
166500
167000
167500
Fig. 4. Left: Memory access histogram for page 40 on node #0 of a 16K points FFT program in the SPLASH-2 suite run on a simulated 4-node SCI configuration
590000
590500
591000
591500
592000
592500
593000
593500
Fig. 5. Right: Memory access histogram for page 144 on node #0 of a 64K keys RADIX program in the SPLASH-2 suite run on a simulated 4-node SCI configuration
Both traces were run through the simulator while the number of dynamic counters was varied. This served to help us find the optimal hardware size of a physical implementation: constraints are an as lean as possible hardware realization which should still ensure low loads for buses and high-level tools. Figures 6 and 7 display the results of these experiments. The results follow intuition: the larger the number of counters is, the less flushes to the ring buffer occur. The same holds when the area that a counter can cover is increased: more and more adjacent acesses will be counted in a single register. Naturally, there is an optimum size for the maximum coverable range depending on the program under test: for the FFT increasing this maximal range to more than 16 bytes yields less counter range extensions. The reason for this lies in the FFT’s mostly neighboring accesses to complex numbers, reflecting exactly a 16 byte granularity (two double values in C). The RADIX results represent a different case: while for the overall work of the monitor the same holds as for the FFT (a larger number of counters and a bigger covered area account for less flushes), the number of counter range extensions only decreases for increased areas when a larger number of counters is working. This can, however, attributed to the linear access pattern in RADIX:
10000
Total # of counter range extensions
Total # of counters flushed to ring buffer
A Hardware-Based Monitoring Concept
counters = 1 =4 =8 = 16 1000 = 32 = 64 100
10
213
7000 counters = 1 =4 =8 = 16 = 32 = 64
6000 5000 4000 3000 2000 1000
1
4
8
16 32 64 128 Maximum size of covered range [bytes]
0
256
4
8
16 32 64 128 Maximum size of covered range [bytes]
256
Fig. 6. Number of flushes to the ring buffer and counter coverage extensions for the FFT 3500 counters = 1 =4 =8 = 16 = 32 = 64
Total # of counter range extensions
Total # of counters flushed to ring buffer
10000
3000 2500
1000
2000 1500
100
1000
10
4
8
16
32
64
Maximum size of covered range [bytes]
128
256
counters = 1 =4 =8 = 16 = 32 = 64
500 0
4
8
16 32 64 128 Maximum size of covered range [bytes]
256
Fig. 7. Number of flushes to the ring buffer and counter coverage extensions for RADIX
successive reads and writes can always be put into the same counter until the maximum range has been reached. As soon as the counters are able to cover the whole 4K page (e. g. 256 byte range × 16 counters = 4096 bytes), the number of extensions decreases drastically. A number of 16 or 32 counters appears to be a good compromise in a practical implementation of this monitor. Already in this size the number of accesses to main memory can be reduced significantly, even with relatively small ranges covered by the counters. As the current monitor design always writes a packet of 16 bytes to the external ring buffer, this means that for 16 counters the amount of flushed data sums up to 84 KBytes for the FFT and 60 KBytes for RADIX. Given a simulated runtime for the FFT of 0.127 sec and 1.142 sec for RADIX, this yields a top average data rate of 661 KBytes/s. Roughly the same numbers hold for 32 counters. When the the Dynamic Coverage LRU is switched on with a maximum range of larger than 1 word, this rate instantly drops to 280 KBytes/s. With typical PC I/O bandwidths of around 80 MB/s (PCI’s nominal bandwidth of 133 MByte/s can only be reached with infinite bursts), monitoring
214
Robert Hockauf et al.
represents a mere 0.8 % of the system’s bus load and shouldn’t influence program execution too much. Switching off all monitor optimizations (range of 4 byte, only one dynamic counter) converts our hardware into a trace-writing monitor, creating trace data at a maximum rate of 1.06 MB/s. An external ring buffer has to be sized according to these figures and the ability of a possible on-line monitoring software to make use of the sampled data. Derived from rules of thumb, ring buffers in the 1 MByte range should suffice to allow a reasonably relaxed access to flushed data without disturbing a running application. It has to be noted that we are currently restricted to page-sized areas that can be covered by the dynamic counters of the monitor as PCI addresses are physical addresses. This means that consecutive virtual addresses may be located on different pages. One solution to this problem would be to duplicate parts of the processor’s MMU mechanisms tables on the monitoring hardware in order to be able to retranslate PCI into virtual addresses. The obvious issue of hardware complexity has so far excluded this possibility.
5
Related Work
The parallel programming in environments with distributed memory is supported by a number of tools so that there are strong efforts in a standardization of the monitoring interface [3]. Tool environments for DSM-oriented systems are less wide-spread. Mainly for software-based DSM systems performance debugging tools have been developed, which analyse traces for data locality [4] [2]. Over the last years, monitoring support has become increasingly available on research as well on commercial machines. For the CC-NUMA FLASH multiprocessor system the hardware-implemented cache coherence mechanism is complemented by components for the monitoring of fine-grained performance data (number and duration of misses, invalidations etc.) [11]. Modern CPU chips incorporate hardware counters which collect information about data accesses, cache misses, TLB misses etc. For some multiprocessor systems these information is exploited by performance analysis tools [12] Martonosi et.al. propose a multi-dimensional histogram performance monitor for the SHRIMP multiprocessor [10].
6
Conclusion
In the paper we presented the rationale and the architecture of the distributed hardware monitor for the SMiLE DSM PC cluster. In order to be able to use the information provided by these hardware monitors a performance evaluation system needs a software infrastructure consisting of a local monitor library, a communication and managing layer providing a global monitor abstraction.
A Hardware-Based Monitoring Concept
215
This items will be covered in an OMIS [8] compliant implementation of a prototypical DSM monitoring infrastructure.
References 1. G. Acher, H. Hellwagner, W. Karl, and M. Leberecht. A PCI-SCI Bridge for Building a PC-Cluster with Distributed Shared Memory. In Proceedings The Sixth International Workshop on SCI-based High-Performance Low-Cost Computing, pages 1–8, Santa Clara, CA, Sept. 1996. SCIzzL. 207 2. D. Badouel, T. Priol, and L. Renambot. SVMview: A Performance Tuning Tool for DSM-Based Parallel Computers. In L. Boug`e, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, EuroPar’96 – Parallel Processing, number 1123 in LNCS, pages 98–105, Lyon, France, Aug. 1996. Springer Verlag. 214 3. A. Bode. Run-Time Oriented Design Tools: A Contribution to the Standardization of the Development Environments for Parallel and Distributed Programs. In F. Hoßfeld, E. Maehle, and E. W. Mayr, editors, Proceedings of the 4th Workshop PASA’96 Parallel Systems & Algorithms, pages 1–12. World Scientific, 1997. 214 4. M. Gerndt. Performance Analysis Environment dor SVM-Fortran Programs. Technical Report IB-9417, Research Centre J¨ ulich (KFA), Central Institute for Applied Mathematics, J¨ ulich, Germany, 1994. 214 5. H. Hellwagner, W. Karl, and M. Leberecht. Enabling a PC Cluster for HighPerformance Computing. SPEEDUP Journal, 11(1), June 1997. 206 6. IEEE Standard for the Scalable Coherent Interface (SCI). IEEE Std 1596-1992, 1993. IEEE 345 East 47th Street, New York, NY 10017-2394, USA. 207 7. M. Leberecht. A Concept for a Multithreaded Scheduling Environment. In F. Hoßfeld, E. Maehle, and E. W. Mayr, editors, Proceedings of the 4th Workshop on PASA’96 Parallel Systems & Algorithms, pages 161–175. World Scientific, 1996. 211 8. T. Ludwig, R. Wism¨ uller, V. Sunderam, and A. Bode. OMIS — On-line Monitoring Interface Specification (Version 2.0). TUM-I9733, SFB-Bericht Nr. 342/22/97 A, Technische Universit¨ at M¨ unchen, Munich, Germany, July 1997. 215 9. D. Magdic. Limes: An Execution-driven Multiprocessor Simulation Tool for the i486+-based PCs. School of Electrical Engineering, Department of Computer Engineering, University of Belgrade, POB 816 11000 Belgrade, Serbia, Yugoslavia, 1997. 211 10. M. Martonosi, D. W. Clark, and M. Mesarina. The SHRIMP Performance Monitor: Design and Applications. In Proceeding 1996 SIGMETRICS Symnposium on Parallel and Distributed Tools (SPDT’96), pages 61–69, Philadelphia, PA, USA, May 1996. ACM. 210, 214 11. M. Martonosi, D. Ofelt, and M. Heinrich. Integrating Performance Monitoring and Communication in Parallel Computers. In SIGMETRICS’96 1996 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, 1996. 214 12. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance Analysis Using the MIPS R10000 Performance Counters. In Supercomputing SC’96, 1996. 214
On the Self-Similar Nature of Workstations and WWW Servers Workload Olivier Richard and Franck Cappello LRI, Universit´e Paris-Sud, 91405 Orsay, France {auguste,fci}@lri.fr.
Abstract. This paper presents a workload characterization for workstations, servers and WWW servers. Twelve data sets built from standard UNIX tools and from access.log files are analyzed with three different time scales. We demonstrate that the workload of these resources is statistically self-similar in the periods of irregular activity.
1
Introduction
The resource sharing in a network of workstations and in Internet is the aim of a large number of research projects. Fundamental to this goal is to understand the initial workload characteristics of each resource. Most of the mechanisms (CONDOR [1] and GLUNIX [2]) for workload distribution try to detect the periods when the user does not use its workstation. Nevertheless, even when the user uses its workstation, the initial workload may stay very low. This under-exploitation provides the potential to gather unexploited resources for more agressive users. This goal requires to study carefully the initial workload of a workstation during the user activity. The workload characterization of WWW servers is more recent. In [3], the self-similarity of WWW server workload is investigated. The automatic workload distribution of the WWW requests on mirror sites is an open issue. We use a global approach to characterize the complexity of the workload. This approach is close to the one used for network traffic analysis [4] [5].
2
Method for the Workload Study
More than a hundred of resources are connected in our network of workstations. The machines are used for numerical processing, some heavy simulations and student and researcher works (compiler, mailer, browser, etc.). The highly computing applications saturate the computer and lead to a typical flat plot for the workload. The server workload is typically much more irregular. The workstation workload is a composition of the previous ones: null during the night, sometimes irregular and sometimes saturated during the day. We have also analyzed two WWW servers: our lab WWW server, called LRI and the Ethernet backbone of the computing research department of the D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 216–219, 1998. c Springer-Verlag Berlin Heidelberg 1998
On the Self-Similar Nature of Workstations and WWW Servers Workload
217
Virginia University of technology (USA). The trace for the LRI server starts at October 31, 1995 and finishes at August 4, 1997. For the second WWW server, only the external requests at .cs.vt.edu domain have been recorded during a 38 days period. We analyze the hits and the transfered bytes per time. We choose to use three time scales of analysis because the event granularities are different in a workstation, a server and a WWW server. The statistical analysis, requires about 100.000 events to give results with an acceptable quality. This number corresponds to 50 days for the minute time scale (WWW server workload analysis). It represents a daily period for the 1 second time scale (workstation workload analysis). To understand the workstation and server micro-workloads we have used a 5 milliseconds time scale (100.000 events represent 8 minutes). The workload measurement of the servers and workstations has been done with the vmstat UNIX command (average percentage of the CPU availability). We have used a method based on a snoopy process with low priority to measure the workload with the 5 millisecond resolution. The accesses to the WWW servers have been recorded in the access.log file.
3
Self-similarity
Intuitively, a self-similar signal may be observed graphically as presenting invariant features for differents time scales. As in [4], we consider the workload signal as a wide-sense stationary stochastic process (stationary up to order 2). In such process, the mean (µ) and the variance (σ 2 ) stay constant over time. Let X(t) be such a stochastic process. X(t) has an auto-covariance function, R(τ ), and an autocorrelation function ρ(τ ) of the form R(τ ) = E[(X(t) − µ)(X(t + τ ) − µ)] and ρ(τ ) = R(τ )/R(0) = R(τ )/σ 2 . We assume that X(t) is of the form ρ(τ ) → τ −β L(τ ), when τ → ∞ (1), where L(τ ) is slowly varying at infinity. (examples of such function are: L(τ ) = const, L(τ ) = log(τ )). Let X (m) denotes a new time series obtained by averaging the original series X in non-overlapping blocks of size m. That is: X (m) = (1/m)(Xtm−m+1 + Xtm−m+2 + . . . + Xtm ). Let ρ(m) (τ ) denotes the autocorrelation function of X (m) . If the aggregated process X (m) has the same autocorrelation structure as X, the process X is called exactly second order self-similar with self-similarity parameter H = 1 − β/2, i.e., ρ(m) (τ ) = ρ(τ ). X is called asymptotical second order self-similar with self-similarity parameter H = 1 − β/2, if we assume ρ(m) (τ ) → ρ(τ ) for large m and τ . In the next section we use three methods to test and characterize the selfsimilarity of a stochastic process: the variance-time graphic of the X (m) processes, the R/S analysis [4] and Whittle estimator [6]. All methods give an estimate of H (the self-similarity parameter). H larger than 1/2 and lower that 1 suggests the self-similarity of the signal. A rigorous introduction to the selfsimilar phenomenon can be found in [4].
4
Results
The data sets size is about 70.000 measurements. Eight machines have been closely analyzed in the network of workstations. The table 4 gathers the estimates of H with the three methods for the observed machines in the network of
218
Olivier Richard and Franck Cappello
workstations. For all machines, H is larger than 1/2 and lower than 1 and β is between 0 and 1. So, together the 3 methods suggest that self-similar stochastic processes may be used to represent closely these machine workloads. Table 1. H estimates for the CPU workload signals (1 second and 5 ms time scales), variance analysis of the X (m) processes, R/S analysis and Whittle estimates 1 second time scale β(V AR) H(V AR) H(R/S) Sun1 0.41 0.79 0.69 Sun2 0.29 0.85 0.64 Sun3 0.21 0.89 0.57 Sun4 0.48 0.75 0.64 Sun5 Sun6 HP1 0.30 0.84 0.79 HP2 0.37 0.81 0.80
5 ms time scale H(W ) β(V AR) H(V AR) H(R/S) 0.92 0.31 0.84 0.78 0.86 0.49 0.75 0.66 0.99 0.68 0.65 0.89 0.34 0.82 0.84 0.86 0.79 -
H(W ) 0.84 0.91 0.81 -
The results of the estimates for the H parameter for the WWW servers with the various methods are presented in table 4. The first three raws present the estimates for H and β for three different segments of the LRI trace. The raw labeled BR gives the results for the Ethernet backbone of the computer science department (Virginia University). The values of H and beta for these traces indicates that both signals (hits per minute and transmitted bytes per minute) might be closely represented by self-similar processes. Table 2. H estimates of hits per second and bytes per second for the WWW servers. The method used for each estimate is indicated between parentheses.
hit09 hit28 hit31 BR
5
Hits per second β(V AR) H(V AR) H(R/S) 0.50 0.74 0.71 0.62 0.68 0.74 0.64 0.67 0.78 0.43 0.78 0.88
Bytes per second H(W ) β(V AR) H(V AR) H(R/S) 0.86 0.64 0.68 0.71 0.78 0.62 0.68 0.76 0.75 0.75 0.62 0.72 0.63 0.68 0.83
H(W ) 0.74 0.72 0.61 -
Analysis of Results and Conclusion
We have shown that self-similar stochastic processes correspond to the workload of these resources (in the periods of irregular activity). A more accurate confidence interval for the estimates of H may be obtained with the Whittle estimator applied to larger data sets or with other methods such as the wavelet methods. [7] has recently shown that some network traffics were multi-fractal instead of mono-fractal (H is varying with time). Our result about the self-similar nature of the WWW server workload is contrary to the one given in [3]. In their data set, there are several defects : zero hit for several hours, or very punctual high volume transfers. Removing these
On the Self-Similar Nature of Workstations and WWW Servers Workload
219
punctual defects changes the conclusions and suggest that the self-similarity is not as rare as the authors have suggested and that high perturbations may alterate the results of the self-similarity tests. The knowledges about self-similarity in WWW servers are getting wider. In [8], the authors propose an explanation for the nature of client site WWW traffic. The article [9] provides an interesting result: superposition of self-similar processes yields to a self-similar process with fractional Gaussian noises. [4] has presented some consequences of traffic self-similarity on communication network engineering. There are drastic implications on modeling the individual source, on the notion of “burstiness” and on the congestion management. The methods and the results obtained in the field of communication networks may be used as the bases for further works in the workload management area. For example, the estimate of H may be used to generate synthetic traces from the fractional Gaussian noise model [10] or from the fractional ARIMA (p, d, q) processes [11]. The synthetic traces would represent the initial workload of a workstation, a server or a WWW server.
References 1. M. J. Litzkow, M. Livny, and M. W. Mutka. Condor : A hunter of idle workstations. In 8th International Conference on Distributed Computing Systems, pages 104–111, Washington, D.C., USA, June 1988. IEEE Computer Society Press. 216 2. Amin M. Vahdat, Douglas P. Ghormley, and Thomas E. Anderson. Efficient, portable, and robust extension of operating system functionality. Technical Report CSD-94-842, University of California, Berkeley, December 1994. 216 3. Martin Arlitt and Carey L. Williamson. Web server workload characterization: The search for invariants. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer SYstems, volume 24,1 of ACM SIGMETRICS Performance Evaluation Review, pages 126–137, New York, May23–26 1996. ACM Press. 216, 218 4. Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. On the Self-Similar Nature of Ethernet Traffic (Extended Version). IEEE/ACM Transactions on Networking, 1994. 216, 217, 219 5. Vern Paxson and Sally Floyd. Wide-area traffic: the failure of Poisson modeling. ACM SIGCOMM 94, pages 257–268, August 1994. 216 6. Jan Beran. Statistic for Long-Memory Processes. Chapman and Hall, 1994. 217 7. R.H. Riedi and J. L`evy V`ehel. Multifractal properties of tcp traffic: a numerical study. Technical Report 3129, INRIA, 1997. 218 8. Mark Crovella and Azer Bestavros. Self-similarity in World Wide Web traffic: Evidence and causes. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer SYstems, volume 24,1 of ACM SIGMETRICS Performance Evaluation Review, pages 160–169, New York, May23–26 1996. ACM Press. 219 9. K. Krishnan. A new class of performance results for a fractional brownian traffic model. Queueing System, 22:277–285, 1996. 219 10. B. B. Mandelbrot and J. W. Van Ness. Fractional brownian motion, fractional noises and applications. SIAM Review, 10(4):422–437, October 1968. 219 11. J. Granger and R. Joyeux. An introduction to long-memory time series models and fractional differencing. Time Series Analysis, 1, 1980. 219
White-Box Benchmarking Emilio Hern´ andez1 and Tony Hey2 1
Departamento de Computaci´ on, Universidad Sim´ on Bol´ıvar, Apartado 89000, Caracas, Venezuela, [email protected] 2 Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK [email protected]
Abstract. Structural performance analysis of the NAS parallel benchmarks is used to time code sections and specific classes of activity, such as communication or data movements. This technique is called whitebox benchmarking because, similarly to white-box methodologies used in program testing, the programs are not treated as black boxes. The timing methodology is portable, which is indispensable to make comparative benchmarking across different computer systems. A combination of conditional compilation and code instrumentation is used to measure execution time related to different aspects of application performance. This benchmarking methodology is proposed to help understand parallel application behaviour on distributed-memory parallel platforms.
1
Introduction
Computer evaluation methodologies based on multi-layered benchmark suites, like Genesis [1], EuroBen [2] and Parkbench [3] have been proposed. The layered approach is used for characterising the performance of complex benchmarks based on the performance models inferred from the execution of low-level benchmarks. However, it is not straightforward to relate the performance of the lowlevel benchmarks with the performance of more complex benchmarks. The study of the relationship between the performance of a whole program and the performance of its components may help establish the relevance of using low-level benchmarks to characterise the performance of more complex benchmarks. For this reason we propose the structural performance analysis of complex benchmarks. A benchmarking methodology, called white-box benchmarking, is proposed to help understand parallel application performance. The proposed methodology differs from standard profiling in that it is not procedure oriented. Partial execution times are not only associated to code sections but also to activity classes, such as communication or data movements. These execution times may be compared to the results of simpler benchmarks in order to assess their predictive properties. The proposed methodology is portable. It only relies on MPI WTIME (the MPI [4] timing function) and the availability of a source code preprocessor for conditional compilation, for instance, a standard C preprocessor. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 220–223, 1998. c Springer-Verlag Berlin Heidelberg 1998
White-Box Benchmarking
221
Timing Method. The proposed timing method is simple enough to be portable. It is based on the combination of two basic techniques: incremental conditional compilation and section timing. Incremental conditional compilation consists of selecting code fragments from the original benchmark to form several kernels of the original benchmark. A basic kernel of a parallel benchmark can be built by selecting the communication skeleton. A second communication kernel can contain the communication skeleton plus data movements related to communication (e.g. data transfers to communication buffers). By measuring the elapsed time of both kernels, we know the total communication time (the time measured for the first kernel) and the time spent in data movements (the difference in the execution time of both benchmarks). The net computation time can be obtained by subtracting the execution time of the second kernel from the execution time of the complete benchmark. Not every program is suitable for this type of code isolation, see [5] for a more detailed discussion. Section timing is used on code fragments that take a relatively long time to execute, for example, the subroutines at the higher level of the call tree. Three executable files may be produced, the communication kernel, the communication plus data movements kernel and the whole benchmark. Optionally, if information by code section is required, an additional compile-time constant has to be set to include the timing functions that obtain the partial times. The use of MPI WTIME allows us to achieve code portability. Two out of these three benchmark executions are usually fast because they will not execute “real” computation, but only communication and data movements. Section 2 presents an example of the use of the methodology with the NAS parallel Benchmarks [6]. In section 3 we present our conclusions.
2
Case Study with NAS Parallel Benchmarks
The NAS Parallel Benchmarks [6] are a widely recognized suite of benchmarks derived from important classes of aerophysics applications. In this work we used the application benchmarks (LU, BT and SP) and focused on the communication kernels extracted from these benchmarks. The main visible difference between these communication kernels comes from the fact that LU sends a larger number of short messages, while SP and BT send fewer and longer messages. This means that the LU communication kernel should benefit from low latency interconnect subsystems, while BT and SP communication kernels would execute faster on networks with a greater bandwidth. 2.1
Experiments
Several experiments with the instrumented version of the NAS benchmarks were conducted on a Cray T3D and a IBM SP2. For a description of the hardware and software configurations see [5]. Several experiments were conducted using the white-box methodology described here. These experiments are also described in detail in [5].
222
Emilio Hern´ andez and Tony Hey
NAS Application Benchmarks (Class A size) Time 0
10
20
30
40
50
LU (T3D)
BT (T3D)
SP (T3D)
LU (SP2)
BT (SP2)
SP (SP2)
Comm.
Move.
Fig. 1. Communication time and communication-related data movement time on T3D and SP2 (16 processors).
Figure 1 compares communication overhead in the T3D and the SP2 for LU, BT and SP. The communication kernel of LU runs marginally faster on the SP2 than on the T3D, while SP and BT communication kernels execute faster on the T3D. The execution of the communication kernels indicates that communication performance is not substantially better in the SP2 or the T3D. The main difference, in favour of the T3D, is the time spent in data movements related to communication, rather than communication itself. Measurements made with COMMS1 (the Parkbench ping-pong benchmark), slightly modified for transmitting double precision elements, show that the T3D has a startup time equal to 104.359µsec and a bandwidth equal to 3.709 Mdp/sec, while the SP2 has a startup time equal to 209.458µsec and a bandwidth equal to 4.210 Mdp/sec, where Mdb means “millions of double precision elements”. As mentioned above, the LU communication kernel should run faster on low latency networks, while BT and SP communication kernels would execute faster on networks with a greater bandwidth. The observed behaviour of LU, BT and SP, seems to contradict the expected behaviour of these benchmarks, based on the COMMS1 results. Apart from bandwidth and startup time, many other factors may be playing an important role in communication performance, which are not measured by COMMS1. Some of these factors are network contention, the presence of collective communication functions, the fact that messages are sent from different memory locations, etc. In other words, it is clear from these experiments that communication performance may not be easily characterized
White-Box Benchmarking
223
by low level communication parameters (latency and bandwidth) obtained from simpler benchmarks.
3
Conclusions
White-box benchmarking is a portable and comparatively effortless performance evaluation methodology. A few executions of each benchmark are necessary to get the information presented in this article, one for the complete benchmark and the rest for the extracted kernels. A benchmark visualisation interface like GBIS [7] may easily be enhanced to incorporate query and visualisation mechanisms to facilitate the presentation of results related to this methodology. Useful information has been extracted from the NAS parallel benchmarks case study using white-box benchmarking. Communication-computation profiles of the NAS parallel benchmarks may easily indicate the balance between communication and computation time. Communication kernels obtained by isolating the communication skeleton of selected applications may give us a better idea about the strength of the communication subsystem. Additionally, some behaviour characteristics can be exposed using white-box benchmarking, like load balance and a basic execution profile. This methodology may be used in benchmark design rather than in application development. Programs specifically developed as benchmarks may incorporate code for partial measurements and kernel selection. A description of the instrumented code sections may also be useful and, consequently, could be provided with the benchmarks. The diagnostic information provided by white-box benchmarking is useful to help understand the variable performance of parallel applications on different platforms.
References 1. C. A. Addison, V.S. Getov, A.J.G. Hey, R.W. Hockney, and I.C. Wolton. The genesis distributed-memory benchmarks. In J. Dongarra and W. Gentzsch, editors, Computer Benchmarks, pages 257–271. North-Holland, 1993. 220 2. A. van der Steen. Status and Direction of the EuroBen Benchmark. Supercomputer, 11(4):4–18, 1995. 220 3. J. Dongarra and T. Hey. The PARKBENCH Benchmark Collection. Supercomputer, 11(2-3):94–114, 1995. 220 4. Message Passing Interface Forum. The message passing interface standard. Technical report, Univeristy of Tennessee, Knoxville, USA, April 1994. 220 5. E. Hern´ andez and T. Hey. White-Box Benchmarking (longer version). Available at http://www.usb.ve/~emilio/WhiteBox.ps, 1998. 221 6. D. Bailey et. al. The NAS Parallel Benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, USA, March 1994. 221 7. M. Papiani, A.J.G. Hey, and R.W. Hockney. The Graphical Benchmark Information Service. Scientific Programming, 4(4), 1995. 223
Cache Misses Prediction for High Performance Sparse Algorithms Basilio B. Fraguela1 , Ram´on Doallo1 , and Emilio L. Zapata2 1
2
Dept. Electr´ onica e Sistemas. Univ. da Coru˜ na, Campus de Elvi˜ na s/n, A Coru˜ na, 15071 Spain {basilio,doallo}@udc.es Dept. de Arquitectura de Computadores. Univ. de M´ alaga, Complejo Tecnol´ ogico Campus de Teatinos, M´ alaga, 29080 Spain [email protected]
Abstract. Many scientific applications handle compressed sparse matrices. Cache behavior during the execution of codes with irregular access patterns, such as those generated by this type of matrices, has not been widely studied. In this work a probabilistic model for the prediction of the number of misses on a direct mapped cache memory considering sparse matrices with an uniform distribution is presented. As an example of the potential usability of such types of models, and taking into account the state of the art with respect to high performance superscalar and/or superpipelined CPUs with a multilevel memory hierarchy, we have modeled the cache behavior of an optimized sparse matrix-dense matrix product algorithm including blocking at the memory and register levels.
1
Introduction
Nowadays superscalar and/or superpipelined CPUs provide high speed processing, but global performances are constrained due to memory latencies even despite the fact that a multilevel memory organization is usual. Performances are further reduced in many numerical applications due to the indirect accesses that arise in the processing of sparse matrices, because of their compressed storage format [1]. Several software techniques for improving memory performance have been proposed, such as blocking, loop unrolling, loop interchanging or software pipelining. Navarro et al. [5] have studied some of these techniques for the sparse matrixdense matrix product algorithm. They have proposed an optimized version of this algorithm as a result of a series of simulations on a DEC Alpha processor. The traditional approach for cache performance evaluation has been the software simulation of the cache effect for every memory access [7]. Different approaches consist in providing performance monitoring tools (built-in counters in current microprocessors), or the design of analytical models.
This work was supported by the Ministry of Education and Science (CICYT) of Spain under project TIC96-1125-C03, Xunta de Galicia under Project XUGA20605B96, and E.U. Brite-Euram Project BE95-1564
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 224–233, 1998. c Springer-Verlag Berlin Heidelberg 1998
Cache Misses Prediction for High Performance Sparse Algorithms
225
This paper chooses this last approach and addresses the problem of the estimation of the total number of misses produced in a direct-mapped cache using a probabilistic model. In a previous work, Temam and Jalby [6] model the cache behavior for the sparse matrix-vector product of uniform banded sparse matrices, although they do not consider cross interferences. Simpler sparse linear algebra kernels than the one this paper is devoted to have been also modeled in [2], being the purpose of this work to demonstrate that it is feasible to extend this type of models to more complex algorithms, such as the one mentioned above. This fact can make possible further improvements for these codes by making easier the study of the effect of these techniques on the memory hierarchy. An extension of the proposed model for K-way associative caches has been introduced in [3]. The remainder of the paper is organized as follows: Section 2 describes the algorithm to be modeled. Basic model aparameters and concepts are introduced in Sect. 3 together with a brief explanation, due to space limitations, of the modeling process. In Sect. 4 the model is validated and used to study the cache behavior of the algorithm as a function of the block dimensions and the cache main parameters. Section 5 concludes the paper.
2
Modeled Algorithm
The optimized sparse matrix-dense matrix product code proposed in [5] is shown in Fig. 1. The sparse matrix is stored using the Compressed Row Storage (CRS) format [1]. This format uses three vectors: vector A contains the sparse matrix entries, vector C stores the column of each entry, and vector R indicates in which point of A and C a new row of the sparse matrix starts and permits knowing the number of entries per row. As a result the sparse matrix must be accessed row-wise. The dense matrix is stored in B, while D is the product matrix. The loops are built so that variable I always refers to the rows of the sparse matrix and matrix D, J refers to the columns of matrices B and D, and K traverses the dimension common to the sparse matrix and the dense matrix. Our code is based on a IKJ order of the loops (from the outermost to the innermost loop). This code uses one level blocking at the cache level selecting a block of matrix B with BK rows and BJ columns. The accesses to the block in the inner loop are row-wise, so a copy by rows to a temporal storage WB is desired in order to avoid self interferences and achieve a good exploitation of the spatial locality. The performance has been further improved by applying blocking at the register level. This code transformation consists in the application of strip mining to one or more loops, loop interchanging, full loop unrolling and the elimination of redundant load and store operations. In our case, inner loop J has been completely unrolled and the load and stores for D have been taken out of loop K. This modification requires the existence of BJ registers (d1, d2, . . . , dbj in the figure) to store these values, as Fig. 1 shows. The resulting algorithm has a bidimensional blocking for registers, resulting in fewer loads and stores per arithmetic operation. Besides the number of independent floating point operations in the loop body is increased.
226
Basilio B. Fraguela et al.
1 DO J2=1, H, BJ 2 LIMJ=J2+MIN(BJ, H-J2+1)-1 3 4 5
DO I=1, M+1 R2(I)=R(I) ENDDO
6 7
DO K2=1, N, BK LIMK=K2+MIN(BK, N-K2+1)
8 9 10 11 12
DO J=1, LIMJ-J2+1 DO K=1, LIMK-K2 WB(J,K)=B(K2+K-1,J2+J-1) ENDDO ENDDO
13 14 15
DO I=1, M K=R2(I) LK=R(I+1)
24 25 26 27 28 29
DO WHILE (K
30 R2(I)=K 31 ENDDO 32 ENDDO 33 ENDDO
16
d1=D(I,J2)
17
d2=D(I,J2+1) ... dbj=D(I,J2+BJ-1)
18
19 20 21 22 23
→
Fig. 1. Sparse matrix-dense matrix product with IKJ ordering and blocking at the memory and register levels
3
Probabilistic Model
We call intrinsic miss the one that takes place the first time a given memory block is accessed. The accesses to the block from that moment on result in hits unless the block is replaced. This happens in a direct mapped cache if and only if another memory block mapped to the same cache line is accessed. If this block belongs to the same vector as the replaced line it is called a self interference, whereas if it belongs to another vector it is called a cross-interference. The replacement probability grows with the cache area (number of lines) affected by the accesses between two consecutive references to the considered block. This area depends on the memory location of the vectors to be accessed. In most cases, two consecutive references to a given block are separated by accesses to several vectors that cover different areas which are measured as ratios, as they correspond to line replacement probabilities. The total area covered by these accesses in the cache is calculated adding these areas as independent probabilities1 , operation that we express with symbol ∪. This way our probabilistic model does not make any assumptions on the relative location of these vectors in memory. The main parameters our model employs are depicted in Table 1. By word we mean the logical access unit, this is to say, the size of a real or an integer. We have chosen the size of a real, but the model is totally scalable. Integers are considered through the use of parameter r. 1
Given the independent probabilities p1 and p2 of events A1 and A2 respectively, the probability of A1 or A2 is obtained as p1 ∪ p2 = p1 + p2 − p1 p2
Cache Misses Prediction for High Performance Sparse Algorithms
227
Table 1. Notation used Cs Ls Nc M N H BJ BK NBJ NBK Nnz pn r
Cache size in words Line size in words Number of cache lines (Cs /Ls ) Number of rows of the sparse matrix Number of columns of the sparse matrix Number of columns of the dense matrix Block size in the J dimension Block size in the K dimension Number of blocks in the J dimension (H/BJ) Number of blocks in the K dimension (N/BK) Number of entries of the sparse matrix Probability that a position in the sparse Nnz matrix contains an entry M ·N size of an integer size of a real
A main issue of the proposed model we want to point out is the way the cache area affected by the accesses to a given vector is calculated. This area depends on the vector access pattern and the cache parameters. For example, a sequential access to x words loads (x + Ls − 1)/Ls lines on average. The cache area they cover is Ss (x) = min{1, (x + Ls − 1)/Cs }
(1)
In a similar way different expressions or iterative methods have been developed to estimate the average cache area affected by the different types of accesses to a given vector. Combinig these expressions and adding the areas covered by the accesses to the different vectors the average cross interference probability is estimated. Also the self interference probability may be calculated as a function of the vector size, the cache parameters, and the vector access pattern. Miss rate calculation is directly obtained from the previous probabilities. These are the access patterns found in this code: 1. Sequential accesseses on vectors A and C within groups of entries (not positions) located in the same row in a set of BK consecutive columns in lines 19-21. There is a hit in reuse probability in the four loops (with index variables K, I, K2 and J2 from the inner to the outer) with decreasing weight. These accesses are slightly irregular because of the different number of entries in each considered subrow and the offset inside the vectors between the data corresponding to entries in consecutive subrows. The difference between these vectors is that C is always accessed at least once in each iteration of loop 13-31 in line 19, and it is always accessed once more than vector A, as it is used to detect the end of the subrow. 2. Access to matrix B in groups of BJ consecutive subcolumns of BK elements each in line 10. Recall that the access is by columns, and FORTRAN stores
228
Basilio B. Fraguela et al.
matrices by columns, so it is completely sequential in each subcolumn. There is only a real reuse probability in loop 9-10, as each element is only accessed once. 3. Subrows of BJ consecutive positions of matrix D read in lines 16-80 and written in lines 27-29. The read operation has a hit in reuse probability in the loop indexed by I, and when the cache is large, also in the loop on K2. The write operations will result in hits providing they are not affected by the self interferences in the accesses to a row and the cross interferences due to the accesses in lines 19-26. 4. Vectors R and R2 have almost the same access pattern: they both are sequentially accessed in loop 3-5 and also inside the loop indexed by I (lines 14, 15 and 30: this last line gives the difference between these vectors). In the first case, only an access to the other vector takes place between two consecutive accesses to the same line, while in the second the cross interference probability is much higher. There is a hit in reuse probability in the first access to each line in the two loops when the cache is large with respect to the block size. 5. Matrix WB, which will be studied below, is accessed by rows in loops 8-11 to be filled with the values of the block to process, and by columns -thus, sequentially- in the inner loop, being the column chosen as a function of the column of the considered entry. This last fact makes the selection irregular. When this matrix fits in the cache, there is some hit in reuse probability in the first access to each line due to the previous access in loops 8-11. Conversely, the accesses in line 10 have a hit in reuse probability derived from the accesses in lines 22-24. As an example, and due to space limitations, we shall only explain the way the number of misses on matrix WB has been modeled, as it is usually responsible for most of the misses and it has the more complex access pattern. This matrix contains a transposed copy of the block of matrix B that is being processed, so it has BJ rows and BK columns. We calculate first the number of misses for WB in the inner loop and then the number of misses during the copying process of the block of matrix B. 3.1
Misses on Matrix WB in the Inner Loop
The hit probability for a given line of matrix WB during the processing of the j-th row is calculated as j−1 Phit WB (j) = i=1 P (1 − P )i−1 (1 − P )i·nl (1 − Pcross WB (i)) (2) +(1 − P )j−1 (1 − P )j·nl (1 − Pcross WB (j))Psurv WB where nl = max{0, Ss (BJ · BK) − 1} is the average number of lines of WB that compete with another line of this matrix for a given cache line and P = 1 − (1 − pn )Av is the probability that a line of matrix WB is accessed during the processing of a subrow of the sparse matrix, being Av the average of different columns of the matrix that have elements in the same line.
Cache Misses Prediction for High Performance Sparse Algorithms
229
In this way, P (1 − P )i−1 is the probability that the last access to the considered line has taken place i iterations before of the loop on variable I. Function Pcross WB (j) gives the cross interference probability generated by the accesses to other matrices and vectors during the processing of j subrows of the sparse matrix. The value is calculated as Pcross WB (j) = SD (j) ∪ SA (j) ∪ SC (j) ∪ Ss (j · r) ∪ Ss (j · r)
(3)
where the last two terms stand for the area covered by the accesses to R and R2. The areas corresponding to the references to matrix D and vectors A and C are calculated using a deterministic algorithm further developed in [2]. The last addend in expression (2) stands for the probability of the reuse of a line that has not been referenced yet in loop I (probability (1 − P )j−1 ) but that has remained in the cache since the copying process. This probability of hit in the reuse means excluding both self (probability (1 − P )j·nl ) and cross interferences (1 − Pcross WB (j)) as well as being in cache when the copying has just finished (Psurv WB , whose calculation is not included here due to space limitations). The number of misses on matrix WB in the inner loop is calculated multiplying the average miss probability by the number of accesses to the first element of a line in this loop, as only in the access to the beginning of a line can a miss take place: m Phit WB (j) BJ+Ls −1 Finner WB = 1 − j=1 M Ls (4) N ·BK nz NBJ · NBK N
P
3.2
Misses on Matrix WB in the Copying
The number of misses during the copying is obtained multiplying the average number of misses per copy process by the number of blocks, NBJ NBK . This value is estimated as
BJ·BK
Fcopy WB =
Ls BJ
j=1
i=1
A · Ss (1) +
Ls − A Pmiss first (i, j) BJ
(5)
Ls where A = max{0, BJ − 1} is the average number of accesses after the first one to a given line of matrix WB during an iteration of the loop on line 8. As the equation shows, the miss probability for these accesses is the associated to an access to a line of B. As for Pmiss first (i, j), it is the probability of a miss for the first access to line i during iteration j of the loop, which is calculated as
Pmiss first (i, j) = Pacc (j − 1)(Sself (BJ, BK) ∪ Ss (BK)) +(1 − Pacc (j − 1))((1 − Phit WB (M )) ∪ Pexp WB (i, j))
(6)
where Pacc (j) if the probability of the line having been accessed in the j previous iterations, which is min{1, (Ls + j − 2)/BJ} but for j = 0, for which the probability is null. The first access to a line results in a miss if it is not in the cache
230
Basilio B. Fraguela et al.
when the copying begins (probability 1 − Phit WB (M )) or if it has been replaced during the copy of the elements previously accessed, which is Pexp WB (i, j). On the other hand, if the line has been accessed in the previous iteration, only the accesses to BK elements of matrix B located in two consecutive columns (whose area we approach by a completely sequential access) and the accesses to a row of matrix WB can have replaced the considered line. This last probability is calculated using the deterministic algorithm we have mentioned above.
4
Cache Behavior of the Algorithm
The model was validated with simulations on synthetic matrices made by replacing the references to memory by functions that calculate the position to be accessed and feed it to the dineroIII cache simulator, belonging to the WARTS tools [4]. Table 2 shows the model accuracy for some combinations of the input parameters using synthetic matrices. The average error obtained in the trial set was under 5%. Table 2. Predicted and measured misses and deviation of model for optimized sparse matrix-dense matrix product with an uniform entries distribution. Order (M = N ), Nnz and H in thousands, Cs in Kwords and number of misses in millions Order 2 2 2 2 2 2 10 10 10 10 10 10 10 10
Nnz 20 20 20 20 20 20 100 100 100 100 100 100 100 100
pn 0.005 0.005 0.005 0.005 0.005 0.005 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
H 0.2 0.2 0.2 0.2 0.2 0.2 10 10 10 10 10 10 10 10
BJ 25 25 25 25 25 25 20 20 20 20 20 20 20 20
BK 500 500 500 500 500 500 1000 1000 1000 1000 1000 1000 1000 1000
Cs 8 16 32 8 16 32 8 16 32 64 8 16 32 64
predicted measured Ls misses misses 4 1.524 1.516 4 1.162 1.163 4 0.962 0.972 8 0.887 0.872 8 0.667 0.660 8 0.552 0.553 4 646 672 4 608 612 4 549 546 4 490 492 8 401 419 8 373 378 8 321 323 8 287 290
Dev. -0.55% 0.10% 1.08% -1.70% -0.99% 0.17% 4.05% 0.72% -0.45% 0.41% 4.40% 1.35% 0.65% 0.86%
As a first result of our modeling, a linear increase of the number of misses with respect to N and M due to blocking technique is shown. The exception are the accesses to matrix D, as they have an access stride M , which makes them increase noticeably when M is a divisor or multiple of Cs . Figure 2 illustrates the evolution of the number of misses with respect to the block size for a given matrix. We have supposed a value of BJ between 8 and 30, this is, that there
Cache Misses Prediction for High Performance Sparse Algorithms
231
6
6
x 10
x 10
9
12
8 7
8 30 6
25 20
# Misses
# Misses
10
6 5
30 25
4
20
3
4 15
0
5
10
15
20
25
30
35
40
45
5
15
2
10
2
10
1 Bj
50
Bk
Fig. 2. Number of misses on a 16Kw cache with Ls = 8 for a M = N = 5000 sparse matrix with pn = 0.02 and H = 100 as a function of the block size (BK in hundreds)
0
5
10
15
20
25
30
35
40
45
5
Bj
50
Bk
Fig. 3. Number of misses on a 16Kw cache with Ls = 8 for a M = N = 5000 sparse matrix with pn = 0.002 and H = 100 as a function of the block size (BK in hundreds)
are up to 30 registers available for the blocking at the register level. The number of misses grows in the two directions of the axis tending asymptotically to a maximum when BJ × BK > Cs . The minimum number of misses is obtained in a case in which the block fits completely in the cache (BJ = 22, BK = 700). Besides, among all of the possible block sizes the best has been the one with the greatest value of BJ that multiplied by a multiple of 100 (the variation of the BK value in the figure is 100 and that of BJ is 2) gives the greatest value under Cs . The reason is that the greater BJ is, lesser blocks there are in the J direction, reducing the number of accesses. In addition, accesses to matrix WB in the inner loop take place sequentially in the J direction, whose components are stored in consecutive memory positions. As a result exploitation of the spatial locality is improved with the block size increase in this direction. Simulations showed that the actual optimum block was (BJ = 20, BK = 800), with 5.6% less misses, which is a slight difference due to the small errors of the model. Although the evolution of the number of misses usually follows the previous structure, we can obtain important variations depending on several factors. For example, Fig. 3 shows a similar plot for a matrix with lesser pn . Here the optimum block size is (BJ = 24, BK = 2000), which is much greater than Cs (almost three times). This is because pn is much smaller in this matrix, reducing remarkably the replacement probability due to self and cross interferences. In this way much greater blocks may be used, reducing the number of accesses and misses. On the other hand, although there are little variations, the number of misses is stabilized when the block size is similar to or greater than Cs because the interference increase is balanced with the lesser number of blocks to be used. The optimum block size was checked with simulations which indicated the same value for BJ and a value of 2100 for BK, with 0.04% less misses. Finally, in Fig. 4 we study the cache behavior as a function of the cache parameters. Log Cs axis stands for the logarithm base two of Cs measured in Kw. There is an increase of the number of misses when the block size is greater
232
Basilio B. Fraguela et al.
7
x 10 3 2.5
# Misses
2 1.5 1 0
0.5 2 0 1
4 2
3
4
6 5
6
7
8
Log Cs
Log Ls
Fig. 4. Number of misses for a M = N = 5000 sparse matrix with pn = 0.02 and H = 100 using a block BJ = 22, BK = 700 as a function of Cs and Ls than Cs . Besides this value is always much greater for small line sizes (Ls ≤ 8 words). The optimum line size turns out to be 32 when the cache size is greater than or similar to that of the block. This means a balance between the spatial locality exploitation in the access to WB in the inner loop of the algorithm and the self interference probability. As Cs grows, so does the optimum line size due to the reduction in the self and cross interference probabilities.
5
Conclusions
The proposed model may be parameterized and considers matrices with an uniform distribution. It significantly extends the previous models in the literature as it incorporates the three possible types of misses (intrinsic misses, self and cross interferences) in the cache and it takes into account the propagation of data between successive loops of the code (reusability). Besides the modeled code size is noticeably greater than those used in most previous works and their predictions are highly reliable. An effort has been done to structure the model in algorithms and formulae related to typical access patterns so that they can be reused when studying other codes. The time required by our probabilistic model to get the miss predictions is much smaller than that of the simulations, being this difference larger the larger the dimensions and/or sparsity degree of the considered sparse matrix. For example, the time required to generate the trace corresponding to the product of a matrix with M = N = 1500 and 125K entries with a block size (BJ = 10, BK = 500) by a matrix of the same size (almost 246M references) and process it with dineroIII simulating a 16Kw cache with a line size of 8 words was around 20 minutes on a 433 MHz DEC Alpha processor, while our model required between 0.1 and 0.2 seconds. As for the time required to obtain the best block size for a given matrix and cache, the time required to do one simulation for each possible pair (BJ, BK)
Cache Misses Prediction for High Performance Sparse Algorithms
233
in Fig. 2 would be around 54.5 hours on this machine. Anyway, as different locations of the vectors in memory generate different numbers of misses, some other simulations would be needed to get an average for each pair. On the other hand our model required only 350.5 seconds. As shown in Sect. 4, it is possible to analyze the behavior of the number of misses with respect to the basic characteristics of the cache (its size and its line size), the block dimensions and the features of the matrix. Besides the behavior of each vector may be studied in a separate manner.
References 1. Barrett, R., Berry, M., Chan, T., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., H. van der Vorst, H.: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM Press (1994). 224, 225 2. Fraguela, B.B.: Cache Misses Prediction in Sparse Matrices Computations. Technical Report UDC-DES-1997/1. Departamento de Electr´ onica e Sistemas da Univerdade da Coru˜ na (1997). 225, 229 3. Fraguela, B.B, Doallo, R., Zapata, E.L.: Modeling Set Associative Caches Behavior for Irregular Computations. To appear in Proc. ACM Sigmetrics/Performance Joint Int’l. Conf. on Measurement and Modeling of Computer Systems, Madison, Wisconsin, (1998). 225 4. Lebeck, A.R., Wood, D.A.: Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, 27(10) (1994) 15–26. 230 5. Navarro, J.J., Garc´ıa, E., Larriba-Pey, J.L., Juan, T.: Block Algorithms for Sparse Matrix Computations on High Performance Workstations. Proc. ACM Int’l. Conf. on Supercomputing (ICS’96) (1996) 301–309. 224, 225 6. Temam, O., Jalby, W.: Characterizing the Behaviour of Sparse Algorithms on Caches. Proc. IEEE Int’l. Conf. on Supercomputing (ICS’92) (1992) 578–587. 225 7. Uhlig, R.A, Mudge, T.N.: Trace-Driven Memory Simulation: A Survey. ACM Computing Surveys 29 (1997) 128–170. 224
h-Relation Models for Current Standard Parallel Platforms Rodr´ıguez C., Roda J.L., Morales D.G., Almeida F. Dpto. E.I.O. y Computaci´ on Universidad de La Laguna - Tenerife - Spain
Abstract. This paper studies the validity of the BSP h-relation hypothesis on four current standard parallel platforms. The error introduced by the influence of the number of processors is measured on five communication patterns. We also measure the influence of the communication patterns on the time invested in an h-relation. The asynchronous nature of many current standard message passing programs do not easily fits inside the BSP model. Often this has been criticized as the most serious drawback of BSP. Based in the h-relation hypothesis we propose an extension to BSP model valid for standard message passing parallel programs. The use and accuracy of h-relation models on standard message passing programs are illustrated using a parallel algorithm to compute the Discrete Fast Fourier Transform.
1
Introduction
The development of a reasonable abstraction of parallel machines is a formidable challenge. A simple and precise model of parallel computation is necessary to guide the design and analysis of parallel algorithms. Among the plethora of solutions proposed, PRAM, Networks, LogP and BSP models are the most popular. Section 2 introduces the BSP model. Sections 3 presents the four representative platforms used in our study: a 10 Mbits Coaxial Ethernet Local Area Network, an UTP Ethernet LAN, an IBM SP2 and a Silicon ORIGIN 2000. The IBM Scalable POWERparallel SP2 used in the experiments is a distributed memory parallel computer with 44 processors. Processors or nodes are interconnected through a High Performance Switch (HPS). The HPS is a bi-directional multistage interconnection network. The computing nodes are Thin2, each powered by a 66MHz Power2 RISC System/6000 processor. All the algorithms were implemented in PVMe [3], the improved version of the message-passing software PVM [2]. The Origin 2000 system we have used has 64 R10000 processors (196 MHz). Origin systems use distributed shared memory (DSM). To a processor, main memory appears as a single addressable space. A four-port crossbar switch interconnects the processor, the main memory, the router board and the I/O subsystem. The router board interfaces the node with the CrayLink Interconnect fabric. Both the 10 Mb/sec. Coaxial LAN and the UTP LAN are composed of Sun Sparc workstations at 70MHz running Solaris 2.4. The UTP LAN uses a FORE-SYSTEMS ES-3810 Ethernet Workgroup Switch that interconnects all D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 234–243, 1998. c Springer-Verlag Berlin Heidelberg 1998
h-Relation Models for Current Standard Parallel Platforms
235
the computers of the LAN. A theoretical 10 point-to-point Mb/sec is guaranteed. As for the Origin 2000, the PVM version used in both LANs was 3.3.11. Instead of following the common classical approach of using a library adapted to the proposed model, like Active Messages for the LogP [1] or the Oxford BSP library for BSP [5], we tried to check the validity of the models in current Standard Message Passing libraries. The influence of the number of processors in the time spent in an h-relation is measured for the most common communication patterns in section 3. Both the LogP and the BSP models disregard the effects of the pattern. Most algorithms can be built around a small set of communication primitives such as one to all, all to one, all to all, permutations and reductions if an appropriate data layaout is used. Is it reasonable to treat everything as the general case? Section 4 measures the impact of communication patterns in the h-relation hypothesis and estimates the BSP-like gaps and latencies for the considered architectures. The asynchronous nature of many PVM/MPI programs do not easily fits inside the BSP model. Often this has been criticized as the most serious drawback of BSP. In section 5 we propose an extension to BSP, the EBSP model, for current standard message passing parallel programs. The use of the BSP and EBSP models in conventional PVM/MPI programs will be illustrated in this section using the Fast Fourier Transform. Conclusions are presented in Section 6.
2
The Bulk Synchronous Parallel Model (BSP)
The BSP model [Val90] views a parallel machine as a set of p processor-memory pairs, with a global communication network and a mechanism for synchronizing all the processors. A BSP calculation consists of a sequence of supersteps. The cost of a BSP program can be calculated simply by summing the cost of each separated superstep executed by the program; in turn, for each superstep the cost can be decomposed into: (i) local computation; (ii) global exchange of data and (iii) barrier synchronization. The communication pattern performed on a given superstep is called an h-relation if the maximum number of packets a processor communicates in the superstep is h. h = max { ini @ outi / i {0, ..., p − 1} } Where @ is a binary operator, usually the maximum or the sum. Values ini and outi respectively denote the number of packets entering and leaving processor i. Both the particular operation @ and the size of the unit packet depend on the particular architecture. The operation @ is closer to the maximum (@ = max) for machines allowing a complete parallel processing of incoming/outgoing messages. The correct interpretation for each case depends on the number of input/output ports of the network interface. The results showed on this paper correspond to take as operator @ the sum. The two basic BSP parameters that model a parallel machine are: the gap g, which reflects per-processor network bandwidth, and the minimum duration of a superstep L, which reflects the latency to send a packet through the network as
236
Rodr´ıguez C., Roda J.L., Morales D.G., Almeida F.
well as the overhead to perform a global synchronization. The fundamental of the BSP model lays on the h-relation hypothesis. It states that the communication time spent on an h-relation is given by gh. Let denote by W the maximum time spent in local computation by any processor during the superstep. The BSP model guess that the running time of a superstep is bounded by the formula: T ime Superstep = W + gh + L
3
The Influence of the Number of Processors in the Validity of the h-relation Hypothesis for Five Different Patterns in Four Different Architectures
In the following paragraphs we will study the variation of the h-relation time on the four architectures with the number of processors under five different communication patterns: Exchange (abbreviated E), PingPong (P P ), OneToAll (OA), AllToOne (AO) and AllToAll (AA). Five hundred experiments were carried out for each architecture, each pattern and each number of processors. The h-relation size varies between 6720 and 1720320 bytes. In all the tables presented in this work, time is given in seconds. The rest of the section is dedicated to the different communication patterns Π in {E, P P, OA, AO, AA}. For each pattern Π, two tables denoted Π.1 and Π.2 are presented. Sub-column labeled T ime in tables Π.1 gives the times for 8 processors using the patterns Π. It is observed that the time TΠ (h) spent on an h-relation size, follows a linear equation TΠ (h) = LΠ + gΠ h. To obtain the general linear approach to the h-relation time we have computed the least square fit of the average times for the different number of intervening processors. Tables Π.2 present the values of LΠ and gΠ for the four architectures. Columns labeled M axErr in tables Π.1 contain the maximum percentage of error computed according to the formula: M axErr = 100 ∗ (maxi {|T imeΠ,i (h) − (LΠ + gΠ h)|/ i HΠ }/(mini {T imeΠ,i (h)})
Where T imeΠ,i (h) denotes the time spent when i processors communicate according to pattern Π. Index i varies in HΠ = {2, 4, 6, 8} processors for the Exchange and PingPong patterns, and i is in HΠ = {4, 6, 8} for the OneToAll, AllToOne and AllToAll patterns. An injection pattern is characterized by the existence of a subset S of processors of the total set of available processors H, such that each processor p of S sends its message to a different processor d(p) of H. We say that an injection pattern is an Exchange when: d(d(p)) = p for any processor p in S. An injection pattern is called a PingPong if and only if S ∩ d(S) = ∅ . The PingPong pattern gives place to an h = m-relation, where m is the message size. For the Exchange, the h-relation is h = 2 ∗ m. Table E.1 shows a factor of almost 3 between the times of the Origin and the IBM SP2. The same factor appears between the UTP LAN and the COA LAN. However, the values in column M axErr prove that the IBM SP2 is among the 4 architectures the most invariant in the number of processors. It is remarkably the technological advance observed in the low value of MaxErr for the UTP LAN when compared
h-Relation Models for Current Standard Parallel Platforms
237
Table 1. The Exchange Pattern (E.1) Exchange h-relation 6720 26880 107520 430080 1720320
ORIGIN Time 0.000080 0.000227 0.001058 0.004230 0.017021
2000 MaxErr 26.43 17.78 4.71 4.32 4.25
IBM SP2 Time MaxErr 0.000310 16.78 0.000839 7.48 0.003103 4.04 0.012738 0.48 0.050868 0.69
UTP LAN Time MaxErr 0.013278 26.25 0.046496 14.18 0.173081 7.57 0.674435 5.03 2.745955 7.02
COA LAN Time MaxErr 0.033285 160.92 0.132507 133.94 0.573673 172.24 2.323600 179.46 9.226720 177.65
Table 2. Values of gE and LE (E.2) ORIGIN 2000 IBM SP2 UTP LAN COA LAN LE -1.1759E-05 6.1566E-05 -9.3670E-04 -3.2527E-04 gE 9.8942E-09 2.9675E-08 1.5923E-06 5.3772E-06
with the 178% of the COA LAN. The negative values of the Exchange latency LE in table E.2 are due to the unsuitability of taking the byte as packet size unit. However, this choice was determined by the goal of comparing different architectures. The PingPong times appear in Table PP.1. As for the Exchange, there is an almost perfect invariance of the IBM SP2 communication time in the number of intervening couples. For this pattern h = m, opposite to what occurs for the Exchange, there is no parallelism between inputs and outputs. The time ratio between the IBM SP2 and the Origin diminishes from three to a factor of two. The same decreasing is observed for the UTP LAN and COA LAN. There is a decreasing both in the time and in the MaxErr columns for the COA LAN architecture. This is due to the smaller number of collisions. The Personalized One to All communication pattern measures the outbound node performance. From Table OA.1 follows that the OneToAll communication time is better aproached by a linear by pieces function, but, fore sake of simplicity, we approach its behavior by a single linear function. Although it no appears in the tables, for small values of h, the time slowly grows with the number of processors. This is due to the heavier influence of the latency. For larger values of h, time slightly decreases with the number of processors, due to the increasing parallelism introduced in the communications. This phenomenon is specially outstanding for the COA LAN [7]. Observe the decreasing in the g value of the COA LAN to 1.072E − 06 from 4.08E − 06 for the PingPong pattern. For that reason the OneToAll is the fastest pattern for the COA LAN architecture. The Personalized All to One Communication Pattern measures the inbound node performance. Each processor sends a distinct message of size m to the receiver processor. This experiment was implemented by making the receiver processor call the routine pvm recv() p − 1 times, and the senders making a pvm send(). As for the OneToAll, we will approach the AllToOne time by a linear function LAO + gAO ∗ m ∗ (p − 1) in the h-relation size: h = m ∗ (p − 1).
238
Rodr´ıguez C., Roda J.L., Morales D.G., Almeida F.
Table 3. The PingPong Pattern (PP.1) PingPong h-relation 6720 26880 107520 430080 1720320
ORIGIN Time 0.000107 0.000454 0.001945 0.008132 0.032797
2000 MaxErr 54.26 7.53 2.29 2.95 4.16
IBM SP2 Time MaxErr 0.000386 8.58 0.001197 10.87 0.004811 0.87 0.018752 0.24 0.074292 0.52
UTP LAN Time MaxErr 0.015953 20.08 0.056722 6.97 0.220280 3.59 0.850265 2.37 3.376577 6.48
COA LAN Time MaxErr 0.027919 107.54 0.112579 88.11 0.443787 80.56 1.773765 80.13 7.023510 80.97
Table 4. Values of gP P and LP P (PP.2) ORIGIN 2000 IBM SP2 UTP LAN COA LAN LP P -7.5956E-05 1.5691E-04 6.2546E-03 5.6284E-03 gP P 1.9071E-08 4.3467E-08 1.9609E-06 4.0818E-06
Table 5. The OneToAll Pattern (OA.1) OneToAll h-relation 6720 26880 107520 430080 1720320
ORIGIN Time 0.000248 0.000354 0.000962 0.005203 0.020398
2000 MaxErr 141.77 20.09 37.79 8.60 9.06
IBM SP2 Time MaxErr 0.000383 47.26 0.001022 9.73 0.003597 4.37 0.014700 2.46 0.059125 2.73
UTP LAN Time MaxErr 0.022926 39.91 0.046733 11.31 0.170056 17.36 0.744115 23.66 2.818493 17.40
COA LAN Time MaxErr 0.018732 25.57 0.033738 13.01 0.118535 10.19 0.471055 10.21 1.846068 7.28
Table 6. Values of gOA and LOA (OA.2) ORIGIN 2000 IBM SP2 UTP LAN COA LAN LOA -3.4072E-05 2.0023E-05 1.0634E-02 7.6290E-03 gOA 1.1704E-08 3.4320E-08 1.6493E-06 1.0725E-06
Table 7. The AllToOne Pattern (AO.1) AllToOne h-relation 6720 26880 107520 430080 1720320
ORIGIN Time 0.000232 0.000281 0.000745 0.003445 0.014098
2000 MaxErr 142.86 37.37 19.97 6.19 9.78
IBM SP2 Time MaxErr 0.000449 26.87 0.001133 14.50 0.003719 5.63 0.014139 3.21 0.057669 1.40
UTP LAN Time MaxErr 0.015997 74.77 0.027941 12.98 0.114966 5.81 0.469320 5.78 1.814855 7.00
COA LAN Time MaxErr 0.021609 65.20 0.031973 13.64 0.125891 8.10 0.512352 6.11 2.006803 4.93
h-Relation Models for Current Standard Parallel Platforms
239
Table 8. Values of gAO and LAO (AO.2) ORIGIN 2000 IBM SP2 UTP LAN COA LAN LAO 4.0212E-05 9.1160E-05 3.5069E-03 6.9769E-03 gAO 8.1586E-09 3.3386E-08 1.0521E-06 1.1619E-06
Table 9. The AllToAll Pattern (AA.1) AllToAll h-relation 6720 26880 107520 430080 1720320
ORIGIN Time 0.000411 0.000529 0.001091 0.004102 0.019053
2000 MaxErr 181.56 76.85 16.81 16.36 3.70
IBM SP2 Time MaxErr 0.000694 46.40 0.001287 20.63 0.003605 5.68 0.014036 2.87 0.055054 3.39
UTP LAN Time MaxErr 0.040892 104.27 0.059168 12.76 0.170425 7.96 0.700719 6.14 2.749624 4.26
COA LAN Time MaxErr 0.046908 286.30 0.221194 120.10 0.655296 93.53 2.348122 78.44 9.041233 73.94
Unless for the COA LAN, for all the architectures, the inbound bandwidth gAO has decreased compared with the outbound bandwidth gOA . This results to be the fastest pattern for the Origin and UTP LAN architectures. In the Personalized All to All Communication Pattern each processor has to send p − 1 different messages of length m to the other p − 1 processors. Processor i cyclically sends the messages, to processors i + 1, i + 2, ..., i − 1. The AllToAll time can be modeled by a linear function: LAA + gAA 2m(p − 1). The IBM SP2 achieves its better performance for this pattern. While the performance of the UTP LAN only doubles the COA LAN performance for a pattern free of collisions like the PingPong, it is more than three times faster for this pattern. This is a consequence of the improving achieved in the bisection bandwidth.
4
The Influence of the Communication Pattern in the h-relation
BSP states that the actual times spent on these five patterns for the same hrelation have to be similar. The influence of the communication pattern in the time spent in an h-relation is shown in Table 11. To obtain the general linear approach to the h-relation time, we have computed the least square fit of the average times Taverage of the set Π of patterns and the different number of processors: Taverage (h) = ΣkΠ (ΣiHΠ T imeΠ,i (h)/|H|)/|Π| These values of L and g appear in Table 12. There is a factor of ten between the BSP-PVM values for the IBM SP2 g = 3.44∗E−08 and the corresponding Oxford BSP library values: g = 35 ∗ E − 8, L = 4.62 ∗ E − 4 for the same machine
240
Rodr´ıguez C., Roda J.L., Morales D.G., Almeida F.
Table 10. Values of gAA and LAA (AA.2) ORIGIN 2000 IBM SP2 UTP LAN COA LAN LAA 2.3919E-05 3.9166E-04 1.2658E-02 8.7903E-02 gAA 1.0984E-08 3.1578E-08 1.5957E-06 5.2131E-06
Table 11. AvErr and MaxErr for all the architectures h-relation 6720 26880 107520 430080 1720320
ORIGIN 2000 AvErr MaxErr 59.82 304.00 25.75 103.00 26.11 108.63 25.97 120.06 24.79 126.06
IBM SP2 AvErr MaxErr 20.58 70.37 8.24 19.34 12.59 33.01 10.76 30.38 10.81 29.61
UTP LAN AvErr MaxErr 29.82 141.41 15.81 20.07 13.67 5.04 13.41 19.12 13.83 81.88
COA LAN AvErr MaxErr 42.59 10.13 42.30 86.66 43.15 49.51 42.96 57.48 42.81 228.00
[4]. Columns labeled AvErr and M axErr respectively show the average and maximum errors defined as: AvErr = 100 ∗ ((ΣiΠ |Ti (h) − (g ∗ h + L)|/|Π|)/(ΣiΠ Ti (h)/|Π|)) M axErr = 100 ∗ ((maxiΠ |Ti (h) − (g ∗ h + L)|)/(minjΠ Tj (h))) The table proves that the time invested in moving data in the IBM SP2 and the UTP LAN is more independent of the specific communication pattern than in the other two architectures.
5
Extending the BSP Model to Current Standard Message Passing Libraries: EBSP
The barrier synchronization after each step imposed by BSP, does not completely agree with the way PVM and MPI programs are written. Still, PVM and MPI programs can be divided in ”Message steps” that we will call ”M-steps” in roughly the same sense than BSP ”supersteps”. In a ”M-step” a processor i = 0, ..., p − 1, performs some local computation, send the data needed by other processors and receives the data it needs for the next M -step. Processors may be in different M -steps at a given time, since no global barrier synchronization is used. However, as in pure BSP we assume that the total number of M -steps R, performed by all the p processors is the same, and communications always occur among processors in adjacent steps k − 1 and k (computation on any processor can be arbitrarily divided to achieve this goal). The time ts,i when processor i finishes its step s is bounded by the ”BSP-like-time” Ts given by: T1 = max{w1,i } + g ∗ max{in1,i @out1,i } + L Ts = Ts−1 + max{ws,i } + g ∗ max{ins,i @outs,i } + L
(1)
h-Relation Models for Current Standard Parallel Platforms
241
Table 12. The values of g and L for all the architectures ORIGIN 2000 IBM SP2 UTP LAN COA LAN L -1.5327E-05 8.0835E-05 3.8229E-03 1.3220E-02 g 1.2192E-08 3.4454E-08 1.5107E-06 2.4318E-06
where i = 0, 1, ..., p − 1, s = 2, ..., R; @ is in {+, max}; ws,i is the time spent in computing by processor i in M -step s; ins,i and outs,i respectively denote the number of messages sent and received by processor i in the step s. Gap and Latency values g and L can be computed as proposed in the former paragraph. Thus the ”BSP-like-time”, TR gave us an upper bound approximation to the execution time of a PVM/MPI program. A closer bound to the actual PVM/MPI time ts,i when processor i finishes its s-th M -step is the value Φs,i given by the Extended BSP model (EBSP) we propose here. Let us define for a given step s and processor i the set Ωs,i of in-partners of i in step s as Ωs,i ={ j / processor j sends a message to processor i in step s } ∪ { i}. The EBSP time of an PVM/MPI program is given by the formulas: Φ1,i = max{w1,j / j Ω1,i } + g ∗ h1,i + L i = 0, 1, ..., p − 1, h1,i = max{in1,j + out1,j /j Ω1,i } Φs,i = max{Φs−1,j + ws,j /j Ωs,i } + g ∗ hs,i + L and hs,i = max{ins,j + outs,j /j Ωs,i } s = 2, ..., R
(2)
The total time of a PVM/MPI program in the ESBP model is given by Ψ = max{ΦR,j /j {0, ..., p − 1}} where R is the total number of steps. Instead a global barrier synchronization, the EBSP model implies a synchronization among partners. Formula (2) becomes the BSP-like time of formula (1) when the family of sets Ωs,i is the whole set of processors {0, ..., p − 1}. (This is the case, if the BSP barrier synchronization implies Ωs,i = {0, ..., p − 1} for all i. 5.1
Example: The Fast Fourier Transform
We will illustrate the BSP and EBSP models using the parallel algorithm to compute the Fast Fourier Transform described in [6]. The total time predicted by the model for this algorithm is:
log(p−1)
Tlog(p)+1 = Ψ = D(N/p) + F N/p log(N/p) +
(L + g(N 2i/p ) + V N 2i/p ) (3)
i=0
Table 13 contains the values of the three computational constants D, F and V . The good accuracy of the h-relation model for a 524.288 floats FFT example is shown in Table IV. Columns labeled MODEL in Table 14 present the time computed according to formula (3). Under the REAL label are the actual times.
242
Rodr´ıguez C., Roda J.L., Morales D.G., Almeida F.
Table 13. Division (D), sequential FFT (F ) and combinations (V ) constants Constants D F R ORIGIN 2000 2.3928E-07 2.8453E-07 4.8103E-07 IBM SP2 5.5161E-07 5.8598E-08 8.6916E-07 UTP-COA LAN 6.0400E-07 1.4400E-06 2.6500E-06
Table 14. FFT real times versus h-relation model time Times ORIGIN 2000 MODEL ERROR IBM SP2 MODEL ERROR UTP LAN MODEL ERROR COA LAN MODEL ERROR
2 1.61 1.55 3.68 3.36 3.21 1.59 12.36 11.00 11.01 11.91 12.76 7.14
4 0.96 0.86 10.85 1.90 1.83 3.85 11.31 8.41 25.65 10.80 12.01 11.17
8 0.85 0.56 34.10 1.27 1.18 6.82 11.03 7.27 34.13 10.94 11.73 7.22
Entries in columns ERROR give the error percentage computed by the formula ERROR = 100 ∗ (REAL − M ODEL)/REAL. The accuracy of the model is specially good for the IBM SP2 and acceptable for the ORIGIN 2000 and UTP LAN. The error grows with the number of processors. The curious exception is the Coaxial LAN. On this architecture, the time of the model for the two first PingPong communications (i = 0 and i = 1) is over the actual time while it is under the actual time of the third communication (i = 2) producing a compensation that leads to a false accuracy. In fact, this is the only architecture where the model fails its prediction of a decreasing behavior in the actual time (see columns 4 and 8 in row COA LAN).
6
Conclusions
We have studied the validity of the h-relation hypothesis on four representative different platforms: a 10 Mbits Coaxial Ethernet Local Area Network, an UTP Ethernet LAN, an IBM SP2 and a Silicon ORIGIN 2000. Current Standard Message Passing libraries have been used instead of specific BSP environments. The maximum and average errors introduced by the influence of the pattern and the number of processors, has been measured. The collective computation provided by MPI and the new version of PVM 3.4 makes feasible the use of a methodology compatible with the Bulk Synchronous Programming Model.
h-Relation Models for Current Standard Parallel Platforms
243
However, the more relaxed nature of many PVM/MPI programs does not easily fit inside the BSP model. We have proposed a new model, EBSP, that extends the BSP model to current standard message passing parallel programs. The use and accuracy of the BSP and EBSP model has been illustrated using a Fast Fourier Transform example. Acknowledgments Part of this research has been done at C4 (CESCACEPBA) and at IAC in Tenerife.
References 1. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., Eicken, T. LogP: Towards a Realistic Model of Parallel Computation. Proceedings of the 4th ACM SIGPLAN, Sym. Principles and Practice of Parallel Programming. May 1993. 235 2. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Mancheck, R., Sunderam, V. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press. 1994. 234 3. IBM PVMe for AIX User’s Guide and Subroutine Reference Version 2, Release 1. Document number GC23-3884- 00. IBM Corp. 1995. 234 4. Marin, J., Martinez, A. Testing PVM versus BSP Programming, VIII Jornadas de Paralelismo, Sept 1997. pp.153-160. 240 5. Miller, R., Reed, J.L. The Oxford BSP Library Users’ Guide. Technical Report, Programming Research Group, University of Oxford. 1993 235 6. Roda, J., Rodriguez, C., Almeida, F., Morales, D.G. Predicting the Performance of Injection Communication Patterns on PVM. 4th European PVM/MPI Users’ Group Meeting. Springer-Verlag LNCS. pp. 33-40. Cracow. Nov-97. 241 7. Roda, J., Rodriguez, C., Almeida, F., Morales, D.G. Prediction of Parallel Algorithms Performance On Bus-Based Networks using PVM. 6th EuroMicro Workshop on Parallel and Distributed Processing. pp. 57-63. Madrid. Jan-98 237 8. Valiant, L.G. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8): 103-111, August 1990.
Practical Simulation of Large-Scale Parallel Programs and its Performance Analysis of the NAS Parallel Benchmarks Kazuto Kubota1 , Ken’ichi Itakura2 , Mitsuhisa Sato1 , and Taisuke Boku2 1
Real World Computing Partnership, Tsukuba, Japan 2 University of Tsukuba, Tsukuba, Japan
Abstract. A simulation technique for very large-scale data parallel programs is proposed. In our simulation method, a data parallel program is divided into computation and communication sections. When the control flow of the parallel program does not depend on the contents of network messages, the computation time on each processor is calculated independently. An instrumentation tool called EXCIT is used to calculate the execution time on the target architecture and generate message traces. The communication time is calculated on the message traces by using a network simulator, which is generated by a network simulator generating system INSPIRE. With our tool set, the behavior of parallel programs on thousands processors can be estimated within a practical time span. We demonstrate our method to analyze the class B problems of LU and MG programs of the NAS Parallel Benchmarks with various parameters such as cache size and network bandwidth examined. We found that communication overhead affects the total execution time considerably, while cache effect is small.
1
Introduction
Recent massively parallel processors (MPP) architectures are scalable to very large numbers of processors. Some huge scale MPP systems with thousands of processors, such as CP-PACS[1] and ASCI Red[2], have actually been built. In this paper, we describe a practical simulation technique for very largescale data parallel programs. While the exact simulation of a parallel system is an effective technique to analyze the performance of such large-scale systems in designing a new large-scale systems, it is difficult to simulate its behavior in practical time, especially in case of large-scale parallel systems. In several approaches[3][4], analytical modeling is used to understand the scalability of systems. However, it is difficult to examine the user program’s behavior in detail because many factors such as program implementation, complier optimization and network collision are not taken into account. For the system with only a small number of Processing Units (PU), there is another approach that the entire system is actually simulated by both instruction simulator and network simulator[5]. Due to the limitation of the simulation time and the memory size D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 244–254, 1998. c Springer-Verlag Berlin Heidelberg 1998
Practical Simulation of Large-Scale Parallel Programs
245
on a simulation platform, it is difficult to apply this approach to large-scale parallel programs. Our simulation approach is based on code instrumentatio technique and network simulation. At the first step, we execute the instrumented version of the original parallel program independently to calculate the execution time of the computation part of the parallel program on the target architecture and to get its message trace files. We assume that the control flow of the parallel program does not depend on the contents of network messages. Then, we compute the communication time in the target network using the message trace files. We use the instrumentation tool EXCIT[6] to instrument codes to estimate the computation time in the target processor. INSPIRE[7] is a network simulator generating system which generates the network simulator given by a target network description. Our simulation technique has the following advantages: – Up to thousands of processors can be simulated in a reasonable time. – Real size of programs can be analyzed. – Computation and communication overlap can be estimated precisely. To demonstrate the effectiveness of our technique, we apply it to analyze the class B size MG kernel and the LU application in the NAS Parallel Benchmarks[8]. We analyze the performance of these programs, varying the number of processors up to 1024 processors, some parallel processor parameters such as cache size and network parameters including network topology, bandwidth, and communication overhead. The simulation results help us to determine which hardware parameters are suitable for parallel programs. This information is also effective in selecting an appropriate platform and number of processors and program tuning.
2
Simulation Method of Large-Scale Parallel Programs
Our goal in this study is to analyze the performance of a large-scale parallel system. Data parallel program is one of important models in large-scale scientific applications. We assume that our simulated program is a data parallel program and the control flow of the parallel program does not depend on the contents of network messages. 2.1
Structure of Data Parallel Program
The typical structure of a data parallel program using four processors is shown in Figure 1. The data parallel program is decomposed into the following sections:
– Computation section: The section in which only computation is executed (Thick lines in PU0). The execution time of the computation section is denoted by T comp.
246
Kazuto Kubota et al.
PU0
PU1
PU2
point to point comm. section
Communication section Computation section
PU3
barrier
global comm. section
gather
Fig. 1. Structure of data parallel program.
– Communication section: The section in which computation and communication are executed (Thin lines in PU0). The execution time of the computation section is denoted by T comm. The communication section is divided into further two sections. • Point to point communication section: The section which contains send /recv and the computations surrounding send / recv . The start and end points of this section is specified by the user. The execution time of this section(T pp) consists of computation time(T ppcomp ) and communication time(T ppcomm ). • Global communication section: The section of global communications such as barrier and gather . The execution time of the global communication section(T g) consists of computation time(T gcomp ) and communication time(T gcomm ). 2.2
Simulation Steps
Figure 2 illustrates the overview of our simulation system. The simulation steps are as follows: – The original parallel programs (prog#0∼prog#3) are instrumented to generate the instrumented version of codes (prog’#0∼prog’#3, prog”#0). – prog’#0∼prog’#3 are executed independently to generate the message trace. The computation time on the target architecture is also calculated by another instrumented code(prog”#0). – Network simulation on these message trace is performed to compute the communication time. The instrumentation tool, called EXCIT, is used to instrument the original code to compute T comp on the target architecture and generate the message trace into the files. Message passing functions in prog”#0 are replaced by dummy functions. T comp of PU0 on the target architecture is calculated by executing prog”#0. The execution time of the computation section is about the same as the time when that program is executed as a parallel program, because we assume that the control flow of the parallel program does not depend on the contents of messages. In most data parallel programs, the computation time of
Practical Simulation of Large-Scale Parallel Programs
point to point comm. section specification
247
prog prog prog prog #0 #1 #2 #3
ExCit Dummy message passing lib. for comp. sect.
Dummy message passing lib. for comm. sect. prog’’prog’prog’prog’prog’ #0 #0 #1 #2 #3
NDF file Network simulator
INSPIRE
Message trace file Global comm. Database Computation sect. Global communication sect. Point to point sect.time: time: Tcomp time: Tg = Tg comp + Tgcomm Tpp = Tpp comp + Tppcomm
Computation time: Tall comp = Tcomp + Tg comp + Tppcomp
Communication time: Tall comm = Tgcomm + Tppcomm
Fig. 2. Overview of Data parallel program simulation.
exec recv exec send exec ...
10040 src=1 size=10 tag=3 244 dest=1 size=1 tag=2 1354
PU0
exec send exec recv exec ...
10323 dest=0 size=4 tag=3 248 src=0 size=10 tag=2 1322
...
PU1
Fig. 3. Message trace file. each PU program is approximately the same. Therefore, we just only examined the T comp of the PU0 program. Message traces are generated by executing prog’#0∼prog’#3. Communication functions in prog’#0∼prog’#3 are replaced by trace functions which output communication function information such as source PU, destination PU, tags etc. Figure 3 shows examples of the message trace files. Each PU has its trace file. In the examples, “recv” shows a block receive. Source and destination and message tag are described, but message data is not recorded. “exec” shows the consumed cycles between communication functions. Computation and communication overlap can be calculated precisely, because communications are simulated with their preceding and succeeding computations. By using a network simulator generated by INSPIRE, T pp is calculated on the message trace files. T ppcomp is calculated by summing the “exec” entry of each message trace file, and taking the maximum result of all files. T ppcomm is
248
xxx.c/f
Kazuto Kubota et al.
xxx.s
input file
Parser
ExCit
code generator
output file
intermediate representation
xxx_.s
cache simulation basic block counter
analyzer instrumented intermediate representation
instruction analyzer
a.out
execution time
xxx.bb
Fig. 4. Execution cycle calculation by EXCIT. calculated by subtracting T ppcomp from T pp. The number of global communications are extracted from the message trace files and T gcomp and T gcomm are obtained by referring to the database file. The computation and communication time of global communications appeared in the parallel program are calculated in advance and stored in the database file. Finally, the computation time(T allcomp ) and the communication time(T allcomm) of the parallel program are obtained by the following equation. T allcomp = T comp + T ppcomp + T gcomp, T allcomm = T ppcomm + T gcomm 2.3
EXCIT and Execution Cycle Counting
EXCIT is an assembler level instrumentation tool to insert additional user code into the original program. In our simulation system, the instrumented code calculates the execution cycle of the program at execution time Figure 4 shows its overview. In EXCIT, the parser translates the xxx.s input file into an intermediate representation. This intermediate representation is then analyzed by the analyzer and information is inserted into the intermediate representation as follows. – The source code is divided into basic blocks. Hook functions to call a basic block counter is inserted at the top of each of the basic blocks. – The load and store instructions are extracted. Hook functions are inserted to call a cache simulator before the load and store instructions. The code generator generates an instrumented code xxx .s from the instrumented intermediate representation. This code is linked with the basic the block counter and the cache simulator, and an object code(a.out) is generated. Instruction analyzer calculates the typical execution cycle of each basic block and generates a table file xxx.bb; the typical execution cycle of each basic block is recorded in this file. The typical execution cycle is the execution cycle of each basic block under the following conditions. – The instruction cache and data cache are hit. – The floating point operation latency is the typical cycle. – No exceptions happen.
Practical Simulation of Large-Scale Parallel Programs
249
The typical execution cycle is calculated by simulating the instructions in all basic blocks one time. An execution cycle is obtained by summing the typical execution cycles of the basic blocks and cache miss penalties. Assume a user specifies a section where the execution cycle is to be calculated. When the program enters the section, the cycle counter is cleared to zero. Every time each basic block is passed, its typical execution cycle is added to the cycle counter. Concurrently, a cache simulation is performed when a load or a store instruction is executed. When the section is finished, the cache miss penalties are added to the counter, and the execution cycle of the current section is obtained. It should be noted that the execution cycle changes according to the cache performance. We applied this execution cycle calculation method to the SPECfp92[9] benchmark programs. The error ratio is within 20% when compared with the real execution time. In our experiments, a MicroSparc-II CPU model was used. 2.4
Network Simulator Generating System INSPIRE
INSPIRE generates a clock level network simulator from NDF (Network Description FILE). NDF describes the hardware resources of a network, a network connection and a routing algorithm. Many kinds of network topology descriptions, including 2D mesh and Hyper-cube, are provided as libraries. Message trace files are taken as input into the network simulator and a simulation is performed. The user can specify several network parameters such as bandwidth and overheads in the network simulation.
3
Performance Analysis of the NAS Parallel Benchmarks
In this section, using our simulation technique, we analyze the class B problem size MG kernel and the LU application in the NAS Parallel Benchmark suite (ver.2.3). In this simulation, we examine the effect of cache size, network topology, communication overhead(µ0), bandwidth(b/w), and the number of PUs. 3.1
Simulation Setup
In the simulation, we used 100 MHz, 2 integer pipes and 1 floating pipe CPU model. It has four way set associative, logical mapped L1 cache and its size varied 8192, 16384 and 65536 Byte. L1 miss penalty is 5 cycles. L2 cache size is 1 MByte and its miss penalty is 20 cycles. The number of PUs varied 16, 64, 128, 512 and 1024 in the MG, and 16, 32, 64, 128, 256 in the LU. First, we show the effect of cache size on computation time. L1 cache sizes used here are 8192, 16384 and 65536 Bytes. Next, we examine the effect of network topology on communication time. A complete network(complete), a two dimensional mesh network(2d mesh), Hyper Crossbar Network(HXB)[1] and a ring network(ring) are compared. In the HXB and the two dimensional mesh network, we choose the nearest cubic shape and the nearest square shape
250
Kazuto Kubota et al. 4.0
70 L1 cache = 8192 Byte L1 cache = 16384 L1 cache = 65536
60
3.5 3.0
50
complete HXB 2d mesh ring
2.5 time(sec)
time(sec)
40 30 20
2.0 1.5 1.0
10
0.5 0
0 10
100 #PU
1000
10
100
1000 #PU
Fig. 5. MG: Cache size effect on com- Fig. 6. MG: Topology vs. communicaputation time (HXB, b/w=100MB/sec, tion time (b/w= 100MB/sec, µ0 = 0). µ0 =0). respectively. Then, we fixed the network topology to the HXB, and examined the effect of the µ0 and the bandwidth. In our network model, when communication functions such as mpi send() are called, the message goes to the network with the µ0 overhead. 3.2 Kernel Benchmark: MG The MG kernel is a program which solves three dimensional discrete Poisson equations by multi-grid method. The class B problem size is 256×256×256. The message pattern of the MG is a hierarchy communication pattern. A hierarchy gird is used in the MG, and adjacent PUs on the hierarchy gird communicate with each other. In Figure 5, cache size doesn’t greatly affects the computation times. In case of small number of PUs, cache miss rate should be high, because the working set size per a PU is large. But actually, when 16 PUs were used, the hit rate of a level one cache was more than 90%. The data using locality seems to be high in the MG. So, we fixed the cache size at 16 Kbytes in the following network simulation. Figure 6 shows the effect of network topology on the communication time. The complete and the HXB networks show almost the same performance. The two dimensional mesh network shows some overhead while the ring network shows very high overhead. Figure 7 shows the effect of bandwidth on the communication time when µ0 is 0. The communication time is large when the number of PUs are small, because the message size is larger when the number of PUs are small, and larger size message takes a long communication time when bandwidth is narrow. When compared with the computation time of the MG, the communication time is less than 7%. Therefore we can conclude that the influence of bandwidth in the MG is small. Figure 8 shows the effect of the µ0 on the communication time. The influence of µ0 appears at any number of PUs. Figure 9 shows the relationship between the efficiency and the number of PUs. We calculated efficiency using the time of 16 PUs. When the number of PUs reaches 1024, and if µ0 is 500 µsec, the efficiency declines to 40%.
Practical Simulation of Large-Scale Parallel Programs
10.0
251
2.0
b/w = 10 MB/sec b/w = 100 MB/sec b/w = infinite
8.0
u0 = 0 usec u0 = 10 usec u0 = 500 usec 1.5
time(sec)
time(sec)
6.0
4.0
1.0
0.5
2.0
0
0
10
100
1000
10
#PU
100 #PU
1000
Fig. 7. MG: Bandwidth vs. communi- Fig. 8. MG: µ0 vs. communication time cation time (HXB, µ0 = 0). (HXB, b/w= infinite).
1.2 1
1200
0.8
1000 time(sec)
efficiency
1400
0.6 0.4
u0 = 0 usec u0 = 10 usec u0 = 500 usec
0.2
L1 cache = 8192byte L1 cache = 16384 L1 cache = 65536
800 600 400 200
0 10
100 #PU
1000
0 10
100 #PU
Fig. 9. MG: µ0 vs. efficiency (HXB, Fig. 10. LU: Cache size effect on computation time (HXB,b/w= 100MB/sec, b/w= infinite). µ0 =0).
252
Kazuto Kubota et al.
100
100
complete HXB 2d mesh ring
80
80
60 time(sec)
time(sec)
60
b/w = 10MB/sec b/w = 100MB/sec b/w = infinite
40
20
40
20
0
0
10
100
10
#PU
100 #PU
Fig. 11. LU: Topology vs. communi- Fig. 12. LU: Bandwidth vs. communication time (b/w= 100MB/sec, µ0 = cation time (HXB, µ0 = 0). 0µsec ).
3.3
Application Benchmark: LU
The LU application is a Navier-Stokes equation solver based on the Symmetric Successive Over-Relaxation (SSOR) method. Here the class B problem size is 102×102×102. The message pattern of the LU is adjacent communication. Similar to the MG, computation time is less dependent on the cache size(Figure 10). In the following network simulation, we fixed the cache size at 16 Kbytes similar to the MG experiment. In Figure 11, the complete network and the HXB shows almost the same performance. However, in contrast to the MG case, the two dimensional mesh network shows high performance, because the LU message pattern is adjacent communication. In the ring network, communication overhead rose when the number of PUs were larger than 128. Figure 12 shows the effect of the bandwidth on the communication time. By the similar reason to the case of the MG, the influence of the bandwidth is small. Figure 13 shows the effect of the µ0 on the communication time when the bandwidth is infinite. The influence of µ0 is large, because the influence of µ0 appears when the number of PUs are not only small but also large. Figure 14 shows the relationship between µ0 and efficiency. When the number of PUs becomes larger than 64, efficiency begins to decrease. For example, when there are 256 PUs and µ0 is 500 µsec , efficiency declines to 50%. 3.4
Scalability Analysis using Simulation Results
Using the simulation results, we can find the balance point between CPU performance and the network performance. Assume 80% efficiency is required. In figure 14, the bandwidth is already infinite. Therefore it is impossible to adjust 80% efficiency by increasing the bandwidth when µ0 is 500 µsec . Therefore in case of 128 PUs, we must decrease the µ0 to less than 232 µsec to achieve 80%
Practical Simulation of Large-Scale Parallel Programs 100
253
1.2
u0 = 0 usec u0 = 10 usec u0 = 500 usec
80
1 0.8 efficiency
time(sec)
60
40
u0 = 232 usec 0.6
u0 = 44.5 usec
0.4
20
u0 = 0 usec u0 = 10 usec u0 = 500 usec
0.2 0
0 10
100 #PU
100
10 #PU
Fig. 13. LU: µ0 vs. communication Fig. 14. LU: µ0 vs. efficiency (HXB, time (HXB, b/w= infinite). b/w= infinite). This graph also shows the balance point of CPU performance and Network performance when efficiency is 80%. This topic is described in section 3.4. efficiency. And for 256 PUs, we should decrease the µ0 to less than 44.5 µsec to achieve the same efficiency. These µ0 are obtained by interpolation.
4
Conclusions
In this paper, we have described a simulation technique for very large-scale data parallel programs. Many factors such as program implementation, complier optimization can be estimated accurately by the simulation based on instrumentation tool. Computation and communication overlap can be calculated precisely, because communications are simulated with its neighboring computations. The class B problems in the NAS Parallel Benchmarks can be examined within a practicable time. From the simulation results, we found that in the MG and the LU, communication overhead can greatly affect the communication time, while cache size doesn’t cost much execution time. We are going on the performance analysis of the rest of the NAS Parallel Benchmarks. We believe that the results will provide an effective guideline for both architect and application users.
References 1. A. Ukawa, “Status of the CP-PACS Project,” International Symposium on Lattice Field Theory, 1994. 244, 249 2. T.G. Mattson, D. Scott, and S.R. Wheat, “A TeraFLOP Supercomputer in 1996: the ASCI TFLOPS System,” Proceedings of the International Parallel Processing Symposium, 1996. 244
254
Kazuto Kubota et al.
3. Maurice Yarrow and Rob Van der Wijngaart, “Communication Improvement for the LU NAS Parallel Benchmark: A Model for Efficient Parallel Relaxation Schemes,” NAS Technical Report NAS-97-032, 1997 244 4. Edward Rothberg, Jaswinder Pal Singh, and Anoop Guputa, “Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors,” Proc. of ISCA’93, pp.14-25, 1993. 244 5. Taisuke Boku, Masahiro Mishima, Ken’ichi Itakura, Hiroshi Nakamura, and Kisaburo Nakazawa, “VIPPES: A performance preevaluation system for parallel processors,” HPCN Europe’96, 1996. 244 6. Kazuto Kubota, Ken’ichi Itakura, Mitsuhisa Sato, and Taisuke Boku, “Accurate performance analysis based on code instrumentation,” IPSJ SIG Report, 97ARC123-12, pp.67-72, 1997(In Japanese). 245 7. Taisuke Boku, Tomoaki Harada, Takashi Sone, Hiroshi Nakamura, and Kisaburo Nakazawa, “INSPIRE: A general purpose network simulator generating system for massively parallel processors,” Proc. of PERMEAN’95, pp.24-33, 1995. 245 8. David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, and Maurice Yarrow, “The NAS Parallel Benchmarks 2.0,” NASA Ames Research Center Report, NAS-05-020, 1995. 245 9. http://www.spec.org 249
Assessing LogP Model Parameters for the IBM-SP Iskander Kort and Denis Trystram LMC-IMAG BP 53 Domaine Universitaire 38041, Grenoble Cedex 9,France {kort,trystram}@imag.fr
Abstract. This paper deals with LogP-model parameters assessment for the IBM-SP machine . An extension of LogP has been used. This extension allows the communication of arbitrary sized messages. Moreover, a distinction has been made between the sender and the receiver overheads. We present a methodology to assess the overhead parameter under the MPI communication library. Then, we use a microbenchmark to assess the L and g parameters.
1
Introduction
During the last decade many parallel computers have been devized. Most of them consist of a set of processors which communicate by sending messages via an interconnection network. The major advantage of these computers is scalability, which corresponds to the ability to integrate a large number of processors while sustaining good performance. However, writing efficient programs for such computers is still a hard work. In fact, the performance can decrease drastically due to large communication costs. In order to overcome this difficulty it is essential to use a relatively precise model describing the target computer for program design and analysis. Since the interconnection network is the most important bottleneck in nowadays computers, many researchers have assessed some parameters related to this component, mainly latency and bandwidth. In [12], models for the Intel Paragon and a transputer-based parallel machine have been proposed. In [8] measures of latency and bandwidth of the Thinking Machine CM 5 have been provided. A more general study dealing with a variety of current parallel computers can also be found in [4]. Another important issue in parallel programming is portability. Indeed, a program that runs efficiently on a given computer may show poor performance on another one. In order to get efficient and portable parallel programs one should use generic computational models. Such models describe a parallel computer using a set of parameters. These latters reflect the most pertinent features of current computers and are intended to be instantiated when dealing with a specific machine. Many generic models are presented in the literature such as Phase-PRAM/APRAM/BSP/LogP [1] [7]. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 255–262, 1998. c Springer-Verlag Berlin Heidelberg 1998
256
Iskander Kort and Denis Trystram
This paper deals with the assessment of LogP model parameters for an IBMSP multicomputer [9,10,11]. Previous work showed that this model is suitable for many parallel architectures. In [2] the model parameters were assessed for a CM5 multicomputer. In [3] assessments for the Intel Paragon, the Meiko CS2 and a workstation cluster are provided. In [6], the model parameters were assessed for some local area networks. We present LogP in next section, then we show in Sect.3 how to assess the model parameters.
2
The LogP Model
The LogP model was introduced in [2]. It presents a good trade-off between realism and ease of use. Indeed, it describes a wide range of current parallel architectures with a few simply defined parameters. These parameters are: L: The network latency, that is the time needed to transmit a message of one or a few words. o: The communication overhead , that is the time it takes the sender or the receiver to process a message. g: The gap, that is the minimum delay between two successive sends or two successive receives. In other words, 1/g is the maximum bandwidth per processor. P: The processors’ number. We will rather use an extension of LogP where messages may have arbitrary sizes. So, the model parameters become functions of message size. Moreover, we split the o parameter into two new ones: os and or . The os (resp. or ) parameter is the time needed by the sender (resp. receiver) to manage an outgoing (resp. incoming) message.
3
Assessing LogP Parameters
We describe in this section the experiments for assessing the LogP parameters. In these experiments, we used the MPI communication library1 because of its wide use [13][5]. In order to design the right experiments and to be able to interpret the obtained results, the communication subsystem behavior should be studied at first. 3.1
The Communication Subsystem Behavior
We check in this section if the communication overhead is equal to the send primitive delay. This is equivalent to check if the sender processes entirely the outgoing messages within the send primitive or if it does some extra operations 1
More specifically, the IBM implementation was used (MPI-F version 1.43 under the Aix 3.2 operating system).
Assessing LogP Model Parameters for the IBM-SP
257
after the primitive completion. To answer this question, we performed the experiment shown by Fig. 1. In this experiment, a processor P0 sends a message to a processor P1 and then performs some computations. We measure the MPI-Recv delay for various computation loads.
P0
P1 Clock
MPI Send
Compute
MPI Recv
Clock
Fig. 1. The communication subsystem behavior. The experiment shows that the MPI Recv delay does not depend on the computation load when the message size is less than 1840 bytes. However, when the message size is greater than this value, the MPI-Recv delay increases along with the computation load. Therefore, the messages are sent by blocks. Each block is of 1840 bytes, that is eight packets. These blocks are due to the sliding window protocol used in the IBM-SP1-2 to acknowlege packets [9]. Since the communication overhead is greater than the send primitive delay, it is essential to find out a new methodology to assess this parameter. 3.2
Assessing the Overhead
Assessing the os Parameter. As shown in the previous section, the outgoing messages are not entirely processed within the send primitive. So, in order to assess os we should force the sender to finish the message processing. This can be done using the MPI Test primitive. This primitive is intended to check the non blocking communications completion. However, it can be called with dummy arguments to insure communications progress. We define the following methodology to assess os : – Estimate the MPI Test delay when no communication is pending (T estDelay). – Given an n bytes message, estimate the number of MPI Test calls necessary to complete the communication (T estCount). To do so, we perform an experiment similar to the one in Fig.1. However, in this case, P0 calls T estCalls times MPI Test before the computation phase. Each time MPI Test is called, it processes pending communications, whenever this is possible. We repeat the experiment with increasing values of T estCalls. We stop when the MPI Recv delay does not increase any longer with the computation load. – Finally we measure os using the program below. This methodology has led to the results in Fig.2. In this figure, we denote by os (asynchronous) (resp. os (synchronous)) the sender overhead when the
258
Iskander Kort and Denis Trystram
Program 1 os assessment program. P0 code MPI Barrier(MPI COMM WORLD); Os = MPI Wtime(); MPI Bsend(m,n,MPI CHAR,1,TAG,MPI COMM WORLD); for (i=0; i < T estCount;i++) do MPI Test(MPI REQUEST NULL,&flag,&status); Os = M P I W time() − Os − T estCount ∗ T estDelay P1 Code MPI Barrier(MPI COMM WORLD); MPI Recv(m,n,MPI CHAR,0,TAG,MPI COMM WORLD,&status);
MPI Bsend (resp. MPI Ssend) primitive is used. Notice that the MPI Bsend delay is smaller than the overhead except when n is less than 1840 bytes. More specifically, the ratio MPI Bsend-Delay/Overhead is of about 35% for long messages. Moreover, the overhead increases linearly with the message size. In order to get the best fit, we split the measures into five intervals and we established an equation for each interval (see Table 3.2). The comparison of the overheads in the synchronous and asynchronous cases shows that: – when n ≤ 4KB, os (synchronous) > os (asynchronous), this is due to the handshaking operation undertaken by the sender in the synchronous mode. – when n > 4KB, os (asynchronous) > os (synchronous). Moreover, the difference os (asynchronous) − os (synchronous) increases along with the message size. Indeed, MPI Bsend sends the message according to the rendez-vous protocol2 . On the other hand this primitive is intended to be local3 . To cope with these two contradictory properties, the message is buffered at the sender processor which explains the above mentionned diffrence.
Table 1. The sender overhead in the asynchronous and synchronous modes: n is the message size in bytes, delays are expressed in µs. 0 ≤ n ≤ 4KB 4KB < n ≤ 16KB 16KB < n ≤ 32KB 32KB < n ≤ 48KB n > 48KB 2 3
MPI Bsend os = 0.040 ∗ n + 30 os = 0.081 ∗ n − 90 os = 0.073 ∗ n + 70 os = 0.058 ∗ n + 530 os = 0.050 ∗ n + 870
MPI Ssend os = 0.040 ∗ n + 110 os = 0.053 ∗ n + 20 os = 0.054 ∗ n + 175 os = 0.037 ∗ n + 700 os = 0.034 ∗ n + 890
In this protocol a message is not sent unless a receive request is posted. A send primitive is said to be local if its completion is not conditioned by the receiver state.
Assessing LogP Model Parameters for the IBM-SP
259
Assessing the or Parameter. In order to assess the receiver communication overhead, Prog. 2 is used. In this program both processors first synchronize. Then, processor P0 sends a message m to P1 . Processor P1 reads the IBM-SP global clock, posts a receive request and finally calls MPI Test so many times till all the packets of m are received (f lag becomes true). The loopDelay() function calculates the delay of call iterations when no communication is pending. This delay is given by the equation: delay = 4.48 ∗ call − 10 The obtained results are illustrated in Fig.2. Parameter or increases linearly along with message size according to the equations: for 0 < n ≤ 16KB, or = 0.059 ∗ n + 10 for n > 16K, or = 0.035 ∗ n + 420 To end this section, we compare the os and or parameters. In the asynchronous case, os is equal to or when n ≤ 4KB and os > or otherwise. This is due to message buffering at the sender. In the synchronous case, os can be considered to be equal to or when n ≤ 16KB. However, os > or when n > 16KB. Program 2 or assessment program. p0 code MPI Barrier(MPI COMM WORLD); MPI Bsend(m,n,MPI CHAR,1,TAG,MPI COMM WORLD); p1 code MPI Barrier(MPI COMM WORLD); call = 0; Or = MPI Wtime(); MPI Irecv(m,n,MPI CHAR,0,TAG,MPI COMM WORLD,&request); repeat MPI Test(&request,&flag,&status); call + +; until (f lag == true); Or = M P I W time() − Or − loopDelay(call);
3.3
Assessing the L and g Parameters
The assessment of g and L is more difficult than that of o, so only small messages will be considered in this section. In order to assess g we used the microbenchmark presented in [3]. The microbenchmark measures the delay of issuing M requests (denoted by Delay(M )), as shown in Fig. 3(a). A request is a light-weight asynchronous remote procedure call. Figure 3(b) shows the average request delay (ie Delay(M )/M ) under LogP. Notice that there are three regimes: send-only, transition and steady-state. In the send-only regime, no reply arrives during the
260
Iskander Kort and Denis Trystram 3500 os(asynchronous) MPI_Bsend delay os(synchronous) MPI_Ssend delay or
time(microseconds)
3000 2500 2000 1500 1000 500 0 0
10000
20000
30000 n(bytes)
40000
50000
Fig. 2. Comparison of communication overhead and the send primitives delay.
issue phase. So, the average request delay is equal to os . For greater values of M , some replies arrive during the issue phase. This corresponds to the transition regime. Beyond a certain value of M , the capacity limit is reached. This corresponds to the steady state. If the communication bottleneck is the network then the average request delay equals g. On the other hand, if the communication bottleneck is the processor, then the average request delay equals os + or . To distinguish these two cases, the experiments are done with different values of ∆ (computation time). If the curve limit increases even with small values of ∆ then we are in the second case. Otherwise, the network is the communication bottleneck. We implemented the microbenchmark using MPI immediate communication primitives. Figure 4 shows the microbenchmark signature for a 1000-bytes message. Experiments were performed with ∆ = 0µs and ∆ = 20µs. Since the two curves converge to different values we deduce that the overhead is the performance bottleneck in the IBM-SP. The microbenchmark gave the following values: = 76µs , omicro = 73µs. These values are close to those found when we omicro s r applied our methodology: os = 72µs , or = 82µs. Notice that the transition regime starts for a small value of M (for M = 5). This means that the interconnection network latency may be neglected.
4
Concluding Remarks
In this paper we presented a methodology to assess the LogP overhead parameter at the sender and receiver processors. The obtained results show that the sender overhead is more important when using asynchronous communications. This is due to message buffering.
Assessing LogP Model Parameters for the IBM-SP Send-only
transition
261
steady-state
Delay(M )/M
Start Timer
∆2
Repeat M times Issue Request
∆1
Compute for ∆ time Stop Timer handle remaining replies
os
∆ = 0 M
(a)
(b)
Fig. 3. (a) The microbenchmark pseudo-code. (b) The microbenchmark signature.
Moreover, the microbenchmark showed that the overhead is the performance bottleneck in the IBM-SP. The microbenchmark gave results which are close to those obtained using our methodology. However, this microbenchmark is not very convenient when one wants to find out the o(n) function. Indeed, in this case a signature must be established for each message size. Also, the microbenchmark underestimates the overhead when message size is greater than 1840 bytes. We are planning to use this model in predicting some communication patterns delays such as broadcast and scatter. The model will also be used in some scheduling algorithms and heuristics for LogP. Acknowledgments The authors are grateful to the anonymous referees for their valuable suggestions.
References 1. C. Boeres. Versatile Communication Cost Modelling for Multicomputer Task Scheduling Heuristics. PhD thesis, University of Edinburgh, 1996. 255 2. D. E. Culler, R. M. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN Symposium on principles and practice of parallel programming, 1993. 256 3. D. E. Culler, L. T. Liu, R. P. Martin, and C. Yoshikawa. LogP Performance Assessment of Fast Network Interfaces. IEEE Micro, February 1996. 256, 259 4. J. J. Dongarra and T. Dunigan. Message Passing Performance of Various Computers. Technical report, UT-CS-95-299, Computer science department, University of Tennesee,Knoxville, Tenessee, 1995. 255 5. H. Franke, E. Wu, P. Pattnaik, and M. Snir. MPI Programming Environment for IBM SP1/SP2. In International Conference on Distributed Computing Systems IEEE Computer Society Press Los Alamitos California, 1995. 256
262
Iskander Kort and Denis Trystram 170 160 150
Delay(M)/M
140 130 120
without computation with computation
110 100 90 80 70 0
100
200
300
400
500
M
Fig. 4. The microbenchmark signature for a 1000-bytes message under the IBM-SP.
6. K. K. Keeton, T. E. Anderson, and D. A. Patterson. Logp Quantified: The Case for Low-overhead Local Area Networks. In Hot Interconnects III: A Symposium on High Performance Interconnects, Stanford University, Stanford California, August 10-12 1995. 256 7. Z. Li, Mills P., and J. Reif. Models and Resource Metrics for Parallel and Distributed Computation. In Annual Hawaii International Conference on System Sciences, Parallel Algorithms Software Technology Track, IEEE Press, 1995. 255 8. L. Mengjou, T. Rose, et al. Performance Evaluation of the CM-5 Interconnection Network. Technical report, University of Minnesota, 1993. 255 9. M. Snir et al. The Communication Software and Parallel Environment of the IBM-SP2. IBM systems journal, 34(2), 1995. 256, 257 10. B. Stunkel et al. The SP2 High Performance Switch. IBM systems journal, 34(2), 1995. 256 11. G. Tengwall. SP2 Architecture and Performance. springer verlag, 1994. 256 12. C. Tron. Mod`eles Quantitatifs de Machines Parall`eles: Les r´eseaux d’interconnection. PhD thesis, Institut National Polytechnique de Grenoble, France, 1994. 255 13. University of Tenessee. MPI Forum: A Message Passing Interface Standard, 1994. 256
Communication Pre-evaluation in HPF Pierre Boulet1 and Xavier Redon2 1
2
´ LIP, Ecole Normale Sup´erieure de Lyon, 46, all´ee d’Italie, F-69364 Lyon cedex 07, France LIFL, Univerist´e des Sciences et Technologies de Lille, Bˆ atiment M3, Cit´e Scientifique, F-59655 Villeneuve d’Ascq cedex, France
Abstract. Parallel computers are difficult to program efficiently. We believe that a good way to help programmers write efficient programs is to provide them with tools that show them how their programs behave on a parallel computer. Data distribution is the major performance factor of data-parallel programs and so automatic data layout for High Performance Fortran programs has been studied by many researchers recently. The communication volume induced by a data distribution is a good estimator of the efficiency of this data distribution. We present here a symbolic method to compute the communication volume generated by a given data distribution during the program writing phase (before compilation). We stay machine-independent to assure portability. Our goal is to help the programmer understand the data movements its program generates and thus find a good data distribution. Our method is based on parametric polyhedral computations. It can be applied to a large class of regular codes.
1
Introduction
Parallel computing has become the solution of choice for heavy scientific computing programs. Unfortunately parallel computers require considerable knowledge and programming skills to exploit their full potential. A major mean to reduce the programming complexity is to use high-level languages. High Performance Fortran (HPF) is such a language. It follows the data-parallel programming paradigm where the computations are directed by the data. Indeed, the computations occur on the processor that “owns” the data being written. So the data distribution is a very important efficiency factor when programming with a data-parallel language. Our goal is to help the programmer find a good data-distribution. We want a tool that is executed when the program is being written. Work on such a tool has started at the LIFL with HPF-builder [6], a tool that interactively displays the arrays as they are distributed and aligned by the HPF directives. This tool also allows to change the data distribution graphically. We propose here a new step in the development of a tool that helps the programmer understand how data move in its program. Given a data-distribution, we are able to compute the volume of the communications generated by a program. We use symbolic computation tools to stay free of the problem size. All D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 263–272, 1998. c Springer-Verlag Berlin Heidelberg 1998
264
Pierre Boulet and Xavier Redon
this is done at the language level, thus retaining portability and machine independence. This tool is intended as a mean to write a reasonably efficient program that can be tuned for a particular parallel machine with profiling systems later. The sequel of this paper is organized as follows. In Sect. 2 we briefly review related works, then in Sect. 3 we describe the problem we consider and present its modelization in Sect. 4. We then detail the tools used to solve our problem along with an example in Sect. 5. And we finally conclude in Sect. 6.
2
Related Work
Many researchers [1,9,10,13,14,15] have studied automatic data distribution. Estimating communication costs has been the key factor to determine the quality of a data distribution. Most of the previous works [12,7,11] have studied compiletime estimation of these communication costs. Indeed, they use the fact that most program parameters are known at that time and in many cases, these studies also use machine (and compiler) dependent data. Our work differs from previous work by the techniques used and the stage of program development we focus on: the program writing phase. We also use exact parameterized methods and we stay compiler and machine independent, we work at the language level.
3 3.1
Description of the Problem General Remarks
The problem we study in this paper is the evaluation of the communication volume in a HPF program before compilation. The goal is to help the programmer understand the communications generated by his program and find a good data distribution (or a more efficient way to code his algorithm). The communications we consider are at the PROCESSORS level: as in HPF, we use an abstract target machine. We stay at the language level, thus allowing to find a data distribution that is well suited to the problem, and thus retaining portability. Our aim is not to find the best data distribution for a given machine, but a good one for any machine (and compiler). In the future, if some compiler optimization techniques such as overlap areas to vectorize communications are used by most compilers, we could adapt our evaluation techniques to these optimizations. In a first step, though, we remain at the language level to validate our approach. We believe indeed that a good data distribution at the language level should not be a bad one for any compiler, regardless of the compilation techniques used. 3.2
Modelization
We consider that the only communication generation statements are storage statements, we do not take into account I/O statements. For each of these storage
Communication Pre-evaluation in HPF
265
statements, the surrounding loops define an iteration domain that “shapes” the communication pattern. Given the mathematical tools we use in the following, we have to restrict ourselves to loop bounds that define a polyhedron, that is loop bounds defined as extrema of affine functions of the surrounding loop indices and parameters. For the same reason, we restrict the array access functions to affine functions. This class of loops contains most of the linear algebra routines. To simplify the following discussion, we make the hypotheses below without loss of generality: – Scalar values are described as arrays with one element. – Storage statements read only one array reference that may not be aligned with the result value. Indeed, any more complex statement can be decomposed in a sequence of such statements with temporary storage [3,2]. – All the arrays of the considered storage statement are aligned with respect to the same template T . This restriction makes sense since the distributions of computation related arrays should be related in some way. Actually most HPF programs contain only one template with all the arrays aligned onto. The domain of this template T is noted DT Given the previous restrictions, a storage statement S is represented by: WS (φWS (I)) = fS (RS (φRS (I)))
(1)
where the iteration vector I has value in the iteration domain DS defined by the surrounding loops. WS and RS are arrays and φWS and φRS are affine access functions. We call operation an instance S(I) of a statement S for a given value I of the iteration vector. The number of communications generated by a given statement is the number of array elements that need to be communicated for some operation generated by this statement. There is a communication when the array elements being read are not on the same virtual processor than the one the written element is. As array elements can be replicated, we will focus on the template elements.
4
Algebraic Method to Evaluate HPF Communications
As said in the previous section, we evaluate the communication cost in a HPF program by counting the number of communications between template elements. 4.1
Formula for Communication Cost Evaluation
There is a communication between two template elements if they are not distributed onto the same virtual processor and if the computation of a value to be stored in the first template element uses a value stored in the second one. To be more precise, let us consider a storage statement S (as defined in the previous section) and a template T . We also need to define some functions:
266
Pierre Boulet and Xavier Redon
– Function OTS gives, for an element J of DT , the set of storage operations from S which compute values to be stored in T (J); – Function τ gives, for an operation o from S, the set of template subscripts from where o may read its data. – Last, function πT is the distribution function which maps the template T on the virtual processors. The number of template communications generated by statement S is equal to the number of elements in the union of the sets CJ (S): C(S) = (J, o) | o ∈ OTS (J), πT (J) ∈ πT (τ (o)) . CJ (S) = (2) J∈DT
J∈DT
Note that C(S) is a set of couples and not of mere operations. Indeed we may have to count several times the same operation, so we added the template subscript to distinguish different occurrences of the same operation. Computing the set C(S) implies the application of a change of basis on the program loop nests. Indeed, in the original program, the loops are used to enumerate the operations in the program execution order and the set C(S) enumerates them following the template iteration space. 4.2
Formal Representation of HPF Alignments
To perform this change of basis the only informations needed are the HPF ALIGN directives. Basically HPF alignments are a sub-set of linear alignments but a replication symbol * has been added. Because of the replication symbol, a HPF alignment cannot be defined by a linear transformation from the array space to the template space. A convenient way to represent a HPF alignment α is to use two linear transformations γ and δ. Transformation δ defines the replication part of the alignment. Let I be a subscript of an array A aligned on a template T using α. The set of the subscripts of T on which the data A(I) is stored is: {J | J ∈ DT , γ(I) = δ(J)} = δ −1 (γ(I)) . Let us consider the alignment: !HPF$ ALIGN A(i) WITH T(i,*)
The first transformation from the array space to an intermediate template space is obtained by removing the replication symbols in the directive: γ : i → i. The second transformation is theprojection from the template space to the i intermediate template space: δ :
→ i. j With this representation of a HPF alignment one can explicit the function τ of Sect. (4.1). Let us consider a statement S as defined in (1) and an alignment τRS for array RS on the template T defined by (γRS , δRS ). In this context, the following holds: −1 ∀I ∈ DT , τ (S(I)) = τRS (φRS (I)) = δR (γRS (φRS (I))) . S
Communication Pre-evaluation in HPF
4.3
267
Enumeration of Operations in the Template Iteration Space
The main problem of enumerating operations in the template space is that there may be several operations corresponding to a template element T (J). I.e. more than one operation may produce a value to be stored in T (J). In fact, the solution lies again in the formal representation of HPF alignments. Let us denote by τWS the alignment for array WS . According to Sect. 4.2, the alignment can be written −1 . Therefore, the set OTS (J) introduced in the beginning of as τWS = γWS ◦ δW S this section is defined by: −1 (δWS (J)} . OTS (J) = {S(I) | I ∈ DS , φWS (I) ∈ γW S
Let us consider the code fragment below: !HPF$ ALIGN M(i,j) WITH T(i,*) DO i=1,n DO j=1,m M(i,j) = ... (s1) END DO END DO
i i The aligment of M on T is defined by: γ :
→ i, δ :
→ i . j j s1 Since the subscript function for M is the identity, the set OT is such that: i i i i i OTs1 s1 s1 | j ∈ [1, m] . = | γ = δ = j j j j j This means that a template element T(i,j) owns the full row of rank i of array A. In conclusion, the set C(S) may be computed using the subscript functions in S for WS and RS , the alignments of WS and RS on the template T and the distribution of T on the virtual processors: C(S) = (J, S(I)) | I ∈ Φ−1 ∈ πT (ΦRS (I)) (3) WS (J), πT (J) J∈DT
with the functions ΦWS and ΦRS defined as follows: −1 ◦ γWS ◦ φWS , ΦWS = τWS ◦ φWS = δW S −1 ◦ γRS ◦ φRS . ΦRS = τRS ◦ φRS = δR S
4.4
Formal Representation of HPF Distributions
A HPF distribution directive for a template T can be represented using a projection ρT and an integer vector κT of size the dimension of the virtual processors grid P . Projection ρT selects the dimensions of T to be distributed on P . The dimension of T projected on the ith dimension of P is distributed according to the pattern CYCLIC((κT )i ). We denote by Amin (respectively by Amax ) the
268
Pierre Boulet and Xavier Redon
vector of lower bounds (respectively the vector of upper bounds) for array A. In this context the distribution πT can be explicited: πT (J) = P min + (ρT (J − T min ) ÷ κT )%(P max − P min + 1l) .
(4)
In the previous expression, the operator ÷ (respectively the operator %) represents an element-wise integer division (respectively modulo) on vectors. One may remark that a BLOCK distribution can be achieved on the ith dimension of P using a relevant (κT )i value: Timax − Timin + 1 (κT )i = . Pimax − Pimin + 1
5
Tools for the Evaluation of HPF Communications
In the previous section we have reduced the problem of evaluating HPF communications of a statement S to counting the elements of a set C(S). This section presents the tools used to automatically build C(S) and to count its elements. We illustrate the use of these tools on the following HPF program: Program MatInit !HPF$ PROCESSORS P(8,8) !HPF$ TEMPLATE T(n,m) !HPF$ DISTRIBUTE T(CYCLIC,CYCLIC) ONTO P real A(n,m), B(n) !HPF$ ALIGN A(i,j) WITH T(i,*) !HPF$ ALIGN B(i) WITH T(i,1)
do i=1,n do j=1,m A(i,j)=B(i) (S1) end do end do end
We have integrated the different tools into an interactive program called CIPOL. CIPOL provides a lisp-like textual interface to the tools, and prettyprints their results. 5.1
Manipulation of Polyhedra
Our main tool is the polyhedral library developed by Wilde [16]. Most of the operations on polyhedra (union, intersection, image, etc.) are implemented in this library which allows to define a polyhedron by a set of constraints or a set of rays. For our application, we only need the first definition scheme. The generic sets Φ−1 WS (J) and ΦRS (I) described in the previous section are computed using the image and pre-image functions of the polyhedral library. The generic sets for our example are: −1
ΦA (j1 , j2 , n, m) =
1 ≤ j1 ≤ n ∧ i1 = j1 , ΦB (i1 , i2 , n, m) = 1 ≤ i2 ≤ m
1 ≤ i1 ≤ n ∧ j1 = i1 . j2 = 1
The first set means that the ith row of array A is duplicated on each element of the ith row of template T . The second set shows that array B is aligned with the first column of template T .
Communication Pre-evaluation in HPF
5.2
269
Using the PIP Software to Compute Inclusion
The evaluation of the non inclusion in definition (2) is not easy to implement since it implies that a generic condition must be verified for each value from a given set. It is simpler to verify an inclusion condition. In this case we just have to verify that a condition is verified for one value from a given set. Hence an inclusion condition can be modelized by an integer programming problem and can be resolved by a software such as PIP (see [8]). So, in place of counting the number of elements in C(S) we compute the number of elements in the set C(S): C(S) =
[
J ∈DT
(J, S(I)) | I ∈ Φ−1 WS (J), πT (J) ∈ πT (ΦRS (I))
.
(5)
The final result is obtained using the following relation: Card(C(S)) = Card(
[ J ∈DT
(J, S(I)) | I ∈ Φ−1 WS (J) ) − Card C(S)
.
(6)
The PIP software is able to find the lexicographical minimum of a parametric set of integer vectors defined by a set of linear constraints S(P ). It may also take into account linear constraints on the parameters, this other set of constraints C is called the context. We denote by lexmin(C, S(P )) the result computed by PIP. Since the initial set of integer vectors is parametric, PIP does not return an unique vector but a quast (Quasi-Affine Selection Tree). Indeed, the minimum depends on the values of the parameters, hence PIP splits the domain of the parameters in sub-domains on which the minimum can be expressed in a parametric way. If there is no solution for a sub-domain of the parameter space (because for these values of the parameters S(P ) is void), PIP denotes by ⊥ the lack of solution. Let us solve the following integer program: J ∈ DT J ∈ ΦRS (I) , S(I, J) = . (7) lexmin(C, S(I, J)), C = πT (J) = πT (J ) I ∈ Φ−1 (J) WS Consider now the sub-domains of the parameter space for which PIP gives a solution other than ⊥. It is easy to deduce, from what we said about PIP, that the number of elements in these sub-domains is equal to the number of elements in C(S). Remember that PIP only deals with linear constraints with respect to the parameters and the variables of the problem. Function πT involves euclidian divisions but there is a well known method to linearize πT that may be found in [5]. We just have to introduce three new integer vectors to replace the initial definition (4) of the distribution function by a definition which gives πT (J) as the solution of the following system (the ∗ operator is an element-wise vector multiplication): ρT (J − T min) = N ∗ κT + R ∧ N = Q ∗ (P max − P min + 1l) + πT (J) − P min . P min ≤ πT (J) ≤ P max ∧ N ≥ 0 ∧ Q ≥ 0 ∧ 0 ≤ R < κT
270
Pierre Boulet and Xavier Redon
One may note that Ni gives the block number for T (J) with respect to the ith dimension. The previous system is effectively linear only if κT and P max − P min are constant vectors. Computation of a parametric solution in other cases is left for future work but it is possible to obtain a result by asking the user to provide the values of some key parameters. Integer program (7) may so be rewritten with only linear constraints. For our program example MatInit, the template is distributed on the processor grid using two CYCLIC patterns, hence the distribution is defined by:
ρT
j1 j2
=
j1 j2
, κT =
1 1
.
The integer problem to solve is lexmin(C(n, m), S(i1 , i2 , j1 , j2 )) with
1 ≤ j1 ≤ n ∧ 1 ≤ j2 ≤ m ∧ i1 = j1 ∧ 1 ≤ i2 ≤ m , 8 j = i1 ∧ j2 = 1 ∧ 1 ≤ p1 ≤ 8 ∧ 1 ≤ p2 ≤ 8 > < 1 j1 − 1 = 8q1 + p1 − 1 ∧ j2 − 1 = 8q2 + p2 − 1 S(i1 , i2 , j1 , j2 ) = . > : j1 − 1 = 8q1 + p1 − 1 ∧ j2 − 1 = 8q2 + p2 − 1 q1 ≥ 0 ∧ q2 ≥ 0 ∧ q1 ≥ 0 ∧ q2 ≥ 0 C(n, m) =
One may note that, in this example, there is no variable representing the remainder in the division by κT (as R in (5.2)) since the block sizes are equal to 1. The result of PIP is that there exists a solution not equal to ⊥ in the polyhedron defined by: C(n, m) =
1 ≤ j1 ≤ n ∧ 1 ≤ j2 ≤ m ∧ i1 = j1 ∧ 1 ≤ i2 ≤ m ∧ j2 − 1 = 8q ∧ q ≥ 0
.
The new parameter q is used to express that j2 − 1 must be a multiple of 8.
5.3
Counting the Elements of the Communication Set
The last stage of our method consists in counting the number of integer vectors in the sub-domains computed by PIP. These parametric sub-domains D(I, J, P ) are defined in function of a parameter J representing the subscript of a general template element T (J), in function of a parameter I which represents the subscript of an operation storing a value on T (J) and in function of a vector of program parameters P . We need to compute the number of integer vectors in D(I, J, P ) in function of the program parameters. Fortunately, Loechner and Wilde have extended the polyhedron library to include a function able to count the number of integer vectors in a parametric polyhedron (see [4]). Like PIP, this function splits the parameter space in sub-domains on which the result can be given by a parametric expression. Hence, to apply relation (6) one has to implement an addition on Quasts. For the example MatInit the final result (in the context n ≥ 1 and m ≥ 1) is
Count(C(n, m)) − Count(C(n, m)) = n.m
7m 7 3 5 1 3 1 1 − 0, , , , , , , 8 8 4 8 2 8 4 8
m
The brackets on the previous expression denote a periodic number: if we denote by v the vector 0, 78 , 34 , 58 , 12 , 38 , 14 , 18 , the value of the periodic number is vm%8 . When the parameter m is a multiple of 8 we have the expected result of communications at the template level.
7n.m2 8
atomic
For a detailled description of how the pre-evaluation can be done in an automatic way take a look at the report available at the URL ftp://ftp.lifl.fr/pub/reports/AS-publi/an98/as-182.ps.gz
Communication Pre-evaluation in HPF
6
271
Conclusion
Data partitioning is a major performance factor in HPF programs. To help the programmer design a good data distribution strategy, we have studied the evaluation of the communication cost of a program during the writing of this program. We have presented here a method to compute the communication volume of a HPF program. This method is based on the polyhedral model. So, we are able to handle loop nests with affine loop bounds and affine array access functions. Our method is parameterized and machine independent. Indeed all is done at the language level. An implementation is done using the polyhedral library and the PIP software. Ongoing work includes extending this method to a larger class of programs and adding compiler optimizations in the model. The last point is quite important since the pure counting of elements exchanged is only one of the factors in the actual communication costs. We will have to recognize special communications patterns as broadcasts which can be implemented more efficiently than general communications. We are also integrating this method in the HPF-builder tool [6].
References 1. Jennifer M. Anderson and Monica S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. ACM Sigplan Notices, 28(6):112–125, June 1993. 264 2. Vincent Bouchitte, Pierre Boulet, Alain Darte, and Yves Robert. Evaluating array expressions on massively parallel machines with communication/computation overlap. International Journal of Supercomputer Applications and High Performance Computing, 9(3):205–219, 1995. 265 3. S. Chatterjee, J. R. Gilbert, R. S. Schreiber, and S.-H. Tseng. Automatic array alignment in data-parallel programs. In ACM Press, editor, Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 16–28, Charleston, South Carolina, January 1993. 265 4. Philippe Clauss, Vincent Loechner, and Doran Wilde. Deriving formulae to count solutions to parameterized linear systems using ehrhart polynomials: Applications to the analysis of nested-loop programs. Technical Report RR 97-05, ICPS, apr 1997. 270 5. Fabien Coelho. Contributions to HPF Compilation. PhD thesis, Ecole des mines de Paris, October 1996. 269 6. Jean-Luc Dekeyser and Christian Lefebvre. Hpf-builder: A visual environment to transform fortran 90 codes to hpf. International Journal of Supercomputing Applications and High Performance Computing, 11(2):95–102, 1997. 263, 271 7. Thomas Fahringer. Compile-time estimation of communication costs for data parallel programs. Journal of Parallel and Distributed Computing, 39(1):46–65, November 1996. 264 8. Paul Feautrier. Parametric integer programming. RAIRO Recherche Op´erationnelle, 22:243–268, September 1988. 269
272
Pierre Boulet and Xavier Redon
9. Paul Feautrier. Towards automatic distribution. Parallel Processing Letters, 4(3):233–244, 1994. 264 10. M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, College of Engineering, University of Illinois at Urbana-Champaign, September 1992. 264 11. Manish Gupta and Prithviraj Banerjee. Compile-time estimation of communication costs of programs. Journal of Programming Languages, 2(3):191–225, September 1994. 264 12. Ken Kennedy and Ulrich Kremer. Automatic data layout for High Performance Fortran. In Sidney Karin, editor, Proceedings of the 1995 ACM/IEEE Supercomputing Conference, December 3–8, 1995, San Diego Convention Center, San Diego, CA, USA, New York, NY 10036, USA and 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1995. ACM Press and IEEE Computer Society Press. 264 13. Kathleen Knobe, Joan D. Lukas, and Guy L. Steele. Data optimization: Allocation of arrays to reduce communication on SIMD machines. Journal of Parallel and Distributed Computing, 8:102–118, 1990. 264 14. Jingke Li and Marina Chen. Index domain alignment: Minimizing cost of crossreferencing between distributed arrays. In Frontiers 90: The 3rd Symposium on the Frontiers of Massively Parallel Computation, College Park, MD, October 1990. 264 15. S. Wholey. Automatic data mapping for distributed-memory parallel computers. In ACM, editor, Conference proceedings / 1992 International Conference on Supercomputing, July 19–23, 1992, Washington, DC, INTERNATIONAL CONFERENCE ON SUPERCOMPUTING 1992; 6th, pages 25–34, New York, NY 10036, USA, 1992. ACM Press. 264 16. Doran Wilde. A library for doing polyhedral operations. Master’s thesis, Oregon State University, Corvallis, Oregon, dec 1993. 268
Modeling the Communication Behavior of Distributed Memory Machines by Genetic Programming L. Heinrich-Litan, U. Fissgus, St. Sutter, P. Molitor, and Th. Rauber Computer Science Institute, Department for Mathematics and Computer Science, Martin-Luther-Universit¨ at Halle-Wittenberg, D-06099 Halle (Saale), Germany
Abstract. Due to load imbalance and communication overhead the behavior of the runtime of distributed memory machines is very complex. The contribution of this paper is to show that runtime functions predicting the execution time of the communication operations can be generated by means of the genetic programming paradigm. The runtime functions generated dominate those presented in literature, till today.
1
Introduction
Distributed memory machines (DMMs) provide large computing power which can be exploited for solving large problems or computing solutions with high accuracy. Nevertheless, DMMs are still not broadly accepted. One of the main reasons is the costly development process for a specific parallel algorithm on a specific DMM. This is due to the fact that parallel algorithms on DMMs may show a complex runtime behavior caused by communication overhead and load imbalance. Thus, there is considerable research effort to model the performance of DMMs. This includes modeling the runtimes of communication operations with parametrized formulas [6,5,4,8,1]. The modeling of the execution time of communication operations can be used in compiler tools, in parallelizing compilers or simply as concise information of the performance behavior for the application programmer. Genetic algorithms (GAs) are stochastic optimization techniques which simulate the natural evolutionary process of beings. They often outperform conventional optimization methods when applied to difficult problems. We refer to [2] for an overview on complex problems in the area of industrial engineering, where GAs have been successfully applied. The genetic programming approach (GP) has been introduced by Koza [7] and is a special form of a genetic algorithm. The contribution of this paper is to show that runtime functions of high quality, which model the execution time of communication operations, can be modeled by the genetic programming approach. To illustrate the effectiveness of the approach we have chosen [8] for comparison. In [8], collective communication
This work has been supported in part by DFG grant Ra 524/5 and Mo 645/5.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 273–279, 1998. c Springer-Verlag Berlin Heidelberg 1998
274
L. Heinrich-Litan et al.
operations from the communication libraries PVM and MPI are investigated and compared. Their execution time is modeled by runtime functions found by curve fitting. In this paper we demonstrate the effectiveness of GPs to model the execution time of communication operations on DMMs, by example. Especially we demonstrate that GP generates runtime functions of higher quality than those published in literature till now. The paper is structured as follows. Section 2 gives a brief overview on the investigations made by [8]. Section 3 presents how performance modeling can be attacked by GPs. Experimental results are shown and discussed in Section 4.
2
Runtime Functions Generated by Curve Functions
First, we give an overview on the communication operations which we use in our experiments. Then we present the model of the communication behavior on the IBM SP2 obtained by curve fitting. The specific execution platform that [8] used for their experiments is an IBM SP2 with 37 nodes and 128 MByte main memory per processor from GMD St.Augustin, Germany, with IBM MPI-F, Version 1.41. They investigate and compare different communication operations. Due to space limitation, we only report on two communication operations. Single-transfer operation: In MPI, the standard point-to-point communication operations are the blocking MPI Send() and MPI Recv(). For blocking send, the control does not return to the executing process before the send buffer can be reused safely. Some implementations use an intermediate buffer. Scatter operation: A single-scatter operation is executed by the global operation MPI Scatter() which must be called by all participating processes. As effect, the specified root process sends a part of its own local data to each other process. In [8], the runtimes of MPI communication operations on the IBM SP2 are modeled by runtime formulas found by curve fitting that depend on various machine parameters including the number of processors, the bandwidth of the interconnecting networks, and the startup times of the corresponding operations. cf (p, b) = The investigations resulted in the parameterized runtime formula fscatter −6 −6 −6 10.28 · 10 + 91.59 · 10 · p + 0.030 · 10 · p · b [µsec] for the scatter operation, cf and fsingle (b) = 211.8 · 10−6 + 0.030 · 10−6 · b [µsec] for the single-transfer operation. The parameters b and p are the message size in bytes and the number of processors, respectively.
3
The Genetic Programming Approach
The usual form of genetic algorithm was described by Goldberg [3]. Genetic algorithms are stochastic search techniques based on the mechanism of natural selection and natural genetics. They start with an initial population, i.e., an initial set of random solutions which are represented by chromosomes. The chromosomes evolve through successive iterations, called generations. During each
Modeling the Behavior on DMMs by GAs
275
generation, the solutions represented by the chromosomes of the population are evaluated using some measures of fitness. To create the next generation, new chromosomes, called offsprings, are formed. The chromosomes involved in the generation of a new population are selected according to their fitness values. Fitter chromosomes have higher probabilities of being selected. After several generations, the algorithm converges to the best chromosome which hopefully represents the optimal or suboptimal solution to the problem. GAs whose chromosomes are syntax trees of formulas, i.e., computer programs, are GPs. In the next sections we present the GP we have applied for the problem of finding runtime functions of high quality modeling the execution time of the MPI communication operations. We use terminology given in [7]. 3.1
The Chromosomes
In genetic programming, a chromosome is an element of the set of all possible combinations of functions that can be composed recursively from a set F of basic functions and a set T of terminals. Each particular function f ∈ F takes a specified number of arguments, specifying the arity of f . The basic functions we allowed in our application are addition (+) and multiplication (·) both of arity 2, and the operations sqr, sqrt, ln, and exp, all of arity 1. The set T is composed of one or two variables p and b (p specifies the number of processors and b specifies the message size), and the constants 1 to 10, 0.1, and 10−6 . We represent each chromosome by the syntax tree corresponding to the expression. In the following, we identify a chromosome with the runtime function which is represented by that chromosome in order to make the diction easier. 3.2
Fitness Function
Fitness is the driving force of Darwinian natural selection and, likewise, of genetic programming. It may be measured in many different ways. The accuracy of the fitness function with respect to the application is crucial for the quality of the results produced by GP. In the application handled here experiments have shown that the average percental error of a chromosome as fitness results in best solutions. Of course, as the aim is to minimize the average percental error, a chromosome is said to be fit if its average percental error is small. Let P os be the set of measuring points and m(α) be the measured data at measuring point α, then the percental error errorf (α) of a runtime function f at point α and the average percental error errorf of f are given by errorf (α) abs(m(α) − f (α)) · 100 and errorf = α∈P os , errorf (α) = m(α) | P os | respectively. There are two problems we have to care about when applying this fitness function. The populations can contain chromosomes whose average percental error is greater than the maximal decimal value (which is about 1.79·10308 in GNU g++).
276
L. Heinrich-Litan et al.
In this case, we have to redefine the error of these chromosomes to be some large value. The other problem is that some operations of the syntax trees are not defined, e.g., ln a with a ≤ 0. The fitness function of our GP replaces these faulty operations by constants and adds a penalty to the error. Chromosomes with faulty operations are not taken as final result of the GP. 3.3
Crossover and Mutation Operators
The crossover operator for GP creates variation in the population by producing new offsprings that consist of parts taken from each parent. It starts with two parental syntax trees selected with error-inverse-proportionate selection, and produces two offspring syntax trees by independently selecting, using an uniform probability distribution, one random point in each parent to be the crossover point for that parent. The offsprings are produced by exchanging the crossover fragments. To direct the algorithm to generate smooth runtime functions we restricted the solution region to syntax trees of depth up to a constant c. The mutation operator introduces random changes in structures randomly selected from the population. It begins by selecting a point within the syntax tree by random. Then, it removes the fragment which is below this mutation point and inserts a randomly generated subtree at that point. The depth of the new syntax tree is also limited by c. 3.4
Initialization and Termination
There are several methods to initialize the GP. At the beginning of our experiments we generated each chromosome of the first population recursively top down. The semantics of a node v of depth less than c is taken from F ∪ T by random. If a node has depth c, the semantics is randomly taken from the set T of terminals. Experiments showed, that the results are better if we generate a part of the initial chromosomes by approximation, and the rest of them as described above. These initial approximation functions represent linear dependencies between the measuring points and the measured data. In our experiments, the GP stops after having considered a predefined number of generations.
4
Experimental Results
We begin by quantifying the quality of the runtime functions given in [8]. Then we compare these runtime functions to those generated by GP and discuss the results. We close the section by presenting a runtime formula generated by GP. We evaluated the runtime functions given in [8] (see Section 2) by using the fitness function specified in Section 3.2. The test set of measured data contained communication times for p = 4, 8, 16, 32 processors and message sizes between 4 and 800 bytes with stepsize 20, between 800 and 4.000 with stepsize 400, and between 4.000 and 240.000 with stepsize 4.000.
Modeling the Behavior on DMMs by GAs
277
Table 1. Deviations of the runtime functions Operation
Approach
errorf
% < 0.01 scatter curve fitting 57.36 0.0 GP 9.68 0.0 single-transfer curve fitting 71.90 0.0 GP 14.64 0.3
of α ∈ P os < 0.1 < 1 0.6 2.6 2.3 14.2 1.2 20.9 1 17.6
with errorf (α) < x < 10 < 25 < 50 < 100 23.8 49.0 72.5 83.4 62.9 93.7 97.7 100.0 47.4 59.4 67.4 73.4 59.6 75.6 93 100.0
To generate runtime functions by GP, we have to set some parameters. By setting the size of each population to 50, the maximum number of generations to 12000, the maximum depth of syntax trees to 15, and the probabilities for applying crossover, mutation, and the copy operation to 80%, 10% and 10% respectively, we obtained runtime functions by GP which dominate by far the runtime functions from [8] obtained by curve fitting. In Table 1 we compare the deviations of the runtime functions found by curve fitting to the deviations of the best runtime functions generated after having run the GP algorithm 3 times, which takes about 6 hours on a SUN SPARC station Ultra 1, 128 MByte RAM. The first column specifies the communication operation, the second column the used approach and the third column the average percental error of the corresponding runtime function. Columns 4 to 10 give the percentage of measuring points α ∈ P os for which the percental error errorf (α) is less than x for x = 0.01, 0.1, 1, 10, 25, 50 and 100, respectively. Figure 1 graphically compares the deviation of the runtime function from [8] and the deviation of the runtime function generated by GP for the scatter operation with respect to some set of measured data. The curves differ very much for small values of parameter b whereby the curve generated by GP dominates by far the curve found by curve fitting. For large parameter values, both runtime functions predict the execution time of the scatter operation roughly alike. Figure 2 shows the comparison of the deviations of the runtime functions found by curve fitting and GP, respectively, for the single-transfer operation (which depends only on parameter b) with respect to several sets of measured data. Figure 2 shows only the cut-out determined by the small message sizes (up to 500 byte). Let us discuss the results. The disadvantage of curve fitting is that usually only one simple function is selected to model a communication operation for a large range of message sizes. This may lead to poor results in some regions since often different methods are used for different message sizes. GP on the other hand allows for generating complex functions. GP predicting We close the section by presenting the runtime formula fscatter the execution time of the scatter operation which has been generated by GP : GP fscatter (p, b) = 10−6 · (b · p · (p1/2 − 1)1/2 )1/2 ·
(b · p · 0.000625 + 0.00005 · b + 0.00375 · p2 + 0.003 · p + ln(ep/2 ) + 49)1/2 . It shows that the runtime formulas generated by GPs are rather complex. To our opinion, this is the only disadvantage of the approach we presented.
278
L. Heinrich-Litan et al.
450 Deviation of the runtime function found by curve fitting Deviation of the runtime function generated by GP
Deviation of the runtime function found by curve fitting Deviation of the runtime function generated by GP
400
Error [%]
350 450 400 350 300 250 200 150 100 50 0
Error [%]
300
200 150
20
10
250
100
1000 10000 100000 1e+06 Messagesize [byte]
5
25
30
15 10 Number of processors
100 50 0 0
Fig. 1. Deviation of the runtime function found by curve fitting and deviation of the runtime function generated by GP for the scatter operation.
5
100
200 300 Messagesize [byte]
400
500
Fig. 2. Deviation of the runtime function obtained by curve fitting and deviation of the runtime function generated by GP for the single-transfer operation.
Conclusions
We showed by example that genetic programming provides runtime functions predicting the execution time of communication operations of DMMs which dominate the runtime functions found by curve fitting. The next step of our investigations we will make is to automatically generate different runtime functions for different intervals of the parameters. The intervals themselves have to be determined by genetic programming, too. Using such non uniform runtime functions seems to be necessary to predict the execution functions more accurately, as DMMs use various communication protocols depending on the message size.
References 1. Foschia, R., Rauber, Th., and R¨ unger, G.: Prediction of the Communication Behavior of the Intel Paragon. In Proceedings of the 1997 IEEE MASCOTS Conference, pp.117-124, 1997. 273 2. Gen, M., and Cheng, R.: Genetic Algorithms & Engineering Design. John Wiley & Sons, Inc., New York, 1997. 273 3. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley, Reading, MA, 1989. 274 4. Hu, Y., Emerson, D., and Blake, R.: The communication performance of the Cray T3D and its effect on iterative solvers. Parallel Computing, 22:829-844, 1996. 273 5. Hwang, K., Xu, Z., and Arakawa, M.: Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing. IEEE Transactions on Parallel and Distributed Systems, 7(2):522-536, 1996. 273 6. Johnson, L.: Performance Modeling of Distributed Memory Architecture. Journal of Parallel and Distributed Computing, 12:300-312, 1991. 273
Modeling the Behavior on DMMs by GAs
279
7. Koza, J.: Genetic Programming. The MIT Press, 1992. 273, 275 8. Rauber, Th., and R¨ unger, G.: PVM and MPI Communication Operations on the IBM SP2: Modeling and Comparison. In Proceedings of HPCS 97, 1997. 273, 274, 276, 277
Representing and Executing Real-Time Systems Rafael Ramirez National University of Singapore Information Systems and Computer Science Department Lower Kent Ridge Road, Singapore 119260 [email protected]
Abstract. In this paper, we describe an approach to the representation, specification and implementation of real-time systems. The approach is based on the notion of concurrent object-oriented systems where processes are represented as objects. In our approach, the behaviour of an object (its safety properties and time requirements) is declaratively stated as a set of temporal constraints among events which provides great advantages in writing concurrent real-time systems and manipulating them while preserving correctness. The temporal constraints have a procedural interpretation that allows them to be executed, also concurrently. Concurrency issues and time requirements are separated from the code, minimizing dependency between application functionality and concurrency/timing control.
1
Introduction
Parallel computers and distributed systems are becoming increasingly important. Their impressive computation to cost ratios offer a considerable higher performance than that possible with sequential machines. Yet there are few commercial applications written for them. The reason is that programming in these environments is substantially more difficult than programming for sequential machines, in respect of both correctness (to achieve correct synchronization) and efficiency (to minimize slow interprocess communication). While in traditional sequential programming the problem is reduced to make sure that the program’s final result (if any) is correct and that the program terminates, in concurrent programming it is not necessary to obtain a final result but to ensure that several properties hold during program execution. These properties are classified into safety properties, those that must always be true, and progress (or liveness) properties, those that must eventually be true. Partial correctness and termination are special cases of these two properties. To make things even worse, there exists a wide variety of parallel architectures and a corresponding variety of concurrent programming paradigms. For most problems, it is not possible to envisage a general concurrent algorithm which is well suited to all parallel architectures. Real-time systems are inherently concurrent systems which, in addition to usually require synchronisation and communication with both their environment and within their own components, need to execute under timing constraints. We D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 279–286, 1998. c Springer-Verlag Berlin Heidelberg 1998
280
Rafael Ramirez
propose an approach which aims to reduce the inherent complexity of writing concurrent real-time systems. Our approach consists of the following: 1. In order to incorporate the benefits of the declarative approach to concurrent real-time programming, it is necessary that a program describe more than its final result. We propose a language based on classical first-order logic in which all safety properties and time requirements of programs are declaratively stated. 2. Programs are developed in such a way that they are not specific to any particular concurrent programming paradigm. The proposed language provides a framework in which algorithms for a variety of paradigms can be expressed, derived and compared. 3. Applications can be specified as objects, which provides encapsulation and inheritance. The language object-oriented features produce structured and understandable programs, therefore reducing the number of potential errors. 4. Concurrency issues and timing requirements are separated from the rest of the code, minimizing dependency between application functionality and concurrency control. In addition, concurrency issues and timing requirements can also be independently specified. Thus, it is in general possible to test different synchronisation schemes without modifying the timing requirements and vice versa. Also, a synchronisation/time scheme may be reused by several applications. This provides great advantages in terms of program flexibility, reuse and debugging.
2
Related Work
This work is strongly related to Tempo ([8]) and Tempo++ ([17,16]). Tempo is a declarative concurrent programming language based on first-order logic. It is declarative in the sense that a Tempo program is both an algorithm and a specification of the safety properties of the algorithm. Tempo++ extends Tempo (as presented in [8]) by adding numbers and data structures (and operations and constraints on them), as well as supporting object-oriented programming. Both languages explicitly described processes as partially ordered set of events. Events are executed in the specified order, but their execution times are only implicit. Here, we extend Tempo++ to support real-time by making event execution times explicit as well as allowing the specification of the timing requirements in terms of relations among these execution times. Concurrent logic programming is also an important influence for this work. This is because concurrent logic programming languages (e.g. Parlog [6], KL1 [22]) and their object-oriented extensions (e.g. Polka [4], Vulcan [10]) preserve many of the benefits of the abstract logic programming model, such as the logical reading of programs and the use of logical terms to represent data structures. However, although these languages preserve many benefits of the logic programming model, and their programs explicitly specify their final result, important program properties, namely safety and progress properties, remain implicit. These properties have to be preserved by using control features such as
Representing and Executing Real-Time Systems
281
modes and sequencing, producing programs with little or no declarative reading [17]. In addition, traditional concurrent logic programming languages do not provide support for real-time programming and thus, are not suitable for this kind of applications. Concurrent constraint programming languages [19] and their object-oriented and real-time extensions (e.g. [21,20]) suffer from the same problems as concurrent logic languages. Program safety and progress properties remain implicit. These properties are preserved by checking and imposing value constraints on shared variables in the store. Also, there is no clear separation of programs concurrency control, timing requirements and application functionality. In addition, the concurrent constraint model is most naturally suited for shared memory architectures not being easily adapted to model distributed programming systems. Our work is also related to specification languages for concurrent real-time systems. It is closer to languages based on temporal logic, e.g. Unity [3] and TLA [13], and real-time extensions of temporal logic, e.g. [12], than process algebras [9,14] and real-time extensions of state-transition formalisms [2,5]. Unity and TLA specifications can express the safety and progress properties of concurrent systems but they are not executable. Both formalisms model these systems by sequences of actions that modify a single shared state which might be a bottleneck if the specifications were to be executed. Section 3 introduces the core concepts of our approach to real-time programming, namely events, precedence constraints and real-time, and outlines a methodology for the development of concurrent real-time systems (to make the paper sel-contained there is a small overlap with [8] in Sections 3.1 and 3.3). Section 4 describes how these concepts can be extended by adding practical programming features such as data structures and operations on them as well as supporting object-oriented programming. Finally, Section 5 summarizes our approach and its contributions as well as some areas of future research.
3 3.1
Events, Constraints and Real-Time Events and Precedence
Many researchers, e.g. [11,15], have proposed methods for reasoning about temporal phenomena using partially ordered sets of events. Our approach to the specification and implementation of real-time systems is based on the same general idea. We propose a language in which processes are explicitly described as partially ordered sets of events. The event ordering relation X < Y , read as “X precedes Y ”, is the main primitive predicate in the language (there are only two primitive predicates), its domain is the set of events, and is defined by the following axioms (the last axiom is actually a corollary of the other three): ∀X∀Y ∀Z(X < Y ∧ Y < Z → X < Z) ∀X∀Y (time(Y, eternity) → X < Y ) ∀X(X < X → time(X, eternity)) ∀X∀Y (Y < X ∧ time(Y, eternity) → time(X, eternity))
282
Rafael Ramirez
The meaning of predicate time(X, V alue) is “event X is executed at time V alue” and eternity is interpreted as a time point that is later than all others. Events are atomically executed in the specified order (as soon as its predecessors have been executed) and no event is executed for a variable that has execution time eternity. Long-lived processes (processes comprising a large or infinite set of events) are specified allowing an event to be associated with one or more other events: its offsprings. The offsprings of an event E are named E + 1, E + 2, etc., and are implicitly preceded by E, i.e. E < E + N , for all N . The first offspring E + 1 of E is referred to by E+. Syntactically, offsprings are allowed in queries and bodies of constraint definitions, but not in their heads. Long-lived processes may not terminate because body events are repeatedly introduced. Interestingly, an infinitely-defined constraint need not cause nontermination: a constraint, all whose arguments have time value eternity will not be expanded. The time value of an event E can be bound to eternity (as a consequence of the axioms defining “<”) by enforcing the constraint E < E. No offsprings of E will be executed. Their time values are known to be bound to eternity since they are implicitly preceded by E and the time value of E is eternity. In the language, disjunction is specified by the disjunction operator ’;’ (which has lower priority than ’,’ but higher than ’←’). The clause H ← Cs1; . . . ; Csn abbreviates the set of clauses H ← Cs1, . . ., H ← Csn. In the absence of disjunction, a query determines a unique set of constraints. An interpreter produces any execution sequence satisfying those constraints, it does not matter which one. With disjunction, a single set of constraints must be chosen from among many possible sets, i.e., a single alternative must be selected from each disjunction. In the language described above, processes are explicitly described as partially ordered set of events. Their behavior is specified as logical formulas which define temporal constraints among a set of events. This logical specification of a process has a well defined and understood semantics and allows for the possibility of employing both specification and verification techniques based on formal logic in the development of concurrent systems.
3.2
Real-Time
Time requirements in the language may be specified by using the primitive predicate time(X, V alue). This constrains the execution time of event X by forcing X to be executed at time V alue. In this way, quantitative temporal requirements (e.g. maximal, minimal and exact distance between events) can be expressed in the language. For Instance, Maximal distance between two events E and F may be specified by the constraint max(E, F, N ), meaning “event E is followed by event F within N time units”, and defined by max(E, F, N ) ← E < F, time(E, Et), time(F, F t), F t ≺ Et + N .
Representing and Executing Real-Time Systems
283
where ≺ is the usual less than arithmetic relationship among real numbers. Thus, maximal distance between families of events, i.e. events and their offsprings, can be specified by the constraint max∗ (E, F, N ), meaning “occurrences of events E, E+, E + +,. . . are respectively followed by occurrences of events F , F +, F + +, . . . within N time units”, and defined by max∗ (E, F, N ) ← max(E, F, N ), max∗ (E+, F +, N ) Minimal and exact distance between two events, as well as other common quantitative temporal requirements, may be similarly specified. Example 1. Consider an extension to the traditional producer and consumer system where each product cannot be held inside the buffer longer than time t. The system may be represented by the timed Petri net in Figure 1.
Ready to send
Ready to consume
Send
Produce
Buffer
Consume
Receive
(0,t)
Ready to produce
Ready to receive
Fig. 1. A producer-consumer system with timed buffer The producer and consumer components of the system may be respectively specified by P <∗S, S <∗P + and R <∗C, C <∗R+, where P and S represent occurrences of events produce and send and R and C represent occurrences of events receive and consume (P +, S+, R+, C+, . . . represent later occurrences of these events). The behaviour of the buffer may be specified by tbuf (S, R, T ) ← max(S, R, T ), tbuf (S+, R+, T ); tbuf (S+, R, T ). where T is the longest time the buffer can hold a product. 3.3
Development of Concurrent Real-Time Systems
In our approach, the specification of the behaviour of the processes in the system is a program, i.e. it is possible to directly execute the specification. Thus, a program P may be transformed into a program that logically implies P (in [7] some transformations rules that can be applied to programs are presented). The derived program is guaranteed to have the same safety properties as the original one, though its progress properties may differ, e.g. one may terminate and the
284
Rafael Ramirez
other not. The program may be incrementally strengthened by introducing timing constraints to specify the system time requirements. Finally, the program can be turned into a concurrent one by grouping constraints into processes. This final step affects neither the safety nor the progress properties of the algorithm, provided that some restrictions are observed.
4
Object-Oriented Programming
Our approach to the specification and implementation of real-time systems is based on an extension to the logic presented in the previous section. The logic is extended by adding data structures and operations on them, by allowing values to be assigned to events for inter process communication and by supporting object-oriented programming. A detailed discussion of these ideas can be found in [17]. Our language supports object-oriented programming by allowing a class to encapsulate a set of constraints, specified by a constraint query, together with the related constraint definitions, specified by a set of clauses, in such a way that it describes a set of potential run-time objects. The constraint query defines a partial order among a set of events and the timing requirements on their execution times, and the constraint definitions provide meaning to the userdefined constraints in the query. Both the query and definitions are local to the class. Each of the class run-time objects corresponds to a concurrent process. The name of a class may include variable arguments in order to distinguish different instances of the same class. Events appearing in the constraint query of an object implicitly belong to that object. If an event is shared between several objects (it belongs to two or more objects), it cannot be executed until it has been enabled by all objects that share it. In order to specify the object computation, actions may be associated with events. The representation and manipulation of data as well as the spawning of new objects is handled by these actions. Our current implementation of the language assumes an action to be a definite goal. In order to execute an event, the goal associated with it (if any) has to be solved first. Objects communicate via shared events’ values. Shared events represent communication channels and values assigned to them represent messages. A novel feature of the approach described here is that an object can partially inherit another object, i.e. an object can inherit either another object’s temporal constraints or actions. Thus, inheritance of concurrency issues and inheritance of code are independently supported. This allows an object to have its synchronisation scheme inherited from another object while defining its own code, or vice versa, or even inherit its synchronisation scheme and code from two different objects. Our language appears to add to the proliferation of concurrent programming paradigms: processes (objects) communicate via a new medium, shared events. However, our objective is rather to simplify matters by providing a framework in which algorithms for a variety of concurrent programming paradigms can be expressed, derived, and compared. Among the paradigms we have considered are
Representing and Executing Real-Time Systems
285
synchronuos message passing, asynchronous message passing and shared mutable variables. Execution: Our current implementation uses a constraint set CS containing the constraints still to be satisfied, and a try list T L containing the events that are to be tried but have not yet been executed. The interpreter constantly takes events from T L and checks if they are enabled, i.e. if they are not preceded by any other event (according to CS), and the timing constraints on their execution times are satisfiable, in which case they are executed. The order in which the events are tryed is determined by the timing constraints in CS.
5
Conclusions
We have described an approach to the representation, specification and implementation of concurrent real-time systems. The approach is based on the notion of concurrent object-oriented systems where processes are represented as objects. In the approach, each object is explicitly described as a partially ordered set of events and executes its events in the specified order. The partial order is defined by a set of temporal constraints and object synchronisation and communication are handled by shared events. object behaviour (safety properties and time requirements) are declaratively stated which provides great advantages in writing real-time systems and manipulating them while preserving correctness. The specification of the behaviour of an object has a procedural interpretation that allows it to be executed, also concurrently. Our approach can also be used as the basis for a development methodology for concurrent systems. First, the system can be specified in a perpicuous manner, and then this specification may be incrementally strengthened and divided into objects that communicate using the intended target paradigm. An object can partially inherit another object, i.e. an object can inherit either another object’s temporal constraints, actions or actions definitions. Thus, inheritance of concurrency issues and inheritance of code are independently supported. Current status. A prototype implementation of the complete language has been written in Prolog, and used to test the code of a number of applications. The discussion of these applications is out of the scope of this paper. Future work. In the language presented, the action associated with an event can, in principle, be specified in any programming language. Thus, different types of languages, such as the imperative languages, should be considered and their interaction with the model investigated. Events are considered atomic. Instead of being atomic, they could be treated as time intervals during which other events can occur ([1] and [11]). Such events can be further decomposed to provide an arbitrary degree of detail. This could be useful in deriving programs from specifications. Object behaviour is specified as logical formulas which define temporal constraints among a set of events. This logical specification of an object has a well defined and understood semantics and we are planning to look carefully into the
286
Rafael Ramirez
possibility of employing both specification and verification techniques based on formal logic in the development of concurrent real-time systems.
References 1. Allen, J.F. 1983. Maintaining knowledge about temporal intervals. Comm. ACM 26, 11, pp.832-843. 285 2. Alur, R., and Dill, D.L. 1990. Automata for modeling real-time systems, in ICALP’90: Automata, Languages and Programming, LNCS 443, pp.322-335. Springer-Verlag. 281 3. Chandy, K.M. and Misra, J. 1988. Parallel Program Design. Addison-Wesley. 281 4. Davison, A. 1991. From Parlog to Polka in Two Easy Steps, in PLILP’91: 3rd Int. Symp. on Programming Language Implementation and LP, Springer LNCS 528, pp.171-182, Passau, Germany, August. 280 5. Dill, D.L. 1989. Timing assumptions and verification of finite-state concurrent systems, in CAV’89: Automatic Verification Methods for Finite-state Systems, LNCS 407, pp. 197-212, Springer-Verlag. 281 6. Gregory, S. 1987. Parallel Logic Programming in PARLOG, Addison-Wesley. 280 7. Gregory, S. 1995. Derivation of concurrent algorithms in Tempo. In LOPSTR95: Fifth International Workshop on Logic Program Synthesis and Transformation. 283 8. Gregory, S. and Ramirez, R. 1995. Tempo: a declarative concurrent programming language. Proc.of the ICLP (Tokyo, June), MIT Press, 1995. 280, 281 9. Hoare, C.A.R. 1985. Communicating Sequential Processes, Prentice Hall. 281 10. Kahn, K.M., Tribble, D., Miller, M.S., and Bobrow, D.G. 1987. Vulcan: Logical Concurrent Objects, In Research Directions in Object-Oriented Programming, B. Shriver, P. Wegner (eds.), MIT Press. 280 11. Kowalski R., and Sergot, M. 1986. A Logic-based Calculus of Events, New Generation Computing, 4, 1, pp.67-95. 281, 285 12. Koymans, R. 1990. Specifying real-time properties with metric temporal logic, Realtime Systems, 2, 4, pp.255-299. 281 13. Lamport, L. 1994. The temporal logic of actions. ACM Trans. on Programming Languages and Systems, 16, 3, pp. 872-923. 281 14. Milner, R. 1989. Communication and Concurrency, Prentice Hall. 281 15. Pratt, V. 1986. Modeling concurrency with partial orders, International Journal of Parallel Programming, 1(15):33-71. 281 16. Ramirez, R. 1995. Declarative concurrent object-oriented programming in Tempo++. In Proceedings of the ICLP’95 Workshop on Parallel Logic Programming Japan, T. Chikayama, H. Nakashima and E. Tick (Ed.). 280 17. Ramirez, R. 1996. A logic-based concurrent object-oriented programming language, PhD thesis, Bristol University. 280, 281, 284 18. Ramirez, R. 1996. Concurrent object-oriented programming in Tempo++. In Proceedings of the Second Asian computing Science Conference (Asian’96), Singapore. LNCS 1179, pp. 244-253. Springer-Verlag 19. Saraswat V. 1993. Concurrent constraint programming languages, PhD thesis, Carnegie-Mellon University, 1989. Revised version appears as Concurrent constraint programming, MIT Press, 1993. 281 20. Saraswat V. 1993 et al. Programming in timed concurrent constraint languages. In Constraint Programming - Proceedings of the 1993 NATO ACM Symposium, pp. 461-410. Springer-Verlag. 281
Representing and Executing Real-Time Systems
287
21. Smolka, G. 1995. The Oz programming model, Lecture Notes in Computer Science Vol. 1000, Springer-Verlag, pp.324-243. 281 22. Ueda, K. and Chikayama, T. 1990. Design of the kernel language for the parallel inference machine. Computer Journal 33, 6, pp.494-500. 280
Fixed Priority Scheduling of Age Constraint Processes Lars Lundberg Department of Computer Science, University of Karlskrona/Ronneby, S-372 25 Ronneby, Sweden, [email protected]
Abstract. Real-time systems often consist of a number of independent processes which operate under an age constraint. In such systems, the maximum time from the start process Li in cycle k to the end in cycle k+1 must not exceed the age constraint Ai for that process. The age constraint can be met by using fixed priority scheduling and periods equal to Ai /2. However, this approach restricts the number of process sets which are schedulable. In this paper, we define a method for obtaining process periods other than Ai /2. The periods are calculated in such a way that the age constraints are met. Our approach is better in the sense that a larger number of process sets can be scheduled compared to using periods equal to Ai /2.
1
Introduction
Real-time systems often consist of a number of independent periodic processes. These processes may handle external activities by monitoring sensors and then producing proper outputs within certain time intervals. A similar example is a process which continuously monitors certain variables in a database. When these variables or sensors change, the system have to produce certain outputs within certain time intervals. These outputs must be calculated from input values which are fresh, i.e. the age of the input value must not exceed certain time limits. The processes in these kinds of systems operate under the age constraint. The age constraint defines a limit on the maximum time from the point in time when a new input value appears to the point in time when the appropriate output is produced. Figure 1 shows a scenario where a value Ei appears shortly after process Li has started its k:th cycle (denoted Lki ). Process Li starts its execution by reading the sensor or variable. Consequently, Ei will not affect the output in cycle k. The output Fi corresponding to value Ei (or fresher) is produced at the end of cycle k+1. The age constraint Ai is defined as the maximum time between the beginning of the process’ execution in cycle k to the end of the process’ execution in cycle k+1. A scheduling scheme can be either static or dynamic. In dynamic schemes the priority of a process is decided at run-time, e.g. the earliest deadline algorithm [3]. In static schemes, processes are assigned a fixed priority, e.g. the rate-monotone D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 288–296, 1998. c Springer-Verlag Berlin Heidelberg 1998
Fixed Priority Scheduling of Age Constraint Processes
289
algorithm [4]. Fixed priority scheduling is relatively easy to implement and it requires less overhead than dynamic schemes. Most studies in this area have looked at scenarios where the computation time and the period of a process are known. However, for age constraint processes the period is not known. We have instead defined a maximum time between the beginning of the process’ execution in cycle k to the end of the process’ execution in cycle k+1. We would like to translate this restriction into a period for the process thus making it possible to use fixed priority schemes. The age constraint is met by specifying a process period Ti = Ai /2 (we will use the notation Ti for other purposes), thus obtaining a set of processes with known periods Ti and computation times Ci . For such scenarios, it is well known that rate-monotone scheduling is optimal, and a number of schedulability tests have been obtained [4]. Specifying a process period Ti = Ai /2 for age constraint processes is, however, an unnecessary strong restriction, which do not allow that the start of Li in cycle k and the end in cycle k+2 may be separated by a time greater than 3Ti , whereas the age constraint allows a separation of up to 2Ai = 4Ti . In this paper we show that, by using rate-monotone scheduling, it is possible to define periods which are better than using periods Ti = Ai /2. Our method is better in the sense that we will be able to schedule a larger number of process sets than using Ti = Ai /2.
Ai
✛ Ti
✛
✲✛
✲ Ti
✲
C
✛ i✲ Lki
Lk+1 i
✻
Ei
❄
Time ✲
Fi
Fig. 1. The age constraint for process Li .
2
Calculating Process Periods
Consider a set of n processes L = [L1 , L2 , ..., Ln ], with associated age constraints Ai and computation times Ci . The priority of each process is defined by its age constraint Ai . The smaller the value Ai , the higher the priority of Li , i.e. the priority order is the same as for rate-monotone scheduling with Ti = Ai /2. We assume preemptive scheduling and we order the processes in such a way that Ai ≤ Ai+1 .
290
Lars Lundberg
We start by considering L1 . This process has the highest priority, and is thus not interrupted by any other process. We want to select as long periods as possible, thus minimizing processor utilization. Obviously, the period of a process must not exceed Ai − Ci . Since L1 has the highest priority, it is safe to set the period of L1 to A1 − C1 In order to distinguish our periods from the ones which are simply equal to half the age constraint Ai , we denote our periods as Ti , i.e. T1 = A1 − C1 . We now consider process L2 . The maximum response time R2 of process L2 is defined as the maximum time from the release of L2 in cycle k to the time that L2 completes in the same cycle. If the execution of L2 is not interrupted by any other process, the response time is simply equal to the computation time C2 , i.e. R1 = C1 . However, the execution of L2 may be interrupted by L1 . Process L1 may in fact interfere with as much as R2 /T1 C1 [3]. Consequently, R2 = C2 + R2 /T1 C1 . The only unknown value in this equation is R2 . The equation is somewhat difficult to solve due to the ceiling function (R2 /T1 ). In general there could be many values of R2 that solve this equation. The smallest such value represents the worst-case response time for process L2 . It has been shown that R2 can be obtained from this equation by forming a recurrence relationship. The technique for doing this is shown in [3]. The beginning of L2 in cycle k and the end of L2 in cycle k+1 may be separated with as much as T2 + R2 , where T2 is the period that we will assign to process L2 . From the age constraint we know that T2 + R2 ≤ A2 . In order to minimize processor utilization we would like to select as long a period T2 as possible. Consequently, T2 = A2 − R2 . In general, the maximum i−1 response time of process i can be obtained from the relation Ri = Ci + j=1 Ri /Tj Cj [3]. When we know Ri , the cycle time for Li is set to Ti = Ai − Ri . Figure 2 shows a set with three processes L1 , L2 and L3 , defined by A1 = 8, C1 = 2, A2 = 10, C2 = 2, A3 = 12 and C3 = 2. From these values we obtain the period T1 = A1 − C1 = 8 − 2 = 6. The maximum response time R2 for process L2 is obtained from the relation R2 = C2 + R2 /T1 C1 = 2 + R2 /62. The smallest value R2 which solves this equation is 4, i.e. R2 = 4. Consequently, T2 = A2 −R2 = 10−4 = 6. The maximum response time R3 for process L3 is obtained from the relation R3 = C3 + R3 /T1 C1 + R3 /T2 C2 = 2 + R3 /62 + R3 /62. The smallest value R3 which solves this equation is 6, i.e. R3 = 6. Consequently, T3 = A3 − R3 = 12 − 6 = 6. In figure 2, the first release of L1 is done at time 2, the first release of L2 is done at time 1 and the first release of L3 is done at time 0. In the worst-case scenario, process Li may suffer from the maximum response time Ri in two consecutive cycles, i.e. in order to meet the age constraint we know that 2Ri ≤ Ai . Consequently, there is no use in selecting a Ti smaller than Ri . However, as long as we obtain Ti which are longer than or equal to Ri , the age constraint will be met. Therefore, process Li can be scheduled if and only if Ri ≤ Ai /2.
Fixed Priority Scheduling of Age Constraint Processes
A1 = 8
✛
L1 defined by A1 = 8 and C1 = 2 from this we calculate T1 = 6
S E
✛ ✛
L2 defined by A2 = 10 and C2 = 2 from this we calculate T2 = 6
S
✛
✲
✲
E
✲
E denotes the end of a computation
S E
T1 = 6 A2 = 10
S E
✲
S
2
4
6
S denotes the start of a computation
E
T2 = 6 A3 = 12
L3 defined by A3 = 12 ✛ S ES and C3 = 2 from this ✛ ✲ we calculate T3 = 6 T3 = 6 0
291
S
E
✲ ES
E Time
✲
8
10
12
14
16
18
Fig. 2. A set with three processes L1 , L2 and L3 , defined by A1 = 8, C1 = 2, A2 = 10, C2 = 2, A3 = 12 and C3 = 2. Theorem 1. A set of processes which is schedulable using rate-monotone priority assignment and Ti = Ai /2 is also schedulable using our scheme. Proof. The difference between our scheme and rate-monotone with Ti = Ai /2 is that we use different periods Ti . Since the process set is schedulable using the Ti periods we know that Ri ≤ Ti (1 ≤ i ≤ n), where Ri denotes the maximum response time using the Ti periods. By use of induction, we show that Ri ≤ Ri (1 ≤ i ≤ n). • R1 = C1 ≤ R1 = C1 • If Rj ≤ Rj (1 ≤ j ≤ x < n), then Tj = Aj /2 ≤ Aj −Rj = Tj . If Tj ≤ Tj , then Rx+1 = Cx+1 + xj=1 Rx+1 /Tj Cj ≤ Rx+1 = Cx+1 + xj=1 Rx+1 /Tj Cj . Consequently, Ri ≤ Ri ≤ Ti = Ai /2, i.e. Ri ≤ Ai /2, which means that process Li (1 ≤ i ≤ n) can be scheduled using our scheme. Consider the processes in figure 2. If we would have used the periods Ai /2, we would have got T1 = 4, T2 = 5 and T3 = 6. This would have resulted in a utilization of C1 /T1 + C2 /T2 + C3 /T3 = 2/4 + 2/5 + 2/6 = 1.23 > 1, i.e. the process set would not have been schedulable. Consequently, our scheme is better than rate-monotone and Ti = Ai /2 in the sense that we are able to schedule a larger number of process sets.
3
Simple Analysis
In the scheme that we propose, the period of a process depends on the priority of the process. The period for a process Li gets shorter if the priority of Li is reduced and vice versa. This property makes our scheme hard to analyze. However, for the limited case when there are two processes, a thorough analysis is possible.
292
Lars Lundberg
Theorem 2. The priority assignment used in our scheme is optimal for all sets containing two processes. Proof. Consider two processes L1 and L2 , such that A1 ≤ A2 . Assume that these two processes are schedulable if the priority of L2 is higher than the priority of L1 . We will now show that if this is the case, the two processes are also schedulable if L1 has higher priority than L2 . If L1 and L2 are schedulable when the priority of L2 is higher than the priority of L1 , then R1 = C1 + R1 /(A2 − C2 )C2 = C1 + kC2 ≤ A1 /2 (for some integer k > 0). Consequently, C2 ≤ A1 /2 − C1 . If we consider the opposite priority assignment we know that the schedulability criterion is that R2 = C2 + R2 /(A1 − C1 )C1 ≤ A2 /2. Since C2 ≤ A1 /2 − C1 , we know that the maximum interference from process L1 on process L2 is C1 , i.e. R2 = C2 + C1 . Since C2 ≤ A1 /2 − C1 and R2 = C2 + C1 , we know that R2 ≤ A1 /2, and since A1 ≤ A2 , we know that R2 ≤ A2 /2, thus proving the theorem. Theorem 3. All sets containing two√processes L1 and L2 for which C1 /(A1 − C1 ) + C2 /(A2 − C2 ) is less than 2( 2 − 1) = 0.83 are schedulable using our scheme, and there are process √ sets containing two processes for which C1 /(A1 − C1 ) + C2 /(A2 − C2 ) = 2( 2 − 1) + e (for any e > 0) which are not schedulable using our scheme. Proof. We assume that A1 ≤ A2 , and that we use our scheme for calculating periods and priorities. We want to find the minimum value C1 /(A1 − C1 ) + C2 /(A2 − C2 ), such that the process set is not schedulable. We know that as long as C1 /(A1 − C1 ) ≤ 1, process L1 can be scheduled. This is a trivial observation. Process L2 can be scheduled if R2 ≤ A2 /2, i.e. we want to minimize C1 /(A1 − C1 ) + C2 /(A2 − C2 ) under the constraint that R2 = C2 + R2 /(A1 − C1 )C1 > A2 /2. If R2 /(A1 − C1 ) = k (for some integer k > 0), then the minimum for C1 /(A1 −C1 )+C2 /(A2 −C2 ) is obtained when R2 = k(A1 −C1 ) = C2 +R2 /(A1 − C1 )C1 = C2 + kC1 => C2 = k(A1 − 2C1 ). Consequently, we want to find the k which minimizes C1 /(A1 − C1 ) + k(A1 − 2C1 )/(A2 − k(A1 − 2C1 )). Since, C2 = k(A1 − 2C1 ) ≤ A2 /2 we see that A2 − k(A1 − 2C1 ) > 0, and since 0 ≤ A1 − 2C1 , we see that the minimum for C1 /(A1 − C1 ) + k(A1 − 2C1 )/(A2 − k(A1 − 2C1 )) is obtained for k = 1. Consequently, we want to minimize C1 /(A1 − C1 ) + (A1 − 2C1 )/(A2 − (A1 − 2C1 )), under the constraint that R2 = A1 − C1 > A2 /2. The minimum is obviously obtained when A1 − C1 is as small as possible, i.e. when A1 − C1 = A2 /2 + e (for some infinitely small positive number e). Consequently, we want to minimize C1 /(A2 /2 + e) + (A2 /2 + e − C1 )/(A2 − (A2 /2 + e − C1 )). Without loss of generality we assume that A2 = 2. In that case we obtain the following function (disregarding e) f (C1 ) = C1 + (1 − C1 )/(1 + C1 )
Fixed Priority Scheduling of Age Constraint Processes
293
From this we obtain the derivative of f : f (C1 ) = 1 + ((1 − C1 ) − (1 + C1 ))/(1 + C1 )2 By setting f (C1 ) = 0 we will find the values C1 which minimizes f . √ 1+((1−C1)−(1+C1))/(1+C1 )2 = 0 => 1+2C1 +C12 +1−2C1 = 0 => C1 = 2−1. √ √ From this we that min f (C1 ) = f ( 2 − 1) = 2( 2 − 1) = 0.83. It is interesting to note that the schedulability bound for C1 /(A1 − C1 ) + C2 /(A2 −C2 ) is the same as the schedulability bound for rate-monotone scheduling and two processors. However, in that case we have C1 /T1 +C2 /T2 = 2C1 /A1 + 2C2 /A2 = 0.83. At this point we do not know if it is a coincident that the values are the same or not. It would be interesting to examine the case with three processes and see if the √ schedulability bound for C1 /(A1 − C1 ) + C2 /(A2 − C2 ) + C3 /(A3 − C3 ) = 3( 3 2 − 1) = 0.78, which is the bound for rate-monotone scheduling and three processes.
4
Improving the Scheme
In the previous sections we assumed that the interference from a higher priority process Lj affected the maximum response time Rj+x for a process Lj+x according to the formula Rj+x = Cj+x + · · · + Rj+x /Tj Cj . However, if Tj+x = kTj (for some integer k > 0), we can adjust the phasing of Lj and Lj+x in such a way that the interference of Lj on Lj+x is limited to (Rj+x /Tj − 1)Cj (see figure 3). The phasing is adjusted in such a way that a release of process Lj+x always occurs at exactly the same time as a release of process Lj . Consequently, if Tj+x < kTj ≤ Tj+x − Cj we can extend the period of Lj+x to kTj . In order to distinguish the periods obtained when using the optimized version of the scheme from the ones obtained using the unoptimized version, we denote the optimized period for process Lj as tj . In order to obtain the optimized periods tj , for a set containing n processes, we start with the periods Tj and we then use the following algorithm: t1 = T 1 for y = 2 to n loop ty = T y for j = 1 to y − 1 loop if ty < ktj ≤ Ty − Cj then ty = ktj end loop y =y+1 end loop The execution of the process set is started by releasing all processes at the same time.
294
Lars Lundberg
✛ A1 = 3 ✲ ✛T1 = 2✲✛ Release of L1
✻
Release of L2
✻ ✛
Scenario (a)
✲
✻
✻
T2 = 3.9
✻
✻ ✲✛
✻ ✲
✛
A2 = 7.2
✛ A1 = 3 ✲ ✛T1 = 2✲✛ Release of L1
✻
Release of L2
✻ ✛
✻
✻
✻
✻ ✲✛
3
4
✻
✻ ✲
A2 = 6.3 2
✲
Scenario (b)
✻
T2 = 4
1
✻
✲
✛ 0
✻
5
✲ 6
7
Time
✲
8
9
10
11
Fig. 3. In scenario (a) there are two processes L1 and L2 . Process L1 has a period T1 = 2 and a computation time C1 = 1. Process L2 has a period T2 = 3.9 and a computation time C2 = 1.3. In this scenario we are able to meet the age constraints A1 = 3 and A2 = 7.2, i.e. A2 = T2 + R2 = T2 + C2 + 2C1 = 3.9 + 1.3 + 2 = 7.2. In scenario (b) we consider the same processes, with the exception that the period of L2 has been extended, i.e. T2 = 2T1 = 4. The phasing of L1 and L2 has also been adjusted such that a release of L2 always coincides with a release of L1 . In this scenario we are able to meet the age constraints A1 = 3 and A2 = 6.3, i.e. A2 = T2 + R2 = T2 + C2 + C1 = 4 + 1.3 + 1 = 6.3. Consequently, by extending the period of L2 we were able to meet tougher age constraints.
Fixed Priority Scheduling of Age Constraint Processes
295
Theorem 4. A set of processes which is schedulable using the unoptimized periods Ti and an arbitrary phasing of processes is also schedulable using the optimized periods ti , provided that we are able to adjust the phasing of the processes. Proof. Let ri denote the maximum response time for process Li using the periods ti and the optimized phasing. Obviously, Ti ≤ ti . From this we conclude that i−1 r /t C ≤ R = C + ri = Ci + i−1 j i i j=1 i j j=1 Ri /Tj Cj . Consequently, if Ri ≤ Ai /2, then ri ≤ Ai /2. Figure 4 shows a process set which is schedulable using the improved version of our scheme. This process set would not have been schedulable using the unoptimized version of our scheme. Consequently, the improved version of the scheme is better in the sense that we are able to schedule a larger number of process sets.
L1 defined by A1 = 7 and C1 = 1 from this we calculate t1 = 6
A1 = 8
✛ ✛
✲ ✲
t1 = 6 A2 = 8 ✛
L2 defined by A2 = 8 and C2 = 2 from this we calculate t2 = 6
S E
✛
t2 = 6
✲
S
✛ 0
2
4
S E
A3 = 9
✛
L3 defined by A3 = 9 and C3 = 3 from this we calculate t3 = 6
✲ S E
E t3 = 6 6
✲
✲ S
E
S
E Time
✲
8
10
12
14
16
18
Fig. 4. Three processes which are schedulable using the improved scheme.
5
Conclusions
In this paper a method for scheduling age constraint processes has been presented. An age constraint process is defined by two values: the maximum time between the beginning of the process’ execution in cycle k to the end of the process’ execution in cycle k+1 (the age constraint), and the maximum computation time in each cycle, i.e. the period of each process is not explicitly defined. We present an algorithm for calculating process periods. Once the periods have been calculated the process set can be executed using preemption and fixed priority scheduling. The priorities are defined by the rate-monotone algorithm. Trivially, the age constraint can be met by using periods equal to half the age constraint. However, the periods obtained from our method are better in the sense that they make it possible to schedule a larger number of process sets
296
Lars Lundberg
than we would have been able to do if we had used periods equal to half the age constraint. All process sets which are schedulable using periods equal to half the age constraint are also schedulable using our method. A simple analysis of our method shows that the rate-monotone priority assignment algorithm is optimal for all process sets containing two processes, using our process periods. We also show that all process √ sets with two processes, for which C1 /(A1 −C1 )+C2 /(A2 −C2 ) is less than 2( 2−1) = 0.83, are schedulable using our scheme. There are, however, process √ sets containing two processes for which C1 /(A1 − C1 ) + C2 /(A2 − C2 ) = 2( 2 − 1) + e (for any e > 0) which are not schedulable using our scheme, i.e. we provide a simple schedulability test for process sets containing two processes. We also define an improved version of our method. The improved version capitalizes on the fact that there is room for optimization when the period of one process is an integer multiple of the period of another process. The improved version of the method is better than the original version in the sense that we are able to schedule a larger number of process sets. All process sets which are schedulable using the original version are also schedulable using the improved version. Previous work on age constraint process have concentrated on creating cyclic interleavings of processes [1]. Other studies have looked at age constraint processes which communicate [5]. One such scenario is scheduling of age constraint processes in the context of hard real-time database systems [2].
References 1. W. Albrecht, and R. Wisser, “Schedulers for Age Constraint Tasks and Their Performance Evaluation”, In Proceedings of the Third International Euro-Par Conference, Passau, Germany, August 1997. 296 2. N. C. Audsley, A. Burns, M.F. Richardson, and A.J. Wellings, “Absolute and Relative Temporal Constraints in Hard Real-Time Databases”, In Proceedings of IEEE Euromicro Workshop on Real-Time Systems, Los Alamitos, California, February 1992. 296 3. A. Burns, and A. Wellings, “Real-Time Systems and Programming Languages”, Addison-Wesley, 1996. 288, 290 4. J.P. Lehoczky, L. Sha, and Y. Ding, “The rate monotone scheduling algorithm: exact characterization and average case behavior”, In Proceedings of IEEE 10th Real-Time Systems Symposium, December 1989. 289 5. X. Song, and J. Liu, “How Well can Data Temporal Consistency be Maintained”, In Proceedings of the 1992 IEEE Symposium on Computer Aided Control Systems design, Napa, California, USA, March 1992. 296
Workshop 03: Scheduling and Load Balancing Susan Flynn Hummel, Graham Riley, and Rizos Sakellariou Co-chairmen
Introduction Mapping a parallel computation onto a parallel computer system is one of the most important questions for the design of efficient parallel algorithms. In the case of irregular data structures, the problem of initially distributing and then maintaining an even workload as computation progresses, becomes very complex. This workshop will discuss the state-of-the-art in this area. Besides the discussion of novel techniques for mapping and scheduling irregular dynamic or static computations onto a processor architecture, a number of open problems are of special interest for this workshop: One is the development of partitioning algorithms used in numerical simulation, taking into account not only the cutsize but also the special characteristics of the application and their impact on the partitioning. Another topic of special relevance for this workshop concerns re-partitioning algorithms for irregular and adaptive finite element computations that minimize the movement of vertices in addition to balancing the load and minimising the cut of the resulting new partition. A third topic is the development of dynamic load balancing algorithms that adapt themselves to the special characteristic of the underlying parallel computer, easing the development of parable applications. Other topics, in this uncomplete list of open problems, relate to applicationspecific graph partitioning, adaptable load balancing algorithms, scheduling algorithms, novel applications of scheduling and load balancing, load balancing on workstation clusters, parallel graph partitioning algorithms, etc.
The Papers The papers selected for presentation in the workshop were split into four sessions each of which consists of work which covers a wide spectrum of different topics. We would like to thank sincerely the more than 20 referees that assisted us in the reviewing process. Rudolf Berrendorf considers the problem of minimising load imbalance along with communication on distributed shared memory systems; a graph-based technique using data gathered from execution times of different program tasks and memory access behaviour is described. An approach for repartitioning and data remapping for numerical computations with highly irregular workloads on distributed memory machines is presented by Leonid Oliker, Rupak Biswas, and D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 297–298, 1998. c Springer-Verlag Berlin Heidelberg 1998
298
Susan Flynn Hummel et al.
Harold Gabow. Sajal Das and Azzedine Boukerche experiment with a load balancing scheme for the parallelisation of discrete event simulation applications. The problem of job scheduling on a parallel machine is tackled by Christophe Rapine, Isaac Scherson and Denis Trystram, who analyse an on-line time/space scheduling scheme. The second session starts with a paper by Wolf Zimmermann, Martin Middendorf and Welf Loewe who consider k-linear scheduling algorithms on the LogP machine. Cristina Boeres, Vinod Rebello and David Skillicorn consider also a LogP-type model and investigate the effect of communication overheads. The problem of mesh partitioning and a measure for efficient heuristics is analysed by Ralf Diekmann, Robert Preis, Frank Schlimbach and Chris Walshaw. Konstantinos Antonis, John Garofalakis and Paul Spirakis analyse a strategy for balancing the load on a two-server distributed system. The third session starts with Salvatore Orlando and Raffaele Perego who present a method for load balancing of statically predicted stencil computations in heterogeneous environments. Fabricio Alves Barbosa da Silva, Luis Miguel Campos and Isaac Scherson consider the problem of parallel job scheduling again. A general architectural framework for building parallel schedulers is described by Gerson Cavalheiro, Yves Denneulin and Jean-Louis Roch. A dynamic loop scheduling algorithm which makes use of information from previous executions of the loop is presented by Mark Bull. A divide-and-conquer strategy motivated by problems arising in finite element simulations is analysed by Stefan Bischof, Ralf Ebner and Thomas Erlebach. The fourth session starts with a paper by Dingchao Li, Akira Mizuno, Yuji Iwahori and Naohiro Ishii that proposes an algorithm for scheduling a task graph on an heterogeneous parallel architecture. Mario Dantas considers a scheduling mechanism within an MPI implementation. An approach of using alternate schedules for achieving fault tolerance on a network of workstations is described by Dibyendu Das. An upper bound for the load distribution by a randomized algorithm for embedding bounded degree trees in arbitrary networks is computed by Jaafar Gaber and Bernard Toursel.
Optimizing Load Balance and Communication on Parallel Computers with Distributed Shared Memory Rudolf Berrendorf Central Institute for Applied Mathematics Research Centre J¨ ulich D-52425 J¨ ulich, Germany [email protected]
Abstract. To optimize programs for parallel computers with distributed shared memory two main problems need to be solved: load balance between the processors and minimization of interprocessor communication. This article describes a new technique called data-driven scheduling which can be used on sequentially iterated program regions on parallel computers with a distributed shared memory. During the first execution of the program region, statistical data on execution times of tasks and memory access behaviour are gathered. Based on this data, a special graph is generated to which graph partitioning techniques are applied. The resulting partitioning is stored in a template which is used in subsequent executions of the program region to efficiently schedule the parallel tasks of that region. Data-driven scheduling is integrated into the SVMFortran compiler. Performance results are shown for the Intel Paragon XP/S with the DSM-extension ASVM and for the SGI Origin2000.
1
Introduction
Parallel computers with a global address space share an important abstraction appreciated by programmers as well as compiler writers: the global, linear address space seen by all processors. To build such a computer in a scalable and economical way, such systems usually distribute the memory with the processors. Parallel computers with physically distributed memory but a global address space are termed distributed shared memory machines (DSM). Examples are SGI Origin2000, KSR-1, and Intel Paragon XP/S with ASVM [3]). To implement the global address space on top of a distributed memory, techniques for multi-cache systems are used which distinguish between read and write operations. If processors read from a memory location, the data is copied to the local memory of that processor where it is cached. On a write operation of a processor, this processor gets exclusive ownership of this location and all read copies get invalidated. The unit of coherence (and therefore the unit of communication) is a cache line or a page of the virtual memory system. For this reason, care has to be taken to avoid false sharing (independent data objects are mapped to the same page). D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 299–306, 1998. c Springer-Verlag Berlin Heidelberg 1998
300
Rudolf Berrendorf
There are two main problems to be solved for parallel computers with a distributed shared memory and a large number of processors: load balancing and minimization of interprocessor communication. Data-driven scheduling is a new approach to solve both problems. The compiler modifies the code such that at run-time data on task times and memory access behaviour of the tasks is gathered. With this data a special graph is generated and partitioned for communication minimization and load balance. The partitioning result is stored in a template and it is used in subsequent executions of the program region to efficiently schedule the parallel tasks to the processors. The paper is organized as follows. After giving an overview of related work in section 2, section 3 gives an introduction to SVM-Fortran. In section 4 the concept of data-driven scheduling is discussed and in section 5 performance results are shown for an application executed on two different machines. Section 6 concludes and gives a short outlook of further work.
2
Related Work
There are a number of techniques known for parallel machines which try to balance the load, or to minimize the interprocessor communication, or both. Dynamic scheduling methods are well known as an attempt to balance the load on parallel computers, usually for parallel computers with a physically shared memory. The most rigid approach is self scheduling [10] where each idle processor requests only one task to be executed. With Factoring [7]) each idle processor requests at the beginning of the scheduling process larger chunks of tasks to reduce the synchronization overhead. All of these dynamic scheduling techniques take in some sense a greedy approach and therefore they have problems if the tasks at the end of the scheduling process have significantly larger execution times than tasks scheduled earlier. Another disadvantage is the local scheduling aspect with respect to one parallel loop only and the fact that the data locality aspect is not taken into account. There are several scheduling methods known which have a main objective in generating data locality. Execute-on-Home [6] uses the information of a data distribution to execute tasks on that processor to which the accessed data is assigned (i.e. the processor which owns the data). An efficient implementation of the execute-on-home scheduling based on data distributions is often difficult if the data is accessed in an indirect way. In that case, run-time support is necessary. CHAOS/PARTI [13] (and in a similar manner RAPID [4]) is an approach to handle indirect accesses as it is for example common in sparse matrix problems. In an inspector phase the indices for indirection are examined and a graph is generated which includes through the edges the relationship between the data. Then the graph is partitioned and the data is redistributed according to the partitioning. Many problems in the field of technical computing are modeled by the technique of finite elements. The original (physical) domain is partitioned into discrete elements connected through nodes. A usual approach of mapping such
Optimizing Load Balance and Communication on Parallel Computers
301
problems to parallel computers is the partitioning (e.g. [8]) of the resulting grid such that equally sized subgrids are mapped to the processors. For regular grids, partitioning can be done as a geometric partitioning. Non-uniform computation times involved with each node and data that need to be communicated between the processors have to be taken into account in the partitioning step. The whole problem of mapping such a finite element graph onto a parallel computer can be formulated as a graph partitioning problem where the nodes of the graph are the tasks associated which each node of the finite element grid and the edges represent communication demands. There are several software libraries available which can be used to partition such graphs (e.g. Metis [8], Chaco [5], Party [9], JOSTLE [12]). Getting realistic node costs (i.e. task costs) is often difficult for non-uniform task execution times. Also, graph partitioners take no page boundaries into account when they map finite element nodes onto the memory of a DSM-computer and thus they ignore the false sharing problem. [11] discusses a technique to reorder the nodes after the partitioning phase with the aim to minimize communication.
3
SVM-Fortran
SVM-Fortran [2] is a shared memory parallel Fortran77 extension targeted mainly towards data parallel applications on DSM-systems. SVM-Fortran supports coarse-grained functional parallelism where a parallel task itself can be data parallel. A compiler and run-time system is implemented on several parallel machines such as Intel Paragon XP/S with ASVM, SGI Origin2000, and SUN and DEC multiprocessor machines. SVM-Fortran provides standard features of shared memory parallel Fortran languages as well as specific features for DSM-computers. In SVM-Fortran the main concept to generate data locality and to balance the load is the dedicated assignment of parallel work to processors, e.g. the distribution of iterations of a parallel loop to processors. Data locality is not a problem to be solved on the level of individual loops but it is a global problem. SVM-Fortran uses the concept of processor arrangements and templates as a tool to specify scheduling decisions globally via template distributions. Loop iterations are assigned to processors according to the distribution of the appropriate template element. Therefore, in SVM-Fortran templates are used to distribute the work rather than used to distribute the data as it is done in HPF. Different to HPF, it is not necessary for the SVM-Fortran-compiler to know the distribution of a template at compile time.
4
Data-Driven Scheduling
User-directed scheduling, where the user specifies the distribution of work to processors (e.g. specifying a block distribution for a template), makes sense and
302
Rudolf Berrendorf
is efficient if the user has reliable information on the access behaviour (interprocessor communication) and execution times (load balance) of the parallel tasks in the application. But the prerequisite for this is often a detailed understanding of the program, all data structures, and their content at run-time. If the effort to gain this information is too high or if the interrelations are too complex (e.g. indirect addressing on several levels, unpredictable task execution times), the help of a tool or an automated scheduling is desirable. Data-driven scheduling tries to help the programmer in solving the load balancing problem as well as the communication problem. Due to the way datadriven scheduling works, the new technique can be applied to program regions which are iterated several times (sequentially iterated program regions; see Fig. 1 for an example). This type of program region is common in iterative algorithms (e.g. iterative solvers), nested loops where the outer loop has not enough parallelism, or nested loops where data dependencies prohibit the parallel execution of the outer loop. The basic idea behind data-driven scheduling is to gather run-time data on task execution times and memory access behaviour on the first execution of the program region, use this information to find a good schedule which optimizes communication and load balance, and use the computed schedule in subsequent executions of the program region. For simplicity, we restrict the description to schedule regions with one parallel loop inside, although more complex structures can be handled. Fig. 1 shows a small example program. This program is an artificial program to explain the basic idea. In a real program, the false sharing problem introduced with variable a could be easily solved with the introduction of a second array. In this example, the strategy to optimize for data locality and therefore how to distribute the iterations of the parallel loop to the processors might differ dependent on the content of the index field ix. On the other side, the execution time for each of the iterations might differ significantly based on the call to work(i) and therefore load balancing might be a challenging task. Data-driven scheduling tries to model both optimization aspects with the help of a weighted graph. The compiler generates two different codes (code-1, code-2) for the program region. Code-1 is executed on the first execution of the program region1 . Beside the normal code generation (e.g. handling of PDOs), the code is further instrumented to gather statistics. At run-time, execution times of tasks and accesses to shared variables (marked with the VARS-attribute in the directive) are stored in parallel in distributed data structures. Reaching the end of the schedule region, the gathered data is processed. A graph is generated by the run-time system where each node of the graph represents a task and the node weights are the task’s execution time. To get a realistic value, measured execution times have to be subtracted by overhead times (measurement times, page miss times etc.) and synchronization times. To explain the generation of edges in the graph, have a look at Fig. 2(a) which shows the assignments done in the example program. Task 1 and 3 access page 1, 1
The user can reset a schedule region through a directive in which case code-1 is executed again, for example if the content of an index field has changed.
Optimizing Load Balance and Communication on Parallel Computers
C C
--- simple, artificial example to explain the basic idea --- a memory page shall consist of 2 words PARAMETER (n=4) REAL,SHARED,ALIGNED a(2,n) ! shared variable INTEGER ix(n) ! index field DATA /ix/1,2,1,2 C --- declare proc. field, template, and distribute template --CSVM$ PROCESSORS:: proc(numproc()) CSVM$ TEMPLATE templ(n) CSVM$ DISTRIBUTE (BLOCK) ONTO proc:: templ ... C --- sequentially iterated program region --DO iter=1,niter C --- parallel loop enclosed in schedule region --CSVM$ SCHEDULE REGION(ID(1),VARS(a)) CSVM$ PDO (STRATEGY (ON HOME (templ(i)))) DO i=1,n a(1,i) = a(2,ix(i)) ! possible communication CALL work(i) ! possible load imbalance ENDDO CSVM$ SCHEDULE REGION END ENDDO
Fig. 1. (Artificial) Example of a sequentially iterated program region.
task 1: task 2: task 3: task 4:
a(1,1) a(1,2) a(1,3) a(1,4)
= = = =
a(2,1) a(2,2) a(2,1) a(2,2)
access to page 1 access to page 2
node cost
10 1
edge cost
1 10 3
task number
10 2 1 10 4
(b) Generated graph.
(a) Executed assignments.
Fig. 2. Memory accesses and resulting graph for example program. task #tasks 1 2 3 4 page 1 w/r r 2 page 2 w/r r 2 page 3 w 1 page 4 w 1 task time 10 10 10 10
Fig. 3. Page accesses and task times (w=write access, r=read access).
303
304
Rudolf Berrendorf
task 2 and 4 access page 2 either with a read access or with a write access. Page j (j=3,4) is accessed only within task j. Table 4 shows the corresponding access table as it is generated internally as a distributed sparse data structure at runtime. The most interesting column is labeled #tasks and gives the number of different tasks accessing a page. If this entry is 1, no conflicts exist for this page and therefore no communication between processors is necessary for this page, independent of the schedule of tasks to processors. If two or more tasks access a page (#tasks > 1) of which at least one access is a write access, communication might be necessary due to the multi-cache protocol if these tasks are not assigned to the same processor. This is modeled in the graph with an edge between the two tasks/nodes with edge costs of one page miss. Fig. 2(b) shows the graph corresponding to the access table in table 4 (for this simple example, task times are assumed 10 units and a page miss counts 1 unit). The handling of compulsory page misses is difficult as this depends on previous data accesses. If the distribution of pages to processors can be determined at run time (e.g. in ASVM[3]) and if this distribution is similar for all iterations, then these misses can be modeled by artificial nodes in the graph which are pre-assigned to processors. In a next step, this graph is partitioned. We found that multilevel algorithms as found in most partitioning libraries give good results over a large variety of graphs. The partitioning result (i.e. the mapping of task i to processor j) is stored in the template given in the PDO-directive. The data gathering phase is done in parallel, but the graph generation and partitioning phase is done sequentially on one processor. We investigate to use parallel partitioners in the future. Code-2 is executed at the second and all subsequent executions of the program region and this code is optimized code without any instrumentation. The iterations of the parallel loop are distributed according to the template distribution which is the schedule found in the graph partitioning step.
5
Results
We show performance results for the ASEMBL-program which is the assembly of a finite element matrix done in a sequential loop modeling time steps as it is used in many areas of technical computing. The results were obtained on an Intel Paragon XP/S with the extension ASVM [3] implementing SVM in software (embedded in the MACH3 micro kernel) and on a SGI Origin2000 with a fast DSM-implementation in hardware. A multilevel algorithm with Kernighan-Lin refinement of the Metis library [8] was used for partitioning the graph. For the Paragon XP/S, a data set with 900 elements was taken, and for the Origin2000 a data set with 8325 elements was chosen. Due to a search loop in every parallel task, task times differ. The sparse matrix to be assembled is allocated in the shared memory region; the variable accesses are protected by page locks where only one processor can access a page but accesses to different pages can be done in parallel. The bandwidth of the sparse matrix is small,
Optimizing Load Balance and Communication on Parallel Computers
305
therefore strategies which partition the iteration space in clusters will generate better data locality. Fig. 4 shows speedup values for the third sequential time step. Because of the relative small size of the shared array and the large page size on Paragon XP/S (8 KB), the performance improvements are limited to a small number of processors. On both systems, the performance with the Factoring strategy is limited because data locality (and communication) is of no concern with this strategy. Particularly on the Paragon system with high page miss times, this strategy shows no real improvements. Although the shared matrix has a low bandwidth and therefore the Block strategy is in favor, differences in task times cause load imbalances with the Block strategy. Data-driven Scheduling (DDS) performs better than the other strategies on both systems.
40
5
DDS
DDS 4.5
Factoring
Factoring
4
30
3.5
25 Speedup
Speedup
Block
35
Block
3 2.5
20 15
2
10
1.5
5
1 0.5
1
2
4 6 Anzahl Prozessoren
8
(a) Speedup on Paragon XP/S.
10
0
4
8
16
24 32 40 number of processors
48
56
64
(b) Speedup on SGI Origin2000.
Fig. 4. Results for ASEMBL.
6
Summary
We have introduced data-driven scheduling to optimize interprocessor communication and load balance on parallel computers with a distributed shared memory. With the new technique, statistics on task execution times and data sharing are gathered at run-time. With this data, a special graph is generated and graph partitioning techniques are applied. The resulting partitioning is stored in a template and used in subsequent executions of that program region to efficiently schedule the parallel tasks to the processors. We have compared this method with two loop scheduling methods on two DSM-computers where data-driven scheduling performs better than the other two methods.
306
Rudolf Berrendorf
Currently, we use deterministic values for all model parameters. To refine our model, we plan to investigate into non-deterministic values, e.g. number of page misses, page miss times.
7
Acknowledgments
Reiner Vogelsang (SGI/Cray) and Oscar Plata (University of Malaga) gave me access to SGI Origin2000 machines. Heinz Bast (Intel SSD) supported me on the Intel Paragon. I would like to thank the developers of the graph partitioning libraries I used in my work, namely: George Karypis (Metis), Chris Walshaw (Jostle), Bruce A. Hendrickson (Chaco), and Robert Preis (Party).
References 1. R. Berrendorf, M. Gerndt. Compiling SVM-Fortran for the Intel Paragon XP/S. Proc. Working Conference on Massively Parallel Programming Models (MPPM’95), pages 52–59, Berlin, October 1995. IEEE Society Press. 2. R. Berrendorf, M. Gerndt. SVM Fortran reference manual version 1.4. Technical Report KFA-ZAM-IB-9510, Research Centre J¨ ulich, April 1995. 301 3. R. Berrendorf, M. Gerndt, M. Mairandres, S. Zeisset. A programming environment for shared virtual memory on the Intel Paragon supercomputer. In Proc. Intel User Group Meeting, Albuquerque, NM, June 1995. http://www.cs.sandia.gov/ISUG/ps/pesvm.ps. 299, 304 4. C. Fu, T. Yang. Run-time compilation for parallel sparse matrix computations. In Proc. ACM Int’l Conf. Supercomputing, pages 237–244, 1996. 300 5. B. Hendrickson, R. Leland. The Chaco user’s guide, version 2.0. Technical Report SAND95-2344, Sandia National Lab., Albuquerque, NM, July 1995. 301 6. High Performance Fortran Forum. High Performance Fortran Language Specification, 2.0 edition, January 1997. 300 7. S. F. Hummel, E. Schonberg, L. E. Flynn. Factoring - a method for scheduling parallel loops. Comm. ACM, 35(8):90–101, August 1992. 300 8. G. Karypis, V. Kumar. Analysis of multilevel graph partitioning. Technical Report 95-037, Univ. Minnesota, Department of Computer Science, 1995. 301, 304 9. R. Preis, R. Dieckmann. The PARTY Partitioning-Library, User Guide, Version 1.1. Univ. Paderborn, September 1996. 301 10. P. Tang, P.-C. Yew. Processor self-scheduling for multiple nested parallel loops. In Proc. IEEE Int’l Conf. Parallel Processing, pages 528–535, August 1986. 300 11. K. A. Tomko, S. G. Abraham. Data and program restructuring of irregular applications for cache-coherent multiprocessors. In Proc. ACM Int’l Conf. Supercomputing, pages 214–225, July 1994. 301 12. C. Walshaw, M. Cross, M.G. Everett, S. Johnson, K. McManus. Partitioning & mapping of unstructured meshed to parallel machine topologies. In Proc. Irregular ´ Parallel Algorithms for Irregularly Structured Problems, volume 980 of LNCS, 95: pages 121–126. Springer, 1995. 301 13. J. Wu, R. Das, J. Saltz, H. Berryman, S. Hiranandani. Distributed memory compiler design for sparse problems. IEEE Trans. Computers, 44(6), 1995. 300
Performance Analysis and Portability of the PLUM Load Balancing System Leonid Oliker1 , Rupak Biswas2 , and Harold N. Gabow3 1
RIACS, NASA Ames Research Center, Moffett Field, CA 94035, USA MRJ, NASA Ames Research Center, Moffett Field, CA 94035, USA 3 CS Department, University of Colorado, Boulder, CO 80309, USA
2
Abstract. The ability to dynamically adapt an unstructured mesh is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult. To address this problem, we have developed PLUM, an automatic portable framework for performing adaptive numerical computations in a message-passing environment. PLUM requires that all data be globally redistributed after each mesh adaption to achieve load balance. We present an algorithm for minimizing this remapping overhead by guaranteeing an optimal processor reassignment. We also show that the data redistribution cost can be significantly reduced by applying our heuristic processor reassignment algorithm to the default mapping of the parallel partitioner. Portability is examined by comparing performance on a SP2, an Origin2000, and a T3E. Results show that PLUM can be successfully ported to different platforms without any code modifications.
1
Introduction
The ability to dynamically adapt an unstructured mesh is a powerful tool for efficiently solving computational problems with evolving physical features. Standard fixed-mesh numerical methods can be made more cost-effective by locally refining and coarsening the mesh to capture these phenomena of interest. Unfortunately, an efficient parallelization of these adaptive methods is rather difficult, primarily due to the load imbalance created by the dynamically-changing nonuniform grid. Nonetheless, it is generally thought that unstructured adaptive-grid techniques will constitute a significant fraction of future high-performance supercomputing. With this goal in mind, we have developed a novel method, called PLUM [7], that dynamically balances processor workloads with a global view when performing adaptive numerical calculations in a parallel message-passing environment. The mesh is first partitioned and mapped among the available processors. Once an acceptable numerical solution is obtained, the mesh adaption procedure [8] is invoked. Mesh edges are targeted for coarsening or refinement based on an error indicator computed from the solution. The old mesh is then coarsened, resulting in a smaller grid. Since edges have already been marked for refinement, the new mesh can be exactly predicted before actually performing the refinement step. Program control is thus passed to the load balancer at this time. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 307–317, 1998. c Springer-Verlag Berlin Heidelberg 1998
308
Leonid Oliker et al.
If the current partitions will become load imbalanced after adaption, a repartitioner is used to divide the new mesh into subgrids. The new partitions are then reassigned among the processors in a way that minimizes the cost of data movement. If the remapping cost is compensated by the computational gain that would be achieved with balanced partitions, all necessary data is appropriately redistributed. Otherwise, the new partitioning is discarded. The computational mesh is then refined and the numerical calculation is restarted.
2 2.1
Dynamic Load Balancing Repartitioning the Initial Mesh Dual Graph
Repeatedly using the dual of the initial computational mesh for dynamic load balancing is one of the key features of PLUM [7]. Each dual graph vertex has a computational weight, wcomp , and a remapping weight, wremap . These weights model the processing workload and the cost of moving the corresponding element from one processor to another. Every dual graph edge also has a weight, wcomm , that models the runtime communication. New computational grids obtained by adaption are represented by modifying these three weights. If the dual graph with a new set of wcomp is deemed unbalanced, the mesh is repartitioned. 2.2
Processor Reassignment
New partitions generated by a partitioner are mapped to processors such that the data redistribution cost is minimized. In general, the number of new partitions is an integer multiple F of the number of processors, and each processor is assigned F partitions. Allowing multiple partitions per processor reduces the volume of data movement but increases the partitioning and reassignment times [7]. We first generate a similarity measure M that indicates how the remapping weights wremap of the new partitions are distributed over the processors. It is represented as a matrix where entry Mij is the sum of the wremap values of all the dual graph vertices in new partition j that already reside on processor i. Various cost functions are usually needed to solve the processor reassignment problem using M for different machine architectures. We present three general metrics: TotalV, MaxV, and MaxSR, which model the remapping cost on most multiprocessor systems. TotalV minimizes the total volume of data moved among all the processors, MaxV minimizes the maximum flow of data to or from any single processor, while MaxSR minimizes the sum of the maximum flow of data to and from any processor. Experimental results [2] have indicated the usefulness of these metrics in predicting the actual remapping costs. A greedy heuristic algorithm to minimize the remapping overhead is also presented. TotalV Metric. The TotalV metric assumes that by reducing network contention and the total number of elements moved, the remapping time will be reduced. In general, each processor cannot be assigned F unique partitions cor-
Performance Analysis and Portability of the PLUM Load Balancing System
309
responding to their F largest weights. To minimize TotalV, each processor i must be assigned F partitions ji f , f = 1, 2, . . . , F , such that the objective function F=
P F
Miji
f
i=1 f =1
is maximized subject to the constraint = jk s , for i = k or r = s; i, k = 1, 2, . . . , P ; r, s = 1, 2, . . . , F. ji r We can optimally solve this by mapping it to a network flow optimization problem described as follows. Let G = (V, E) be an undirected graph. G is bipartite if V can be partitioned into two sets A and B such that every edge has one vertex in A and the other vertex in B. A matching is a subset of edges, no two of which share a common vertex. A maximum-cardinality matching is one that contains as many edges as possible. If G has a real-valued cost on each edge, we can consider the problem of finding a maximum-cardinality matching whose total edge cost is maximized. We refer to this as the maximally weighted bipartite graph (MWBG) problem (also known as the assignment problem). When F = 1, optimally solving the TotalV metric trivially reduces to MWBG, where V consists of P processors and P partitions in each set. An edge of weight Mij exists between vertex i of the first set and vertex j of the second set. If F > 1, the processor reassignment problem can be reduced to MWBG by duplicating each processor and all of its incident edges F times. Each set of the bipartite graph then has P ×F vertices. After the optimal solution is obtained, the solutions for all F copies of a processor are combined to form a one-to-F mapping between the processors and the partitions. The optimal solution for the TotalV metric and the corresponding processor assignment of an example similarity matrix is shown in Fig. 1(a). The fastest MWBG algorithm can compute a matching in O(|V |2 log |V | + |V ||E|) time [3], or in O(|V |1/2 |E| log(|V |C)) time if all edge costs are integers of absolute value at most C [5]. We have implemented the optimal algorithm with a runtime of O(|V |3 ). Since M is generally dense, |E| ≈ |V |2 , implying that we should not see a dramatic performance gain from a faster implementation. MaxV Metric. The metric MaxV, unlike TotalV, considers data redistribution in terms of solving a load imbalance problem, where it is more important to minimize the workload of the most heavily-weighted processor than to minimize the sum of all the loads. During the process of remapping, each processor must pack and unpack send and receive buffers, incur remote-memory latency time, and perform the computational overhead of rebuilding internal and shared data structures. By minimizing max α×max(ElemsSent), β×max(ElemsRecd) , where α and β are machine-specific parameters, MaxV attempts to reduce the total remapping time by minimizing the execution time of the most heavily-loaded processor. We can solve this optimally by considering the problem of finding a maximum-cardinality matching whose maximum edge cost is minimum. We refer to this as the bottleneck maximum cardinality matching (BMCM) problem. To find the BMCM of the graph G corresponding to the similarity matrix, we first need to transform M into a new matrix M . Each entry Mij represents
Leonid Oliker et al.
1
2
100 110 120
0 1
50
90
2
40
85
3 2
New Partitions
3
95
70
30
60
3
1
0
New Processors TotalV moved = 525 MaxV moved = 275 MaxSR moved = 485 (a)
0
Old Processors
Old Processors
0
1
2
100 110 120
0 1
50
90
2
40
85
3 1
New Partitions
3
95
70
30
60
0
2
3
New Processors TotalV moved = 640 MaxV moved = 245 MaxSR moved = 475 (b)
0
1
2
100 110 120
0 1
50
90
2
40
85
3 2
New Partitions
3
95
70
30
60
1
3
0
New Processors TotalV moved = 570 MaxV moved = 255 MaxSR moved = 465 (c)
0
Old Processors
New Partitions
Old Processors
310
1
2
3
100 110 120
0 1
50
90
2
40
85
3 3
95
70
30
60
2
1
0
New Processors TotalV moved = 550 MaxV moved = 260 MaxSR moved = 470 (d)
Fig. 1. Various cost metrics of a similarity matrix M for P = 4 and F = 1 using (a) the optimal MWBG, (b) the optimal BMCM, (c) the optimal DBMCM, and (d) our heuristic algorithms the maximum cost of sending data to or receiving data from processor i and partition j: P P Miy , y = j , β Mxj , x = i . Mij = max α y=1
x=1
Currently, our framework for the MaxV metric is restricted to F = 1. We have implemented the BMCM algorithm of Bhat [1] which combines a maximum cardinality matching algorithm with a binary search, and runs in O(|V |1/2 |E| log |V |). The fastest known BMCM algorithm, proposed by Gabow and Tarjan [4], has a runtime of O((|V | log |V |)1/2 |E|). The new processor assignment for the similarity matrix in Fig. 1 using this approach with α = β = 1 is shown in Fig. 1(b). Notice that the total number of elements moved in Fig. 1(b) is larger than the corresponding value in Fig. 1(a); however, the maximum number of elements moved is smaller. MaxSR Metric. Our third metric, MaxSR, is similar to MaxV in the sense that the overhead of the bottleneck processor is minimized during the remapping phase. MaxSR differs, however, in that it minimizes the sum of the heaviest data flow from any processorand to any processor, expressed as α×max(ElemsSent) + β×max(ElemsRecd) . We refer to this as the double bottleneck maximum cardinality matching (DBMCM) problem. The MaxSR formulation allows us to capture the computational overhead of packing and unpacking data, when these two phases are separated by a barrier synchronization. Additionally, the MaxSR metric may also approximate the many-to-many communication pattern of our remapping phase. Since a processor can either be sending or receiving data, the overhead of these two phases should be modeled as a sum of costs. We have developed an algorithm for computing the minimum MaxSR of the graph G corresponding to our similarity matrix. We first transform M to a new matrix M . Each entry Mij contains a pair of values (Send,Receive) correspond-
Performance Analysis and Portability of the PLUM Load Balancing System
311
ing to the total cost of sending and receiving data, when partition j is mapped to processor i: P P Mij = Sij = α Miy , y = j , Rij = β Mxj , x = i . y=1
x=1
Currently, our algorithm for the MaxSR metric is restricted to F = 1. Let σ1 , σ2 , . . . , σk be the distinct Send values appearing in M , sorted in increasing order. Thus, σi < σi+1 and k ≤ P 2 . Form the bipartite graph Gi = (V, Ei ), where V consists of processor vertices u = 1, 2, . . . , P and partition vertices v = 1, 2, . . . , P , and Ei contains edge (u, v) if Suv ≤ σi ; furthermore, edge (u, v) has weight Ruv if it is in Ei . For small values of i, graph Gi may not have a perfect matching. Let imin be the smallest index such that Gimin has a perfect matching. Obviously, Gi has a perfect matching for all i ≥ imin . Solving the BMCM problem of Gi gives a matching that minimizes the maximum Receive edge weight. It gives a matching with MaxSR value at most σi + MaxV(Gi ). Defining MaxSR(i) =
min (σj + MaxV(Gj )),
imin ≤j≤i
it is easy to see that MaxSR(k) equals the correct value of MaxSR. Thus, our algorithm computes MaxSR by solving k BMCM problems on the graphs Gi and computing the minimum value MaxSR(k). However, we can prematurely terminate the algorithm if there exists an imax such that σimax +1 ≥ MaxSR(imax), since it is then guaranteed that the MaxSR solution is MaxSR(imax). Our implementation has a runtime of O(|V |1/2 |E|2 log |V |) since the BMCM algorithm is called |E| times in the worst case; however, it can be decreased to O(|E|2 ). The following is a brief sketch of this more efficient implementation. Suppose we have constructed a matching M that solves the BMCM problem of Gi for i ≥ imin . We solve the BMCM problem of Gi+1 as follows. Initialize a working graph G to be Gi+1 with all edges of weight greater than MaxV(Gi ) deleted. Take the matching M on G, and delete all unmatched edges of weight MaxV(Gi ). Choose an edge (u, v) of maximum weight in M. Remove edge (u, v) from M and G, and search for an augmenting path from u to v in G. If no such path exists, we know that MaxV(Gi ) =MaxV(Gi+1 ). If an augmenting path is found, repeat this procedure by choosing a new edge (u , v ) of maximum weight in the matching and searching for an augmenting path. After some number of repetitions of this procedure, the maximum weight of a matched edge will have decreased to the desired value MaxV(Gi+1 ). At this point our algorithm to solve the BMCM problem of Gi+1 will stop, since no augmenting path will be found. To see that this algorithm runs in O(|E|2 ), note that each search for an augmenting path uses time O(|E|) and that there are O(|E|) such searches. A successful search for an augmenting path for edge (u, v) permanently eliminates it from all future graphs, so there are at most |E| successful searches. Furthermore, there are at most |E| unsuccessful searches, one for each value of i. The new processor assignment for the similarity matrix in Fig. 1 using the DBMCM algorithm with α = β = 1 is shown in Fig. 1(c). Notice that the MaxSR
312
Leonid Oliker et al.
solution is minimized; however, the number of TotalV elements moved is larger than the corresponding value in Fig. 1(a), and more MaxV elements are moved than in Fig. 1(b). Also note that the optimal similarity matrix solution for MaxSR is provably no more than twice that of MaxV. Heuristic Algorithm. We have developed a heuristic greedy algorithm that gives a suboptimal solution to the TotalV metric in O(|E|) steps [7]. All partitions are initially flagged as unassigned and each processor has a counter set to F that indicates the remaining number of partitions it needs. The non-zero entries of the similarity matrix M are then sorted in descending order. Starting from the largest entry, partitions are assigned to processors that have less than F partitions until done. If necessary, the zero entries in M are also used. Oliker and Biswas [7] proved that a processor assignment obtained using the heuristic algorithm can never result in a data movement cost that is more than twice that of the optimal TotalV assignment. In addition, experimental results in Sec. 3.1 demonstrate that our heuristic quickly finds high quality solutions for all three metrics. Applying this heuristic algorithm to the similarity matrix in Fig. 1 generates the new processor assignment shown in Fig. 1(d). 2.3 Remapping Cost Model Once the reassignment problem is solved, a model is needed to quickly predict the expected redistribution cost for a given architecture. Our redistribution algorithm consists of three major steps: first, the data objects moving out of a partition are stripped out and placed in a buffer; next, a collective communication distributes the data to its destination; and finally, the received data is integrated into each partition and the boundary information is consistently updated. This remapping procedure closely follows the superstep model of BSP [9]. The expected time for the redistribution procedure on bandwidth-rich systems can be expressed as γ × MaxSR + O, where MaxSR = max(ElemsSent) + max(ElemsRecd), γ is the total computation and communication cost to process each redistributed element, and O is the sum of all constant overheads [7]. This formulation demonstrates the need to model the MaxSR metric when performing processor reassignment. By minimizing MaxSR, we can guarantee a reduction in the computational overhead of our remapping algorithm. To compute γ and O, a simple least squares fit through several data points for various redistribution patterns and their corresponding runtimes can be used. This procedure needs to be performed only once for each architecture, and the values of γ and O can then be used in actual computations to estimate the redistribution cost.
3
Experimental Results
The 3D TAG parallel mesh adaption procedure [8] and the PLUM global load balancing strategy [7] have been implemented in C and C++, with the parallel activities in MPI for portability. All experiments were performed on the widenode SP2 at NASA Ames, the Origin2000 at NCSA, and the T3E at NASA Goddard, without any machine-specific optimizations.
Performance Analysis and Portability of the PLUM Load Balancing System
313
The computational mesh used in this paper is one used to simulate an acoustics wind-tunnel experiment of a UH-1H helicopter rotor blade [7]. Three different cases are studied, with varying fractions of the domain targeted for refinement based on an error indicator calculated directly from the flow solution. The strategies, called Real 1, Real 2, and Real 3, subdivided 5%, 33%, and 60% of the 78,343 edges of the initial mesh. This increased the number of mesh elements from 60,968 to 82,489, 201,780, and 321,841, respectively. 3.1 Comparison of Reassignment Algorithms Table 1 presents a comparison of our five different processor reassignment algorithms in terms of the reassignment time (in secs) and the amount of data movement. Results are shown for the Real 2 strategy on the SP2 with F = 1. The PMeTiS [6] case does not require any explicit processor reassignment since we choose the default partition-to-processor mapping given by the partitioner. The poor performance for all three metrics is expected since PMeTiS is a global partitioner that does not attempt to minimize the remapping overhead. Previous work [2] compared the performance of PMeTiS with other partitioners. Table 1. Comparison of reassignment algorithms for Real 2 on the SP2 with F =1
Algthm. PMeTiS MWBG BMCM DBMCM Heuristic
P = 32 TotalV MaxV MaxSR Reass. Metric Metric Metric Time
P = 64 TotalV MaxV MaxSR Reass. Metric Metric Metric Time
58297 34738 49611 50270 35032
67439 38059 52837 54896 38283
5067 4410 4410 4414 4410
7467 5822 5944 5733 5809
0.0000 0.0177 0.0323 0.0921 0.0017
2667 2261 2261 2261 2261
4452 3142 3282 3121 3123
0.0000 0.0650 0.1327 1.2515 0.0088
The execution times of the other four algorithms increase with the number of processors because of the growth in the size of the similarity matrix; however, the heuristic time for 64 processors is still very small and acceptable. The total volume of data movement is obviously the smallest for the MWBG algorithm since it optimally solves for the TotalV metric. In the optimal BMCM method, the maximum of the number of elements sent or received is explicitly minimized, but all the other algorithms give almost identical results for the MaxV metric. In our helicopter rotor experiment, only a few localized regions of the domain incur a dramatic increase in the number of grid points between refinement levels. These newly-refined regions must shift a large number of elements onto other processors in order to achieve a balanced load distribution. Therefore, a similar MaxV solution should be obtained by any reasonable reassignment algorithm. The DBMCM algorithm optimally reduces MaxSR, but achieves no more than a 5% improvement over the other algorithms. Nonetheless, since we believe that the MaxSR metric can closely approximate the remapping cost on many architectures, computing its optimal solution can provide useful information. Notice
314
Leonid Oliker et al.
that the minimum TotalV increases slightly as P grows from 32 to 64, while MaxSR is dramatically reduced by over 45%. This trend continues as the number of processors increases, and indicates that PLUM will remain viable on a large number of processors, since the per processor workload decreases as P increases. Finally, observe that the heuristic algorithm does an excellent job in minimizing all three cost metrics, in a trivial amount of time. Although theoretical bounds have only been established for the TotalV metric, empirical evidence indicates that the heuristic algorithm closely approximates both MaxV and MaxSR. Similar results were obtained for the other edge-marking strategies. 3.2 Portability Analysis The top three plots in Fig. 2 illustrate parallel speedup for the three edgemarking strategies on the SP2, Origin2000, and T3E. Two sets of results are presented for each machine: one when data remapping is performed after mesh refinement, and the other when remapping is done before refinement. The speedup numbers are almost identical on all three machines. The Real 3 case shows the best speedup values because it is the most computation intensive. Remapping data before refinement has the largest relative effect for Real 1, because it has the smallest refinement region and predictively load balancing the refined mesh returns the biggest benefit. The best results are for Real 3 with remapping before refinement, showing an efficiency greater than 87% on 32 processors.
Refinement speedup
60
30
SP2 Remap after refinement Remap before refinement
45
24
40 20
6
0
Remapping time (secs)
60
12
15
0
0
SP2
Origin2000
1
T3E
1
10
1
10
10
0
0
0
10
10
0
8
16
T3E
18
Real_1 Real_2 Real_3
30
80
Origin2000
24
32
40
48
Number of processors
56
64
0
10 4
8
12
16
20
24
Number of processors
28
32
0
16
32
48
64
80
96 112 128
Number of processors
Fig. 2. Refinement speedup (top) and remapping time (bottom) within PLUM on the SP2, Origin2000, and T3E, when data is redistributed after or before mesh refinement To compare the performance on the SP2, Origin2000, and T3E more critically, one needs to look at the actual times rather than the speedup values. Table 2 shows how the execution time (in secs) is spent during the refinement and subsequent load balancing phases for the Real 2 case when data is remapped
Performance Analysis and Portability of the PLUM Load Balancing System
315
Table 2. Anatomy of execution times for Real 2 on the Origin2000, SP2, and T3E P 2 4 8 16 32 64 128
Adaption Time O2000 SP2 T3E
Remapping Time O2000 SP2 T3E
Partitioning Time O2000 SP2 T3E
5.261 2.880 1.470 0.794 0.458
3.005 3.005 2.963 2.346 0.491
0.628 0.584 0.522 0.396 0.389
12.06 6.734 3.434 1.846 1.061 0.550
3.455 1.956 1.034 0.568 0.333 0.188 0.121
3.440 3.440 3.321 2.173 1.338 0.890
2.648 1.501 1.449 0.880 0.592 0.778 1.894
0.815 0.537 0.424 0.377 0.429 0.574
0.701 0.477 0.359 0.301 0.302 0.425 0.599
before the subdivision phase. The processor reassignment times are not presented since they are negligible compared to other times, as is evident from Table 1. Notice that the T3E adaption times are consistently more than 1.4 times faster than the Origin2000 and three times faster than the SP2. One reason for this performance difference is the disparity in the clock speeds of the three machines. Another reason is that the mesh adaption code does not use the floating-point units on the SP2, thereby adversely affecting its overall performance. The bottom three plots in Fig. 2 show the remapping time for each of the three cases on the SP2, Origin2000, and T3E. In almost every case, a significant reduction in remapping time is observed when the adapted mesh is load balanced by performing data movement prior to refinement. This is because the mesh grows in size only after the data has been redistributed. In general, the remapping times also decrease as the number of processors is increased. This is because even though the total volume of data movement increases with the number of processors, there are actually more processors to share the work. The remapping times when data is moved before mesh refinement are reproduced for the Real 2 case in Table 2 since the exact values are difficult to read off the log-scale. Perhaps the most remarkable feature of these results is the peculiar behavior of the T3E when P ≥ 64. When using up to 32 processors, the remapping performance of the T3E is very similar to that of the SP2 and Origin2000. It closely follows the redistribution cost model given in Sec. 2.3, and achieves a significant runtime improvement when remapping is performed prior to refinement. However, for 64 and 128 processors, the remapping overhead on the T3E begins to increase and violates our cost model. The runtime difference when data is remapped before and after refinement is dramatically diminished; in fact, all the remapping times begin to converge to a single value! This indicates that the remapping time is no longer affected only by the volume of data redistributed but also by the interprocessor communication pattern. One way of potentially improving these results is to take advantage of the T3E’s ability to efficiently perform one-sided communication. Another surprising result is the dramatic reduction in remapping times when using 32 processors on the Origin2000. This is probably because network contention with other jobs is essentially removed when using the entire machine. When using up to 16 processors, the remapping times on the SP2 and the Ori-
316
Leonid Oliker et al.
gin2000 are comparable, while the T3E is about twice as fast. Recall that the remapping phase within PLUM consists of both communication and computation. Since the results in Table 2 indicate that computation is faster on the Origin2000, it is reasonable to infer that bulk communication is faster on the SP2. These results generally demonstrate that our methodology within PLUM is effective in significantly reducing the data remapping time and improving the parallel performance of mesh refinement. Table 2 also presents the PMeTiS partitioning times for Real 2 on all three systems; the results for Real 1 and Real 3 are almost identical because the time to repartition mostly depends on the initial problem size. There is, however, some dependence on the number of processors used. When there are too few processors, repartitioning takes more time because each processor has a bigger share of the total work. When there are too many processors, an increase in the communication cost slows down the repartitioner. Table 2 demonstrates that PMeTiS is fast enough to be effectively used within our framework, and that PLUM can be successfully ported to different platforms without any code modifications.
4
Conclusions
In this paper, we verified the effectiveness of our PLUM load balancer for adaptive unstructured meshes on a helicopter acoustics problem. We developed three generic metrics to model the remapping cost on most multiprocessor systems. Optimal solutions for these metrics, as well as a heuristic approach were implemented. We showed that the data redistribution overhead can be significantly reduced by applying our heuristic processor reassignment algorithm to the default mapping given by the global partitioner. Portability was demonstrated by presenting results on the three vastly different architectures of the SP2, Origin2000, and T3E, without the need for any code modifications. Results showed that, in general, PLUM will remain viable on large numbers of processors. However, our redistribution cost model was violated on the T3E when 64 or more processors were used. Future research will address the improvement of these results, and the development of a more comprehensive remapping cost model.
References 1. Bhat, K.: An O(n2.5 log2 n) time algorithm for the bottleneck assignment problems. AT&T Bell Laboratories Unpublished Report (1984) 310 2. Biswas, R., Oliker, L.: Experiments with repartitioning and load balancing adaptive meshes. NASA Ames Research Center Technical Report NAS-97-021 (1997) 308, 313 3. Fredman, M., Tarjan, R.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34 (1987) 596–615 309 4. Gabow, H., Tarjan, R.: Algorithms for two bottleneck optimization problems. J. of Alg. 9 (1988) 411–417 310 5. Gabow, H., Tarjan, R.: Faster scaling algorithms for network problems. SIAM J. on Comput. 18 (1989) 1013–1036 309 6. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregular graphs. University of Minnesota Technical Report 96-036 (1996) 313
Performance Analysis and Portability of the PLUM Load Balancing System
317
7. Oliker, L., Biswas, R.: PLUM: Parallel load balancing for adaptive unstructured meshes. NASA Ames Research Center Technical Report NAS-97-020 (1997) 307, 308, 312, 313 8. Oliker, L., Biswas, R., Strawn, R.: Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2. Springer-Verlag LNCS 1117 (1996) 35–47 307, 312 9. Valiant, L.: A bridging model for parallel computation. Comm. ACM 33 (1990) 103–111 312
Experimental Studies in Load Balancing . Azzedine Boukerche and Sajal K. Das Department of Computer Sciences, University of North Texas, Denton, TX. USA
Abstract. This paper takes an experimental approach to the load balancing problem for parallel simulation applications. In particular, it focuses upon a conservative synchronization protocol, by making use of an optimized version of Chandy&Misra null message method, and propose a dynamic load balancing algorithm which assumes no compile time knowledge about the workload parameters. The proposed scheme is also implemented on an Intel Paragon A4 multicomputer, and the performance results for several simulation models are reported.
1
Introduction
A considerable number of research projects on load balancing in parallel and distributed systems in general has been carried out in the literature due to the potential performance gain from this service. However, all of these algorithms are not suitable for parallel simulation since the synchronization constraints exacerbate the dependencies between the LPs. In this paper, we consider the load balancing problem associated with the conservative synchronization protocol that makes use of an optimized version of Chandy-Misra Null messages protocol [3]. Our primary goal is to minimize the synchronization overhead in conservative parallel simulation, and significantly reduce the total execution time of the simulation by evenly distributing the workload among processors [1,2].
2
Load Balancing Algorithm
Before the load balancing algorithm can determine the new assignment of processes, it must receive information from the processes. We propose to implement the load balancing facility as two kinds of processes: load balancer and process migration processes. A load balancer makes decision on when to move which process to where, while migration process carries out the decision made by the load balancer to move processes between processors. To prevent the bottleneck and reduce the message traffic over a fully distributed approach. we propose a Centralized approach (CL), and a Multi-Level (ML) hierarchical scheme; where processors are grouped and work loads of processors are balanced hierarchically through multiple levels. At the present time,
Part of this research is supported by Texas Advanced Technology Program grant TATP-003594031
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 318–321, 1998. c Springer-Verlag Berlin Heidelberg 1998
Experimental Studies in Load Balancing.
319
we settle with a Two-Level scheme. In level 1, we use a centralized approach that makes use of a (Global) Load Balancing Facitilty, (GLBF), and the global state consisting of process/processors mapping table. The motivation for this is that many contemporary parallel computers are usually equipped with a frontend host processor (e.g., hypercube multiprocessor machines). The load balancer (GLBF ) sends periodically a message < Request Load > to a specific processor, called f irst proc, within each clusters requesting the average load of each processor within each groups. In level 2, the processors are partitioned into clusters, and the processors within each group are structured as a virtual ring, and operate in parallel to collect the work loads of processors. A virtual ring is designed to be traversed by a token which originates at a particular processor, called first proc, passes through intermediate processors, and ends it traversal at a preidentified processor called last proc. Each of the ring will have its own circulating token, so that information (i.e., work loads) gathering within the rings is concurrent. As the token travels through the processors of the ring, it accumulates the information (the load of each processors and id of the processors that contain the highest/lowest loads), so that when it arrives at the last proc, information have been gathering from all processors of the ring. When all processors have responded, the load balancer (GLBF ) computes the new load balance. Once the load balancer makes the decision on when to move which LP to where, it sends M igrateRequest to the migration process which in turns initiates the migration mechanism. Then, the migration process sends the (heavily overloaded) selected process to the lighted underloaded neighbor processor. The computational load in our parallel simulation model consists of executing the null and the real messages. We wish to distribute the null messages and the real messages among all available processors. We define the (normalized) Load k k k , Navg ) = αRavg /Ravg + at each processor (P rk ) as follows Loadk = F (Ravg k k k (1−α)Navg /Navg ; where Ravg ,and Navg are respectively the average CPU-queue length for real messages and null messages at each processor P rk ; and α and (1 - α) are respectively the relative weights of the corresponding parameters. The value of α was determined empirically.
3
Simulation Experiments
The experiments have been conducted on a 72 nodes Intel Paragon, and we examined a general distributed communication model (DCM), see Fig. 1, modeling a national network consisting of four regions where the subnetworks are a toroid, a circular loops, a fully connected, and a pipeline networks. These regions are connected through four centrally located delays. A node in the network is represented by a process. Final message destinations are not known when a message is createad. Hence, no routing algorithm is simulated. Instead one third of arrival messages are forwarded to the neighboring nodes. A uniform distribution is employed to select which neigbor receives a message. Consequently the volumes of messages between any two nodes is inversely proportional to their hop distance. Messages may flow between any two nodes and typically several paths may be
320
Azzedine Boukerche and Sajal K. Das
Fig. 1. Distributed Communication Model employed. Nodes receive messages at varying rates that are dependant on traffic patterns. Since there are numerous deadlocks in this model, it provides a stress test for any conservative synchronization mechanisms. In fact, deadlocks occurs so frequently and null messages are generated so often that the load balancing strategy is almost a necessity in this model. Various simulation conditions were created by mimicking the daily load fluctuation found in large communication network operating across various time zones in the nation [1]. In the pipline region, for instance, we arranged the subregion into stages, and all processes in the same stages perform the same normal distribution with a standard deviation 25%. The time required to execute a message is significantly larger than the time to raise an event message in the next stage (message passing delay). The experimental data was obtained by averaging several trial runs. The execution time (in seconds) as function of the number of processors are presented in the form of graphs. We also report the synchronization overhead in the form of null message ratio (N M R) which is defined as the number of null messages processed by the simulation using Chandy-Misra null-message approach divided by the number of real messages processed. the performance of our dynamic load balancing algorithms were compared with a static partitioning. Figure 2 depicts the results obtained for the DCM model. We observe that the load balancing algorithm improves the running time. In other words, the running time for both network models, ie., Fully Connected and Distributed Communication, decreases as we increase the number of processors. The results in Fig. 6 indicate a reduction of 50% in the running time using the CL dynamic load balancing strategy compared with the static one when the number of processors is increased from 32 to 64, and about 55-60% reduction when we use the ML dynamic load balancing algorithm. Figure 3 displays N M R as a function of the number of processors employed in the network models. We observe a significant reduction of null-messages for all load balancing schemes. The results show that N M R increases as the number of processors increases for both population levels. For instance, If we confine ourselves to less than 4 processors, we see approximately 30-35% reduction of
Experimental Studies in Load Balancing.
1550
R
Static CL Dynamic
1350
65
ML Dynamic
U
N M R
1150
N
950
T 750 I M 550 E
55
321
Static CL- Dynamic ML- Dynamic
45 35 25
350
15
250
2 4
8
16
32
48
64
24 8
Fig. 2. Run Time Vs. Nbr of Processors
16
32
48
64
PROCESSORS
PROCESSORS
Fig. 3. NMR Vs. Nbr of Processors
the synchronization overhead using the CL dynamic load balancing algorithm over the static one. Increasing the number of processors from 4 to 16, we observe about 35-40% reduction of N M R. We also observe a 45% reduction using CL strategy over the static one, when 64 processors. are used, and a reduction of more than 50% using ML scheme over the static one. These results conclude that the multi-level load balancing strategy significantly reduces the synchronization overhead when compared to a centralized method. In other words, a careful dynamic load balancing improves the performance of a conservative parallel simulation.
4
Conclusion and Future Work
We have described a dynamic load balancing algorithm based upon a notion of CPU-queue length utilization, and in which process migration takes place between physical processors. The synchronization protocol makes use of ChandyMisra null message approach. Our results indicate that careful load balancing can be used to further improve the performance of Chandy-Misra’s approach. We note that a Multi-Level approach seems to be a promising solution in reducing further the execution time of the simulation. An improvement between 30% and 50% in the event throughput were also noted which resulted in a significant reduction in the running time of the simulation models.
References 1. Boukerche, A., Das, S. K.: Dynamic Load Balancing Strategies For Parallel Simulations. TR-98. University of North Texas. 318, 320 2. Boukerche, A., and Tropper C., “A Static Partitioning and Mapping Algorithm for Conservative Parallel Simulations”, IEEE/ACM PADS’94, 1994, 164–172. 318 3. Misra, J., “Distributed Discrete-Event Simulation”, ACM Computing Surveys, Vol. 18, No. 1, 1986, 39–65. 318
On-Line Scheduling of Parallelizable Jobs Christophe Rapine1 , Isaac D. Scherson2 , and Denis Trystram1 1 2
LMC – IMAG, Domaine Universitaire, BP 53, 38041 GRENOBLE cedex 9, France University of California, Irvine, Department of Information and Computer Science, Irvine, CA 92717-3425, U.S.A.
Abstract. We consider the problem of efficiently executing a set of parallel jobs on a parallel machine by effectively scheduling the jobs on the computer’s resources. This problem is one of optimization of resource utilization by parallel computing programs and/or the management of multi-users requests on a distributed system. We assume that each job is parallelizable and can be executed on any number of processors. Various on-line scheduling strategies of time/space sharing are presented here. The goal is to assign jobs to processors in space and time such that the total execution time is optimized.
1 1.1
Introduction Description of the Computational Model
A model of computation is a description of a computer together with its program execution rules. The interest of a model is to provide a high level and abstract vision of the real computer world such that the design and the theoretical development and analysis of algorithms and of their complexity are possible. The most commonly adopted parallel computer model is the PRAM model [4]. A PRAM architecture is a synchronous system consisting of a global shared-memory and an unbounded number of processors. Each processor owns a small local memory and is able to execute any basic instruction in one unit of time. After each basic computational step a synchronization is done and each processor can access a global variable (for reading or writing) in one time unit. This model provides a powerful basis for the theoretical analysis of algorithms. It allows in particular the classification of problems in term of their intrinsic parallelism. Basically, the parallel characterization of a problem P is based on the order of magnitude of the time T∞ (n) of the best known algorithm to solve an instance of size n. T∞ (n) provides a lower bound for the time to solve P in parallel. However, in addition to answer the question how fast the problem can be solved ?, one important characteristic is how efficient is the parallelization ?, i.e. what is the order of magnitude of the number of processors P∞ (n) needed to achieve T∞ (n). This is an essential parameter to deal with in order to implement an algorithm on an actual parallel system with limited resources. This is best represented by the ratio µ∞ between the work of the PRAM algorithm, W∞ (n) = T∞ (n).P∞ (n) and the total number of instructions in the algorithm Wtot . In this paper we consider a physically-distributed logically-shared memory machine composed of m D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 322–327, 1998. c Springer-Verlag Berlin Heidelberg 1998
On-Line Scheduling of Parallelizable Jobs
323
identical processors linked by an interconnection network. In the light of modern fast processor technology, compared to the PRAM model, communications and synchronizations become the most expensive operations and must be considered as important as computations.
1.2
Model of Jobs
For reasons of efficiency, algorithms must be considered at a high level of granularity, i.e. elementary operations are to be grouped into tasks, linked by precedence constraints to form programs. A general approach leads to write a parallel algorithm for v virtual processors with m v P∞ . The execution of the program needs a scheduling policy a, static and/or dynamic, which directly influences the execution time Tm of the program. We define the work of the application on m processors as the space-time product Wm = m × Tm . Ideally speaking, one may hope to obtain a parallel time equal to Tseq /m, where Tseq denotes the sequential execution time of the program. This would imply a conservation of the work. Wm = Wtot . Such a linear speed-up is hard to reach even with an optimal scheduling policy, due mainly to communication delays and the intrinsic non-parallelism of the algorithm (even in the PRAM model we may have µ∞ 1). We introduce the ratio µm to represent the “penalty” of a parallel execution with respect to a sequential one and define it as: µm =
1.3
Wm m.Tm = Wtot Tseq
Discussion on Inefficiency
The Inefficiency factor (µm ) is an experimental parameter which reflects the quality of the parallel execution, even though it can be computed theoretically for an application and a scheduling. Let us remark that the parameter µm depends theoretically on n and on the algorithm itself. Practically speaking, most existing implementations of large industrial and/or academic scientific codes show a small constant upper bound. However, we can assume some general properties about the inefficiency that seem realistic. First, it is reasonable to assume that the work Wp is an increasing function of p. Hence inefficiency is also an increasing function of p : µp ≤ µp+1 for any p. Moreover, we can also assume that the parallel time Tp is a non-increasing function of p, all the less for “reasonable” number of processors. Hence, (p + 1)Tp+1 ≤ (1 + 1p )p Tp ⇒ µp+1 ≤ (1 + 1p )µp . Developing the inequality, for any number of processors p, q ≥ 1 we get µp+q ≤ p+q p µp . In µ other words, pp is a non-increasing function of p. As a consequence, and as µ1 is 1 by definition, µp is bounded by p, which intuitively means that in the worst case the execution of the application on p processors leads in fact to a sequential execution, i.e. the application is not parallel.
324
1.4
Christophe Rapine et al.
Competitivity
The factor µp is a performance characteristic of the parallel execution of a program. It can be easily computed a posteriori after an execution in the same way than speedup or efficiency. A more theoretical and precise performance guarantee is the competitive ratio: an algorithm is said to be ρ(m) competitive if for ∗ (I), where any instance I to be scheduled on m processors, Tm (I) ≤ ρ(m).Tm ∗ Tm (I) denotes the time of the execution produced by the algorithm and Tm (I) is the optimal time on m processors. Our goal is, given multiprocessor tasks individually characterized by an inefficiency µp when executed on p processors, to determine the competitive ratio for the whole system.
2
Scheduling Independent Jobs
We define an application as a set of jobs linked by precedence constraints. Most parallel programming paradigms assume that a parallel program produces independent subproblems, to be treated concurrently. Consider the following problem: Given a m-processor parallel machine and a set of N independent parallelizable jobs (Ji )1,N whose duration is not known until they terminate, what competitive ratio can we guarantee for the execution? We assume that each job Ji has a certain inefficiency µip when executed on p processors and that it can be preempted. The question can be restated as: Given a set of inefficiencies , what is the competitive ratio we may hope for from an on-line scheduling strategy? In the following, we denote by T ∗ (I) the optimal execution time and by Ta (I) the one of the scheduling strategy a. The competitive ratio ρ∗a of a is defined as the maximum on all the instances I of the fraction Ta (I)/T ∗ (I). Another interesting performance guarantee is the comparison between Ta (I) and the optimal execution time TS∗ (I) when the parallelization of the jobs is not allowed, i.e. jobs are computed one at a time on single processors. We denote by ρseq the maximum a on I of Ta (I)/TS∗ (I). This competitive ratio will highlight the (possible) gain to allow the multiprocessing of jobs compared to the classical approach we present below where jobs are purely sequential. 2.1
Two Extreme Strategies: Graham and Gang Scheduling
The scheduling of multiprocessor jobs has recently become a matter of much research and study. In off-line scheduling theory, several articles have been published by J.B?la˙zewicz and al. [2,1], considering multiprocessor tasks but each one requiring a fixed number of processors for execution. If we look at the field of on-line scheduling, the most important contributions focused on the the case of one-processor tasks. In this classical approach any task requires only one processor for execution. The most famous algorithm is due to Graham [3] and consists simply in a greedy affectation of the tasks to the processors such that a processor cannot be idle if there is a remaining unscheduled job. Its competitive ratio is 1 , which is the best possible ratio for any deterministic scheduling strategy 2− m
On-Line Scheduling of Parallelizable Jobs
325
when the computation costs of the tasks are not known till their completion. At the other end of the spectrum, the Gang scheduling analyzed recently in Scherson [5] assumes multiprocessor jobs and schedule them allowing the whole machine, m processors, to all the tasks. For the Happiness function defined by Scherson, the Gang strategy is proved to be the best one for preemptive multiprocessor job scheduling. Certainly, if we are concerned with the competitive ratio, the best strategy will be a compromise between allowing one processor per job (Graham) and the whole machine to any job (Gang). In particular we hope 1 ) of that multiprocessing of the job may improve the competitive ratio (2 − m i Graham. In the following, let denote µp = max1,N µp . Lemma 1. The competitive ratio of Gang scheduling is equal to µm . Proof. Consider any instance I. Let ai be the work of job Ji , and Wtot the total N µi amount of work of the program. We have TGang = i=1 mm ai ≤ µmm Wtot Noticing that Wtot /m is always a lower bound of T ∗ , it follows that ρ∗Gang ≤ µm . Consider now a particular instance consisting of m identical jobs of work a. Gang scheduling produces an execution time of µm a, while assigning one processor per job produces a schedule of length exactly a. As this schedule is one-processor job, it proves that ρseq Gang ≥ µm , which gives the result. ✷ Lemma 2. For Graham scheduling, ρseq Graham = 2 −
1 m
and ρ∗Graham ≥
m µm
Proof. Let consider the scheduling of a single job of size a. Any one-processor scheduling delivers an execution time of a, while the optimal one is reached giving the whole machine to the job. Thus T ∗ = µmm a and so ρ∗Graham ≥ µmm . ✷
3
SCHEduling Multiprocessors Efficiently: SCHEME
A Graham scheduling breaks into two phases: in the first one all processors are busy, getting an efficiency of one. Inefficiency happens when less than m tasks remain. Our idea is to preempt these tasks and to schedule them with a Gang policy in order to avoid idle time. J3
J1 J2 a1
a3 Gang phasis.
a2
Greedy phasis Work W
a3 µ3 m. m
J3
a2 µ2 m. m a 1 µm . 1 m
J1
J2
J2
Fig. 1. Principle of a SCHEME scheduling
326
3.1
Christophe Rapine et al.
Analysis of the Sequential Competitivity
Let’s consider first time t in the Graham scheduling when less than m jobs are not completed. Let J1 , . . . , Jk , k ≤ m−1, be these unfinished jobs, and denote by a1 , . . . , ak their remaining amount of work. Let W = m.t be the work executed i=k + µmm between time 0 and t. We have the majoration TSC (m) ≤ W i=1 ai . m i=k Introducing the total amount of work of the jobs, Wtot = W + i=1 ai , and the maximum amount of work remaining for a job, a = max{a1 , . . . , ak }, it follows i=k that TSC (m) ≤ Wtot /m + µmm−1 i=1 ai ≤ Wtot /m + m−1 m (µm − 1)a To obtain an expression of sequential competitivity , we need to minorate the one-processor optimal time. A lower bound is always Wtot /m. As we consider a one-processor job scheduling, TS∗ is greater than the sequential time of any task. Hence we have TS∗ > a. We get the following lemma: Lemma 3. SCHEME has a sequential competitivity of (1 −
1 m )µm
+
1 m
Proof. It is sufficient to prove that the bound is tight. Consider the following instance composed of m − 1 jobs of size m and m jobs of size 1. The optimal schedule allocate one processor per tasks of size m and assign all the small jobs to the last processor. It realizes an execution time m = Wtot /m. A possible execution of SCHEME scheduling allocates first one small jobs per processor and then uses the gang strategy for the m − 1 large jobs. It conducts to an execution time TSC (m) = 1 + (m − 1)µm , which realizes the bound. ✷ 3.2
Analysis of the Competitive Ratio
We determine here the competitive ratio of the SCHEME scheduling compared to an optimal multiprocessor jobs scheduling. Let consider an instance of the problem on m processors, and S ∗ an optimal schedule. We decompose S ∗ into temporal area slices (A∗i )i , each slice corresponding to the jobs compute between time ti−1 and ti defined recursively as follow: t0 = 0, and ti+1 is the maximal date such that in the time interval [ti , ti+1 [ no job is preempted or completed. Notice that in the slice A∗i , either all the jobs are computed on one processor, the slice is said “one-processored”, or at least one job is executed on p ≥ 2 processors. Let denote by Ai the area needed in the SCHEME schedule to compute the instructions of A∗i . Notice that any piece of job in the SCHEME scheduling is either process on one or on m processors. • Let A∗i be a one-processored slice. Let denote by Gi the amount of instructions of A∗i executed on m processors in Ai and by Wi the amount of instructions of A∗i executed on one processor in Ai . By definition A∗i = Gi + Wi . At least one of the jobs is entirely computed on one processor in the SCHEME schedule, hence Wi ≥ A∗i /m. In Ai , the area to compute the instruction of Gi is increased by a factor µm . Thus Ai = µm Gi + Wi = µm A∗i − (µm − 1)Wi ≤ A∗ 1 1 ∗ )µm + m )A∗i . It follows that Ai ≤ ρseq µm A∗i − (µm − 1) mi ≤ ((1 − m SC (m)Ai . ∗ • Let Ai be a multi-processored slice. There exists a job J computes on p ≥ 2 processors in the slice. Let a be its amount of instruction and denote by
On-Line Scheduling of Parallelizable Jobs
327
B = A∗i − µp a the remaining area less J. In the SCHEME schedule, the area Ai needed to execute the instructions of A∗i is at most µm (a + B) = µm (A∗i − µ µ −1 p (µp − 1)a). Moreover A∗i ≤ pp a × m. Hence: Ai ≤ µm (1 − pµp m )A∗i The µ −1
maximum ratio appears minimizing the term pµp p. Noticing that p/µp and µp are increasing functions, maximal value is reached for p = 2. Hence: Ai ≤ 2 2 ∗ ∗ (1 − m + mµ )µm A∗i . We call the ratio factor ρpar SC (m). For any slice Ai of S , 2 seq par ∗ we have the corresponding area in S bounded by max{ρSC (m), ρSC (m)}.Ai . Noticing that the SCHEME schedule contains no idle time, we get the following lemma: Lemma 4. The competitive ratio of SCHEME scheduling is equal to ρ∗SC (m) = (1 −
2 − µ2 1 1 )µm + max{1, µm } m m µ2
Proof. To prove that the ratio is tight, consider the instance composed of m − 2 jobs of size 1 and one job of size (2/µ2 ). A possible schedule allocating one processor per unit job and two for the last job, completes at time 1. The SCHEME scheduling leads to time execution µmm ((m − 2) + µ22 ). ✷
4
Concluding Remarks
We are currently experimenting the mixed strategy SCHEME shown to be (theoretically) promising. Other more sophisticated scheduling strategies are also under investigation.
References 1. J. B?la˙zewicz, P. Dell’Olmo, M. Drozdowski, and M.G. Speranza. Scheduling multiprocessors tasks on three dedicated processors. Information Processing Letters, 41:275–280, April 1992. 324 2. J. B?la˙zewicz, M. Drabowski, and J. W¸eglarz. Scheduling multiprocessor tasks to minimize schedule length. IEEE Transactions on Computers, 35(5):389–393, 1986. 324 3. R.L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416–429, March 1969. 324 4. R.M. Karp. On the computational complexity of combinatorial problems. Networks, 5:45–68, 1975. 322 5. I. D. Scherson, R. Subramanian, V. L. M. Reis, and L. M. Campos. Scheduling computationally intensive data parallel programs. In Placement dynamique ´ et r´epartition de charge : application aux syst` emes parall`eles et r´epartis (Ecole Fran¸caise de Parall´elisme, R´eseaux et Syst`eme), pages 107–129. Inria, July 1996. 325
On Optimal k-linear Scheduling of Tree-like Task Graphs for LogP-Machines Wolf Zimmermann1 , Martin Middendorf2 , and Welf L¨ owe1 1
Institut f¨ ur Programmstrukturen und Datenorganisation, Universit¨ at Karlsruhe, 76128 Karlsruhe, Germany, {zimmer,loewe}@ipd.info.uni-karlsruhe.de 2 Institut f¨ ur Angewandte Informatik und Formale Beschreibungsverfahren, Universit¨ at Karlsruhe, 76128 Karlsruhe, Germany, [email protected]
Abstract. A k-linear schedule may map up to k directed paths of a task graph onto one processor. We consider k-linear scheduling algorithms for the communication cost model of the LogP-machine, i.e. without assumption on processor bounds. The main result of this paper is that optimal k-linear schedules of trees and tree-like task graphs G with n tasks can be computed in time O(nk+2 log n) and O(nk+3 log n), respectively, if o ≥ g. These schedules satisfy a capacity constraint, i.e., there are at most L/g messages in transit from any or to any processor at any time.
1
Introduction
The LogP-machine [3,4] assumes a cost model reflecting latency of point-topoint-communication in the network, overhead of communication on processors themselves, and the network bandwidth. These communication costs are modeled with parameters Latency, overhead, and gap. The gap is the inverse bandwidth of the network per processor. In addition to L, o, and g, parameter P describes the number of processors. Furthermore, the network has finite capacity, i.e. there are at most L/g messages in transit from any or to any processor at any time. If a processor would attempt to send a message that exceeds this capacity, then it stalls until the message can be sent without violation of the capacity constraint. The LogP-parameters have been determined for several parallel computers [3,4,6,8]. These works confirmed all LogP-based runtime predictions. Much research has been done in recent years on scheduling task graphs with communication delay [9,10,11,14,21,23,25], i.e. o = g = 0. In the following, we discuss those works assuming an unrestricted number of processors. Papadimitriou and Yannakakis [21] have shown that given a task graph G with unit computation times for each task, communication latency L ≥ 1, and integer Tmax , it is NP-complete to decide whether G can be scheduled in time ≤ Tmax . Their results hold no matter whether redundant computations of tasks are allowed or not. Finding optimal time schedules remains NP-complete for DAGs obtained by the concatenation of a join and a fork (if redundant computations D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 328–337, 1998. c Springer-Verlag Berlin Heidelberg 1998
On Optimal k-linear Scheduling of Tree-like Task Graphs
329
are allowed) and for fine grained trees (no matter if redundant computations are allowed or not). A task graph is fine grained if the granularity γ which is a constant closely related to the ratio of computation and communication is < 1 [22]. Gerasoulis and Yang [9] find schedules for task-graphs guaranteeing the factor 1 + γ1 of the optimal time if recomputation is not allowed. For some special classes of task graphs, such as join, fork, coarse grained (inverse) trees, an optimal schedule can be computed in polynomial time [2,9,15,25]. If recomputation is allowed some coarse grained DAGs can also be scheduled optimally in polynomial time [1,5,15]. The problem to find optimal schedules having a small amount of redundant computations has been addressed in [23]. Recently, there are some works investigating scheduling task graphs for the LogP-machine without limitations on the number of processors [12,16,17,18,20,24,26] and with limitations on the number of processors [19,7]. Most of these works discuss approximation algorithms. In particular, [18,26] generalizes the result of Gerasoulis and Yang [9] to LogP-machines using linear schedules. [20] shows that optimal linear schedules for trees and tree-like task graphs can be computed in polynomial time if the cost of each task is at least g − o. However, linear schedules for trees often have bad performance. Unfortunately, if we drop the linearity, even scheduling trees becomes NP-complete [20]. [24] shows that the computation of a schedule of length at most B is NP-complete even for fork and join trees and o = g. They discuss several approximation algorithms for fork and join trees. [12] discusses an optimal scheduling algorithm for some special cases of fork graphs. In this paper, we generalize the linearity constraint and allow instead to map up to k directed paths on one processor. We show that it is possible to compute optimal k-linear schedules on trees and tree-like task graphs in polynomial time if o = g. In contrast to most other works on scheduling task graphs for the LogP-machine, we take the capacity constraint into account. Section 2 gives the basic definitions used in this paper. Section 3 discusses some normalization properties for k-linear schedules on trees. Section 4 presents the scheduling algorithm with an example that shows the improvement of klinearity w.r.t. linear schedules.
2
Definitions
A task graph is a directed acyclic graph G = (V, E, τ ), where the vertices are sequential tasks, the edges are data dependencies between them, and τv is the computation cost of task v ∈ V . The set of direct predecessors PRED v of a task v in G is defined by PRED v = {u ∈ V | (u, v) ∈ E}. The set of direct successors SUCC v is defined analogously. Leaves are tasks without predecessors, roots are tasks without successors. A task v is ancestor (descendant) of a task u iff there is a path from v to u (u to v). The set of ancestors (descendants) of u is denoted by ANC u (DESC u ). A set of tasks U ⊆ V is independent iff for all u, v ∈ V , u = v, neither u ∈ ANC v nor v ∈ ANC u . A set of tasks U ⊆ V is k-independent iff there is an independent subset W ⊆ U of size k. G is an inverse tree iff |SUCC v | ≤ 1 and there is exactly one root. In this case, SUCC v
330
Wolf Zimmermann et al.
denotes the unique successor of non-root tasks v. G is tree-like iff for each root v, the subgraph of G induced by the ancestors of v is a tree. A schedule for G specifies for each processor Pi and each time step t the operation performed by processor Pi at time t, provided there starts one. The operations are the execution of a task v ∈ V (denoted by the task), sending a message containing the result of a task v to another processor P (denoted by send (v, P )), and receiving a message containing the result of a task1 v from processor P (denoted by recv (v, P )). A schedule is therefore a partial mapping s : P × N → OP, where P denotes an infinite set of processors and OP denotes the set of operations. s must satisfy some restrictions: (i) Two operations on the same processor must not overlap in time. The computation of a task v requires time τv . Sending from v or receiving a message sent by v requires time o. (ii) If a processor computes a task v ∈ V then any task w ∈ PRED v must be computed or received before on the same processor. (iii) If a processor sends a message of a task, it must be computed or received before on the same processor. (iv) Between two consecutive send (receive) operations there must be at least time g, where v is the message sent earlier (received later). (v) For any send operation there is a corresponding receive operation, and vice versa. Between the end of a send operation and the beginning of the corresponding receive operation must be at least time L. (vi) Any v ∈ V is computed on some processor. (vii) The capacity constraints are satisfied by s. A trace of s induced by processor P is the sequence of operations performed by P together with their starting time. tasks(P ) denotes the set of tasks computed by processor P . A schedule s for G is k-linear, k ∈ N, iff tasks(P ) is not (k + 1) -independent for every processor P , i.e. at most k directed paths of G are computed by P . The execution time TIME (s) of a schedule s is the time when the last operation is finished. A sub-schedule sv induced by a task v computes just the ancestors of v and performs all send and receive operations of the ancestors except v on the same processors at the same time as s, i.e.
8>s(P, t) >< s (P, t) = >>: undefined v
if s(P, t) ∈ ANC v , = w ∈ ANC v , s(P, t) = recv (w, P ) for a processor P and v = w ∈ ANC v , s(P, t) = send (w, P ) for a processor P and v otherwise
It is easy to see that sv is a schedule of the subgraph of G induced by ANC v . A schedule s for an inverse tree G = (V, E, τ ) is non-redundant if each task of G is computed exactly once, for each v ∈ V there exists at most one send operation, and there exists no send operation if v is computed on the same processor as its successor. 1
For simplicity, we say sending and receiving the task v, respectively.
On Optimal k-linear Scheduling of Tree-like Task Graphs
3
331
Properties of k-linear Schedules
In this section we prove some general properties of k-linear schedules on trees. These properties are used to restrict the search space for optimal schedules. Lemma 1. Let G = (V, E, τ ) be an inverse tree. For every schedule s for G there exists a non-redundant schedule s with TIME (s ) ≤ TIME (s). If s is k-linear, then also s . Proof: Let s be a schedule for G. First we construct a sub-schedule of s where each node is computed only once. Assume that a task is computed more than once in s. If the root is computed more than once just omit all but one computation of the root. Otherwise let v be computed more than once such that SUCC v is computed only once. Let SUCC v be computed on processor P . Then there exists a computation of v on a processor P0 and a (possibly empty) sequence send (v, P1 ), recv (v, P0 ), send (v, P2 ), . . . , send (v, P ), recv (v, Pk ) of corresponding send and receive operations of v such that v is received on P before the computation of SUCC v starts. Now, all other computations of v are omitted and the above sequence of send and receive operations is replaced by send (v, P ); recv (v, P0 ), and remove all other operations sending or receiving v, provided P0 = P . If P0 = P , then all operations sending or receiving v are remove. This process is repeated until each task is computed exactly once. It is clear that the obtained schedule s is non-redundant, TIME (s ) ≤ TIME (s), and the above transformations preserve k-linearity. Lemma 2. Let G = (V, E, τ ) be an inverse tree. For every non-redundant klinear schedule for G there is a non-redundant k-linear schedule s satisfying TIME (s ) ≤ TIME (s) such that for every processor P , the subgraph of G induced by tasks(P ) is an inverse tree with at most k leaves. Proof: Suppose there is a processor P such that the subgraph GP of G induced by tasks(P ) is not an inverse tree. Since every subgraph of G is a forest consisting of inverse trees, GP must have more than one connected component. Since s is k-linear, each connected component contains at most k leaves. Let s be the schedule obtained from s by computing each of these connected components on separate (new) processors, and updating accordingly the operations that send tasks to processor P . It is easy to see that this transformation neither affects the execution time nor the k-linearity. The claim follows by induction. Corollary 1. Let G = (V, E, τ ) be an inverse tree. Then there exists an optimal k-linear schedule s such that for every v ∈ V the sub-schedule sv induced by v is an optimal k-linear schedule for Gv which is non-redundant and for each processor P the subgraph induced by the tasks tasks(P ) is an inverse tree with at most k leaves. Proof: Let s be an optimal k-linear schedule for G that is non-redundant. If for a v ∈ V the sub-schedule sv is not optimal replace it by an optimal k-linear sub-schedule that is non-redundant. Thus, for obtaining optimal k-linear schedules, it is sufficient to consider non-redundant schedules that satisfy Corollary 1.
332
4
Wolf Zimmermann et al.
The Scheduling Algorithm
In this section, we present the algorithm computing optimal k-linear schedules for task graphs, if o = g. We first reduce the scheduling problem to a special case of the 1-machine scheduling problem with precedence constraints and release dates, i.e. a task-graph G = (V, E, τ ) is to be scheduled on one processor where for every task there is a time rv (release date) such that v cannot be scheduled before rv : Lemma 3. Let G = (V, E, τ ) be an inverse tree. Let s be a schedule for G that satisfies the property of Corollary 1. For a v ∈ V consider the sub-schedule sv and let P be the processor computing v. Let tr be the trace of P in sv where receive operations are replaced by computations of length o computing the received tasks, and Gv obtained from the subgraph of Gv induced by the tasks(P )∪S of sv . Then tr is an optimal 1-machine schedule for G with release dates TIME (sv ) + L + o for v ∈ S. Proof: No v ∈ S can be received before time tv = TIME (sv ), because send(v, P ) cannot be scheduled earlier than tv . Since every reception of a task costs time o, tr is a 1-machine schedule for G with release dates tv + L + o for v ∈ S. Since s is optimal tr is also optimal. Algorithm 1 (1) for v ∈ V according to a topological order of G do (2) tv := ∞; sv := ∅; – – try to compute optimal k-linear schedule (3) for every independent U ⊆ ANC v , 0 < |U | ≤ k do (4) sv := optimal trace (G, v, U, t) (5) if TIME (sv ) < tv then – – a better schedule is found (6) tv := TIME (sv ); (7) sv := sv su {sv (Pu , t ) = send (u, Pv )};
]
u∈PRED(U )
where sv (t + L + o, Pv ) = recv (u, Pu )
(8) end; – – if (9) end; – – for (10) end; – – for
Algorithm 1 computes (in topological order) for each task v ∈ V an optimal schedule. In the iteration step, for every possible set R for the subtree rooted at v as defined in Lemma 3, the optimal trace for the processor computing v is derived (function optimal trace). According to Lemma 3, one of these schedules is optimal for the subtree rooted at v. The algorithm uses the fact that R can is uniquely defined by its leaves. Theorem 1 (Correctness of Algorithm 1). Let G = (V, E, τ ) be an inverse tree with root r. The schedule sr computed by Algorithm 1 is an optimal k-linear schedule for G.
On Optimal k-linear Scheduling of Tree-like Task Graphs
333
Proof: We prove by induction on |ANC v | that for every v ∈ V sv is an optimal k-linear schedule. If |ANC v | = 0, then v is a leaf and the claim is trivial. If v is not a leaf, the claim follows directly by induction hypothesis and Lemma 3. By line (7) every send operation is scheduled L + o time steps earlier than the corresponding receive operations. Hence, a processor can perform at most L/o receive operations for every time interval of length L. Since the corresponding send operations start at the latest possible time, the capacity constraint is satisfied. Theorem 2. Let G = (V, E, τ ) be a tree. Algorithm 1 computes in time O(|V |k+2 log |V |) an optimal k-linear schedule for G. Proof: For every v ∈ V , loop (3)–(8) is executed at most O(|ANC v |k ) times, because the number of subsets of size at most k is O(nk ). Thus, if optimal trace can be implemented by a polynomial algorithm, Algorithm 1 is polynomial. It is known that a simple greedy algorithm with time complexity O(|V | log |V |) computes an optimal schedule a 1-machine scheduling problem [13], i.e. optimal trace can be executed in time O(|ANC v | · log |ANC v |). Hence, the time complexity of loop (3)–(8) is O(|ANC v |k+1 log |ANC v |) by the above discussion. Summing over all v ∈ V yields the claimed time complexity of Algorithm 1 Example: (Execution of Algorithm 1) We compute optimal 1- and 2-linear schedules for the inverse tree G = (V, E, τ ) shown in Fig. 1(a) (where τv = 1 for all v ∈ V ) with the parameters L = 2 and o = 1 (cf. Fig. 1(b) and (c)). Fig. 1(d) shows an optimal schedule. The schedules are visualized by Gantt-charts. The x-axis corresponds to time, the y-axis to y. We place a box at (t, i), if processor Pi starts at time t an operation. The horizontal length of the box corresponds to the cost (i.e. τv if v is computed and o if it is send or receive operation).
0
1
1
2
0
7
1
8
2
9
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 1
3
0
4
3 10
3
4
5
6
4 11
5
2
5 12
7
8
9
10
11
12
13
(a) A Task Graph
14
6 13
6
7 14 processors
(b) An optimal 1-linear schedule 1
2
0
7
1
8
2
9 10 4
3
4
5
6 3
7
8
9 10 11 12 1
2
time
0
3 11 12 5 4 13 14 6 processors
(c) An optimal 2-linear schedule
1
2
3
4
5
0 11 12 13 5 14 6 1
7
2
9 10 4
8
6
7
8
9 10 11 12 1
2
time
0
3
processors
(d) An optimal schedule
Fig. 1. A Task Graph with optimal 1- and 2-linear Schedules
time
334
Wolf Zimmermann et al.
Table 1. Execution Times of Optimal k-linear Schedules sv 1-linear 2-linear 3-linear 8-linear
v ∈ {7, . . . , 14} v ∈ {3, . . . , 6} v ∈ {1, 2} tv = 1 tv = 6 tv = 11 tv = 1 tv = 3 tv = 8 tv = 1 tv = 3 tv = 7 tv = 1 tv = 3 tv = 7
v=0 tv = 16 tv = 12 tv = 11 tv = 11
Send and receive operations are drawn as dark boxes. White boxes denote the computation of a task. Table 1 shows the executions times of the optimal k-linear schedules for the tasks v ∈ V . All tasks on the same level of the tree have the same schedules up to renaming. The table also shows that optimal schedules for G have execution time 11. The optimal schedule for tasks 7–14 is obviously the computation of the task. The optimal 1-linear schedule for task 3 computes tasks 7 and 3 on one processor and task 8 on another processor. The optimal 2-linear schedule computes the tasks 7, 8 and 3 sequentially. For 1-linear schedules, an optimal trace for task 1 and task 0 starts at task 7 (time 11 and 16, respectively). For 2-linear schedules, an optimal trace for task 1 starts with task 7 (time 8), and an optimal trace for task 0 starts with task 7 and 2 (time 12). Obviously, an optimal n-linear schedule for an inverse tree G is optimal among all schedules. Thus, Algorithm 1 can be used to compute an optimal schedule. Although this problem is NP-hard and the algorithm is exponential, many subsets of V can be ignored, because they are not independent. As our example shows, using symmetry may reduce the number of subsets of V even further. E.g. if all tasks of the task graph in Fig. 1(a) have equal weights if they are the same level, then only 66 subsets instead of 215 need to be examined. Although there are still 2O(|V |) possibilities to be examined, the constant in the exponent is much smaller than 1, which makes Algorithm 1 more feasible than a general brute-force search. The generalization of Algorithm 1 to tree-like task graphs is straightforward: Let G = (V, E, τ ) be a tree-like task graph and O be the set of its roots. Then an optimal schedule for G is obtained by scheduling separately the subgraph induced by ANC o (which is a tree by definition of tree-like) for every o ∈ O according to Algorithm 1. Hence, Theorem 2 implies directly the Corollary 2. Let G = (V, E, τ ) be a tree-like task graph. Then, an optimal klinear schedule for G can be computed in time O(|V |k+3 log |V |) .
5
Conclusion
We have shown that optimal k-linear schedules for trees and tree-like task-graphs can be computed in polynomial time on the LogP-machine with an unbounded number of processors, if o = g and k = O(1). For every task graph G that is an inverse tree, Algorithm 1 constructs optimal k-linear schedules for the subtrees Gv of G rooted at v, where the tasks are processed from the leaves to the root.
On Optimal k-linear Scheduling of Tree-like Task Graphs
335
The main step is to construct an optimal k-linear schedule for Gv , when for all proper ancestors u of v the optimal k-linear schedules for Gu are already known. The subgraph of Gv induced by the tasks computed by the processor computing v is always a tree with at most k leaves. Algorithm 1 is polynomial because there is only a polynomial number of independent sets over the ancestors of v of size at most k and for a given independent set, an optimal trace can be computed in polynomial time. Thus, it is possible to consider all possibilities in polynomial time. The generalization of the approach for g > o is not straightforward. The principal approach remains the same, but computing the optimal traces is more difficult. For this, more normalization properties are required. [20] discusses some of those for 1-linear schedules. However, it is not clear whether these apply to k-linear schedules in general, and whether other normalization properties are required. [18,26] show that for every task graph G 1-linear schedules guaranteeing the factor 1+1/γ of the optimal time can be computed in polynomial time, where γ is the granularity of G. Obviously, with increasing k the execution time of optimal k-linear schedules does not increase. In fact, the example in Section 4 shows that it may decrease drastically (in our example by 25%). The question raises whether it is possible to compute k-linear schedules with better performance guarantees for increasing k.
References 1. F.D. Anger, J. Hwang, and Y. Chow. Scheduling with sufficient loosely coupled processors. Journal of Parallel and Distributed Computing, 9:87–92, 1990. 329 2. P. Chretienne. A polynomial algorithm to optimally schedule tasks over an ideal distributed system under tree-like precedence constraints. European Journal of Operations Research, 2:225–230, 1989. 329 3. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 93), pp. 1–12, 1993. published in: SIGPLAN Notices (28) 7. 328 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78–85, 1996. 328 5. S. Darbha and D.P. Agrawal. SBDS: A task duplication based optimal algorithm. In Scalable High Performance Conference, 1994. 329 6. B. Di Martino and G. Ianello. Parallelization of non-simultaneous iterative methods for systems of linear equations. In Parallel Processing: CONPAR 94 – VAPP VI, volume 854 of Lecture Notes in Computer Science, pp. 253–264. Springer, 1994. 328 7. J. Eisenbiegler, W. L¨ owe, and W. Zimmermann. Optimizing parallel programs on machines with expensive communication. In Europar’ 96 Parallel Processing Vol. 2, volume 1124 of Lecture Notes in Computer Science, pp. 602–610. Springer, 1996. 329
336
Wolf Zimmermann et al.
8. J¨ orn Eisenbiegler, Welf L¨ owe, and Andreas Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC’97), pp. 149–155. IEEE Computer Society Press, 1997. 328 9. A. Gerasoulis and T. Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Transactions on Parallel and Distributed Systems, 4:686–701, June 1993. 328, 329 10. J.A. Hoogreven, J.K. Lenstra, and B. Veltmann. Three, four, five, six or the complexity of scheduling with communication delays. Operations Research Letters, 16:129–137, 1994. 328 11. H. Jung, L. M. Kirousis, and P. Spirakis. Lower bounds and efficient algorithms for multiprocessor scheduling of directed acyclic graphs with communication delays. Information and Computation, 105:94–104, 1993. 328 12. I. Kort and D. Trystram. Scheduling fork graphs under LogP with an unbounded number of processors. Submitted to Europar ’98, 1998. 329 13. E.L. Lawler. Optimal sequencing of a single machine subject to precedence constraints. Management Science, 19:544–546, 1973. 333 14. J.K. Lenstra, M. Veldhorst, and B. Veltmann. The complexity of scheduling trees with communication delays. Journal of Algorithms, 20:157–173, 1996. 328 15. W. L¨ owe and W. Zimmermann. On finding optimal clusterings in task graphs. In N. Mirenkov, editor, Parallel Algorithms/Architecture Synthesis pAs’95, pp. 241– 247. IEEE, 1995. 329 16. W. L¨ owe and W. Zimmermann. Programming data-parallel – executing process parallel. In P. Fritzson and L. Finmo, editors, Parallel Programming and Applications, pp. 50–64. IOS Press, 1995. 329 17. W. L¨ owe and W. Zimmermann. Upper time bounds for executing PRAM-programs on the LogP-machine. In M. Wolfe, editor, Proceedings of the 9th ACM International Conference on Supercomputing, pp. 41–50. ACM, 1995. 329 18. Welf L¨ owe, Wolf Zimmermann, and J¨ orn Eisenbiegler. On linear schedules for task graphs for generalized LogP-machines. In Europar’97: Parallel Processing, volume 1300 of Lecture Notes in Computer Science, pp. 895–904, 1997. 329, 335 19. W. L¨ owe, J. Eisenbiegler, and W. Zimmermann. Optimizing parallel programs on machines with fast communication. In 9. International Conference on Parallel and Distributed Computing Systems, pp. 100–103, 1996. 329 20. M. Middendorf, W. L¨ owe, and W. Zimmermann. Scheduling inverse trees under the communication model of the LogP-machine. Accepted for publication in Theoretical Computer Science. 329, 335 21. C.H. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal on Computing, 19(2):322 – 328, 1990. 328 22. C. Picouleau. New complexity results on the uet-uct scheduling algorithm. In Proc. Summer School on Scheduling Theory and its Applications, pp. 187–201, 1992. 329 23. J. Siddhiwala and L.-F. Cha. Path-based task replication for scheduling with communication cost. In Proceedings of the International Conference on Parallel Processing, volume II, pp. 186–190, 1995. 328, 329 24. J. Verriet. Scheduling tree-structured programs in the LogP-model. Technical Report UU-CS-1997-18, Dept. of Computer Science, Utrecht University, 1997.
On Optimal k-linear Scheduling of Tree-like Task Graphs
337
25. T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951–967, 1994. 329 328, 329 26. W. Zimmermann and W. L¨ owe. An approach to machine-independent parallel programming. In Parallel Processing: CONPAR 94 – VAPP VI, volume 854 of Lecture Notes in Computer Science, pp. 277–288. Springer, 1994. 329, 335
Static Scheduling Using Task Replication for LogP and BSP Models Cristina Boeres1 , Vinod E. F. Rebello1 , and David B. Skillicorn2 1
Departamento de Ciˆencia da Computa¸ca ˜o Universidade Federal Fluminense (UFF), Niter´ oi, RJ, Brazil {boeres,vefr}@pgcc.uff.br 2 Computing and Information Science Queen’s University, Kingston, Canada [email protected]
Abstract. This paper investigates the effect of communication overhead on the scheduling problem. We present a scheduling algorithm, based on LogP -type models, for allocating task graphs to networks of processors. The makespans of schedules produced by our multi-stage scheduling approach (MSA) are compared with other well-known scheduling heuristics. The results indicate that new classes of scheduling heuristics are required to generate efficient schedules for realistic abstractions of today’s parallel computers. The scheduling strategy of MSA can also be used to generate BSP-structured programs from more abstract representations. The performance of such programs are compared with “conventional” versions.
1
Introduction
Programmers face daunting problems when attempting to achieve efficient execution of parallel programs on parallel computers. There is a considerable range in communication performance on the platforms currently in use. More importantly, the sources of performance penalties are varied, so that the best implementation of a program on two different architectures may differ widely in its form. Understandably, programmers wish to build programs at a level of abstraction where these variations can be ignored, and have the program automatically transformed to an appropriate form for each target. One common abstraction is a directed acyclic graph whose nodes denote tasks, and whose edges denote data dependencies (and hence communication, if the tasks are mapped to distinct processors). There are two related issues in transforming this DAG: clustering the tasks that will share each processor, and arranging the communication actions of the program to optimally use the communication substrate of the target architecture. Clustering tasks reduces communication costs, but may also reduce the available parallelism. The objective, therefore, is to find the optimal task granularity with respect to the communication characteristics of the target machine. In task clustering without replication, tasks are partitioned into disjoint sets or clusters and exactly one copy of each task is scheduled. In task clustering with D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 337–347, 1998. c Springer-Verlag Berlin Heidelberg 1998
338
Cristina Boeres et al.
duplication, a task may have several copies belonging to different clusters, each of which is scheduled independently. In architectures with high communication costs, this often produces schedules with smaller total execution times [12]. The task scheduling (or clustering) problem, with or without task duplication, has been well studied and is known to be NP-hard [15]. A more complicated issue is the choice of a model of the target’s communication system. The standard model in the scheduling community is the delay model, where the sole architectural parameter for the communication system is the latency, that is the transit time for each message [15]. A scheduling technique based on latency is unlikely to produce good schedules on most of today’s popular parallel computers because it assumes a fully-connected network. In practice, latency depends heavily on congestion in the shared communication paths. Furthermore, the dominant cost of communication in today’s architectures is that of crossing the network boundary, a cost that arises partly from the protocol stack and partly from interaction with other messages leaving and entering the processor at about the same time. In practice, the latter dominates the communication cost [10] and cannot be modelled as a latency. The delay model also assumes that communication and computation can be completely overlapped. This may not be possible because: the CPU may have to manage, or at least initiate communication (even when the details are handled by a network interface card); the CPU may have to copy data from user space to kernel space; and it may be desirable to postpone some communication to allow messages to be packed together, or routed in a congestion-free way [10]. These architectural properties have motivated new parallel programming models such as LogP [6] and BSP [16], both of which have accurate cost models. Given a program, it is possible to accurately predict its runtime given a few program and architecture parameters. However, both models require writing programs in a slightly awkward way, and so it is useful to consider using a scheduling technique to map more general programs into the required structured form. The LogP model [6] is an MIMD message-passing model with four architectural parameters: the latency, L; the overhead, o, the CPU penalty incurred by a communication action; the gap, g, a lower bound on the interval between two successive communication actions, designed to prevent overrun of the communication capacity of the network (and hence related to the inverse of the network’s bisection bandwidth); and the number of processors, P . Because the parameter g depends on the target architecture, it is hard to write LogP programs in a general and architecture-independent way. Nevertheless, LogP -based predictions of runtimes have been confirmed in practice [6,7]. BSP is similar, except that it takes a less prescriptive view of communication actions. Programs are divided into supersteps which consist of three sequential phases: local computation, communication between processors, and a barrier synchronisation, after which communicated data becomes visible on destination processors. The architecture parameters are p, the number of processors, g the inverse of the effective network bandwidth, and l the time taken for a barrier
Static Scheduling Using Task Replication for LogP and BSP Models
339
synchronisation. The g parameter is significantly different from that of the LogP model despite their superficial similarity. BSP does not limit the rate at which data are inserted into the network. Instead, g is measured by inserting traffic to uniformly-random destinations as fast as possible and measuring how long it takes to be delivered in the steady state (note that this includes an accounting for latency). In a superstep where each processor spends time wi in computation, and sends and receives some quantity of data hi , the BSP cost model charges M AXwi + M AXhi g + l for the superstep. The units of g are instruction execution times per word, and of l are instruction execution times, so that the sum is sensible. This cost model has been shown to be extremely accurate [8,9]. Note that the second term accounts for the fact that the cost of communication is dominated by fan-in and fan-out, while the inability of many architectures to overlap computation and communication is taken into account by summing the first two terms, rather than taking their maximum. To the best of our knowledge, few if any, scheduling algorithms exist for communication models which consider such parameters as those discussed above. L¨owe et al. [13] proposed a clustering algorithm with task replication for the LogP model, but it only operates on restricted types of algorithms. ETF [11] and DSC [17] which do not use task replication, and PY [15], PLW [14] which do, are well-known and effective scheduling algorithms based on the delay model. When applied to parallel computers, delay model approaches implicitly assume two properties: the ability to multicast, i.e. to send a number of different messages to different destinations simultaneously; and that the communication overhead and task computation on a processor can completely overlap, i.e. communication does not require computation time. Neither assumption is particularly realistic, as discussed earlier. In this paper, we briefly present a versatile multi-stage scheduling approach (MSA) for general DAGs on LogP -type machines (with bounded numbers of processors) which can also generate BSP-structured schedules. MSA considers the following parameters, adopted by the Cost and Latency Augmented DAG (claud) model [5] (developed independently of LogP ). In the claud model, a DAG G represents the application and P is a set of m identical processors. The overhead associated with the sending and receiving of a message is denoted by λs and λr , respectively, and the communication latency or delay τ . The claud model very naturally models LogP , given the following parameter relationships: r , g = λs and L = τ (LogP only considers fully connected networks), o = λs +λ 2 P = m, and BSP with h = λs + λr . (Note the delay model is an instance of these models).
2
A Multi-stage Scheduling Approach
The Multi-stage Scheduling Approach (MSA) currently consists of two main stages: the first transforms the input program DAG into one whose characteris-
340
Cristina Boeres et al.
tics (e.g. granularity, message lengths) are better suited to the target architecture; and the second stage maps this resulting DAG onto the target architecture. Each stage consists of two algorithms which are described in greater detail in [1,3]. A parallel application is represented by a directed acyclic graph (DAG) G = (V, E, ε, ω), where V is the set of tasks, E is the precedence relation among them, ε(vi ) is the execution cost of task vi ∈ V and ω(vi , vj ) is the weight associated to the edge (vi , vj ) ∈ E representing the amount of data units transmitted from vi to vj . Presently, MSA only considers input DAGs whose vertices have unit execution costs and edges have unit weight. For the duration of each communication overhead (known here as a communication event), the processor is effectively blocked unable to execute other tasks in V or even to send or receive other messages. Therefore, these sending and receiving overheads can also be viewed as “tasks” executed by processors. In parallel architectures with high overheads, the number of communication events can be reduced by bundling many small messages into fewer, larger ones. This can be achieved by clustering together tasks with common ancestors or successors and restricting the necessary communication events of the cluster to take place only before and after the execution of all of the tasks belonging to that cluster. This allows messages for the same destination processor, from various tasks within the cluster, to be combined and sent as one. In contrast, other clustering techniques (for example [17]) allow communication to occur as soon as each task within the cluster finishes its execution. This leads to a subtle difference between these clustering approaches: the latter only reduces the total communication cost through the elimination of messages, by allocating the source and destination tasks to the same processor; the former, tries to minimise the cost of sending the remaining messages as well. The First Stage. This stage attempts to transform the input DAG G into one which will have better communication behaviour on the target architecture. The formation of clusters, each containing a task vi of G and replicated copies of some of its ancestors, minimises communication in 3 ways: no cost is incurred for communication within the cluster; the bundling of messages to a common destination reduces the number of times that overheads are incurred; and the execution of the replicated tasks within the cluster can hide some of the communication cost between vi and ancestors outside the cluster. The degree of replication, determined by characteristics of both the input DAG and the target architecture, is represented by a single parameter γ, the clustering factor . Determining the best value of γ, i.e. the value that leads to the smallest makespan is discussed elsewhere [1,2]. This clustering process is based on the replication principle (also adopted by PY [15]) and on the latest-finish time of a task i.e. the latest time that a cluster can finish and still have the schedule achieve the minimum makespan assuming all communication costs are γ. The result of this replication algorithm is a set of n clusters of tasks (n being the number of tasks in G). Every cluster contains γ + 1 tasks except those in the first band which may have fewer. Furthermore, the latest finish time of every
Static Scheduling Using Task Replication for LogP and BSP Models
341
cluster is a multiple of γ + 1. Therefore, groups of parallel clusters exist in nonoverlapping bands or computation intervals. Later, gaps will be created, known as communication intervals, between adjacent bands in which only communication event tasks will be scheduled. For target architectures where the overheads of communication are much higher than the delay, minimising the in- and out-degrees of the clusters is essential for effective scheduling. When overheads are negligible and communication cost is dominated by the delay, it is better to minimise the length of messages. A second algorithm is now applied to specify the appropriate dependencies between required clusters taking into consideration the communication characteristics of the target architecture. The problem of reducing the communication costs of the schedule has now been simplified to iteratively considering the communication costs between at most three adjacent computation intervals [1]. For each computation band, this involves finding the combination of immediate ancestor clusters (containing the required immediate ancestor tasks of all tasks in the the current band), which will incur smallest communication cost. The exact solution is computationally expensive and instead a heuristic approach has been proposed [1,3]. The Second Stage. This stage performs a cluster merging step to map each cluster’s distinct virtual processor to a physical one, minimising communication costs by applying Brent’s Principle [4]. Communication is not permitted during execution of a cluster. When a relatively small overhead cost is associated with sending and receiving events, the bundling of messages can actually increase the makespan. In such cases, the edges of the final DAG can be unbundled and the clusters opened to allow messages to be sent after each task is executed. Once the supertasks have been allocated to the physical processors, excess copies of tasks allocated to the same processor are removed and the final schedule is defined for the new DAG based on the parameters λs , λr and τ . The required communication event tasks can now be allocated and the start times of all tasks estimated.
3
The Structure of BSP Programs
To write efficient BSP programs a good strategy is to balance the computation across the processes; balance the communication fan-in and fan-out; and minimise the number of supersteps, since this determines the number of times l is incurred. This problem is similar to the problem of clustering for task scheduling in the sense that the number of supersteps can be reduced by increasing the amount of local computation carried out in a process. However, this also reduces the amount of parallelism and thus, just as the scheduling problem tries to balance parallelism against communication, here parallelism has to be balanced against communication and synchronisation. From the previous discussion of MSA, it is clear that there are similarities in the structure of efficient programs for both BSP and LogP -based models. BSP’s
342
Cristina Boeres et al.
notion of local computation is most accurately captured by the cluster definition of MSA since communication only occurs after all of the tasks within the cluster have been executed. The cost of the superstep computation, wi , in BSP is equivalent to the computation band of constant cost γ + 1 in MSA. Since each cluster in the band has the same size, balanced computation is achieved through task replication – a novel feature for BSP programs. The communication and synchronisation costs are represented by the communication interval in MSA. The exploitation of these similarities suggests the following. An approach similar to that adopted by MSA can be used to generate efficient BSP versions of MIMD programs automatically. The efficient execution of programs depends on their schedule which, in turn, is influenced by architectural parameters. Any change of target architecture is likely to require a change to the program structure i.e. the superstep size. BSP emphasises architecture-independence and portability of programs across architectures. If MSA is able to generate effective program schedules whose costs are known to programmers, many of the benefits of the BSP approach can be maintained in a more general programming setting. Further, if MSA generates schedules with good execution times for general-purpose programs on general parallel machines then this implies that the superstep discipline is not too restrictive. This might be expected to increase the use and acceptance of BSP, particularly for irregular problems where superstep structure is sometimes seen as overly restrictive.
4
The Influence of the Overheads on Task Scheduling
The impact of LogP -type parameters on the scheduling of tasks can be analysed by comparing of makespans generated by various scheduling approaches for a suite of (in this case, unit execution task unit data communication) DAGs over a range of values for communication overheads and latency. Table 1 presents three scheduling approaches with a selection of DAGs which include: an irregular DAG based on a molecular physics application (Ir41 ); two random graphs with 80 and 186 tasks (Ra80 , Ra186 ); a binary in- and out-tree of 511 tasks (It511 , Ot511 ); and a diamond DAG (Di400 ). The application of MSA to a larger variety of DAGs with differing sizes and characteristics (including: course- and fine-grain graphs; regular DAGs such as binary trees and diamond DAGs; and irregular and random graphs) is presented in [1,2,3]. Table 1 also presents the makespans of MSA schedules generated under the BSP model in order to investigate the additional costs incurred by the superstep discipline. For the LogP -type architectures, MSA is compared with two recently published scheduling algorithms (DSC [17] and PLW [14]) both of which have been proved to generate schedules within a factor of two of the optimal under the delay model. While the delay model-based PLW ignores the overhead parameters (o or λs and λr ), DSC is in fact based on an extension of this model (the linear model) and treats the overhead as an additional latency parameter.
Static Scheduling Using Task Replication for LogP and BSP Models
343
One objective of this paper is to show that for architectures with LogP type characteristics, even scheduling strategies which simply add the overhead parameter to the latency do not perform well. Furthermore, in practice, the execution time of the schedule produced can be much worse than predicted. The actual execution times (ATs) of the schedules produced by PLW, DSC and MSA were measured on a discrete event driven simulation of a parallel machine. The main reason for adopting a simulator is to facilitate the study of the relationship between the various parameters and the effects of hardware and operating system techniques to improve message-passing performance. Table 1. A comparison of predicted and simulated makespans for the DSC, PLW and MSA scheduling heuristics. λs and λr are the sending and receiving overheads, respectively, and τ is the latency. The number of processors required by the schedule is represented by Prs, l is the hardware cost of barrier synchronisation and γ the clustering factor. DAG Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ir41 Ra80 Ra80 Ra186 Ra186 It511 It511 Ot511 Ot511 Di400 Di400
λs 0 0 0 0 1 1 1 1 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 0 4 0 4 0 4 0 4 0 4
λr 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 2 2 0 0 0 0 1 1 1 4 0 2 0 2 0 2 0 2 0 2
DSC PLW τ Prs PT AT Prs PT AT γ 1 13 16 16 22 16 16 3 2 9 21 21 20 22 22 3 4 8 31 31 19 27 27 6 8 3 44 42 19 39 39 12 1 7 26 44 22 16 36 2 2 5 31 41 20 22 43 6 4 8 40 52 19 27 40 2 8 4 52 57 19 39 60 12 1 7 26 55 22 16 44 8 2 5 31 43 20 22 47 11 4 8 40 62 19 27 39 11 8 4 52 63 19 39 64 11 1 5 31 47 22 16 49 6 2 5 31 50 20 22 56 8 4 5 38 57 19 27 45 11 8 2 43 58 19 39 72 11 1 5 31 55 22 16 55 11 2 8 40 67 20 22 66 11 1 5 31 73 22 16 72 11 2 8 40 90 20 22 77 11 4 3 44 68 19 27 58 11 8 1 41 41 19 39 90 11 1 8 40 91 22 16 77 11 2 3 38 85 20 22 83 11 4 5 43 75 19 27 60 12 1 5 43 82 22 16 93 1 4 27 25 25 41 31 31 6 2 27 40 193 49 21 145 14 4 66 20 22 87 19 19 4 2 64 31 112 111 17 93 11 4 98 37 39 167 27 27 1 2 95 61 151 341 21 67 1 4 146 25 27 264 13 13 8 2 121 43 171 308 13 81 8 4 27 146 146 165 140 140 6 2 24 236 317 101 97 397 29
MSA Prs PT 10 16 10 19 10 24 5 28 8 30 10 33 3 37 5 36 10 25 6 25 6 27 6 31 10 33 10 34 6 34 6 38 6 43 6 43 6 28 6 29 6 31 6 35 6 34 6 34 5 35 1 41 48 23 25 76 52 21 78 69 85 25 85 45 256 9 256 9 37 121 67 338
AT 16 18 23 27 30 32 33 36 25 25 26 30 33 34 34 37 35 36 27 28 30 34 34 34 35 41 22 56 19 47 25 45 9 9 116 260
MSABSP (l γ Prs PT 3 10 21 2 8 24 8 10 31 15 7 39 6 10 33 6 10 39 8 10 41 15 7 45 6 10 29 8 10 31 15 7 35 15 7 39 12 5 34 12 4 37 15 7 41 15 7 45 12 5 38 12 5 41 15 7 32 15 7 33 15 7 35 15 7 39 12 5 36 12 5 39 15 7 41 1 1 41 6 48 45 14 25 76 2 90 28 11 78 69 2 64 28 2 64 52 8 256 9 8 256 9 2 29 136 28 32 341
= 0) AT 21 26 31 34 31 32 37 39 28 31 32 36 32 34 37 40 35 37 31 33 34 38 34 36 38 41 45 56 32 48 28 52 9 9 136 277
MSABSP (l γ Prs PT 3 10 21 6 10 27 15 7 35 15 7 39 12 5 33 12 5 36 14 5 42 15 7 45 6 10 29 15 7 33 15 7 35 15 7 39 12 5 34 12 4 37 15 7 41 15 7 45 12 5 38 12 5 41 15 7 32 15 7 33 15 7 35 15 7 39 12 5 36 12 5 39 15 7 41 1 1 41 6 48 49 14 25 76 2 90 34 11 78 69 6 64 33 2 64 60 8 256 9 8 256 9 2 29 174 23 34 314
= 2) AT 27 31 33 36 33 35 39 41 32 33 34 38 34 36 39 42 37 39 33 34 35 40 36 38 40 41 49 59 38 50 33 60 9 9 174 296
344
Cristina Boeres et al.
Under the delay model (where λs = λr = 0), the performance of both DSC and PLW for the DAG Ir41 progressively worsens in comparison with MSA as the DAG effectively becomes more fine-grained due to the increasing latency value. Considering the other graphs under this model, one can argue that the improvements in the makespans for PLW over DSC are due mostly to the benefits of task replication which very much depend on the graph structure (e.g. compare Ra80 and Ot511 ). Although MSA also employes replication, it generally utilses fewer processors to greater effect. As expected under this model, the three approaches predict the execution times of their schedules quite accurately. Under LogP -type models (i.e. when the overheads are nonzero), DSC and PLW performs poorly on two counts: the ATs of the generated makespans in many cases are worse than the single processor execution time; and the predictions are not very accurate. In the case of DSC, this is partly because the overhead cost is added to the latency, thus specifying that communication events can be overlapped with task computation. PLW produces the same schedule under both the LogP and delay models, and thus even poorer predictions (PTs), because it ignores overheads all together. While adopting the linear model for PLW may improve matters with respect to the ATs, both PLW and DSC still apply the fundamental assumptions of the delay model which no longer hold on LogP -type machines, e.g. because of the multicasting assumption they tend to create clusters with large out-degrees. The predicted makespans (PTs) of the schedules produced by MSA are, on the whole, quite reasonable although there is scope for improvement. Compared to the predictions for the BSP model, those for the LogP model are in general less accurate. The reason is that it is difficult to predict the communication behaviour since, for example, the order in which messages are sent from a cluster is not defined, or atleast not under the control of the schedule. This effect, hidden by the barrier synchronisations, is more pronounced when strict banding is relaxed. Under BSP, a barrier synchronisation occurs after each pair of computation and communication intervals. In the simulation, the parameter l only models the “system” cost of signalling that all processors have reached the barrier. For different graphs, the reason why the ATs of the LogP schedules are in some cases much better than those of the BSP ones (especially under delay model conditions) is because MSA considered it better to uncluster the supertasks. For the BSP model this is not possible, so MSA attempts to counter this effect by usually increasing γ (compare the values used by MSA with those of the two MSABSP schedules for the graph Ir41 ). A larger γ may sometimes lead to fewer bands with fewer, but larger, smaller in- or out-degree clusters. In the entries where γ and the number of processors are the same for the MSA and MSABSP schedules, the difference in ATs represents the total cost for the barriers. On the whole, it seems that the MSA scheduling approach does manage to produce reasonable DAG transformations (and thus schedules) for the BSP model particularly under LogP model conditions. Note also that the ATs of the BSP schedules are better than those of DSC and PLW when the delay is large and particularly when the overheads are nonzero.
Static Scheduling Using Task Replication for LogP and BSP Models
5
345
Conclusions
A scheduling strategy has been proposed to produce good schedules for a range of parallel computers, including those in which the overheads due to sending and receiving are relatively high compared to the latency. When these communication events (the sending and receiving overheads) are considered, communication is not totally overlapped with computation of tasks as assumed in a number of scheduling heuristics. In the case where overheads are significant, task replication is exploited to decrease the number of communication events. In an attempt to decrease these further, messages are bundled into fewer, larger ones. MSA effectively transforms the structure of the input DAG into one which is more amenable to allocation on the target architecture. For realistic models (LogP and BSP ), initial results indicate that MSA generates better schedules than some well known conventional delay model-based approaches. This suggests that a new class of scheduling heuristics are required to generate efficient schedules for today’s parallel computers. The definition of computational and communication intervals facilitates both a more precise prediction of the number of communication events, especially when the bundling of messages is necessary, and the exploitation of parallelism. Note that the main objective when defining such intervals is not to specify synchronisation points but rather to easily identify potential parallel clusters for efficient allocation onto physical processors and for the scheduling of communication event tasks. In the final LogP schedule, these clearly defined intervals may no longer exist since unnecessary idle times will be removed. For the BSP model, however, these interval are maintained in the schedule. Under this model, although barrier synchronisation is an additional cost to be incurred by programs, MSA is able to reduce its impact by appropriately determining the sizes and the number of supersteps. Given that, for LogP -type machines, the makespans of BSP-structured schedules are not significantly worse than conventional ones and that their predictions relatively accurate, it seems that the superstep discipline is not too restrictive and that an approach to generating BSP-structured programs using a scheduling strategy such as MSA could be advantageous to programmers.
Acknowledgements The authors would like thank Tao Yang and Jing-Chiou Liou for providing the DSC and PLW algorithms, respectively, and for their help and advice.
References 1. C. Boeres. Versatile Communication Cost Modelling for Multicomputer Task Scheduling Heuristics. PhD thesis, Department of Computer Science, University of Edinburgh, May 1997. 340, 341, 342
346
Cristina Boeres et al.
2. C. Boeres and V. E. F. Rebello. Versatile task scheduling of binary trees for realistic machines. In M. Griebl C. Lengauer and S. Gorlatch, editors, Proceedings of the 3rd International Euro-Par Conference on Parallel Processing (Euro-Par’97), LNCS 1300, pages 913–921, Passau, Germany, August 1997. Springer-Verlag. 340, 342 3. C. Boeres and V. E. F. Rebello. A versatile cost modelling approach for multicomputer task scheduling. Accepted for publication in Parallel Computing, 1998. 340, 341, 342 4. R. P. Brent. The parallel evaluation of general arithmetic expressions. J. ACM, 21:201–206, 1974. 341 5. G. Chochia, C. Boeres, M. Norman, and P. Thanisch. Analysis of multicomputer schedules in a cost and latency model of communication. In Proceedings of the 3rd Workshop on Abstract Machine Models for Parallel and Distributed Computing, Leeds, UK., April 1996. IOS press. 339 6. D. Culler et al. LogP: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993. 338 7. B. Di Martino and G. Ianello. Parallelization on non-simultaneous iterative methods for systems of linear equations. In Parallel Processing (CONPAR’94-VAPP VI), LNCS 854, pages 254–264. Springer-Verlag, 1994. 338 8. J. M. D. Hill, P. I. Crumpton, and D. A. Burgess. The theory, practice, and a tool for BSP performance prediction. In Proceedings of the 2nd International Euro-Par Conference on Parallel Processing (Euro-Par’96), volume 1124 of LNCS, pages 697–705. Springer-Verlag, August 1996. 339 9. J. M. D. Hill, S. Donaldson, and D. B. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Oxford University Computing Laboratory, October 1997. 339 10. J. M. D. Hill and D. B. Skillicorn. Lessons learned from implementing BSP. In High-Performance Computing and Networks, Springer Lecture Notes in Computer Science Vol. 1225, pages 762–771, April 1997. 338 11. J-J. Hwang, Y-C. Chow, F. D. Anger, and C-Y. Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM Journal of Computing, 18(2):244–257, 1989. 339 12. H. Jung, L. Kirousis, and P. Spirakis. Lower bounds and efficient algorithms for multiprocessor scheduling of DAGs with communication delays. In Proc. ACM Symposium on Parallel Algorithms and Architectures, pages 254–264, 1989. 338 13. W. Lowe, W. Zimmermann, and J. Eisenbiergler. Optimization of parallel programs on machines with expensive communication. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Proceedings of the 2nd International EuroPar Conference on Parallel Processing (Euro-Par’96), LNCS 1124, pages 602–610, Lyon, France, August 1996. Springer-Verlag. 339 14. M. A. Palis, J-C. Liou, and D. S. L. Wei. Task clustering and scheduling for distributed memory parallel architectures. IEEE Transactions on Parallel and Distributed Systems, 7(1):46–55, January 1996. 339, 342 15. C. H. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal of Computing, 19:322–328, 1990. 338, 339, 340 16. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, 1997. 338
Static Scheduling Using Task Replication for LogP and BSP Models
347
17. T. Yang and A. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951–967, 1994. 339, 340, 342
Aspect Ratio for Mesh Partitioning Ralf Diekmann1 , Robert Preis1 , Frank Schlimbach2 , and Chris Walshaw2 1
University of Paderborn, Germany, {diek@,preis@hni.}uni-paderborn.de 2 University of Greenwich, London, UK, {sf634,C.Walshaw}@gre.ac.uk Abstract. This paper deals with the measure of Aspect Ratio for mesh partitioning and gives hints why, for certain solvers, the Aspect Ratio of partitions plays an important role. We define and rate different kinds of Aspect Ratio, present a new center-based partitioning method which optimizes this measure implicitly and rate several existing partitioning methods and tools under the criterion of Aspect Ratio.
1
Introduction
Almost all numerical scientific simulation codes belong to the class of dataparallel applications: their parallelization executes the same kind of operations on every processor but on different parts of the data. This requires the partitioning of the mesh into equal-sized subdomains as preprocessing step. Together with the additional constraint of minimizing the number of cut edges (i.e. minimizing the total interface length in the case of FE-mesh partitioning), the problem is N P -complete. Thus, a number of graph (mesh) partitioning heuristics have been developed in the past (e.g. [8]), and used in practical applications. Most of the existing graph partitioning algorithms optimize the balance of subdomains plus the number of cut edges, the cut size. This is sufficient for many applications, because an equal balance is necessary for a good utilization of the processors of a parallel system and the cut size indicates the amount of data to be transfered between different steps of the algorithms. The efficiency of parallel versions of global iterative methods like (relaxed) Jacobi, Conjugate Gradient (CG) or Multigrid is (mainly) determined by these two measures. But if the decomposition is used to construct pre-conditioners (DD-PCG) [1,2], cut and balance may not longer be the only efficiency determining factors. The shape of subdomains heavily influences the quality of pre-conditioning and, thus, the overall execution time [10]. First attempts at optimizing the Aspect Ratio of subdomains weigh elements depending on their distance from a subdomain’s center (e.g. [5]) and include this weight into the cost function of local iterative search algorithms like the Kernighan-Lin heuristic [6]. In the following section we will define the Aspect Ratio of subdomains and give hints why it can improve the overall execution time of domain decomposition methods. Section 3 introduces a new mesh partitioning heuristic which implicitly optimizes the Aspect Ratio and Section 4 finishes with experimental results. Further information on this topic can be found in [4]. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 347–351, 1998. c Springer-Verlag Berlin Heidelberg 1998
348
2
Ralf Diekmann et al.
Aspect Ratios
L max
An example that cut size is not always the right measure in mesh partitioning can be found in Fig. 1. The sample mesh is partitioned into two parts with different AR’s and different cuts. A Poisson problem with homogeneous Dirichlet-0 boundary conditions is solved with DD-PCG. It can be observed that the number of iterations is determined by the Aspect Ratio and that the cut size (number of neighboring elements) would be the wrong measure.
Ro L min
da
ry
:C
cut = 2, AR = 1.56, #It s = 8
Ri
Ro
Bo
un
Area: A
Area: A
cut = 4, AR = 1.00, #It s = 5
Fig. 1. Aspect Ratio vs. Cut Size.
Fig. 2. Different definitions of Aspect Ratio R2 R2 C2 AR: LLmax , Ro2 , Ao and 16A . min i
Possible definitions of AR can be found in Fig. 2. The first two are motivated by common measures in triangular mesh generation where the quality of R2 triangles are expressed in either LLmax (longest to shortest boundary edge) or Ro2 min i (area of smallest circle containing the domain to area of largest inscribed circle). R2 The definition of AR = Ro2 expresses the fact that circles are perfect shapes. i Unfortunately, the circles are quite expensive to find for arbitrary polygons: using Voronoi-diagrams, both can be determined in O(2n log n) steps where n is R2 number of nodes of the polygon. AR = Ao (A is the area of the domain) is another measure favoring circle-like shapes. It still requires the determination of the smallest outer circle but turns out to be better in practice. We can do a further step and approximate Ro by the length C of the boundary of the domain (which can be determined fast and updated incrementally in O(1)). The C2 additionally assumes that squares are perfect domains. definition of AR = 16A Circles offer a better circumfence/area ratio but force neighboring domains to become non-convex (Fig. 3). For a sub-domain with area A and circumfence C C2 AR = 16A is the ratio between the area of a square with circumfence C and A. For irregular meshes and partitions, the first definition AR = LLmax does not min express the shape properly. Fig. 4 shows an example. P1 is perfectly shaped, is large. The circle-based but as the boundary towards P4 is very short, LLmax min
Aspect Ratio for Mesh Partitioning
349
P5 P6
P4 P3
Fig. 3. Circles as perfect shapes?
P1
P2
Fig. 4.
Lmax Lmin
Fig. 5. Problems of AR := ?
R2o . R2i
measures usually fail to rate complex boundaries like “zig-zags” or inscribed corners. Fig. 5 shows examples which have the same AR each but are very different in shape.
3
Mesh Partitioning
The task of mesh partitioning is to divide the set of elements of a mesh into a given number of parts. The usual optimization criteria are the balance and the cut, which is the number of crossing edges of the element graph (the element graph expresses the data-dependencies between elements in the mesh). The calculation of a balanced partition with minimum cut is NP-complete. We will consider methods included in the tools JOSTLE [11], METIS [7] and PARTY [9]. The coordinate partitioning (COO) method cuts the mesh by a straight line perpendicular to the axis of the longest elongation such that both parts have an equal number of elements, resulting in a stripe-wise partition. If used recursively (COO R), it usually results in a more box-wise partition. There are many examples for greedy methods , in which one element after another is added to a subdomain. We will consider the variations in which the next element to be added is either chosen in a breadth-first manner (BFS) or as the one which increases the cut least of all (CFS). The new method Bubble (BUB) is designed to create compact domains and the idea is to grow subdomains simultaneously from different seed elements. A partition is represented by a set of (initial random) seeds, one for each part. Starting from the seeds, the parts are grown in a breadth-first manner until all elements are assigned to a subdomain. Each subdomain then computes its center based on the graph distance measure, i.e. it determines the element which is “nearest” to all others in the subdomain. These center-elements are taken as new seeds and the process is started again. The iteration stops if the seeds do not move any more, or if an already evaluated configuration is visited again. In general, the method stops after few iterations and produces connected and very compact parts, i.e. the shape of the parts is naturally smooth. As drawback, the parts do not have to contain the same number of elements. Therefore we integrated a global load balancing step based on the diffusion method [3] as post-processing trying to optimize either the cut (BUB+CUT) or the Aspect Ratio (BUB+AR).
350
Ralf Diekmann et al.
We also implemented the meta-heuristic Simulated Annealing (SA) for the C2 shape optimization and used AR = 16A as optimization function. If the parameters are chosen carefully, SA is able to find very good solutions. Unfortunately, it also requires a very large number of steps.
4
Results
We compare the described approaches with respect to the number of global iterations, the cut size and the Aspect Ratio in Fig. 6 and 7. Methods 1-7 are included in PARTY and 8-11 are default settings of the tools. The partitions are listed with the method numbers with increasing number of iterations.
#Its Cut L_max/L_min R_o/R_i R_o/A C*C/16A
60
No. global iterations
50
1:COO
40
2:COO R
35
3:BFS
30
4:CFS
40
5:BUB
30
6:BUB+CUT
20
7:BUB+AR
No. global iterations
70
#Its Cut L_max/L_min R_o/R_i R_o/A C*C/16A
25 20 15 10
8:PARTY
10
5
9:JOSTLE 0
0 9 10 11 8
6
7 12 5
2
1
3
4
Fig. 6. Results of example turm (531 elements) into 8 parts.
10:KMETIS 11:PMETIS 12:SA
8 12 11 9 10 2
7
5
3
6
1
4
Fig. 7. Results of example cooler (749 elements) into 8 parts.
2
Ro Ro C max The Aspect Ratios L Lmin and Ri do not, whereas the values of A and 16A roughly follow the number of iterations. A low cut also leads to a fairly low number of iterations. The coordinate and greedy methods usually result in very high, whereas the bubble variations in adequate iteration numbers. Very low iteration numbers can be observed for the tools and Simulated Annealing.
References 1. S. Blazy, W. Borchers, and U. Dralle. Paralleliziation methods for a characteristic’s pressure correction scheme. In Flow Simulation with High-Perf. Comp. II, 1995. 347 2. J.H. Bramble, J.E. Pasciac, and A.H. Schatz. The construction of preconditioners for elliptic problems by substructering i.+ii. Math. Comp., 47+49, 1986+87. 347 3. G. Cybenko. Load balancing for distributed memory multiprocessors. J. Par. Distr. Comp., 7:279–301, 1989. 349 4. R. Diekmann, F. Schlimbach, and C. Walshaw. Quality balancing for parallel adaptive fem. In IRREGULAR, LNCS. Springer, 1998. 347 5. N. Chrisochoides et.al. Automatic load balanced partitioning strategies for pde computations. In Int. Conf. on Supercomp., pages 99–107, 1989. 347
Aspect Ratio for Mesh Partitioning
351
6. C. Farhat, N. Maman, and G. Brown. Mesh partitioning for implicit computations via iterative domain decomposition. Int. J. Num. Meth. Engrg., 38:989–1000, 1995. 347 7. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report 95-035, University of Minnesota, 1995. 349 8. B.W. Kernighan and S. Lin. An effective heuristic procedure for partitioning graphs. The Bell Systems Technical Journal, pages 291–308, Feb 1970. 347 9. R. Preis and R. Diekmann. The party partitioning-library, user guide, version 1.1. Technical Report TR-RSFB-96-024, Universit¨ at–GH Paderborn, Sep 1996. 349 10. D. Vanderstraeten, R. Keunings, and C. Farhat. Beyond conventional mesh partitioning algorithms... In SIAM Conf. on Par. Proc., pages 611–614, 1995. 347 11. C. Walshaw, M. Cross, and M.G. Everett. A localised algorithm for optimising unstructured mesh partitions. Int. J. Supercomputer Appl., 9(4):280–295, 1995. 349
A Competitive Symmetrical Transfer Policy for Load Sharing Konstantinos Antonis, John Garofalakis, and Paul Spirakis 1
2
Computer Technology Institute, P.O. Box 1122, 26110 Patras, Greece Phone: (+30)61-273496, Fax: (+30)61-222086 {antonis,garofala,spirakis}@cti.gr University of Patras, Department of Computer Engineering and Informatics 26500 Rion, Patras, Greece. Abstract. Load Sharing is a policy used to improve the performance of distributed systems by transferring workload from heavily loaded nodes to lightly loaded ones in the system. In this paper, we propose a dynamic and symmetrical technique for a two-server system, called DifferenceInitiated (DI), in which transferring decisions are based on the difference between the populations of the two servers. In order to measure the performance of this policy, we apply the SSP analytical approximation technique proposed in [3]. Finally, we compare the theoretically derived results of the DI technique with two of the most commonly used dynamic techniques: the Sender-Initiated (SI), and the Receiver-Initiated (RI) which were simulated.
1
Introduction
Transferring of jobs between two nodes lies in the heart of any distributed load sharing algorithm. Especially in the cases of dynamic or adaptive load sharing techniques (in which scheduling decisions are based on the current state of the system), the transferring of jobs from a heavily loaded node to a lightly loaded node, requires in most cases the mutual knowledge of the current state of the two nodes (server and receiver). Such dynamic load sharing techniques are the nearest neighbor approach, the sender-initiated, the receiver-initiated, etc. In this work, we propose a symmetrical transfer policy for load sharing between two nodes, that uses a threshold on the difference of the populations of the respective queues, in order to make decisions for remote processing. We evaluate the performance of this policy, by applying the State Space Partition (SSP) supplementary queueing approximation technique [3], and also simulations. In the simulation model that we developed, we incorporated two of the most widely used dynamic load sharing policies, the sender-initiated (SI) and receiver-initiated (RI) [1], in order to compare their performance with our policy, which we call difference-initiated (DI).
This work was partially supported by ESPRIT LTR Project no. 20244 - ALCOM-IT basic research program of the E.U. and the Operational Program for Research and Development 2 No. 3.3 453 sponsored by the General Secreteriat for Research and Technology.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 352–355, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Competitive Symmetrical Transfer Policy for Load Sharing
2 2.1
353
The Difference-Initiated Technique The Model
We consider two serving queues as the processing nodes, which are subject to the following simple load sharing mechanism: when the difference of their queue lengths is greater than or equal to a threshold c, then jobs are transferred from the heavier loaded queue to the lightly loaded queue at a constant rate β (i.e. the transfer period is exponential of mean length 1/β). The above algorithm behaves symmetrically, either as server-intitiated or receiver-initiated, depending on whether the difference between the two queue lengths exceeds or is under the corresponding threshold c. A server probes for the queue length of the other server, every time a job arrives to or departs from its own queue. 2.2
The State Space Partition (SSP) Technique
It is easy to observe that the behaviour of our system as time progresses, balances between 3 elementary queueing systems, QS1 , QS2 , and QS3 , when |n1 −n2 | < c, n1 − n2 ≥ c, and n2 − n1 ≥ c, respectively. This technique computes the steadystate probability P (n1 , n2 ) for finding n1 and n2 customers at queue1 and queue2 respectively. The approximation technique conjectures that: P (n1 , n2 ) = A P1 (n1 , n2 ) + B P2 (n1 , n2 ) + C P3 (n1 , n2 )
(1)
where: P1 (n1 , n2 ), P2 (n1 , n2 ), P3 (n1 , n2 ) denote the steady-state probabilities for finding n1 and n2 customers at queue1 and queue2, for the three elementary, easily solvable systems above, with no constraints for (n1 − n2 ), and – A = P rob{1 }, where 1 = {(n1 , n2 ) : |n1 − n2 | < c}, – B = P rob{2 }, where 2 = {(n1 , n2 ) : n1 − n2 ≥ c}, – C = P rob{3 }, where 3 = {(n1 , n2 ) : n2 − n1 ≥ c}. In applying equation 1 we are confronted with the problem of estimating A, B, and C. So, we use the following iteration technique: Iteration Algorithm – Solve QS1 , QS2 , QS3 by any existing technique for Product Form Queueing Networks (Mean Value Analysis, LBANC, etc.) and get Pi (n1 , n2 ), i = 1, 2, 3. – Assume initial values for A, B, C, namely A(0) , B (0) , C (0) . Here we select: A(0) = P1 (1 ), B (0) = P1 (2 ), C (0) = P1 (3 ). – i=1 – () Compute the ith iteration value of P (n1 , n2 ) by equation 1 using A(i−1) , B (i−1) , C (i−1) . Call it P (i) (n1 , n2 ) – Find new values A(i) , B (i) , C (i) by using the following equation: A(i) = A(i−1) P1 (1 ) + B (i−1) P2 (1 ) + C (i−1) P3 (1 ) Similarly, for B and C , summing over 2 and 3 , respectively. – i=i+1 – If a convergence criterion for A, B, C is not satisfied goto (). – Use P(i+1) (n1 , n2 ) as an approximation to P (n1 , n2 ). (i)
(i)
(2)
354
3
Konstantinos Antonis et al.
Simulation Results and Comparisons
Three different transfer policies for load sharing between two FIFO servers have been analysed or simulated and compared: the difference-initiated (DI), the sender-initiated (SI), and the receiver-initiated (RI) technique. The analytical model of DI was described in section 2, while the models for the other two simulated techniques can be found in [1]. We mention here, that only one job can be in transmission any time for each direction. We assume Poisson arrival times, and exponential service and transfer times for all the above policies. The parameters used are the following: Ts = the threshold adopted for SI, Tr = the threshold adopted for RI, c = the threshold adopted for DI, λ1 , λ2 = the arrival rates for the two servers, µ1 , µ2 = the service rates for the two servers, β = the transfer rate. In order to have comparative systems, corresponding to the three policies, we choose to have Ts = c and Tr = 0 in all cases. We compare the three above transfer policies concerning the average response time for a job completion (the total time spent in the two server system by a single job). We have taken analytical (SSP method) or simulation results for the above performance measures for DI including: homogeneous and heterogeneous arrival times and variable β, for heavier and lighter workloads. The same results for SI and RI were taken by our simulation model. 3.1
Comparative Results for Homogeneous Arrival Streams
First, we consider the lightly loaded conditions. Our results have shown that RI has the worst average response time. On the ohter hand, DI behaves better, especially for large β (which means a low transfer cost, see fig. 1). As the arrival rate grows, the differences between the performance measures of the three techniques are decreased, but DI continues to behave a little better. The results for SI and RI are likely to be almost identical under heavily loaded conditions. We think that under more heavier conditions, RI should present better results compared to SI, confirming the results of [2]. DI performs better than the other two because of its symmetrical behaviour. 3.2
Comparative Results for Heterogeneous Arrival Streams
Considering the extremal case of fig. 2, we can see that DI continues to perform better than the other two techniques. DI balances the load between the two servers very well, and its results are very close to the results of the SI policy. On the other hand, RI presents very poor results, because of its nature, since it can not move a big number of jobs from the heavier to the lighter server. The explaination is that the other two policies can move jobs even when the lightly loaded server is not idle. According to the definition of RI and the threshold used in this case, a server has first to be idle, in order to receive a job from another server.
A Competitive Symmetrical Transfer Policy for Load Sharing
355
â=20, Ts=2, Tr=0, c=2, ì1=ì2=2 1.6
1.2
S.I. R_T
0.8
R.I. D.I.
0.4
0.0 0.5
1
ë
1.5
c
Fig. 1. Comparative results for the av. res. times (homogeneous arrival times, β=20).
Ts=2, Tr=0, c=2, ë1=1.5, ë2=0.5, ì1=ì2=2 2.0
1.6
S.I. 1.2
R.I.
R_T
D.I.
0.8
0.4
0.0 0.25
0.5
0.75
1
2
3
5
10
20
â
Fig. 2. Comparative results for the av. res. time (heterogeneous arrival times).
References 1. S. P. Dandamudi, “The Effect of Scheduling Discipline on Sender-Initiated and Receiver-Initiated Adaptive Load Sharing in Homogeneous Distributed Systems”, Technical Report, School of Computer Science, Carleton University, TR-95-25 1995. 352, 354 2. D. L. Eager, E. D. Lazowska, and J. Zahorjan, “A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing”, Performance Evaluation, Vol. 6, March 1986, pp. 53-68. 354 3. J. Garofalakis, and P. Spirakis, “State Space Partition Modeling . A New Approximation Technique”, Computer Technology Institute, Technical Report, TR. 88.09.56, Patras, 1988. 352
Scheduling Data–Parallel Computations on Heterogeneous and Time–Shared Environments Salvatore Orlando1 and Raffaele Perego2 1
Dip. di Matematica Appl. ed Informatica, Universit` a Ca’ Foscari di Venezia, Italy 2 Istituto CNUCE, Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy Abstract. This paper addresses the problem of load balancing data– parallel computations on heterogeneous and time-shared parallel computing environments, where load imbalance may be introduced by the different capacities of processors populating a computer, or by the sharing of the same computational resources among several users. To solve this problem we propose a run–time support for parallel loops based upon a hybrid (static + dynamic) scheduling strategy. The main features of our technique are the absence of centralization and synchronization points, the prefetching of work toward slower processors, and the overlapping of communication latencies with useful computation.
1
Introduction
It is widely held that distributed and parallel computing disciplines are converging. Advances in network technology have in fact strongly increased the network bandwidth of state–of–the–art distributed systems. A gap still exists with respect to communication latencies, but this difference too is now much less dramatic. Furthermore, most Massively Parallel Systems (MPSs) are now built around the same off–the–shelf superscalar processors that are used in high performance workstations, so that fine–grain parallelism is now exploited intra–processor rather than inter–processor. The convergence of parallel and distributed computing disciplines is also more evident if we consider the programming models and environments which dominate current parallel programming: MPI, PVM and High Performance Fortran (HPF) are now available for both MPSs and NOWs. This trend has a direct impact on the research carried out on the two converging disciplines. For example, parallel computing research has to deal with problems introduced by heterogeneity in parallel systems. This is a typical feature of distributed systems, but nowadays heterogeneous NOWs are increasingly used as parallel computing resources [2], while MPSs may be populated with different off–the–shelf processors (e.g. a SGI/Cray T3E system at the time of this writing may be populated with 300 or 450 MHz DEC Alpha processors). Furthermore, some MPSs, which were normally used as batch machines in space sharing, may now be concurrently used by different users as time shared multiprogrammed environments (e.g. an IBM SP2 system can be configured in a way that allows users to specify whether the processors must be acquired as shared or dedicated resources). This paper focuses on one of the main problems programmers thus have to deal with on both parallel and distributed systems: D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 356–366, 1998. c Springer-Verlag Berlin Heidelberg 1998
Scheduling Data
357
the problem of “system” load imbalance due to heterogeneity and time sharing of resources. Here we restrict our view to data–parallel computations expressed by means of parallel loops. In previous works we considered imbalances introduced by non–uniform data– parallel computations to be run on homogeneous, distributed memory, MIMD parallel systems [11,9,10]. We devised a novel compiling technique for parallel loops and a related run–time support (SUPPLE). This paper shows how SUPPLE can also be utilized to implement loops in all the cases where load imbalance is not a characteristic of the user code, but is caused by variations in capacities of processing nodes. Note that, much other research has been conducted in the field of run-time supports and compilation methods for irregular problems [13,12,7]. In our opinion, besides SUPPLE, many of these techniques can be also adopted when load imbalance derives from the use of a time–shared or heterogeneous parallel system. These techiques should also be compared with those specifically devised to face load imbalance in NOW environments [4,8]. SUPPLE is based upon a hybrid scheduling strategy, which dynamically adjusts the workloads in the presence of variations of processor capacities. The main features of SUPPLE are the efficient implementation of regular stencil communications, the hybrid (static + dynamic) scheduling of chunks of iterations, and the exploitation of aggressive chunk prefetching to reduce waiting times by overlapping communication with useful computation. We report performance results of many experiments carried out on an SGI/Cray T3E and an IBM SP2 system. The synthetic benchmarks used for the experiments allowed us to model different situations by varying a few important parameters such as the computational grain and the capacity of each processor. The results obtained suggest that, in the absence of a priori knowledge about the relative capacities of the processors that will actually execute the program, the hybrid strategy adopted in SUPPLE yields very good performance. The paper is organized as follows. Section 2 presents the synthetic benchmarks and the machines used for the experiments. Section 3 describes our run– time support and its load balancing strategy. The experimental results are reported and discussed in Section 4, and, finally, the conclusions are drawn in Section 5.
2
Benchmarks
We adopted a very simple benchmark program that resembles a very common pattern of parallelism (e.g. solvers for differential equations, simulations of physical phenomena, and image processing applications). The pattern is data–parallel and “regular”, and thus considered easy to implement on homogeneous and dedicated parallel systems. In the benchmark a bidimensional array is updated iteratively on the basis of the old values of its elements, while array data referenced are modeled by a five-point stencil. The simple HPF code illustrating the benchmark is shown below. REAL A(N1,N2), B(N1,N2) !HPF$ TEMPLATE D(N1,N2) !HPF$ DISTRIBUTE D(BLOCK,BLOCK) !HPF$ ALIGN A(i,j), B(i,j) WITH D(i,j) ......... DO k= 1,N_ITER !HPF$ INDEPENDENT
358
Salvatore Orlando and Raffaele Perego FORALL (i= 2:N1-1, j= 2:N2-1) B(i,j) = Comp(A(i,j), A(i-1,j), A(i,j-1), A(i+1,j), A(i,j+1)) END FORALL A = B END DO .........
Note the BLOCK distribution to exploit data locality by reducing off localmemory data references. The actual computation performed by the benchmark above is hidden by the function Comp(). We thus prepared several versions of the same benchmark where Comp() was replaced with a dummy computation characterized by known and fixed costs. Moreover, since it is important to observe the performance of our loop support when we increase the number of processors, for each different grid P of processors we modified the dimensions of the data set to keep the size of the block of data allocated to each processor constant. Finally, another feature that we changed during the tests is N ITER, the number of iterations of the external sequential loop. This was done to simulate the behavior of real applications such as solvers of differential equations, which require the same parallel loops to be executed many times, and image filtering applications, which usually perform the update of the input image in just one step.
3
The SUPPLE Approach
SUPPLE (SUPport for Parallel Loop Execution) is a portable run-time support for parallel loops [9,11,10]. It is written in C with calls to the MPI library. The main innovative feature of SUPPLE [9] is its ability to allow data and computation to be dynamically migrated, without losing the ability to exploit all the static optimizations that can be adopted to efficiently implement stencil data references. Stencil implementation is straightforward, due to the regularity of the blocking data layout adopted. For each array involved SUPPLE allocates to each processor enough memory to host the block partition, logically subdivided into an inner and a perimeter region, and a surrounding ghost region. The ghost region is used to buffer the parts of the perimeter regions of the adjacent partitions that are owned by neighboring processors, and are accessed through non local references. The inner region contains data elements that can be computed without fetching external data, while to compute data elements belonging to the perimeter region these external data have to be waited for. Loop iterations are statically assigned to each processor according to the owner computes rule, but, to overlap communications with useful computations, iterations are reordered [3]: the iterations that assign data items belonging to the inner region (which refer local data only) are scheduled between the asynchronous sending of the perimeter region to neighboring processors and the receiving of the corresponding data into the ghost region. This static scheduling may be changed at run-time by migrating iterations, but, in order to avoid the introduction of irregularities in the implementation of stencil data reference, only iterations updating the inner region can be migrated. We group these iterations into chunks of fixed size g, by statically tiling the inner region. SUPPLE migrates chunks and associated data tiles instead of single iterations.
Scheduling Data
359
At the beginning, to reduce overheads, each processor statically executes its chunks, which are stored in a queue Q, hereafter local queue. Once a processor understands that its local queue is becoming empty, it autonomously decides to start the dynamic part of its scheduling policy. It tries to balance the processor workloads by asking overloaded partners for migrating both chunks and corresponding data tiles. Note that, due to stencil data references, to allow the remote execution of a chunk, the associated tile must be accompanied by a surrounding area, whose size depends on the specific stencil features. Migrated chunks and data tiles are stored by each receiving processor in a queue RQ, called remote. Since load balancing is started by underloaded processors, our technique can be classified as receiver initiated [6,14]. In the following we detail our hybrid scheduling algorithm. During the initial static phase, each processor only executes local chunks in Q and measures their computational cost. Note that, since the possible load imbalance may only derive from different “speeds” of the processors involved, chunks that will possibly be migrated and stored in RQ will be considered as having the same cost as the ones stored in Q. Thus, on the basis of the knowledge of the chunk costs, each processor estimates its current load by simply inspecting the size of its queues Q and RQ. When the estimated local load becomes lower than a machine-dependent Threshold, each processor autonomously starts the dynamic part of the scheduling technique and starts asking other processors for remote chunks. Correspondingly, a processor pj , which receives a migration request from a processor pi , will grant the request by moving some of its workload to pj only if its current load is higher than Threshold. To reduce the overheads which might derive from requests for remote chunks which cannot be served, each processor, when its current load becomes lower than Threshold, broadcasts a so-called termination message. Therefore the round-robin strategy used by underloaded processors to choose a partner to be asked for further work skips terminated processors. Once an overloaded processor decides to grant a migration request, it must choose the most appropriate number of chunks to be migrated. To this end, SUPPLE uses a modified Factoring scheme [5], which is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared– memory multiprocessors Finally, the policies exploited by SUPPLE to manage data coherence and termination detection are also fully distributed and asynchronous. A full/empty-like technique [1] is used to asynchronously manage the coherence of migrated data tiles. When processor pi sends a chunk b to pj , it sets a flag marking the data tiles associated with b as invalid. The next time pi needs to access the same tiles, pi checks the flag and, if the flag is still set, waits for the updated data tiles from node pj . As far as termination detection is concerned, the role of a processor in the parallel loop execution finishes when it has already received a termination message from all the other processors, and both its queues Q and RQ are empty. In summary, unlike other proposals [12,4], the dynamic scheduling policy of SUPPLE is fully distributed and based upon local knowledge about the local workload, and thus there is no need to synchronize the processors in order to exchange updated information about the global workload. Moreover, SUPPLE
360
Salvatore Orlando and Raffaele Perego
may also be employed for applications composed of a single parallel loop, such as filters for image processing. Unlike other proposals [13,8], it does not exploit past knowledge about the workload distribution at previous loop iterations, since dynamic scheduling decisions are asynchronously taken concurrently with useful computation.
4
Experimental Results
All experiments were conducted on an IBM SP2 system and an SGI/Cray T3E. Note that both machines might be heterogeneous, since both can be equipped with processors of different capacities. The T3E can in fact be populated with DEC Alpha processors of different generations1, while IBM enables choices from three distinct types of nodes – high, wide or thin – where the differences are in the number of per–node processors, the type of processors, the clock rates, the type and size of caches, and the size of the main memory. However, we used the SP2 (a 16 node system equipped with 66 MHz POWER 2 wide processors) as a homogeneous time–shared environment. To simulate load imbalance we simply launched some compute–bound processes on a subset of nodes. On the other hand, we used the T3E (a system composed of 64 300 MHz - DEC Alpha 21164 processors) as a space-shared heterogeneous system. Since all the nodes of our system are identical, we had to simulate the presence of processors of different speeds by introducing an extra cost in the computation performed by those processors considered “slow”. Thus, if the granularity of Comp() is µ µsec (including the time to read and write the data) on the “fast” processors, and F is the factor of slowdown of the “slow” ones, the granularity of Comp() on the “slow” processors is (F · µ) µsec. To prove the effectiveness of SUPPLE, we compared each SUPPLE implementation of the benchmark with an equivalent static and optimized implementation, like the one exploited by a very efficient HPF compiler. We present several curves, all plotting an execution time ratio (ETR), i.e. the ratio of the time taken by the static scheduling version of the benchmark over the time taken by the SUPPLE hybrid scheduling version. Hence, a ratio greater than one corresponds to an improvement in the total execution time obtained by adopting SUPPLE with respect to the static version. Each curve is relative to a distinct granularity µ of Comp(), and plots the ETRs as a function of the number of processors employed. The size of the data set has been modified according to the number of processors to keep the size of the sub-block statically assigned to each processor constant. 4.1
Time–shared Environments
First of all, we show the results obtained on the SP2. Due to the difficulty in running probing experiments because of the unpredictability of the workloads, as well as for the exclusive use of some resources (in particular, resources as the 1
The Pittsburgh Supercomputing Center has a T3E system with 512 application processors of which half runs at 300 MHz and half at 450 MHz.
Scheduling Data
IBM SP2 (IP): SUPPLE w.r.t. static scheduling - load=4 on 50% procs 3
µ = 3.6 µsec µ = 5.5 µsec
2.5
2
Exec. time ratio
Exec. time ratio
IBM SP2 (US): SUPPLE w.r.t. static scheduling - load=4 on 50% procs 3
µ = 7.5 µsec µ = 17.5 µsec µ = 27.0 µsec
2.5
361
1.5
2
1.5
1
1
0.5
0.5
0
0 0
5
10 Number of processors
(a)
15
20
0
5
10 Number of processors
15
20
(b)
Fig. 1. SP2 results: ETR results for experiments exploiting (a) the Ethernet, and (b) the high–performance switch SP2 high–performance switch and/or the processors themselves) by other users, the results in this case are not so exhaustive as those reported for the tests run on the T3E. As previously mentioned, on the SP2 we ran our experiments after the introduction of a synthetic load on a subset of the nodes used. In particular, we launched 4 compute–bound processes on 50% of the processing nodes employed, while the loads were close to zero on the rest of the nodes. This corresponds to “slow” processors characterized by a value of F (factor of slowdown) equal to 5. The two plots in Figure 1 show the ETRs obtained for various µ and numbers of processors. The size of the sub-block owned by each processor was kept constant and equal to 512 × 512, while N ITER was equal to 5. As regards the SUPPLE parameters, the Threshold value was set to 0.02 msec, and the chunk size g to 32 iterations. Figure 1.(a) shows the SP2 results obtained by using the Ethernet network and the IP protocol. Note that, even in this case as in the following ones, the performance improvement obtained with SUPPLE increases in proportion to the size of the grain µ. For µ = 27 µsec, and 4 processors, we obtained the best result: SUPPLE implementation is about 100% faster than the static counterpart. For smaller values of µ, due to the overheads to migrate data, the load imbalance paid by the static version of the benchmark is not large enough to justify the adoption of a dynamic scheduling technique. Figure 1.(b) shows, on the other hand, the results obtained by running the parallel benchmarks in time–sharing on the SP2 by exploiting the US high– performance switch. Due to the better communication framework, in this case the ETR is favorable to SUPPLE even for smaller values of µ. 4.2 Heterogeneous Environments All the experiments regarding heterogeneous environments were conducted on the T3E. Heterogeneity was simulated, so that we were able to perform a lot of tests, also experimenting different factors F of slowdown. Iterative benchmark Figure 2 reports the results for the iterative benchmark, where the external sequential loop is repeated 20 times (N ITER = 20). For all
362
Salvatore Orlando and Raffaele Perego
Cray T3E: SUPPLE w.r.t. static scheduling - F=2 on 25% procs 2.5
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
2
Exec. time ratio
2
Exec. time ratio
Cray T3E: SUPPLE w.r.t. static scheduling - F=2 on 50% procs 2.5
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
1.5
1
0.5
1.5
1
0.5
0
0 0
10
20
30 40 Number of processors
50
60
70
0
10
20
(a) 4
60
70
Cray T3E: SUPPLE w.r.t. static scheduling - F=4 on 50% procs 4
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
3.5 3 Exec. time ratio
3 Exec. time ratio
50
(c)
Cray T3E: SUPPLE w.r.t. static scheduling - F=4 on 25% procs
3.5
30 40 Number of processors
2.5 2 1.5
2.5 2 1.5
1
1
0.5
0.5
0
0 0
10
20
30 40 Number of processors
(b)
50
60
70
0
10
20
30 40 Number of processors
50
60
70
(d)
Fig. 2. T3E results: ETR results for various values of F and µ the tests we kept fixed the size of the sub–block assigned to each processor (512 × 512). Figures 2.(a) and 2.(b) are relative to an environment where only 25% of all processors are “slow”. The factors F of slowdown of these processors are 2 and 4, respectively. Figures 2.(c) and 2.(d) refer to an environment where half of the processors employed are “slow’. Each curve in all these figures plots the ETR for a benchmark characterized by a distinct µ. With respect to the SP2 tests, in this case we were able to reduce µ (µ now ranges between 0.3 and 2.2 µsec), always obtaining an encouraging performance with our SUPPLE support. Note that such grains are actually very small: for µ equal to 0.3µsec, about 85% of the total time is spent on memory accesses (to compute addresses and access data element covering the five-point stencil), and only 15% on arithmetic operations. We believe that the encouraging results obtained are due to the smaller overheads and latencies of the T3E network, as well as to the absence of time–sharing of the processor, thus speeding up the responsiveness of “slow” processors to requests coming from “fast” ones. The reductions in granularity made it possible to enlarge the chunk size g (g = 128) without losing the effectiveness of the dynamic scheduling algorithm. The Threshold ranged between 0.02 and 0.06 msec, where larger values were used for larger µ. Looking at Figure 2, we can see that better results (SUPPLE execution times up to 3 times lower than those obtained with static scheduling) were obtained
Scheduling Data
363
when the “slow” processors were only 25% of the total number. The reason for this behavior is clear: in this case, we have more “fast” processors to which the extra workload previously assigned to the “slow” processors can be dynamically distributed. The execution times obtained with the static implementation are, on the other hand, almost independent of the percentage of “slow” processors. In fact, even if only one of the processors was “slow”, its execution time would dominate the overall completion time. We also tested SUPPLE on a homogeneous system (i.e. a balanced one) in order to evaluate its overhead w.r.t. a static implementation, which, in this case, is optimal. The overhead is almost constant for data sets of the same sizes subdivided into a given number of chunks, but its influence becomes larger for smaller granularities because of the shorter execution time and the limited possibility of hiding communication latencies with computations. Thus, for µ = 2.2 µsec the two execution times are almost comparable, while for µ = 0.3µsec, the static version of the benchmark becomes 60% faster than SUPPLE. We verified that the overhead introduced by SUPPLE is due to some undesired migration of chunks, and to the dynamic scheduling technique which entails polling the network interface to check for incoming messages (even if these messages do not actually arrive). 35
35 30
30 25
sec.
sec.
25
Stencil + idle 20 Sched. overhead Work time
20 15
15
10
10 5
5 0 0
1
2
3
4
5
6
7
8
Proc ID
9
10
11
12
13
14
15
0 0
1
2
3
4
5
6
7
8
Proc ID
(a)
9
10
11
12
13
14
15
(b)
Fig. 3. Work time and overheads for a static (a) and a SUPPLE (b) version of the benchmark Finally, we instrumented the static and the SUPPLE versions of a specific benchmark to evaluate the ratio between the execution time spent on useful computations and on dynamic scheduling overheads. The features of the benchmark used in this case are the following: a 2048 × 2048 data set distributed over a grid of 4 × 4 T3E processors, and an external sequential loop iterated for 20 times. The simulated unbalanced environment was characterized by 25% of “slow” processors, with a slowdown factor of 4. Figure 3.(a) show the results obtained by running the static implementation of this benchmark. It is worth noting the work time on the “slow” processors, which is 4 times the work time on the “fast” ones. The black portions of the bars show idle times on “fast” processors while waiting for border data. These idle times are thus due to communications implementing stencil data references, where corresponding send/receive on fast/slow processors are not synchronized due to their different capacities. Figure 3.(b), on the other hand, shows the SUPPLE results on the same benchmark. Note the redistribution of workloads from “slow” to “fast” proces-
364
Salvatore Orlando and Raffaele Perego
sors. Thanks to dynamic load balancing, idle times disappeared, but we have larger scheduling overheads due to the communications used to dynamically migrate chunks. These overheads are clearly larger on “slow” processors, which spend a substantial part of their execution time on giving away work and on receiving the results of migrated iterations. Cray T3E: SUPPLE w.r.t. static scheduling - F=2 on 25% procs (1 Iter.) 2.5
2
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
4 3.5 Exec. time ratio
Exec. time ratio
Cray T3E: SUPPLE w.r.t. static scheduling - F=4 on 25% procs (1 Iter.) 4.5
µ = 0.3 µsec µ = 0.6 µsec µ = 1.1 µsec µ = 2.2 µsec
1.5
1
3 2.5 2 1.5 1
0.5
0.5 0
0 0
10
20
30 40 Number of processors
(a)
50
60
70
0
10
20
30 40 Number of processors
50
60
70
(b)
Fig. 4. T3E results: ETR results obtained running a single iteration benchmark for various values of F and µ
Single iteration benchmark As explained above, one of the advantages of SUPPLE is that it can also be used for balancing parallel loops that have only to be executed once. In this case some overheads cannot be overlapped, and at the end of loop execution “slow” processors have to wait for the results of iterations executed remotely. Figure 4 shows the encouraging ETRs obtained by our SUPPLE implementation w.r.t. the static one. Note that all the results plotted in the figure refer to unbalanced environments where only 25% of the processors are “slow”. Moreover, due to the larger data sets used for these tests, the ETRs are in some cases even more favorable for SUPPLE than the ones for iterative benchmarks.
5
Conclusions
We have discussed the implementation on heterogeneous and/or timed–shared parallel machines of regular and uniform parallel loop kernels, with statically predictable stencil data references. We have assumed that no information about the capacities of the processors involved is available until run-time, and that, in time–shared environments, these capacities may change during run–time. To implement the kernel benchmark we employed SUPPLE, a run-time support that we had previously introduced to compile non-uniform loops. The tests were conducted on an SGI/Cray T3E and an IBM SP2. We compared the SUPPLE results with a static implementation of the benchmark, where data and computations are evenly distributed to the various processing nodes. The SP2, a parallel system that can be used as a time–shared NOWs, was loaded with artificial compute-bound processes before running the tests. On the other
Scheduling Data
365
hand, we needed to simulate a heterogeneous SGI/Cray T3E, i.e. a machine whose nodes may be equipped with different off–the–shelf processors and/or memory organization. The performance results were very encouraging. On the SP2, where half of the processors were loaded with 4 compute-bound processes, the SUPPLE version of the benchmark resulted at most 100% faster than the static one. On the T3E, depending on the amount of “slow” processors, on the number of processors employed and the granularity of loop iterations, the SUPPLE version reached percentages of performance improvement ranging between 20% and 270%. Further work has to be done to compare our solution with other dynamic scheduling strategies proposed elsewhere. More exhaustive experiments with different benchmarks and dynamic variations of the system loads are also required to fully evaluate the proposal. However, we believe that hybrid strategies like the one adopted by SUPPLE can be profitably exploited in many cases where locality exploitation and load balance must be solved at the same time. Moreover, our strategy can be easily integrated in the compilation model of a high–level data–parallel language.
References 1. R. Alverson et al. The Tera computer system. In Proc. of the 1990 ACM Int. Conf. on Supercomputing, pages 1–6, 1990. 359 2. A. Anurag, G. Edjlali, and J. Saltz. The Utility of Exploiting Idle Workstations for Parallel Computation. In Proc. of the 1997 ACM SIGMETRICS, July 1997. 356 3. S. Hiranandani, K. Kennedy, and C. Tseng. Evaluating Compiler Optimizations for Fortran D. J. of Parallel and Distr. Comp., 21(1):27–45, April 1994. 358 4. S. Flynn Hummel, J. Schmidt, R. N. Uma, and J. Wein. Load-Sharing in Heterogeneous Systems via Weighted Factoring. In Proc. of the 8th SPAA, July 1997. 357, 359 5. S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. of the ACM, 35(8):90–101, Aug. 1992. 359 6. V. Kumar, A.Y. Grama, and N. Rao Vempaty. Scalable Load Balancing Techniques for Parallel Computers. J. of Parallel and Distr. Comp., 22:60–79, 1994. 359 7. J. Liu and V. A. Saletore. Self-Scheduling on Distributed-Memory Machines. In Proc. of Supercomputing ’93, pages 814–823, 1993. 357 8. M. Colajanni M. Cermele and G. Necci. Dynamic Load Balancing of Distributed SPMD Computations with Explicit Message-Passing. In Proc. of the IEEE Workshop on Heterogeneous Computing, pages 2–16, 1997. 357, 360 9. S. Orlando and R. Perego. SUPPLE: an Efficient Run–Time Support for Non– Uniform Parallel Loops. Technical Report TR-17/96, Dipartimento di Mat. Appl. ed Informatica, Universit` a di Venezia, Dec. 1996. To appear on J. of System Architecture. 357, 358 10. S. Orlando and R. Perego. A Comparison of Implementation Strategies for Non– Uniform Data–Parallel Computations. Technical Report TR-9/97, Dipartimento di Mat. Appl. ed Informatica, Universit` a di Venezia, April 1997. Under revision for publication on the J. of Parallel and Distr. Comp. 357, 358
366
Salvatore Orlando and Raffaele Perego
11. S. Orlando and R. Perego. A Support for Non–Uniform Parallel Loops and its Application to a Flame Simulation Code. In Proc. of the 4th Int. Symposium, IRREGULAR ’97, pages 186–197, Paderborn, Germany, June 1997. LNCS 1253, Spinger-Verlag. 357, 358 12. O. Plata and F. F. Rivera. Combining static and dynamic scheduling on distributed–memory multiprocessors. In Proc. of the 1994 ACM Int. Conf. on Supercomputing, pages 186–195, 1994. 357, 359 13. J. Saltz et al. Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines. Software Practice and Experience, 25(6):597–621, June 1995. 357, 360 14. M.H. Willebeek-LeMair and A.P. Reeves. Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Trans. on Parallel and Distr. Systems, 4(9):979–993, Sept. 1993. 359
A Lower Bound for Dynamic Scheduling of Data Parallel Programs Fabricio Alves Barbosa da Silva2 , Luis Miguel Campos1 , and Isaac D. Scherson1,2 1
Information and Comp. Science, University of California, Irvine, CA 92697 U.S.A. {isaac,lcampos}@ics.uci.edu 2 Universit´e Pierre et Marie Curie, Laboratoire ASIM, LIP6, Paris, France [email protected]
Abstract. Instruction Balanced Time Slicing (IBTS) allows multiple parallel jobs to be scheduled in a manner akin to the well-known gang scheduling scheme in parallel computers. IBTS however allows for time slices to change dynamically and, under dynamically changing workload conditions is a good non-clairvoyant scheduling technique when the parallel computer is time sliced one job at a time. IBTS-parallel is proposed here as a dynamic scheduling paradigm which improves on IBTS by allowing also dynamically changing space sharing of the computer’s processors. IBTS-parallel is also non-clairvoyant and it is characterized under the competitive ratio metric. A lower bound on its performance is also derived.
1
Introduction
A solution to the dynamic parallel job scheduling problem is proposed together with its complexity analysis. The problem is one of defining how to share, in an optimal manner, a parallel machine among several parallel jobs. A job is defined as a set of data parallel threads. One important characteristic that makes the problem dynamic, as defined in [3], is the possibility of arrival of new jobs at arbitrary times. In the static case, the set of jobs to be executed is already defined when scheduling decisions are made, and arbitrary job arrivals are not permitted. In this paper we propose a new scheduling algorithm dubbed IBTS-Parallel, which is derived from the IBTS algorithm. IBTS stands for instruction balanced time slicing, and was originally defined in [6]. IBTS is a non-clairvoyant [3] scheduling algorithm designed to optimize the competitive ratio (CR) metric, also defined in [6][4].
Professor Alain Greiner is gratefully acknowledged. Supported by Capes, Brazilian Government, grant number 1897/95-11. Supported in part by the Irvine Research Unit in Advanced Computing and NASA under grant #NAG5-3692. Supported in part by the Irvine Research Unit in Advanced Computing and NASA under grant #NAG5-3692.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 367–372, 1998. c Springer-Verlag Berlin Heidelberg 1998
368
Fabricio Alves Barbosa da Silva et al.
In addition to the theoretical analysis of IBTS-Parallel, we present an experimental analysis performed using a general purpose event driven simulator developed by our research group.
2
Previous Work
Various classifications for the parallel job scheduling problem have been suggested in the literature. The classification used here is based on the way in which computing resources are shared : temporal sharing, also known as time slicing or preemption; and space slicing, also called partitioning. These two classifications are orthogonal, and may lead to a taxonomy based on the possible options. Table 1 shows the scheduling policies adopted by commercial and research systems, and was borrowed from [2]. It is worth noting that the lack of consensus on which scheduling policy is best among those showed in table 1 is total. The problem is that the assumptions leading to and justifying the different schemes are usually quite different, which makes difficult the comparison between different solutions. time slicing yes independent PEs
Mach
Paragon/service Meiko/timeshare KSR/interactive transputers Tera/streams Chrysalis
structured
space slicing
no
NX/2 on iPSC/2 nCUBE IRIX on SGI NYU Ultra Dynix 2-level/top Hydra/C.mmp
no
local queue
yes
flexible
global queue
gang scheduling
Star OS Psyche Elxsi AP1000
Medusa Butterfly@LLNL Cray T3E Meiko/gang Paragon/gang SGI/gang Tera/PB MAXI/gang
IBM SP2, Victor Meiko/batch Paragon/slice KSR/bath 2-level/bottom TRAC, MICROS Amoeba
CM-5 Cedar DHC on SP2 DQT on RWC-1
Cray T3D CM-2 PASM hypercubes
MasPar MP2 Alliant FX/8 Chagori on K2
Illiac IV MPP GF11 Warp
Fig. 1. Scheduling policies followed in current commercial and research systems We address the scheduling problem by using the same assumptions (i.e. model and metrics) described by Subramaniam and Scherson in [6][4]. In [6], Subramaniam and Scherson studied a scheduling policy named Instruction Balanced Time Slicing (IBTS), which performs well under the competitive ratio metrics [4]. Each job is allocated on all P processing elements (PE) of the machine for a finite quantum (time-slice). All time-quanta devoted to the various jobs are equal as measured in machine instructions (that is, all jobs are preempted after equal number of machine instructions). Note that time quanta may vary in absolute value while still being equal when measured in number of instructions. A good analysis of gang scheduling, which is similar to IBTS but imposes an invariant constant time slice can be found in [5]. In [4] the programming model used is the data parallel (V-RAM) model[1]. The scheduling problem was defined as a optimization problem, and the compet-
Scheduling Dynamically Data Parallel Programs
369
itive ratio was the metric used as the objective function. The competitive ratio (CR) was defined in function of the non clairvoyance of the scheduler. The competitive ratio (CR) is based on the happiness [4] concept. The happiness metric attempts to capture the satisfaction of a job as a function of the scheduling decisions made by a scheduler. The CR is then the ratio between the happiness achieved by a knowledgeable malicious adversary, and the that achieved by a partially ignorant scheduler. IBTS is hence a non-clairvoyant algorithm, and the malicious adversary always has more information that the scheduler itself. It is that hidden information that is used by the adversary to keep the happiness as low as possible. By minimizing the CR the scheduler approaches the result that would be obtained by an adversary with global knowledge. max
CR = y
Happiness(x, y ) Happiness(x, y)
(1)
In equation 1, x represents the input to the algorithm, y is the result obtained by the scheduler and y is the result obtained by the adversary. Also: T Happiness(x, y) =
0
minj ℘j (τ )dτ T
(2)
Where ℘j is the power delivered to job j, and T is the time at which the last job completes. In analogy with physics terminology, the power delivered to a job is defined as: W (3) ℘= Γ Γ is the running time of a job as a function of the number of statements and the mean time completion of each statement (the mean time completion is computed according to the type of the statement: local statements, remote access statements, etc, and is machine architecture dependent). W is the work, or the processor-time product in the ideal world, i.e. it is the product of the ideal number of processors for execution of the job and the running time assuming all statements as local statements. ℘ is valid for the single job case. For multiple jobs we have: Wj =
Tj
℘j (τ )dτ
(4)
0
Another related definition is the inefficiency of running the job on a machine: P Γ ,η > 1 (5) W Γ is the running time of a job, W is the work and P is the number of processors actually allocated to the job. The most important result contained in [4][6] is that, of all non-clairvoyant schedulers, instruction-balanced time-slicing has the least CR, with the corresponding proof. As a consequence of the proof it was verified that the least η=
370
Fabricio Alves Barbosa da Silva et al.
possible CR is equal to ηmax , which is the maximum inefficiency among all jobs that are running in a machine at a given moment.
3
IBTS-parallel
In IBTS, each job is allocated to the whole machine and executes a predefined number of instructions before being preempted. However, not all jobs necessarily use all processors of the machine at all times. In order to further optimize the space sharing in parallel scheduling policies should allocate simultaneously multiple jobs. We use the fact that IBTS is the non-clairvoyant algorithm with the least CR to propose a modified version of IBTS that also permits a better spatial allocation of the machine. Let us start by considering a machine with N processors in a MIMD architectural model as described in [4]. In our modified version of IBTS the machine is shared by more than one job at any given time. Jobs are preempted after all threads, running in parallel, execute a fixed number I of instructions. The resulting scheduling algorithm is dubbed IBTS-parallel. The use of many different spatial scheduling strategies are possible with IBTS-Parallel (first-fit, best-fit, etc. IBTS-Parallel has at least the same performance than IBTS under the CR metric, as stated in the lemma below. Lemma 1. The CR of IBTS-parallel is ≤ ηmax Proof. It follows from [4][6] by verifying that the two properties of IBTS are valid for IBTS-parallel : 1. At any time, an equal amount of power is supplied to all running jobs 2. The job that finishes last incurs the maximum amount of work Using the lemma above, we can state the main result of this section. IBT S Theorem 1. ηmax
par
IBT S ≤ ηmax
In other words, IBTS represents the worst case of IBTS-parallel Proof. As we are running multiple jobs in parallel, the execution time of each job will necessarily be smaller than if we run one job after another, as is the case in IBTS since one job allocates all processing elements of a machine. From the definition of inefficiency : P Γ (6) W P and W do not change from IBTS to IBTS-parallel. The time Γ is smaller or equal for all jobs under IBTS-parallel as compared to pure IBTS. So, all jobs have smaller or equal inefficiencies under IBTS-parallel as compared to pure IBTS, which makes the maximum inefficiency in IBTS-parallel smaller than or equal to the corresponding quantity in IBTS. η=
Scheduling Dynamically Data Parallel Programs
4
371
Simulation and Verification
To verify the results above, we used a general purpose event driven simulator, developed by our research group for studying a variety of related problems (e.g., dynamic scheduling, load balancing, etc). The simulator accepts two different formats for describing jobs. The first is a fully qualified DAG. The second is a set of parameters used to describe the job characteristics such as computation/communication ratio. When the second form is used the actual communication type, timing and pattern are left unspecified and it is up to the simulator to convert this user specification into a DAG, using probabilistic distributions, provided by the user, for each of the parameters. Other parameters include the spawning factor for each thread, a thread life span, synchronization pattern, degree of parallelism (maximum number of thread that can be executed at any given time), depth of critical path, etc. Even-though probabilistic distributions are used to generate the DAG, the DAG itself behaves in a completely deterministic way. Once the input is in the form of a DAG, and the module responsible for implementing a particular scheduling heuristics is plugged into the simulator, several experiments can be performed using the same input by changing some of the parameters of the simulation such as the number of processing elements available or the topology of the network, among others. The outputs can be recorded in a variety of formats for later visualization. For this study we grouped parallel jobs in classes where each class represents a particular degree of parallelism (maximum number of threads that can be executed at any given time). We divided the workload into ten different classes with each class containing 50 different jobs. The arrival time of a job is described by a Poisson random variable with an average rate of two job arrivals per time slice. The actual job selection is done in a round robin fashion by picking one job per class. This way we guarantee the interleaving of heavily parallel jobs with shorter ones. We distinguish the classes of computation and of communication instructions in the various threads that compose a job. A communication forces the thread to be suspended until the communication is concluded. If the communication is concluded during the currently assigned time-slice the thread resumes execution. All threads are preempted only after executing 100 instructions (duration of a time slice), with the caveat that any thread that executed one or more communication instruction will be preempted at the end of the time-slice regardless of how many instructions it was able to execute in the current time-slice. We used a factor of 0.001 communications per computation instructions. The practical implementation of IBTS-parallel was based on a greedy algorithm applied at the beginning of each workload change.During the workload change interval, called cycle, the workload is obviously assumed constant. Thus, the eligible threads of queued jobs are allocated to processors using the first fit strategy for each time slice. Clearly, after all eligible threads are scheduled on a processor for some time slice (slot), the temporal sequence is repeated peri-
372
Fabricio Alves Barbosa da Silva et al.
odically until a workload change again occurs. We considered in simulations a machine with 1024 processors. Preliminary results are shown in table 1. They have been normalized to show the total execution time of IBTS-parallel to be 100%. Table 1. Preliminary experimental results IBTS IBTS-parallel Total Running Total Idle Total Running Total Idle Time (%) Time (%) Time (%) Time (%) 123.6 41.9 100 28.2
It is important to dissect the value obtained for idle time. This value is the result of following three factors: 1. Communications 2. Absence of ready threads 3. Lack of job virtualization The first is a natural consequence of threads communicating among themselves.. The second reflects the fact that jobs arrive and finish at random times and at any given instance there might not be any job ready to be scheduled. The last is a result of not allowing individual threads of a same job to be scheduled at different time-slices. This factor is, we believe, the chief reason behind the high idle time measured and should be greatly reduced by extending IBTS-parallel with job virtualization [4] techniques.
References 1. Blelloch,G. E.: Vector Models for Data-parallel Computing MIT Press,Cambridge, MA (1990) 368 2. Feitelson, D.: Job Scheduling in Multiprogrammed Parallel Systems IBM Research Report RC 19970, Second Revision (1997) 368 3. Motwani, R., Phillips, S., Torng, E., Non-clairvoyant scheduling, Theoretical Computer Science, (1994) 130(1):17- 47 367 4. Scherson, I. D., Subramaniam R., Reis, V. L. M., Campos, L. M.. Scheduling Computationally Intensive Data Parallel Programs. Ecole Placement Dynamique et Re’partition de charge (1996) 39–61 367, 368, 369, 370, 372 5. Squillante, M. S., Wang, F., Papaefthymiou, M., An Analysis of Gang Scheduling for Multiprogrammed Parallel Computing Environments, Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, (1996) 89-98 368 6. Subramaniam, R.: A Framework for Parallel Job Scheduling PhD Thesis, University of California at Irvine (1995) 367, 368, 369, 370
A General Modular Specification for Distributed Schedulers Gerson G. H. Cavalheiro1 , Yves Denneulin2 , and Jean-Louis Roch1 1
2
Projet Apache Laboratoire de Mod´elisation et Calcul, BP 53 F-38041 Grenoble Cedex 9, France Laboratoire d’Informatique Fondamentale de Lille Cit´e Scientifique F-59655 Villeneuve d’Ascq Cedex, France
Abstract. In parallel computing, performance is related both to algorithmic design choices at the application level and to the scheduling strategy. Concerning dynamic scheduling, general classifications have been proposed. They outline two fundamental units, related to control and information. In this paper, we propose a generic modular specification, based not on two but on four components. They and the interactions between them are precisely described. This specification has been used to implement various scheduling algorithms in two different parallel programming environments: PM2 (Espace) and Athapascan (Apache).
1
Introduction
From various general scheduling systems classifications for parallel architectures [2,6], it is possible to extract two fundamental elements: a control unit to compute the distribution of work in a parallel machine and an information unit to evaluate the load of the machine. These units can implement different load balancing mechanisms, privileging either a (class of) machine architecture or an (class of) application structure. A dynamic scheduling system (DSS for short) is particularly interesting in case of an unpredictable behavior at execution time. Irregular applications and networks of workstations (NOW) are typical cases where the execution behavior is unknown a priori. Usually a DSS for a tuple {application,machine} is specified by an entry in a scheduling classification, and is implemented as a black-box: it is difficult to tune it or reuse it for another application or machine without rewriting totally or partially its code. Some works in the literature discuss this problem and present possible solutions, like [4] where scheduling services are grouped in families and presented as functions prototypes. In this paper we discuss the gap between the scheduling classifications and the implementation of DSSs. We present a modular structure for general scheduling
CAPES–COFECUB Brazilian scholarship Supported by CNRS, INPG, INRIA et UJF
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 373–376, 1998. c Springer-Verlag Berlin Heidelberg 1998
374
Gerson G. H. Cavalheiro et al.
systems. This general architecture, presented as a framework, aims at avoiding the introduction of additional constraints to implement the different classes of scheduling. This framework introduces two new functional units: a work creation unit and an executor unit and specifies an interface between the components.
2
General Organization of a Dynamic Scheduling System
In this work a DSS is implemented at software level to control the execution of an application on a parallel architecture. It was conceived to be independent from both application and hardware and to support various classes of scheduling algorithms. The DSS runs on a parallel machine composed of several computing nodes, each one having a local memory and at least one real processor. There is also a support for communications among them. The notion of job. Units of work are produced at the application level; from them, jobs are built and inserted in a graph of precedence (DAG), denoted G. G evolves dynamically; it represents at each instant of execution the current state of the application jobs. Each node in G represents a job; each edge, a precedence constraint between two jobs. So, a job is the basic element to be scheduled, it corresponds to a finite sequence of instructions, and has properties to allow its manipulation by the scheduler. At a given instant of time, a job is in one of the following states: waiting, the job has precedence constraints in G; ready, the job can be executed; preempted, the execution was interrupted; running, the job’s instructions are executed; or completed, the precedence constraints in G imposed by the job are relaxed. Correction of a scheduling algorithm. The scheduling manages the execution of jobs on the machine resources. For this, any DSS has to respect the following properties, that define its semantics: (P1 ) for the application: only jobs in the ready state can be put in the running state by the DSS; (P2 ) for the computing resources: if some resources of the machine (nodes) are idle while there are jobs in the ready state, at least one job will be put by the DSS in the running state after a finite delay. This delay, which may sometimes be bounded, defines the liveness of the DSS. A DSS verifying those properties is said to be correct.
3
A Modular Architecture for a Parallel Scheduling
The DSS is implemented as a framework based on four modules (Fig. 1): a Job Builder, to construct and manage a graph of jobs; a Scheduling Policy, where the scheduling algorithm is implemented; a Load Level, to evaluate the load of the machine; and an Executor, which implements a pool of threads used to execute the jobs. Each module defines an interface to its services. An access to one of them represents a read or a write operation, both accomplished in a nonblocking fashion. A read operation is a request for data; it allows a module to
A General Modular Specification for Distributed Schedulers
375
get a data handled by another one. Write operations also provide resources for data exchange, but in the opposite direction: a module decides to send a data to another module (e.g. a data or status changes). As a reaction to the use of a service, an internal activity can be triggered and executed concurrently with the caller before the service ends. A short description of each module follows. A B
C
Scheduling Policy
D
D’
Job Builder
F’
F
G
Read
Load Level
E
H
Executor
Update Status changing
Fig. 1. Internal communication scheme between modules on the framework. Job Builder unit. Its role is to build jobs from the tasks defined by the application, eventually giving access to their dependencies. So, it manages the evolutions of G. This module is attached to the application to receive and handle the tasks creation requests. The Job Builder unit must be informed when a job terminates to update the states of the jobs in G. In Fig. 1, arrow G shows the task Executor sending a job termination event to the Job Builder. Scheduling Policy unit This is the core of the scheduling framework; it takes decisions about the placement of the jobs. In order to know how the application evolves, the Scheduling Policy receives from the Job Builder every change in G, e.g. when a job is created (arrow A in Fig. 1) or when its state changed (arrow B). To place the jobs to respect the load balancing scheme, the Scheduling Policy may consider the load on the parallel machine (arrow D). Notice that after taking the decisions, the load information must be updated (arrow D’). The Scheduling Policy must also react to two signals: an irregular load distribution detected by the Load Level (arrow C) and the requests of jobs to execute (arrow E) from the Executor. It can also create a new virtual processor (arrow F) or destruct one (arrow F’). Executor unit. The Executor is implemented over a support that provides finegrained activities; it furnishes virtual processors, called threads, to process the execution of the application jobs. Each thread executes an infinite loop of two actions: 1) send a request (arrow E) of a job to the Scheduling Policy; and 2) execute sequentially the job’s instructions. The load information is updated after and before the execution of each job (arrow H). Load Level unit. The Load Level unit collects the load informations in the computing system. These data are updated after the requests for write operations from the Scheduling Policy and Executor units. If an irregular distribution of the load in the machine is detected, this unit sends a message to the Scheduling Policy unit (arrow C). The definition of the load and the notion of irregular distribution depends on the Scheduling Policy unit.
376
4
Gerson G. H. Cavalheiro et al.
Implemented Environments
This general specification has been used to build Athapascan-GS and GTLB, both DSSs for two distinct parallel programming environments. Athapascan-GS [5] was implemented on top of Athapascan-0 [1] for the Athapascan-1 macro dataflow language (INRIA Apache Project, LMC laboratory) to support static and dynamic scheduling algorithms. Athapascan-GS can choose in the set of ready jobs the one that will be triggered in order to optimize a specific index of performance, like memory space, execution time or communication. The Generic Threads Load Balancer (GTLB) [3] defines a generic scheduler for highly irregular applications of tree search that belongs to the Branch&Bound family; it was implemented on the top of PM2 (Espace project, LIFL laboratory) to support schedulers based on algorithms of mapping with migration. On both implementations, this design provides reusability opportunity in the development of specific scheduling algorithms.
5
Conclusion
In this paper, we have proposed a specification design to implement dynamic scheduling systems. This work completes classical classifications which analyze only two units of scheduling algorithms, dedicated respectively to the control and the load information. We claim that such a distinction is not precise enough; the major argument is that it does not take into account the relationships with the application (that produces tasks) and the execution (that consumes tasks). Two new modules are proposed: one dedicated to the work creation and another to the execution support, a protocol that specifies their interactions was also presented. Two examples of DSSs implementing this general structure were presented: Athapascan-GS and GTLB, both supporting various scheduling algorithms.
References 1. J. Briat, I. Ginzburg, M. Pasin and B. Plateau. Athapascan Runtime: Efficiency for Irregular Problems. In Proc. of the 3th Euro-Par Conference. Passau, Aug. 1997. 2. T.L. Casavant and J.G. Kuhl. A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems. IEEE Trans. Soft. Eng.. V. 14(2): 141-154, Fev. 1988. 3. Y. Denneulin. Conception et ordonnancement des applications hautement irr´eguli`eres dans un contexte de parall´ elisme a ` grain fin. PhD Thesis, Universit´e de Lille, Jan. 1998. 4. C. Jacqmot. Load Management in Distributed Computing Systems: Towards Adaptative Strategies. DII, Universit´e Catholique de Louvain, PhD Thesis, Louvain-laNeuve, Jan. 1996. 5. J.-L. Roch et all. Athapascan-1. Apache Project, Grenoble, http://www-apache.imag.fr Oct. 1997. 6. M.H. Willebeek-LeMair and A.P. Reeves. Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Trans. Par. and Dist. Syst. V. 4(9): 979-993, Sept. 1993. 376 373 376 373 376 373
Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments J. Mark Bull Centre for Novel Computing, Dept. of Computer Science, University of Manchester, M13 9PL, U.K. [email protected] Abstract. Dynamic loop scheduling algorithms can suffer from overheads due to synchronisation, loss of locality and small iteration counts. We observe that timing information from previous executions of the loop can be utilised to reduce these overheads. We introduce two new algorithms for dynamic loop scheduling which implement this type of feedback guidance, and report experimental results on a distributed shared memory architecture. Under appropriate circumstances, these algorithms are observed to give significant performance gains over existing loop scheduling techniques.
1
Introduction
Minimising load imbalance is a key activity in producing efficient implementations of applications on parallel architectures. Since loops are the most significant source of parallelism in many applications, the scheduling of loop iterations to processors can be an important factor in determining performance. Most of the existing techniques for dynamic loop scheduling on shared memory machines are variants of, or are derived from, guided self-scheduling (GSS) [6]. These techniques share some common characteristics—they assume no prior knowledge of the workload associated with the iterations of the loop, and they all divide the loop iterations into a set of chunks, where there are more chunks than processors. The key observations that motivate the new algorithms presented here are the following: – Some prior knowledge of the workload associated with the iterations of the loop may be available, particularly if we can assume that the workload has not changed too much since the previous execution of the loop. This is often the case in simulation of physical systems where the parallel loop is over a spatial domain and is executed at every time-step. – Dividing the loop iterations into more chunks than processors can hurt performance, as it may incur overheads associated with additional synchronisation, loss of both inter- and intra-processor locality, and reduced efficiency of loop unrolling or pipelining. Our algorithms are designed to utilise knowledge about the workload derived from the measured execution time of previous occurrences of the loop with the aim of reducing the incurred overheads. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 377–382, 1998. c Springer-Verlag Berlin Heidelberg 1998
378
2 2.1
J. Mark Bull
Feedback-Guided Scheduling Algorithms Self-Scheduling Algorithms
The simplest scheduling algorithm of all is block or static scheduling, which divides the number of the loop iterations by the number of processors as equally as possible. No attempt is made to dynamically balance the load, but overheads are kept to a minimum: no synchronisation is required and the chunk size is as large as possible. Of dynamic algorithms, the simplest is self-scheduling [8] in which a central queue of iterations is maintained. As soon as a processor finishes an iteration, it removes the next available iteration from the queue and executes it. Chunk self-scheduling [3] allows each processor to remove a fixed number k of iterations from the queue instead of one. This reduces the overheads at the risk of poorer load balance. Guided self-scheduling (GSS) [6] attempts to overcome the difficulty of choosing a chunk size by dynamically varying the chunk size as the execution of the loop progresses, starting with large chunks and making them progressively smaller. There is also the possibility that the load may not be well balanced—for example if the loop has N iterations and the first N/p iterations contain more than 1/pth of the workload. A number of algorithms have been proposed which are essentially similar to GSS, and differ mainly in the manner in which the chunk size is decreased as the algorithm progresses. These include adaptive guided selfscheduling [1], factoring [2], tapering [4] and trapezoid self-scheduling [9]. Affinity scheduling [5] uses per-processor work queues in order to reduce contention between processors. Each processor initially has on its queue the iterations it would have been assigned under block scheduling. Another benefit of affinity scheduling is that it allows temporal locality of data across multiple executions of the parallel loop to be exploited. Variants of affinity scheduling to reduce synchronisation costs have been proposed in [7] and [10]. 2.2
New Algorithms
In order to exploit timing information, we require access to a fast, accurate hardware clock. Since our objective is to keep the chunk size a large as possible, we only permit measurement of the execution time of whole chunks, and not of iterations within a chunk. The first algorithm we describe is a feedback guided version of block scheduling. We divide the loop iterations into p chunks, not necessarily of equal size, where p is the number of processors. On each execution of the parallel loop we measure the total execution time for the chunk of iterations assigned to each processor. Dividing this time by the number of iterations on each processor gives us a mean load per iteration for that chunk, and hence a piecewise constant approximation to the true workload. We then choose new bounds for each processor to use for the next execution of the loop, based on an equipartition of the area under this piecewise constant function. By keeping a running total of the areas under each constant piece, this can be achieved in O(p) time. This could
Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments
379
be reduced to O(log2 (p)), using parallel prefix operations, but this would only be worthwhile for large values of p. The second algorithm is a feedback guided version of affinity scheduling. On each execution of the parallel loop we measure the total execution time for the chunk of iterations initially assigned to and subsequently executed by each processor. We derive a mean load per iteration, and hence a piecewise constant approximation to the true workload in the same way as for the feedback guided block algorithm. Equipartitioning this approximation to the workload gives us the initial assignment of iterations for the next execution of the loop. This is similar in spirit to the dynamic affinity algorithm of [7], which initialises each processor with the same number of iterations as it executed on the previous execution of the loop.
3 3.1
Experimental Evaluation Test Problems
To test our algorithms we use a workload consisting of a Gaussian distribution whose midpoint oscillates sinusoidally, defined at the ith iteration and the tth loop execution by 2 i − (N/2 + N sin(2πt/τ )/4 t wi = exp − N/8 where N is the number of iterations, and τ is the period of the oscillation. By varying τ we can control how rapidly the load distribution changes from one execution of the loop to the next—a large value gives a nearly static load, a small value gives a rapidly varying one. We execute the parallel loop 1000 times with values of the period τ varying between 10 and 1000. We use three different loop bodies. Loop Body 1 simply causes the processor to spin for a time Dwit where D is a fixed (wall clock) time period: for (i=0; i
380
J. Mark Bull
For small chunk sizes, more than one processor at a time may be updating the same cache line of a, resulting in a significant number of coherency misses in a cache coherent system. In Loop Body 3 each iteration updates elements in a column of a two-dimensional array using the corresponding element of a second array and its four nearest neighbours, where the number of elements updated is proportional to wit . for (i=1; i<=N; i++){ khi = N * load(i); for (k=1; k=
1 2 3 4
Loop Loop Loop Loop
body body body body
1, 1, 2, 3,
N N N N
= 1000, D = 10−4 seconds. = 1000, D = 10−3 seconds. = 10000. = 1000.
Results and Discussion
We ran each of the four test problems on 16 processors of a Silicon Graphics Origin 2000 system (with 195 MHz R10000 processors), using the following scheduling algorithms: block, feedback guided (FG) block, affinity, feedback guided affinity, simple self-scheduling (SS), guided self-scheduling (GSS) and trapezoid self-scheduling (TSS), recording the execution time Tp . We also ran the block scheduling algorithm on one processor to give a sequential reference time Ts . The computational cost of feedback guidance is low: each call to the hardware clock requires approximately 0.35µs and the computation of the new partition for 16 processors approximately 45µs. Figures 1 to 4 show the efficiency (computed as Ts /pTp ) as a function of the period τ of the oscillating workload. For Problem 1 the workload on each iteration is sufficiently small for the overheads of synchronised access to work queues to dominate the efficiency of the self-scheduling algorithms. The result of this is that simple self-scheduling (which requires O(N ) accesses to the central queue), and affinity scheduling (which requires O(p log(N/p2 )) accesses to each of the p local queues) perform no better than block scheduling. GSS with O(p log N ) accesses and TSS with 4p accesses to the central queue respectively, are more efficient. Adding feedback guidance to affinity scheduling does nothing to reduce the total number of queue accesses, even though there are fewer accesses to queues of other processors, and the additional overhead of timing each chunk of iterations is incurred. FG block
Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments 1
0.6
Block FG Block Affinity FG Affinity SS GSS TSS
1 0.8 Efficiency
0.8
Efficiency
1.2
Block FG Block Affinity FG Affinity SS GSS TSS
0.4
381
0.6 0.4
0.2
0.2
0
0 10
100 Workload period
1000
10
Fig. 1. Problem 1 1.2
1
Block FG Block Affinity FG Affinity SS GSS TSS
0.8
Efficiency
Efficiency
0.8
1000
Fig. 2. Problem 2
Block FG Block Affinity FG Affinity SS GSS TSS
1
100 Workload period
0.6
0.6
0.4
0.4 0.2
0.2 0
0 10
100 Workload period
Fig. 3. Problem 3
1000
10
100 Workload period
1000
Fig. 4. Problem 4
scheduling avoids all synchronisation costs, and for sufficiently slowly changing workload (τ ≥ 100) it is the most efficient of the algorithms tested. For Problem 2 the workload on each iteration is larger, and synchronisation overhead is less important. SS and affinity scheduling successfully balance the load, whereas GSS and TSS (with larger chunks) do not. Feedback guidance does not improve affinity scheduling, but again for sufficiently slowly changing workload (τ ≥ 200), FG block scheduling shows the best performance. For Problem 3, false-sharing induced communication is the most significant overhead. Both SS and GSS produce many small chunks, with consecutive chunks likely to be executed on different processors. Affinity scheduling also produces many small chunks, but there is a higher likelihood of consecutive chunks being executed on the same processor. TSS produces only 4p chunks, and thus it performs well even though it may not successfully balance the load. Apart from the fastest moving workload (τ = 10), FG affinity scheduling is more efficient than affinity scheduling as it further increases the likelihood of consecutive chunks being executed on the same processor. FG block scheduling is again poor for rapidly evolving workload, as it does not balance the load well, but as τ increases, the advantage of having only p chunks becomes significant. For these cases, it is by far the most efficient algorithm. For Problem 4, both synchronisation and communication overheads are important. None of the central queue algorithms perform well, as they do not ex-
382
J. Mark Bull
ploit data affinity between executions of the parallel loop. For rapidly changing workload (τ ≤ 20), affinity scheduling is the most efficient, but as the workload becomes more predictable, FG affinity scheduling gives the best performance. For small τ , FG block scheduling is very poor, but as τ increases, its performance approaches that of feedback guided affinity scheduling.
4
Conclusions
We have described two algorithms which make use of feedback information to reduce the overheads associated with scheduling a parallel loop in a shared variable programming model. Our experiments show that under certain conditions, when there is sufficient correlation between the workload on successive executions of the parallel loop, these new algorithms will outperform traditional loop self-scheduling methods, sometimes by a considerable margin.
References 1. Eager, D.L. and Zahorjan, J. (1992) Adaptive Guided Self-Scheduling, Technical Report 92-01-01, Department of Computer Science and Engineering, University of Washington. 378 2. Hummel, S.F., Schonberg, E. and Flynn, L.E. (1992) Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Communications of the ACM, vol. 35, no. 8, pp. 90–101. 378 3. Kruskal, C.P. and Weiss, A. (1985) Allocating Independent Subtasks on Parallel Processors, IEEE Trans. on Software Engineering, vol. 11, no. 10, pp. 1001–1016. 378 4. Lucco, S. (1992) A Dynamic Scheduling Method for Irregular Parallel Programs, in Proceedings of ACM SIGPLAN ’92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992, pp. 200–211. 378 5. Markatos, E.P. and LeBlanc, T.J. (1994) Using Processor Affinity in in Loop Scheduling on Shared Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 4, pp. 379–400. 378 6. Polychronopoulos, C. D. and Kuck, D. J. (1987) Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, C-36(12), pp. 1425–1439. 377, 378 7. Subramaniam, S. and D.L. Eager (1994) Affinity Scheduling of Unbalanced Workloads, in Proceedings of Supercomputing ’94, IEEE Comp. Soc. Press, pp. 214–226. 378, 379 8. Tang, P. and Tew, P-C. (1986) Processor Self-Scheduling for Multiple Nested Parallel Loops, in Proceedings of 1986 Int. Conf. on Parallel Processing, pp. 528–535, St. Charles, IL. 378 9. Tzen, T.H. and Ni, L.M., (1993) Trapezoid Self-Scheduling Scheme for Parallel Computers, IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 1, pp. 87–98. 378 10. Yan, Y., C. Jin and X. Zhang (1997) Adaptively Scheduling Parallel Loops in Distributed Shared-Memory Systems IEEE Trans. on Par. and Dist. Systems, vol. 8, no. 1, pp. 70–81. 378
Load Balancing for Problems with Good Bisectors and Applications in Finite Element Simulations Stefan Bischof, Ralf Ebner, and Thomas Erlebach Institut f¨ ur Informatik Technische Universit¨ at M¨ unchen D-80290 M¨ unchen, Germany {bischof,ebner,erlebach}@in.tum.de http://www{mayr,zenger}.in.tum.de/
Abstract. This paper studies load balancing issues for classes of problems with certain bisection properties. A class of problems has α-bisectors if every problem in the class can be subdivided into two subproblems whose weight (i.e. workload) is not smaller than an α-fraction of the original problem. It is shown that the maximum weight of a subproblem produced by Algorithm HF, which partitions a given problem into N subproblems by always subdividing the problem with maximum weight, is at most a factor of 1/α · (1 − α)1/α−2 greater than the theoretical optimum (uniform partition). This bound is proved to be asymptotically tight. Two strategies to use Algorithm HF for load balancing distributed hierarchical finite element simulations are presented. For this purpose, a certain class of weighted binary trees representing the load of such applications is shown to have 1/4-bisectors. The maximum resulting load is at most a factor of 9/4 larger than in a perfectly uniform distribution in this case.
1
Introduction
Load balancing is one of the major research issues in the context of parallel computing. Irregular problems are often difficult to tackle in parallel because they tend to overload some processors while leaving other processors nearly idle. For these applications it is very important to find methods for obtaining a balanced distribution of load. Usually, the load is created by processes that are part of a (parallel) application program. During the run of the application, these processes perform certain calculations independently but have to communicate intermediate results or other data using message passing. One is usually interested in achieving a balanced load distribution in order to minimize the execution time of the application or to maximize system throughput. Load balancing problems have been studied for a huge variety of models, and many different solutions regarding strategies and implementation mechanisms have been proposed. A good overview of recent work can be obtained from [5], for example. [6] reviews ongoing research on dynamic load balancing, D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 383–389, 1998. c Springer-Verlag Berlin Heidelberg 1998
384
Stefan Bischof et al.
emphasizing the presentation of models and strategies within the framework of general classification schemes. In this paper we study load balancing for a very general class of problems. The only assumption we make is that all problems in the class have a certain bisection property. Such classes of problems arise, for example, in the context of distributed hierarchical finite element simulations. We show how our general results can be applied to numerical applications in several ways.
2
Using Bisectors for Load Balancing
In many applications a computational problem cannot be divided into many small problems as required for an efficient parallel solution directly. Instead, a strategy similar to divide and conquer is used repeatedly to divide problems into smaller subproblems. We refer to the division of a problem into two smaller subproblems as bisection. Assuming a weight function w that measures the resource demand, a problem p cannot always be bisected into two subproblems p1 and p2 of equal weight w(p)/2. For many classes of problems, however, there is a bisection method that guarantees that the weights of the two obtained subproblems do not differ too much. The following definition captures this concept more precisely. Definition 1. Let 0 < α ≤ 12 . A class P of problems with weight function w : P → R+ has α-bisectors if every problem p ∈ P can be efficiently divided into two problems p1 ∈ P and p2 ∈ P with w(p1 ) + w(p2 ) = w(p) and w(p1 ), w(p2 ) ∈ [αw(p); (1 − α)w(p)]. Note that this definition requires for the sake of simplicity that all problems in P can be bisected, whereas in practice this is not the case for problems whose weight is below a certain threshold. We assume, however, that the problem to be divided among the processors is big enough to allow further bisections until the number of subproblems is equal to the number of processors. This is a reasonable assumption for most relevant parallel applications. Figure 1 shows Algorithm HF, which receives a problem p and a number N of processors as input and divides p into N subproblems by repeated application of α-bisectors to the heaviest remaining subproblem. A perfectly balanced load distribution on N processors would be achieved if a problem p of weight w(p) was divided into N subproblems of weight exactly w(p)/N each. The following theorem gives a worst-case bound on the ratio between the maximum weight among the N subproblems produced by Algorithm HF and this ideal weight w(p)/N . All proofs are omitted in this paper due to space constraints but can be found in [1]. Theorem 2. Let P be a class of problems with weight function w : P → R+ that has α-bisectors. Given a problem p ∈ P and a positive integer N , Algorithm HF uses N − 1 bisections to partition p into N subproblems p1 , . . . , pN such that 1 1 w(p) · max w(pi ) ≤ · (1 − α) α −2 . 1≤i≤N N α
Load Balancing for Problems with Good Bisectors
385
Input: problem p, positive integer N begin P ← {p}; while |P | < N do begin q ← a problem in P with maximum weight; bisect q into q1 and q2 ; P ← (P ∪ {q1 , q2 }) \ {q}; end; output P ; end.
Fig. 1. Algorithm HF (Heaviest Problem First) It is not difficult to see that the bound given in Theorem 2 is asymptotically tight. This means that for sufficiently large N there exists a sequence of bisections such that the maximum load generated by Algorithm HF is arbitrarily close to this bound. For this purpose, the original problem is first split into a number of subproblems p1 , . . . , pn of equal weight. Each of p1 , . . . , pn−1 is then split into 1/α subproblems using a sequence of 1/α − 1 exact α-bisections, whereas pn is split similarly but without doing the final bisection step. The construction outlined above suggests that the bound from Theorem 2 might not be tight for small values of N , however. Indeed, we can prove that for all α ≤ 1/5 and for all N ≤ 1/α, Algorithm HF partitions a problem from a class of problems with α-bisectors into N subproblems such that the weight of the heaviest subproblem is at most w(p)(1 − α)N −1 . Hence, the ratio between the maximum weight and the ideal average weight is bounded by N (1 − α)N −1 , which is smaller than 1/α (1−α)1/α−2 for the range where this bound applies.
3
Application of Algorithm HF to Distributed Finite Element Simulations
In this section, we present the application of Algorithm HF for load balancing in the field of distributed numerical simulations with the finite element (FE) method [2]. The FE method is used in statics analysis, for example, to calculate the response of objects under external load. In this paper, we consider a short cantilever under plane stress conditions, a problem from plane elasticity. The quadratic cantilever is uniformly loaded on its upper side. Its left side is fixed as shown in Figure 2, top left. The definition of the object’s shape, its structural properties, and the boundary conditions lead to a system of elliptic partial differential equations. We substructure the physical domain of the cantilever recursively. With the method of [4], a tree data structure is built reflecting the hierarchy of the substructured domain (see Figure 2). This data structure contains a system of linear
386
Stefan Bischof et al.
short cantilever plain stress f = 0.01 E = 1000 u 2 u1 v = 1/3
f
accumulation of equation systems of the tree nodes recursive substructuring of the physical domain
Fig. 2. The example application equations Su = f for each tree node. Its stiffness matrix S determines the unknown displacement values in the vector u at the black points shown in Figure 2 (right), dependent on the external and reaction forces f . In the leaves, the system of equations is constructed by standard FE discretization and the so-called virtual displacement method. In an internal node, the system of linear equations is assembled out of the equation systems of its children. For the solution of all those equation systems, we use an iterative solver which traverses the tree several times, promoting displacements in top-down direction and reaction forces in bottom-up direction. Since the adaptive structure of the tree is not known a priori, a good load balancing strategy is essential before the parallel execution of the solving iterations. We assign a load value (p) = Cb nb (p) + Cs ns (p) to each tree node p, with nb (p) black points on the border without boundary conditions, and ns (p) black points (not ghost points) belonging to the separator line of node p. The load value (p) models the computing time of node p, where the constants Cs and Cb are independent of node p, and Cs ≈ 6Cb . These values can be easily collected during the tree construction phase. In order to distribute the work load among the processors, we need to know the computing time of whole subtrees of the FE tree. So for each node p, we accumulate the load in the subtree with root p to get the weight value w(p): w(p) =
(p) if p is leaf (p) + w(c1 ) + w(c2 ) if p is root or internal node
where c1 and c2 are the children of p.
Load Balancing for Problems with Good Bisectors
387
For a bisection step according to Algorithm HF, we remove the root node p from the tree of weight values. This yields two subtrees rooted at the children c1 and c2 of p, and the weight of the root node p is ignored for the remainder of the load balancing phase. The results of Theorem 2 are well approximated because (p) (work load for the one-dimensional separator) is negligible compared to w(p) (work load for the two-dimensional domain) in our application with commensurately large FE trees. Algorithm HF chops N subtrees off the FE tree, each of which can be traversed in parallel by the iterative solver. These N subtrees contain the main part of the solving work and may be distributed over the available N processors, whereas the upper N − 1 tree nodes cannot exploit the whole number of processors, anyway. Another strategy divides the entire given FE tree into N partitions of approximately equal size by cutting edges (and not root nodes of subtrees). For this purpose, we set w(p) = (p) in each tree node. So this application of Algorithm HF takes into account that the main memory resources of the processors become the limiting factor if very high accuracy of the simulation is required.
4
Weighted Trees with Good Bisectors
Let T be the set of all rooted binary trees with node weights (v) satisfying: (1) (p) ≤ (c1 ) + (c2 ) for nodes p with two children c1 and c2 (2) (p) ≥ (c) if c is a child of p The weight of a tree T = (V, E) in T is defined as w(T ) = v∈V (v). This class T of binary trees models the load of applications in hierarchical finite element simulations, as discussed in Section 3. Recall that in these applications the domain of the computation is repeatedly subdivided into smaller subdomains. The structure of the domains and subdomains yields a binary tree in which every node has either two children or is a leaf. The resource demands (CPU and main memory) of the nodes in this FE tree are such that the resource demand at a node is at most as large as the sum of the resource demands of its two children. Note that (1) and (2) ensure that the two subtrees obtained from a tree in T by removing a single edge are also members of T . The following theorem shows that trees from the class T can be 14 -bisected by removal of a single edge unless the weight of the tree is concentrated in the root. Theorem 3. Let T = (V, E) be a tree in T , and let r be its root. If (r) ≤ 3 4 w(T ), then there is an edge e ∈ E such that the removal of e partitions T into subtrees T1 and T2 with w(T1 ), w(T2 ) ∈ [ 14 w(T ); 34 w(T )]. According to Theorem 2 a problem p from a class of problems that has 14 bisectors can always be subdivided into N subproblems p1 , . . . , pN such that 9 max1≤i≤N w(pi ) ≤ w(p) N · 4 . The following corollary gives a condition on trees in T that ensures that they can be subdivided into N subproblems using 14 -bisectors.
388
Stefan Bischof et al.
Corollary 4. Let T = (V, E) be a tree in T , and let r be its root. Let N be a positive integer. If w(T ) ≥ 43 (N − 1)(r), Algorithm HF partitions T into N subtrees by cutting exactly N − 1 edges such that the maximum weight of the ) resulting subtrees is at most 94 · w(T N . Note that an optimal min-max k-partition of a weighted tree (i.e., a partition with minimum weight of the heaviest component after removing k edges) can be computed in linear time [3]. These algorithms are preferable to our approach using Algorithm HF in the case of trees that are to be subdivided by removing a minimum number of edges. Since the heaviest subtree in the optimal solution does obviously not have a greater weight than the maximum generated by Algorithm HF, the bound from Corollary 4 still applies.
5
Conclusion
The existence of α-bisectors for a class of problems was shown to allow good load balancing for a surprisingly large range of values of α. The maximum load achieved by Algorithm HF is at most a factor of 1/α · (1 − α)1/α−2 larger than the theoretical optimum (uniform distribution). Load balancing for distributed hierarchical FE simulations was discussed, and two strategies for applying Algorithm HF were presented. The first strategy tries to make the best use of the available parallelism, but requires that the nodes of the FE tree representing the load of the computation have good separators. The second strategy tries to partition the entire FE tree into subtrees with approximately equal load. For this purpose, it was proved that a certain class of weighted trees, which include FE trees, has 1/4-bisectors. Here, the trees are bisected by removing a single edge. Partitioning the trees by removing a minimum number of edges ensures that only a minimum number of communication channels of the application must be realized by network connections. Our results provide worst-case bounds on the maximum load achieved for applications with good bisectors in general and for distributed hierarchical finite element simulations in particular. For the latter application, we showed that the maximum resulting load is at most a factor of 9/4 larger than in a perfectly uniform distribution. At present we are implementing the load balancing methods proposed in this paper and integrating them into the existing finite element simulations software. We expect considerable improvements as compared to the static (compile-time) processor allocation currently in use. See [1] for first results.
References 1. Stefan Bischof, Ralf Ebner, and Thomas Erlebach. Load Balancing for Problems with Good Bisectors, and Applications in Finite Element Simulations: Worst-case Analysis and Practical Results. SFB-Bericht 342/05/98 A, SFB 342, Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen, 1998. http://wwwmayr.in.tum.de/berichte/1998/TUM-I9811.ps.gz. 384, 388
Load Balancing for Problems with Good Bisectors
389
2. Dietrich Braess. Finite Elemente. Springer, Berlin, 1997. 2. u ¨berarbeitete Auflage. 385 3. Greg N. Frederickson. Optimal Algorithms for Tree Partitioning. In Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms SODA ’91, pages 168–177, New York, 1991. ACM Press. 388 4. Reiner H¨ uttl. Ein iteratives L¨ osungsverfahren bei der Finite-Element-Methode unter Verwendung von rekursiver Substrukturierung und hierarchischen Basen. PhD thesis, Institut f¨ ur Informatik, Technische Universit¨ at M¨ unchen, 1996. 385 5. Behrooz A. Shirazi, Ali R. Hurson, and Krishna M. Kavi. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA, 1995. 383 6. Thomas Schnekenburger and Georg Stellner, editors. Dynamic Load Distribution for Parallel Applications. TEUBNER-TEXTE zur Informatik. Teubner Verlag, Stuttgart, 1997. 383
An Efficient Strategy for Task Duplication in Multiport Message-Passing Systems Dingchao Li1 , Yuji Iwahori1, Tatsuya Hayashi2 , and Naohiro Ishii3 1
2
Educational Center for Information Processing Department of Electrical and Computer Engineering 3 Department of Intelligence and Computer Science Nagoya Institute of Technology [email protected]
Abstract. important problem in compilers for parallel machines. In this paper, we present a duplication strategy for task scheduling in multiport message-passing systems. Through a performance gain analysis, we establish a condition under which duplicating a parent task of a task is beneficial. We also show that, by incorporating this strategy into two well-known priority-based scheduling algorithms, significant reductions in the execution time can be achieved.
1
Introduction
Scheduling is a sequential optimization technique used to exploit the parallelism inherent in programs. In this paper, we consider the problem of scheduling the tasks of a given program in a multiport message-passing system with the aim of minimizing the overall execution time of the program. Researchers have investigated two versions of this problem, depending on whether or not task duplication (or recomputation) is allowed. In general, for the same program, task scheduling with duplication produces a schedule with a smaller makespan (i.e., total execution time) than when task duplication is not allowed. Some duplication based scheduling algorithms have been introduced in [2,4,5,7]. However, all of these algorithms assume the availability of unlimited number of processors. If the number of processors available is less than the number of the processors they require, there could be a problem. In this paper, we aim to develop an efficient and practical algorithm with task duplication for scheduling tasks onto a bounded number of available processors. In this area, the work described in [8] is close to ours. The algorithm determines the tasks on the critical paths of a given program before scheduling and tries to select such tasks for duplication in every scheduling step. However, a critical path may be shortened and new critical paths may be generated during the scheduling process, because assigning a task and a parent task of the task to a single processor makes the communication overhead between them become zero. Therefore, our algorithm does not attempt to identify whether a task is critical. Instead, it dynamically analyzes the increase over the program execution time, created possibly by duplicating a parent task of a task, to guide a duplication decision. We will show D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 390–396, 1998. c Springer-Verlag Berlin Heidelberg 1998
An Efficient Strategy for Task Duplication
391
that incorporating this strategy with some known scheduling algorithms such as the mapping heuristic [9] and the dynamic level method [10] can improve their performance without introducing additional computational complexity. The remainder of the paper is organized as follows: system and program models are described in Section 2 and in Section 3, we present a duplication strategy for task scheduling. The performance evaluation of the strategy is discussed in Section 4. Finally, Section 5 gives some concluding remarks.
2
System and Program Models
We consider a system consisting of m identical programmable processors Pi (i = 1 · · · m), each with a private or local memory and fully connected by a multiport interconnection network. In this system, the tasks of a parallel program mapped to different processors communicate solely by message-passing. In every communication step, a processor can send distinct messages to other processors and simultaneously receive multiple messages from these processors. We assume that communication can be overlapped by computation and the communication time can be neglected if two tasks are mapped to a processor. A parallel program is modeled as a weighted directed acyclic graph, or task graph. Let G = (Γ, A, µ, c) denote a task graph, where Γ is a finite set of tasks Ti (i = 1, · · · , n), A is a set of directed arcs representing dependence constraints among tasks, µ is an execution time function whose value µ(Ti ) (time units) is the execution time of task Ti , and c is a communication cost function whose value c(Ti , Tj ) denotes the amount of communication from Ti to Tj . The arc (Ti , Tj ) from Ti to Tj asserts that task Tj cannot start execution until the message from Ti arrives at the processor executing Tj . If (Ti , Tj ) ∈ A, Ti is said to be a parent task of Tj and Tj is said to be a child task of Ti . In task graph G, tasks without parent tasks are known as entry tasks and tasks without child tasks are known as exit tasks. We assume, without loss of generality, that G has exactly one entry task and exactly one exit task. For task Ti , let est(Ti ) and lst(Ti ) denote the earliest start and completion times, respectively. The computation of the earliest start and completion times can be determined in a top-down fashion, starting with the entry task and terminating at the exit task. Mathematically, est(Ti ) =
min
max
(Tj ,Ti )∈A (Tk ,Ti )∈A,k=j
{ect(Tj ), ect(Tk ) + c(Tk , Ti )},
ect(Ti ) = est(Ti ) + µ(Ti ).
(1) (2)
Similarly, let lst(Ti ) and lct(Ti ) denote the latest start and completion times of Ti , respectively. The latest start and completion times of Ti can be given by lct(Ti ) =
min {lst(Tj ) − c(Ti , Tj )},
(Ti ,Tj )∈A
lst(Ti ) = lct(Ti ) − µ(Ti ).
(3) (4)
Note that the latest start time of Ti indicates how long the start of Ti can be delayed without increasing the minimum execution time of the graph.
392
3
Dingchao Li et al.
Scheduling with Duplication
For machine and program models described above, this section develops a duplication based algorithm in an attempt to find a one-to-one mapping from each task to its assigned processor such that the program execution time is minimized. Our algorithm consists of two parts. The first part decides which ready task should be assigned to an idle processor. This part is heuristic in nature and we can make the choice by applying one of the existing strategies such as HLF(Highest Level First), ETF(Earliest Task First), LP(Longest Path), LPT(Longest Processing Time), and CP(Critical Path). Given the task in the first part, the second part decides whether a parent of the task should be duplicated.
Pi
Pj
Pi
Pk
Pj
Pk
Tu
Tu
Tv T v
Tv cm mt( Tu,Tw) mt( Tv ,Tw)
Tw (a)
t
Tw
(b)
Fig. 1. The task duplication strategy. (a) The schedule before duplication. (b) The schedule after duplication.
Before discussing our task duplication strategy, we introduce cm to represent the current time of the event clock, P (Tu ) to represent the processor that processes task Tu and st(P (Tu ), Tu ) to represent the time when Tu starts execution on P (Tu ). We also assume that processor Pk is idle at cm and task Tw is selected for scheduling, as shown in Fig. 1. The start time at which task Tw can run on processor Pk in case of no task duplication is thus given by stnocopy (Pk , Tw ) =
max
(Tu ,Tw )∈A,P (Tu )=Pk
{cm, st(P (Tu ), Tu ) + µ(Tu ) + c(Tu , Tw )}.
(5)
Consider the fact that if the start time of a task is later than the latest start time of the task, then the overall completion time of the whole graph will be
An Efficient Strategy for Task Duplication
393
increased. Thus, we can define the increase incurred due to the execution of task Tw on processor Pk as follows. δnocopy (Pk , Tw ) = stnocopy (Pk , Tw ) − lst(Tw ).
(6)
Obviously, if δnocopy (Pk , Tw ) ≤ 0, we should stop duplicating the parent tasks of Tw and assign Tw onto processor Pk , since duplicating its parent tasks is unable to shorten the schedule length even if its start time is reduced. Assume now that δnocopy (Pk , Tw ) > 0. To minimize the overall execution time, we need to consider task duplication for reducing this increase as much as possible. Let us use mt(Tu , Tw ) to denote the time when the message for task Tw from Tu arrives at processor P (Tw ); i.e., mt(Tu , Tw ) = st(P (Tu ), Tu ) + µ(Tu ) + c(Tu , Tw ) if P (Tu ) = P (Tw ). From Fig. 1, it is clear that we should select a task Tv , as the candidate for duplication, such that mt(Tv , Tw ) =
max
(Tu ,Tw )∈A,P (Tu )=P (Tw )
{st(P (Tu ), Tu ) + µ(Tu ) + c(Tu , Tw )}.
Further let it(Pk ) denote the time when processor Pk becomes idle. Thus, the time at which Tv can start execution on Pk can be calculated below. st(Pk , Tv ) =
max
(Ts ,Tv )∈A,P (Ts )=Pk
{it(Pk ), st(P (Ts ), Ts ) + µ(Ts ) + c(Ts , Tv )}.
(7)
Consequently, in this case, the time at which task Tw can start execution on processor Pk is that stcopy (Pk , Tw ) =
max
(Tu ,Tw )∈A,P (Tu )=Pk ,u=v
{cm, st(P (Tu ), Tu ) + µ(Tu ) + c(Tu , Tw ), st(Pk , Tv ) + µ(Tv )}.
(8)
This produces an increase as follows. δcopy (Pk , Tw ) = stcopy (Pk , Tw ) − lst(Tw ).
(9)
Comparing δcopy (Pk , Tw ) with δnocopy (Pk , Tw ), we can finally establish the following condition: duplicating a parent task of Tw to the current processor should only occur if δcopy (Pk , Tw ) < δnocopy (Pk , Tw ).
(10)
The above duplication strategy clearly takes O(p) times in the worst case, where p is the maximum number of parent tasks of every task in G. Therefore, incorporating this strategy into some known scheduling algorithms such as the mapping heuristic [9] and the dynamic level method [10] does not introduce additional time complexity, since these algorithms generally require O(nm) times for selecting a ready task and an idle processor for assignment, where n is the number of tasks and m is the number of processors.
394
4
Dingchao Li et al.
Experimental Study
This section presents an experimental study of the effects of the task duplication strategy described in Section 3. By incorporating the strategy into the mapping heuristic (MH) and the dynamic level method (DL) without global clock heuristic, we implemented two scheduling algorithms on a Sun workstation using the programming language C. Whereas we compared the performance of each enhanced algorithm with that of the original algorithm under various parameter settings, due to space limitations, here we show only some of the salient results. In our experiments, we generated 350 random task graphs and scheduled them onto a machine consisting of eight identical processors, fully interconnected through full-duplex interprocessor links. The task graphs generated randomly ranged in size from 40 to 160 nodes with increments of 20. Each graph was constructed by generating a directed acyclic graph and removing all transitive arcs. For each vertex, the execution time and the number of the child tasks were picked from an uniform distribution whose parameters are input data. The communication time of each arc was determined alike. The communication-tocomputation (C/C) ratio value was settled to be 5.0, where the C/C ratio of a task graph is defined as the average communication time per arc divided by the average execution time per task. The performance analysis was based on the average schedule length obtained by each algorithm. Tables 1 and 2 show the experimental results, where Impr. represents the percentage performance improvement of each enhanced algorithm compared with that of the original algorithm. It can be seen that the performance of each algorithm with duplication is consistently better than the algorithm without duplication. The performance improvement compared with the original mapping heuristic varies from 4.01% to 8.17%, and the performance improvement compared with the original dynamic level algorithm varies from 3.42% to 6.90%. These results confirm our expectation that the proposed duplication strategy is efficient and practical.
Table 1. Results from the mapping heuristics with and without duplication. No. of graphs 50 50 50 50 50 50 50
No. of Avg. sche. leng. Avg. sche. leng. tasks MH without dup. MH with dup. 40 649.58 609.64 60 860.88 826.34 80 1093.38 1023.16 100 1310.74 1231.68 120 1515.46 1401.70 140 1711.06 1583.58 160 1962.92 1802.51
Impr. (%) 6.15 4.01 6.42 6.03 7.51 7.45 8.17
An Efficient Strategy for Task Duplication
395
Table 2. Results from the dynamic level algorithms with and without duplication. No. of graphs 50 50 50 50 50 50 50
5
No. of Avg. sche. leng. Avg. sche. leng. tasks DL without dup. DL with dup. 40 602.60 572.41 60 788.61 761.66 80 1037.65 993.82 100 1226.73 1172.94 120 1454.92 1367.88 140 1643.25 1536.92 160 1891.74 1761.24
Impr. (%) 5.01 3.42 4.22 4.38 5.98 6.47 6.90
Conclusions
We have presented a novel strategy for task duplication in multiport messagepassing systems. The strategy is based on the analysis of the increase over the total program execution time, created possibly by duplicating a parent task of a task. Experimental results have shown that incorporating it into two wellknown scheduling algorithms either produces the same assignments as the original algorithms, or it produces better assignments. In addition, the costs, both in implementation effort and compile time, are very low. Future work includes a more complete investigation of the impact of varying the communication-tocomputation ratio. Experimental studies on various multiprocessor platforms are also needed.
Acknowledgment This work was supported in part by the Ministry of Education, Science and Culture under Grant No. 09780263, and by a Grant from the Artificial Intelligence Research Promotion Foundation under contract number 9AI252-9. We would like to thank the anonymous referees for their valuable comments.
References 1. B. Kruatrachue and T.G. Lewis, Grain Size Determination for Parallel Processing, IEEE Software, pp. 23-32, Jan., 1988. 2. J.Y. Colin and P. Chritienne, C.P.M. Scheduling with Small Communication Delays and Task Duplication, Operations Research, vol. 39, no. 4, pp. 680-684, July, 1991. 390 3. Y.C.Chung and S.Ranka, Application and Performance Analysis of a CompileTime Optimization Approach for List Scheduling Algorithms on DistributedMemory Multiprocessors, Proc. of Supercomputing’92, pp. 512-521, 1992.
396
Dingchao Li et al.
4. H.B. Chen, B. Shirazi, K. Kavi, and A.R. Hurson, Static Scheduling Using Linear Clustering with Task Duplication, Proc. of ISCA International Conference on Parallel and Distributed Computing and Systems, pp. 285-290, 1993. 390 5. J. Siddhiwala and L.F. Chao, Path-Based Task Replication for Scheduling with Communication Costs, Proc. of the 1995 International Conference on Parallel Processing, vol. II, pp. 186-190, 1995. 390 6. M. A. Palis, J. Liou and D. S. L. Wei, Task Clustering and Scheduling for Distributed Memory Parallel Architectures, IEEE Trans. on Parallel and Distributed Systems, vol. 7, no. 1, pp. 46-55, Jan. 1996. 7. S. Darbha and D.P. Agrawal, Optimal Scheduling Algorithm for DistributedMemory Machines, IEEE Trans. on Parallel and Distributed Systems, vol. 9, no. 1, pp. 87-95, Jan. 1998. 390 8. K.K. Kwok and I. Ahmad, Exploiting Duplication to Minimize the Execution Times of Parallel Programs on Message-Passing Systems, Proc. of the sixth IEEE Symposium on Parallel and Distributed Processing, pp. 426-433, 1994. 390 9. H.El-Rewini and T.G.Lewis, ”Scheduling Parallel Program Tasks onto Arbitrary Target Machines,” J. Parallel and Distributed Computing 9, pp. 138-153, 1990. 391, 393 10. G.C.Sin and E.A.Lee, ”A Compile-Time Scheduling Heuristic for InterconnectionConstrained Heterogeneous Processor Architectures,” IEEE Trans. on Parallel and Distributed Syst., vol. 4, no. 2, pp. 175-187, 1993. 391, 393
Evaluation of Process Migration for Parallel Heterogeneous Workstation Clusters M.A.R. Dantas Computer Science Department, University of Brasilia (UnB), Brasilia, 70910-900, Brazil [email protected]
Abstract. It is recognised that the availability of a large number of workstations connected through a network can represent an attractive option for many organisations to provide to application programmers an alternative environment for high-performance computing. The original specification of the Message-Passing Interface (MPI) standard was not designed as a comprehensive parallel programming environment and some researchers agree that the standard should be preserved as simple and clean as possible. Nevertheless, a software environment such as MPI should have somehow a scheduling mechanism for the effective submission of parallel applications on network of workstations. This paper presents the performance results and benefits of an alternative lightweight approach called Selective - MPI (S-MPI), which was designed to enhance the efficiency of the scheduling of applications on parallel workstation cluster environments.
1
Introduction
The standard message-passing interface known as MPI (Message Passing Interface) is a specification which was originally designed for writing applications and libraries for distributed memory environments. The advantages of a standard message-passing interface are portability, ease-of-use and providing a clearly defined set of routines to the vendors that they can design their implementations efficiently, or provide hardware and low-level system support.Computation on workstation clusters when using the MPI software environment might result in low performance, because there is no elaborate scheduling mechanism to efficiently submit processes to these environments. Consequently, although a cluster could have available resources (e.g. lightly-loaded workstations and memory), these parameters are not completely considered and the parallel execution might be inefficient.This is a potential problem interfering with the performance of parallel applications, which could be efficiently processed by other machines of the cluster. In this paper we consider the mpich [2] as a case study implementation of MPI. In addition, we are using ordinary Unix tools therefore concepts employed D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 397–400, 1998. c Springer-Verlag Berlin Heidelberg 1998
398
M.A.R. Dantas
in the proposed system can be extended for similar environments. An approach called Selective - MPI (S-MPI) [1], is an alternative design to provide MPI users with the same attractive features of the existing environment and covers some of the functions of a high-level task scheduler. Experimental results of demonstrator applications indicate that the proposed solution enhances successfully the performance of applications with a significant reduction on the elapsed-time. The present paper is organised as follows. In section 2, we outline some details of the design philosophy of the S-MPI. Experimental results are presented in section 3. Finally, section 4 presents conclusions.
2
S-MPI Architecture
Figure 1, illustrates the architecture of the S-MPI environment compared to the mpich implementation. Difference between the two environments are mainly the layer where the user interface sits and the introduction of a more elaborate scheduling mechanism.
USER INTERFACE SELECTIVE LOAD BALANCING
PROCGROUP FILE
OR
MACHINES.XXXX FILE
JOB SCHEDULER
TRANSCEND
FUNCTIONS
(TRANSparent Code gENeration Dispatch) RESOURCE MATCHING
USER INTERFACE
CLUSTER CONFIGURATION
CLUSTER CONFIGURATION
MPI-APPLICATION EXECUTION
MPI-APPLICATION EXECUTION
RESULTS
RESULTS
S-MPI
MPI
Fig. 1. Architecture differences between S-MPI and mpich.
The S-MPI approach covers functions of a high-level task scheduler monitoring workstations workload, automatically compiling (or recompiling) the parallel code, matching processes requirements with resources available and finally selecting dynamically an appropriate cluster configuration to execute parallel applications efficiently.
Evaluation of Process Migration
3
399
Experimental Results
Configurations considered in this paper use a distributed parallel configuration composed of heterogeneous clusters of SUNs and SGIs (Silicon Graphics) workstations, in both cases using the mpich implementation. These workstations were private machines which could have idle cycles (i.e. considering a specific threshold of the average CPU queue length load index of the workstations) and memory available. Operating systems and hardware characteristics of these workstations are illustrated by table 1. Table 1. General charateristics of the SUN and SGI clusters. CLUSTERS Machine Architecture Clock(MHz) Data/Instruction Cache (KB) Memory(MB) Operating System MPI version
SUN SUN4u - Ultra
143-167 64 32-250 SOLARIS 5.5.1
mpich 1.1
SGI IP22 133-250 32-1024 32-250 IRIX 6.2
mpich 1.1
We have implemented two demonstrator applications with different ratio between computation and communication, The first application, was a gaussian algorithm to solve a system of equations. The second algorithm implemented was a matrix multiplication, which is also an essential component in many applications. Our algorithm is implemented using a master-slave interaction. The improved performance trend on the S-MPI environment executing the matrix multiplication is presented in figure 2. This figure demonstrates that the S-MPI has a better performance than the MPI configurations. The SUN-MPI and SGI-MPI environments represent the use of the specific machines defined in the static machines.xxx files. The SUN/SGI-MPI configuration shows results of the use of the procgroup file. This static file was formed with the address of heterogeneous machines from both SUN and SGI clusters. Finally, the results of SUN/SGI-SMPI illustrated the use of the automatic cluster configuration approach of the S-MPI. Figure 3 shows the enhanced performance of the S-MPI approach compared to the MPI for solving a system of 1000 equations. This figure shows a downward trend of the elapsed-time for the S-MPI environment in comparison to the conventional SUN and SGI MPI environments.
4
Conclusions
In this paper we have evaluated the performance results and benefits of a lightweight model called Selective - MPI (S-MPI). The system co-exists
400
M.A.R. Dantas
ELAPSED-TIME (SEC)
800 n = 768(SUN-MPI) n = 768(SGI-MPI) n = 768(SUN/SGI-MPI) n = 768(SUN/SGI-SMPI)
700 600 500 400 300 200 100 0 2
3 4 5 6 7 NUMBER OF WORKSTATIONS
8
ELAPSED-TIME (SEC)
Fig. 2. Elapsed-time of the matrix multiplication on MPI and S-MPI environments. 200 180 160 140 120 100 80 60 n = 1000 (SUN-MPI) n = 1000 (SGI-MPI) 40 n = 1000 (SUN/SGI-SMPI) 20 0 2 3 4 5 6 7 NUMBER OF WORKSTATIONS
8
Fig. 3. Elapsed-time of the Gaussian algorithm on MPI and S-MPI environments. cleanly with the existing concepts of the MPI and preserves the original characteristics of the software environment. In addition, S-MPI provides application programmers with a more transparent and enhanced environment to submit MPI parallel applications without layers of additional complexities for heterogeneous workstation clusters. Moreover, the concept clearly has advantages providing users with the same friendly interface and ensuring that all available resources will be evaluated before configuring a cluster to run MPI applications.
References 1. M.A.R. Dantas and E.J. Zaluska. Efficient Scheduling of MPI Applications on Network of Workstations. Accepted Paper on Future Generation Computer Systems Journal, 1997. 398 2. William Gropp Patrick Bridges, Nathan Doss. Users’ Guide to mpich, a Portable Implementation of MPI. Argonne National Laboratory http://www.mcs.anl.gov, 1994. 397
Using Alternative Schedules for Fault Tolerance in Parallel Programs on a Network of Workstations Dibyendu Das Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, India [email protected]
1
Introduction
In this work a new method has been devised whereby a parallel application running on a cluster of networked workstations ( say, using PVM ) can adapt to machine faults or machine reclamation [1] without any program intervention in a transparent manner. Failures are considered fail-stop in nature rather than Byzantine ones [4]. Thus, when machines fail, they are effectively shut-out from the network dying an untimely death. Such machines do not contribute anything further to a computation process. For simplicity both failures as well as reclamation are referred to as faults henceforth. Alternative schedules are used for implementing this adaptive fault tolerance mechanism. For parallel programs representable by static weighted task graphs, this framework builds a number of schedules, statically at compile time, instead of a single one. Each such schedule has a different machine requirement i.e. schedules are built for 1 machine, 2 machines and so on till a maximum of M machines. The main aim of computing alternative schedules is to have possible schedules to one of which the system can fall back upon for fault recovery during run-time. For example, if a parallel program starts off with a 3-machine schedule and one of them faults during run-time the program adapts by switching to a 2-machine schedule computed beforehand. However, the entire program is not restarted from scratch. Recomputation of specific tasks and resending of messages if required are taken care of.
2
Alternative Schedule Generation Using Clustering with Cloning
Schedules in this work are generated using the concept of task clustering with cloning [2], [5] as cloning of tasks on multiple machines have a definite advantage as far as adapting to faults is concerned. The Duplication Scheduling Heuristic by Kruatrachue et al. [2], [3] has been used to generate alternative schedules with differing machine requirements like 1,2,3, . . ..
The author acknowledges financial assisstance in the form of K. S. Krishan Fellowship
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 401–404, 1998. c Springer-Verlag Berlin Heidelberg 1998
402
3
Dibyendu Das
The Schedule Switching Strategy
The major problem at run-time is to facilitate a schedule switch once some machine/s is/are thrown out of the cluster due to fault. As the run-time system may need a new schedule to be invoked at a fault juncture, such a schedule should already exist on the machines. In addition, it is not known at compile time, which cluster of the new schedule will be invoked on which machine. Once the set of completed tasks on each live machine is known the schedule switch to an alternative schedule should take place in such a manner that the resultant program completes as quickly as possible. Also, the switching algorithm itself should be simple and fast. 3.1
Algorithm SchedSwitch
The algorithm for schedule switching is based on minimal weighted bipartite graph matching. The main idea is to shift from a currently executing thread ( cluster of a schedule executing on a machine ) to a virtual cluster of an alternative schedule such that the overall scheduling time is reduced as much as possible. In order to do this a complete weighted bipartite graph G = (V, E) is constructed where V = Mlive ∪ VCl . Mlive is the set of live machines (except the faulty ones) and VCl is the set of clusters of an alternative schedule to which a schedule switch can take place. The set of edges E = (v1 , v2 ), v1 ∈ Mlive , v2 ∈ VCl . Each edge is weighted, the weight being computed via the following strategy: The weight of the matching between cluster i in the pre-computed schedule with machine j is as follows. The not done tasks of cluster i are tried to be scheduled in the machine j as early as possible. While constructing this schedule, some of the tasks in cluster i may start late as compared to their start time in the original alternative schedule. The maximum of these delays among the tasks in cluster i is computed, and the edge e is annotated with this value. A minimum weighted bipartite matching algorithm is applied to G thus created to determine the best mapping of clusters to machines. 3.2
Static and Run-time Strategies for Schedule Switching
One of the major requirements for an adaptive program - be it for fault adaptation or machine reclamation, is that the adaptive mechanism must be reasonably fast. Keeping this in mind two strategies have been designed for this framework STSR ( Static Time Switch Resolution ) and RTSR ( Run Time Switch Resolution ). STSR is a static time approach whereby the possible schedule switches are computed statically given certain fault configurations - usually single machine faults. RTSR is a run-time approach whereby the schedule switching algorithm is applied during adaptation. This is a more general approach compared to STSR. STSR is a costly approach even if applied for only single machine failures as it is not known at what point a machine may fail and hence all possibilities are enumerated and the possible schedule switches computed. The complexity of the STSR strategy is O(Cµ V Cµ ((V )2 + Cµ2 logCµ )), where the maximum number of
Using Alternative Schedules for Fault Tolerance
403
clusters formed by a schedule S in the set of schedules CS generated is denoted as Cµ . Also a maximum of O(V ) nodes are present in the task graph. Though this method is compute-intensive it allows a very fast run-time adaptation. In addition, this algorithm parallelizable. This can be used to reduce its running time. A modified strategy called STSRmod has also been developed. In this modified strategy, even when Cµ is high all possible schedule switches need not be considered. This is based on the observation that we get identical machine to cluster matchings over certain intervals of time as the weights of the matchings change only on some discrete events, such as the completion of a task or the start of a new task on a machine. The complexity of STSRmod reduces to O(Cµ (V )3 ). In RTSR, the switching algorithm is applied at run-time instead of compile time. Hence, adaptation time is longer than in STSR. However, RTSR has a greater flexibility being able to handle a wider range of fault configurations. The complexity of RTSR is O((Cµ2 )logCµ + (V )2 ).
4
Experiment
The simulation experiment uses a parallel application structured as a FFT graph each node being of weight 2 units and each edge of 25 units. The bar graphs shown in Figure 1 are the scheduling times for the number of machines provided. Here, the number of maximum machines in use is 10 while the lowest is 2. The figure depicts how the total scheduling time Tsw ( including faults ) changes as machines fail one by one. The three plots shown are the plots of Tsw for machine failures with inter-machine failure time set to 10, 20 and 40 time units respectively. Tsw includes the retransmission and recomputation costs. The lowest plot shows the change in Tsw with inter-machine failure of 10 units, the middle one shows the change for inter-machine failure of 20 units and the topmost one for 40 units. The upper two plots do not extend in entirety because the program gets over ( i.e. finishes ) before the later failures can take place. The lowest Tsw plot for P = 10 is the value of the ultimate scheduling time if the 10-machine is initially run and a fault occurs after 10 time units forcing a switch to a 9-machine schedule. With the new 9-machine schedule if the program runs for another 10 units before faulting, the Tsw using the 8-machine schedule is plotted next. The other plots have similar connotations but with different inter-machine failure times.
5
Conclusion
In this work we have described the use of computing alternative schedules as a new framework, which can be used easily to adapt to fluctuations which may either cripple the running of a program or substantially reduce its performance on a network of workstations. This strategy, hitherto unused, can be effectively employed to compute a set of good schedules and used later at run-time without the overhead of generating a good schedule at run-time which uses fewer
404
Dibyendu Das
Scheduling Time 140
Inter-failure time = 40 units Inter-failure time = 20 units Inter-failure time = 10 units
120
100
80
60
40
20
P = 10
P=9
P=8
P=7 P=6 No. of Machines in use
P=4
P=2
Fig. 1. Performance for multiple single machine faults for a FFT task graph
machines. Thus, our strategy is a hybrid off-line approach coupled with on-line support for effective fault tolerance or machine eviction on a machine cluster without user intervention.
References 1. A. Acharya, G. Edjlali, and J. Saltz. The utility of exploiting idle workstations for parallel computation. In Proceedings of the ACM Sigmetrics Int’l Conf. on Measurement and Modeling of Computer Systems ’97, Seattle, June 1997. 401 2. B. Kruatrachue and T. Lewis. Grain size determination for parallel programs. IEEE Software, pages 23–32, Jan 1988. 401 3. H. El-Rewini, T. Lewis, and H. H. Ali. Task Scheduling for Parallel and Distributed Systems. Englewood Cliffs, NJ: Prentice-Hall, 1994. 401 4. G. Tel. Introduction to Distributed Algorithms. Cambridge University Press, 1994. 401 5. M. A. Palis, J. Liou, and D. S. L. Wei. Task Clustering and Scheduling for Distributed Memory Parallel Architectures. IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 1, pages 46–55, Jan 1996. 401
Dynamic and Randomized Load Distribution in Arbitrary Networks J.Gaber and B.Toursel L.I.F.L., Universit´e des Sciences et Technologies de Lille1 59655 Villeneuve d’Ascq cedex -France{gaber,toursel}@lifl.lifl.fr
Abstract. We present the analysis of a randomized load distribution algorithm that dynamically embed arbitrary trees in a distributed network with an arbitrary topology. We model a load distribution algorithm by an associeted Markov chain and we show that the randomized load distribution algorithm spreads any M-node tree in a distributed network +δ(ε)), where δ(ε) depends on the converwith N vertices with load O( M N gence rate of the associated Markov chain. Experimental results obtained by the implementation of these load distribution algorithms, to validate our framework, on the two-dimensional mesh of the massively parallel computer MasPar MP-1 with 16,384 processors, and on a network of workstations are also given.
1
Introduction
We model the load distribution problem that dynamically maintains evolving an arbitrary tree in a distributed network as an on-line embedding problem. These algorithms are dynamic in the sense that the tree may starts as one node and grows by dynamically spawning children. The nodes are incrementally embedded as they are spawned. Bhatt and Cai present in [1,2] a randomized algorithm that dynamically embeds an arbitrary M -node binary tree on a N processor binary hypercube with dilation O(log log N ) and load O(M/N + 1). Leighton and al. present in [3,4] two randomized algorithms for the hypercube and the butterfly. The first algorithm achieves, with high probability, a load O(M/N + log N ) and respectively dilation 1 for the hypercube and dilation 2 for the butterfly. The second algorithm achieves, for the hypecube, a dilation O(1) and load O(M/N + 1) with high probability. Kequin Li presents in [5] an optimal randomized algorithm that achieves dilation 1 and load O(M/N ) in linear arryas and rings. The algorithm concerns a model of random trees called the reproduction tree model [5]. The load distribution algorithm that we describe here work for every arbitrary tree (binary or not) on a general network. The analysis use mathematical tools derived from both the Markov chain theory and the numerical analysis of matrix iterative schemes. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 405–409, 1998. c Springer-Verlag Berlin Heidelberg 1998
406
2
J.Gaber and B.Toursel
The Problem
Consider the following needed notations. Let Pj the jth vertex of the host graph and k a given integer. We define, for each vertex Pj , the set V(Pj ) = {Pn(j,1) , Pn(j,2) , ..., Pn(j,k) }. This set defines a logical neighbourhood for Pj . n(j, δ) refers to the δ-neighbour of Pj . The terms logical neighbour denotes that an element of V(Pj ) is not necessarily closed to Pj in the distributede network. Consider the behavior of the following mapping algorithm. At any instant in time, any task allocated to some vertex u that does not have k children can spawn a new child task. The newly spawned children must be placed on vertices with satisfying the following conditions as proposed by the paradigm of Bhatt and Cai: (1) without foreknow how the tree will grow in the future, and (2) without accessing any global information and (3) once a task is placed on a particular vertex, it cannot be reallocated subsequently. Hence, the process migration is disallowed and the placement decision must be implemented within the network in a distributed manner, and locally without any global information. The mapping algorithm that we will describe is randomized and operates as follows. The children of any parent node v initially in a vertex Pj are randomly and uniformly placed on distinct vertices of the set V(Pj ). The probability that a vertex in V(Pj ) is chosen is 1/k. At the start, the root is placed on an initial vertex which can be fixed or randomly choosen. The randomness is crucial to the success of the algorithm as it will be seen in its analysis.
3
Analysis
As was mentionned, as each node is spawned as a child of a node v initially in Pj , it is assigned a random vertex in the set V(Pj ). The probability that any particular vertex is chosen is 1/k. As our process has the property that the choice of a vertex destination at any step depends only on the father’s node, i.e., depends only on the state − 1, we have thus a Markov’s chain whose state space is the set of the N vertices. We construct its transition probability matrix A = (aij )0≤i,j≤N where 1 if Pj ∈ V(Pi ) aij = k 0 otherwise A node distribution can be considered to be a probability distribution and the mapping of the newly spawned nodes is the computation of a one-step transition probabilities. Let p be an N vector where any entry p (i) is the proportion of objects in state i at time . We denote by p0 the initial probability distribution. At the state , we have p = p0 A , ∀ ≥ 1 We know that if the matrix A is regular, then the Markov chain has stationary transition probabilities since lim→∞ A exists. This means that for large values of , the probability of being in state i after transitions is approximatively equal to N1 (the ith entry
Dynamic and Randomized Load Distribution in Arbitrary Networks
407
of p) no matter what the initial state was. In other words, as a consequence of the asymptotic behavior of a such Markov process, the distribution converges to uniform distribution which allocates the same amount of nodes to each vertex for very large M . Let p be the probability distribution at step . We have p = p−1 A et p = lim p = →∞
1 1 1 N , N , ..., N
Let c(v) be the number of outcomes where vertex v is selected. We consider that the Markov chain reaches the stationary phase beginning from t(ε) (e.g., we choose the smallest t such that each entries of pt is equal to 1/N + ε where ε goes to 0). Denote by k () the number of spawned nodes at any step . We have M = ¯ k () , with ¯ ∈ [t, ∞[. =0 ¯
We have c(v) =
p (v)k () which become [6]
=0
we obtain the vector α =
t(ε)
(p (v) − p (v))k () +
=0
t(ε)
α(v)
M N
(p − p)k () . We need now to bound α. As stated by
=0
Cybenko in [7], if γ is subdominant modulus of A, then we have p 2 ≤ γ 2 p0 − p 2 and thus p − p ≤ γ p0 − p and p0 A − p
α =
t(ε)
k
=0
α ≤ p0 − p
()
(p − p) ≤
t(ε)
t(ε) =0
k
()
p − p ≤
t(ε)
k () γ p0 − p thus
=0
k γ
=0
Note that we have use the fact that the number of spawned nodes k () at any (kγ)t(ε)+1 − 1 step is at most k . We obtain α ≤ p0 − p kγ − 1 entry. Denote by δ(ε) = α ∞ Recall that . ∞ of a vector denotes its maximum the maximum entry of α. Since p0 − p = 1 − N1 , − N1 , ... , we have p0 − p ∞ = 1 − N1 and 1 1 − (kγ)t(ε)+1 (1) δ(ε) ≤ (1 − ) N 1 − kγ In view of (1), for small γ, we obtain a small value δ(ε). Recall that γ ∈]−1, 1]. For γ = k1 + a. We compute the Taylor series expansion of δ(ε) at k1 . This reveals that δ(ε) ≤ t(ε) + 1, where t(ε) is the rate of convergence of A (note that t(ε) is bounded). With the above analysis we obtain the following theorem.
408
J.Gaber and B.Toursel
Theorem 1. Given a randomized mapping algorithm of a process graph that is an arbitrary dynamic k-ary tree on any arbitrary topology. If the mapping’s stochastic matrix associated with the logical neighbourhood is regular then the number of nodes mapped on a single vertex of the network is O( M N + δ(ε)), where t(ε)+1 − 1 (kγ) , γ is the subdominant eigenvalue of the mapping’s δ(ε) = (1 − N1 ) kγ − 1 stochastic matrix, t(ε) the number of steps necessary to converge within ε, M is the number of nodes and N the number of vertices of the network.
4
Experimentations
We ran a parallel implementation of an algorithm wherein an arbitrary tree grows during the course of the computation (as in branch-and-bound serach, divideand-conquer or game tree evaluation), on a 128 × 128 mesh. The following table gives the experimental results in terms of the maximum load obtained when the randomized algorithm is (resp. is not) used to balance load (each line shows the result in terms of the maximum load obtained when we embed an arbitrary tree. The first column gives the total nodes number of this tree). Number load without load with ideal of nodes randomized randomized load algorithm algorithm 3123 93.00 1.17 1 55791 219.00 4.59 3.41 65135 4.00 4.00 3.98 99325 16.00 6.82 6.06 204383 64.00 13.43 12.47 731923 64.00 45.86 44.67 1640123 256.00 102.59 100.11 We ran, on a network of 8 stations with a multithreaded environement, a parallel implementation of an algorithm wherein an arbitrary binary tree grow during the course of the computation. The following table gives the effective load of each station obtained with and without the randomized algorithm. The value ToTal denotes the total nodes number of the embeded arbitrary tree.
5
Conclusion
Theorem 1 establishes that for a given mapping function, a simple randomized load distribution algorithm as described in section 2 maintains dynamically evolving an arbitrary tree on a general distributed network with a load O( M N + δ(ε)), where δ(ε) depends on the mapping function. This implies that we can easily compare mapping functions just by computing eigenvalues of the associated adjancy matrix.
Dynamic and Randomized Load Distribution in Arbitrary Networks
N=8 Num´ero load without load with Station randomized randomized algorithm algorithm 0 4171 16106 4012 16164 1 7302 17050 2 7289 16012 3 3605 16627 4 3412 16597 5 51722 20394 6 56659 19222 7 ToTal 138172.00
409
ideal load ToTal/N 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50 17271.50
Acknowledgment We are grateful to F.T.Leighton for the helpful comments and K.Li for helpful discussions. We are also grateful to F.Chung[8], S.Bhatt and G.Cybenko for providing us a helpful papers and suggestions.
References 1. S.N. Bhatt and J.-Y. Cai. Talk a walk, grow a tree. in Proc. 29th Annual IEEE Symposuim on Foundations of Computer, Science, IEEE CS, Washington, DC, pp. 469-478, 1988. 405 2. S.Bhatt and J-Y.Cai. Taking random walks to grow trees in hypercubes. Journal of the ACM, 40(3):741–764, July 1993. 405 3. F.T. Leighton, M.J. Newman, A.G. Ranade, and E.J. Schwabe. Dynamic tree embeddings in butterflies and hypercubes. Siam Journal on computing, 21(4):639– 654, August 1992. 405 4. F.T. Leighton. Introduction to parallel algorithms and architectures. Morgan Kauffmann Publishers., 1992. Traduit en Fran¸cais par P.Fraignaud et E.Fleury, International Thomson publishing France, 1995. 405 5. K. Li. Analysis of randomized load distribution for reproduction trees in linear arrays and rings. Proc. 11th Annual International Symposium on High Performance Computers, Winnipeg, Manitoba, Canada (10-12), July, 1997. 405 6. J. Gaber. Plongement et manipulations d’arbres dans les architectures distribu´ees. Th`ese LIFL, Janvier 1998. 407 7. G. Cybenko. Dynamic load balancing for distributed memory architecture. Journal of Parallel and Distributed Computing, 7:279–301, 1989. 407 8. Fan.R.K. Chung. Spectral Graph Theory. AMS., 1997. 409
Workshop 04 Automatic Parallelization and High-Performance Compilers Jean-Fran¸cois Collard Co-chairmen
Presentation This workshop deals with all topics concerning automatic parallelization techniques and the construction of parallel programs using high performance compilers. Topics of interest include the traditional fields of compiler technology, but also the interplay between compiler technology and communication libraries or run-time support. Of the 16 papers submitted to this workshop, 7 were accepted as regular papers, 2 as short papers, and 7 were rejected.
Organization The first session focuses on data placement and data access. In “Data Distribution at Run-Time: Re-Using Execution Plans,” Beckmann and Kelly show how data placement optimization techniques can be made efficiently available in runtime systems by a mixed compile- and run-time technique. On the contrary, the approach by Kandemir et al. in “Enhancing Spatial Locality using Data Layout Optimizations” to improve cache performance in uni- and multi-processor systems is purely static. They propose an array restructuring framework based on a combination of hyper-plane theory and reuse vectors. However, when data structures are very irregular, such as meshes, the compiler alone can in general extract very little information. In “Parallelization of Unstructured Mesh Computations Using Data Structure Formalization,” Koppler introduces a small description language for mesh structures which allows him to propose a special-purpose parallelizer for the class of applications he tackles. It is worth noticing the wide spectrum of techniques, ranging from completely static methods to purely runtime ones, explored in this field. This definitely illustrates the difficulty of the problem, and the three papers mentioned above make significant contributions asserted by real-life case studies. The second session starts with “Parallel Constant Propagation,” where Knoop presents an extension to parallel programs of a classical optimization of sequential programs: constant propagation. Another classical sequential optimization, extended to parallel programs, is redundancy elimination [2]. It is well known, however, that redundancies can be an asset in the parallel setting. Eisenbiegler takes benefit of this property in his paper “Optimization of SIMD D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 411–412, 1998. c Springer-Verlag Berlin Heidelberg 1998
412
Jean-Fran¸cois Collard
Programs with Redundant Computations,” and reports very encouraging execution time improvements. Finally, in “Exploiting Coarse Grain Parallelism from FORTRAN by Mapping it to IF1,” Lachanas and Evripidou describe the parallelization of Fortran programs through conversion to single assignment. This work is also interesting for its smart use of two separately available tools: the front-end of Parafrase 2 and the back-end of the SISAL compiler. In the third session, Feautrier presents in “A Parallelization Framework for Recursive Tree Programs” a novel framework to analyze dependencies in programs with recursive data. It is most noteworthy that a topically related paper has recently been published elsewhere [3], illustrating that the analysis of programs with recursive structures currently is a matter of great interest. How to extend the static scheduling techniques crafted by the author [1] to this framework is an exciting issue. Like scheduling, tiling is a well-known technique to express the parallelism in programs at compile-time. In their paper “Optimal Orthogonal Tiling,” Andonov, Rajopadhye and Yanev bring a new analytical solution to the problem of determining the tile size that minimizes the total execution time. A mixed compile- and run-time technique is addressed in the last paper. In “Enhancing the Performance of Autoscheduling in Distributed Shared Memory Multiprocessors,” Nikolopoulos, Polychronopoulos and Papatheodoro present a technique to enhance the performance of autoscheduling, a parallel program compilation and execution model that combines automatic extraction of parallelism, dynamic scheduling of parallel tasks, and dynamic program adaptability to the machine resources.
Conclusion When trying to gain retrospect on the papers presented in this workshop, one may notice that the borderline between compile-time and run-time is getting blurred in data dependence techniques, data placement, and the exploitation of parallelism. In other words, intensive research is being conducted to benefit from the best of both worlds so as to cope with real-size applications. This orientation is very encouraging and we can hope the end-user will soon benefit from the nice work proposed by the papers of this workshop.
References 1. P. Feautrier. Some efficient solutions to the affine scheduling problem, part I, one dimensional time. Int. J. of Parallel Programming, 21(5):313–348, October 1992. 412 2. J. Knoop, O. R¨ uthing, and B. Steffen. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems, TOPLAS, 16:1117– 1155, 1994. 411 3. Y. A. Liu. Dependence analysis for recursive data. In Int. Conf. on Computer Languages, pages 206–215, Chicago, Illinois, May 1998. IEEE. 412
Data Distribution at Run-Time: Re-Using Execution Plans Olav Beckmann and Paul H. J. Kelly Department of Computing, Imperial College 180 Queen’s Gate, London SW7 2BZ, U.K. {ob3,phjk}@doc.ic.ac.uk Abstract. This paper shows how data placement optimisation techniques which are normally only found in optimising compilers can be made available efficiently in run-time systems. We study the example of a delayed evaluation, self-optimising (DESO) numerical library for a distributed-memory multicomputer. Delayed evaluation allows us to capture the control-flow of a user program from within the library at run-time, and to construct an optimised execution plan by propagating data placement constraints backwards through the DAG representing the computation to be performed. In loops, essentially identical DAGs are likely to recur. The main concern of this paper is recognising opportunities where an execution plan can be re-used. We have adapted both conventional parallelising compiler techniques and hardware dynamic branch prediction techniques in order to ensure that our run-time optimisations need not perform any more work than a parallelising compiler would have to do unless there is a prospect of better performance.
1
Introduction
Parallel libraries have two major advantages as a parallel programming model. Firstly, they are convenient because user programs simply call library operators which hide all aspects of parallelism internally. Secondly, there is ample evidence [6] that compiled programming models do not yet get close to purposebuilt libraries in terms of performance. It is therefore often worthwhile to invest in a highly optimised, machine-specific library of common numerical subroutines. Hitherto, the disadvantage with such libraries has been that opportunities for optimisation across library calls have been missed. In this paper, we study a delayed evaluation, self-optimising (DESO) vectormatrix library for a distributed-memory multicomputer. The idea of DESO is to delay actual execution of function calls for as long as possible. Evaluation is forced by the need to access array elements1 . Delayed evaluation then provides the opportunity at run-time to construct an execution plan which minimises redistribution by propagating data placement constraints backwards through the DAG representing the computation to be performed. We will study a detailed example in Section 2. 1
The most common reasons for accessing array elements are output and conditional tests. We will refer to statements where this happens as force points.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 413–421, 1998. c Springer-Verlag Berlin Heidelberg 1998
414
Olav Beckmann and Paul H. J. Kelly
Key issues. The main challenge in optimising at run-time is that the optimiser itself has to be very efficient. We achieve this by – Working from aggregate loop nests, which have been optimised in isolation and which are not re-optimised at run-time. It is precisely the point of the library approach that we have invested in an implementation of selected operators which have been pre-optimised offline. – Using a purely mathematical formulation for data distributions, which allows us to calculate, rather than search for optimal placements. We will not expand on this aspect of our methodology here. – Re-using execution plans for previously optimised DAGs. A value-numbering scheme is used to capture cases where this may be possible. The value numbers are used to index a cache of optimisation results, and we use a technique adapted from hardware dynamic branch prediction for deciding whether to further optimise DAGs we have previously encountered. Context and Structure of this Paper. This paper builds on related work in the field of automatic data placement [8], run-time parallelisation [4,9,10] and conventional compiler and architecture technology [1,5]. In our earlier paper [3] we described the basic idea of a lazy, self-optimising parallel vector-matrix library. In this current paper, we extend that work by presenting the techniques we use to avoid re-optimisation of previously encountered problems. Below, we begin in Section 2 by discussing two alternative strategies for run-time optimisation. Following that, Section 3 presents our techniques for avoiding re-optimisation where appropriate. Section 4 shows performance results and Section 5 concludes.
2
Issues in Run-Time Optimisation
This section discusses and compares two different basic strategies for performing run-time optimisation. We will refer to them as “Forward Propagation Only” and “Forward And Backward Propagation”. We use the conjugate gradient iterative algorithm for solving linear systems Ax − b = 0 to illustrate both strategies. The pseudocode for this algorithm can be found in Figure 2. We use the following terminology: n is the number of operator calls in a sequence, a the maximum arity of operators, m is the maximum number of different methods per operator. If we work with a fixed set of data placements, s is the number of different placements, and in DAGs, d refers to the degree of the shared node (see [8]) with maximum degree. Forward Propagation Only. This is the only strategy open to us if we perform run-time optimisation of a sequence of library operators under strict evaluation. The strategy is illustrated in Figure 1: we optimise the placement of each node based on information purely about its ancestors. The total optimisation time for a sequence of operator calls under this strategy has linear complexity in the number of operators. However, as we have illustrated in Figure 1, the price we pay for using such an algorithm is that it may give a significantly suboptimal answer. This problem is present even for trees, but it is much worse for DAGs since shared nodes are not taken into account properly.
Run-Time Interprocedural Data Placement Optimisation
0
Initialised Data
1p=r
A
r
x
A
r
b
x
p=r
A
r
p=r 2 q = A.p
x
p=r
p=r 3 q = A.p γ = q.p
r
p=r q = A.p γ = q.p 6 ρ = r.r α = γρ x = αp + x
r
x
p=r q = A.p
ρ = r.r
γ = q.p α=ρ/γ x = αp + x
q = A.p
A
A
415
x
p=r q = A.p γ = q.p
We now need to transpose either q or p in order to calculate γ. Suppose we choose p.
We now require p to be aligned with x, but we have just transposed it, so we have to transpose back. We have made optimisation very easy by deciding the placement of each new node generated based simply on the placement of its immediate ancestor nodes. However, this can result in sub-optimal placements.
Fig. 1. Run-time optimisation of the first iteration of the CG algorithm (see Figure 2) with Forward Propagation Only. Forward And Backward Propagation. Delayed evaluation gives us the opportunity to propagate placement constraint information backwards through a DAG since we accumulate a full DAG before we begin to optimise. This type of optimisation is much more complex than Forward Propagation Only. Mace [8] has shown it to be NP-complete for general DAGs, but presents algorithms with complexity O((m + s2 )n) for trees and with complexity O(n × sd+1 ) for a restricted class of DAG. The point to note here, though, is that Forward and Backward Propagation does give us the opportunity to find the optimal solution to a problem, provided we are prepared to spend the time required.
3
Re-Using Execution Plans
The previous section has shown how delayed evaluation gives us the opportunity to derive optimal execution plans, but potentially at not insignificant cost. In real programs, essentially identical DAGs often recur. In such situations, our delayed evaluation, run-time approach is set to suffer a significant performance disadvantage over compile-time techniques unless we can reuse the results of previous optimisations we have performed. This section shows how we can ensure that our optimiser does not have to carry out any more work than an optimising compiler would have to do, unless
416
Olav Beckmann and Paul H. J. Kelly
r (0) = b − Ax(0) for i = 1, . . . , imax T ρi−1 = r (i−1) r (i−1) if i=1 p(1) = r (0) else βi−1 = ρi−1 /ρi−2 p(i) = r (i−1) + β(i−1) p(i−1) endif q (i) = Ap(i) T αi = ρi−1 /p(i) q (i) x(i) = x(i−1) + αi p(i) r (i) = r (i−1) − αi q (i) check convergence end
A
r
x
p=q q = A.p
ρ = r.r
γ = q.p α=ρ/γ x = αp + x
Fig. 2. Left: Pseudocode for the conjugate gradient iterative algorithm. Right: Optimisation with Forward And Backward Propagation: we take account of the use of p in the update of x in deciding the correct placement for the earlier calculation of γ. there is the prospect of better performance than the compiler could deliver. We discuss first the problem of how to recognise a DAG, i.e. optimisation problem, which we have encountered before and then the issue of whether to optimise further, or, re-optimise, such a DAG. Recognising Opportunities for Reuse. The full optimisation problem is a large structure. To avoid having to traverse it for comparing with previously encountered DAGs, we derive a hashed “value number” [1,5] for each node. – Our value numbers have to encode data placements and placement constraints, not actual data values. For nodes which are already evaluated, we simply apply a hash function to the placement descriptor of that node. For nodes which are not yet evaluated, we have to apply a hash function to the placement constraints on that node. – The key observation is that by seeking to take account of all placement constraints on a node, we are in danger of deriving an algorithm for calculating value numbers which has the same O-complexity as Forward and Backward Propagation optimisation algorithms: each node in a DAG can potentially exert a placement constraint over every other node. – Our algorithm for calculating value numbers is therefore based on Forward Propagation Only: we calculate value numbers for unevaluated nodes by applying a hash function to the placement constraints deriving from their immediate ancestors.
Run-Time Interprocedural Data Placement Optimisation
417
– Since we do not store the “full DAG” information, we cannot easily detect hash conflicts. We return to this point shortly. When to Re-Use and When to Optimise. Because our value numbers are calculated on Forward Propagation Only information, we have to address the problem of how to handle those cases where nodes which have identical value numbers are used in a different context later on; in other words, how to to avoid the drawbacks of Forward Propagation Only optimisation. This is a branch-prediction problem, and we use a technique adapted from hardware dynamic branch prediction (see [7]) for predicting heuristically whether identical value numbers will result in identical future use of the corresponding node and hence identical optimisation problems. Caching Execution Plans. Value numbers and ‘dynamic branch prediction’ together provide us with a fairly reliable mechanism for recognising the fact that we have encountered a node in the same context before. Assuming that we optimised the placement of that node when we first encountered it, our task is then simply to re-use the placement which the optimiser derived. We do this by using a “cache” of optimised placements, which is indexed by value numbers. Each cache entry has a valid-tag which is set by our branch prediction mechanism. Competitive Optimisation. As we showed in Section 2, full optimisation based on Forward And Backward Propagation can be very expensive. Each time we invoke the optimiser on a DAG, we therefore only spend a limited time optimising that DAG. For a DAG which we encounter only once, this means that we only spend very little time trying to eliminate the worst redistributions. For DAGs which recur, our strategy is to gradually improve the execution plan used until our optimisation algorithm can find no further improvements. The final point we need to address is how to handle hash conflicts. The result of a hash conflict on a value number will be that we use a sub-optimal placement for a node which we had previously optimised. In order to detect this, our system has been instrumented to record the communication cost of executing a DAG under an optimised execution plan. An increase in this cost on a “re-use” of that plan indicates a hash conflict. Summary. We use the full optimisation information, i.e. Forward and Backward Propagation, to optimise. We obtain access to this information by delayed evaluation. We use a scheme based on Forward Propagation Only, with linear complexity in program length, to ensure that we re-use the results of previous optimisations.
4
Performance
In this Section, we show performance figures for our library on the Fujitsu AP3000 multicomputer here at Imperial College. As a benchmark we used the
418
Olav Beckmann and Paul H. J. Kelly
Table 1. Time in milliseconds for 20 iterations of Conjugate Gradient, with a convergence test every 10 iterations, on the AP3000, with 300MHz UltraSparc-2 nodes (average figures, outlying points were omitted). N denotes timings without any optimisation, O timings with optimisation but no caching, and C timings with optimisation and caching of optimisation results. Memory shows time spent in malloc() and free(), Overhead the cost of maintaining data distribution descriptors and suspended library calls at run-time. Speedup is the speedup due to optimisation and execution plan re-use. P N Comp. Memory Overh. Comms. N 1 256 51.14 0.81 3.97 0.00 O 1 256 51.22 0.81 3.88 0.00 C 1 256 51.13 0.79 3.25 0.00 N 4 512 51.30 1.03 3.47 30.51 O 4 512 51.79 0.86 3.06 22.35 C 4 512 51.69 0.94 2.45 21.94 N 9 768 51.72 1.12 3.46 34.78 O 9 768 51.96 0.91 3.08 26.06 C 9 768 51.90 0.98 2.46 25.91 N 16 1024 52.05 1.03 3.48 45.37 O 16 1024 52.29 0.88 3.16 35.99 C 16 1024 52.13 0.95 2.49 35.65
Opt. 0.00 3.59 0.30 0.00 3.73 0.31 0.00 3.74 0.31 0.00 3.82 0.31
Total 55.93 59.50 55.47 86.30 81.79 77.32 91.10 85.75 81.55 101.93 96.14 91.53
Speedup · 0.94 1.01 · 1.06 1.12 · 1.06 1.12 · 1.06 1.11
conjugate gradient iterative algorithm. The pseudo-code for the algorithm and the source code when implemented using our library were shown in Figure 2 and the timing data are in Table 1. – Our optimiser avoids two out of three vector transpose operations per iteration. This can not be seen from the data in Table 1, it was determined analytically and by tracing communication. – Optimisation achieves a reduction in communication time of between 20% and 30%. We do not achieve more because a significant proportion of the communication in this algorithm is due to reduce-operations which are unaffected by our current optimisations. – Run-time overhead and optimisation time are independent of the amount of parallelism used. We suspect that the slight difference is due to cache effects. – The reason why run-time overhead is reduced by optimisation is that performing fewer redistributions also results in spending less time inspecting data placement descriptors. Caching achieves a further reduction in overhead; this is because it is cheaper to read placement descriptors from cache than to generate them by function calls. – Without caching of optimisation results, we achieve an overall speedup of around 6%. On platforms which have less powerful processors than the 300MHz UltraSparc-2 nodes we use here, the cost of optimising afresh each time can easily outweigh the benefit of reduced communication. – With caching of optimisation results, the time we spend optimising is negligible, and we achieve overall speedups of around 12%.
Run-Time Interprocedural Data Placement Optimisation
5
419
Conclusions
We have presented a technique for interprocedural data placement optimisation which exploits run-time control-flow information and is applicable in contexts where the calling program cannot be analysed statically. We present preliminary experimental evidence that the benefits can easily outweigh the run-time costs. Related Work. There is a huge body of work on data mappings for regular problems, [2] is but one example. Our work relies on this in producing optimised implementations for library operators. However, the problem we seek to address in this paper is different — how to perform interprocedural optimisation over a sequence of such pre-optimised operators. Saltz et al. [10] address the basic problem of how to parallelise loops where the dependence structure is not known statically. Loops are translated into an inspector loop which determines the dependencies at run-time and constructs a schedule, and an executor loop which carries out the calculations planned by the inspector. Saltz et al. discuss the possibility of reusing a previously constructed schedule, but rely on user annotations for doing so. Ponnusamy et al. [9] propose a simple conservative model which avoids the user having to indicate to the compiler when a schedule may be reused. Benkner et al. [4] describe the reuse of parallel schedules via explicit directives in HPF+: REUSE directives for indicating that the schedule computed for a certain loop can be reused and SCHEDULE variables which allow a schedule to be saved and reused in other code sections. Value numbering schemes were pioneered by Ershov [5], who proposed the use of “convention numbers” for denoting the results of computations and avoid having to recompute them. More recent work on this subject has been done, e.g., by Alpern et al. [1]. Run-Time vs. Compile-Time Optimisation. The particular example studied in this paper would have been amenable to compile-time analysis. We refer back to Section 1 and the arguments of convenience and efficiency we outlined there for why we parallelise this application via a parallel library. using a library then forces us to optimise at run-time. Further, we have shown that a runtime optimiser need not actually perform any more work than a compiler. To compare the quality of run-time and compile-time schedules, consider the loop opposite, assuming that there are no force-points inside the loop and that the loop is encountered a number of times, evaluation being forced after the loop each time. for(i = 0; i < N; ++i) { if
420
Olav Beckmann and Paul H. J. Kelly
This loop can potentially have 2N control-paths. A compile-time optimiser would have to find one compromise execution plan for all invocations of the loop. With our approach, we optimise the actual DAG which is generated on each occasion. If the number of different DAGs is high, compile-time methods would probably have the edge over ours, since we cannot reuse execution plans. If, however, the number of different DAGs is small, our execution plans for the actual DAGs will be superior to compile-time compromise solutions, and by reusing them, we limit the time spent optimising. Future work. The most exciting next step is to store cached execution plans persistently, so that they can be reused subsequently for this or similar applications. Although we can derive some benefit from exploiting run-time control-flow information, we have the opportunity to make run-time optimisation decisions based on run-time properties of data; we plan to extend this work to address sparse matrices shortly. The run-time system has to make on-the-fly data placement decisions. An intriguing question raised by this work is to compare this with an optimal off-line schedule.
Acknowledgements This work was partially supported by the EPSRC, under the Futurespace and CRAMP projects (refs. GR/J 87015 and GR/J 99117). We thank the Imperial College Parallel Computing Centre for the use of their AP3000 machine.
References 1. B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equalities of variables in programs. In 15th Annual ACM Symposium on Principles of Programming Languages, pages 1–11, San Diego, California, Jan. 1988. 414, 416, 419 2. J. M. Anderson, S. P. Amarasinghe, and M. S. Lam. Data and computation transformations for multiprocessors. SIGPLAN Notices, 30(8):166–178, Aug. 1995. 419 3. O. Beckmann and P. H. J. Kelly. Runtime interprocedural data placement optimisation for lazy parallel libraries (extended abstract). In Lengauer et al., editor, Proceedings of Euro-Par ’97, Passau, Germany, number 1300 in LNCS, pages 306– 309. Springer Verlag, Aug. 1997. 414 4. S. Benkner, P. Mehrotra, J. V. Rosendale, Zima, and Hans. High-level management of communication schedules in HPF-like languages. Technical Report TR-97-46, Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, VA 23681, USA, Sept. 1997. 414, 419 5. A. P. Ershov. On programming of arithmetic operations. Communications of the ACM, 1(8):3–6, 1958. Three figures from this article are in CACM 1(9):16. 414, 416, 419 6. W. D. Gropp. Performance driven programming models. In MPPM’97, Proceedings of the 3rd International Working Conference on Massively Parallel Programming Models, London, U.K., Nov. 1997. To appear. 413 7. J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantative Approach. Morgan Kaufman, San Mateo, California, 1st edition, 1990. 417
Run-Time Interprocedural Data Placement Optimisation
421
8. M. E. Mace. Storage Patterns in Parallel Processing. Kluwer, 1987. 414, 415 9. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In Proceedings of Supercomputing ’93: Portland, Oregon, November 15–19, 1993, pages 361–370, New York, NY 10036, USA, Nov. 1993. ACM Press. 414, 419 10. J. H. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603–612, May 1991. 414, 419
Enhancing Spatial Locality via Data Layout Optimizations M. Kandemir1 , A.Choudhary2 , J. Ramanujam3 , N. Shenoy2 , and P. Banerjee2 1
3
EECS Dept., Syracuse University, Syracuse, NY 13244 [email protected] 2 ECE Dept., Northwestern University, Evanston, IL 60208 {choudhar,nagaraj,banerjee}@ece.nwu.edu ECE Dept., Louisiana State University, Baton Rouge, LA 70803 [email protected]
Abstract. This paper aims to improve locality of references by suitably choosing array layouts. We use a new definition of spatial reuse vectors that takes into account memory layout of arrays. This capability creates two opportunities. First, it allows us to develop an array restructuring framework based on a combination of hyperplane theory and reuse vectors. Second, it allows us to observe the effect of different array layout optimizations on spatial reuse vectors. Since the iteration space based locality optimizations also change the spatial reuse vectors, our approach allows us to compare the iteration-space based and data-space based approaches in terms of their effects on spatial reuse vectors. We illustrate the effectiveness of our technique using an example from the BLAS library on the SGI Origin distributed shared-memory machine.
1
Introduction
In most computer systems, exploiting locality of reference is key to high levels of performance. It is well-known that caching of data whether it is private or shared improves memory latency as well as processor utilization. In fact, increasing the cache hit rates is one of the most important factors in reducing the average memory latency. Although cache hit rates in uniprocessors can be improved by optimizing the organization of the cache such as carefully selecting the cache size, line size and associativity, there are still tasks to be done from software. The programmers and compiler writers often attempt to modify the access patterns of a program so that the majority of accesses are made to the nearby memory. Several efforts have been aimed at iteration space transformations and scheduling techniques to improve locality [10,9,13]; these techniques improve data locality indirectly as a result of modifying the iteration space traversal order. In this paper, we focus on an alternative approach to the data locality optimization problem. Unlike traditional compiler techniques, we focus directly on the data space, and attempt to transform data layouts so that better locality is obtained. We present a data restructuring technique which can improve cache performance in uniprocessors as well as multiprocessor systems. Our technique is based on the concept of reuse vectors [9,13] and uses linear algebra techniques to help a compiler translate potential reuse in a given program into locality. In our D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 422–434, 1998. c Springer-Verlag Berlin Heidelberg 1998
Enhancing Spatial Locality via Data Layout Optimizations
423
technique, we use data transformations expressed by linear full-rank transformation matrices. We also compare data transformations with linear loop transformations that can be expressed by non-singular transformation matrices. We are particularly interested in optimizing spatial locality, that in turn causes better utilization of the long cache lines which represents the current trend in cache architecture. This paper is organized as follows. In the next section, we outline the background and motivations for our work. Section 3 presents a memory layout representation based on hyperplane theory from linear algebra. In Section 4, we give a method to compute a reuse vector, for a given layout expressed by hyperplanes. We also illustrate how this new definition of reuse vector can be employed to guide data layout transformations. In Section 5, we present our preliminary results on the syr2k example from the BLAS library. In Section 6 we review the related work and conclude the paper in Section 7.
2
Background
We represent each iteration of a loop nest of depth n by a loop iteration vector I = (i1 , i2 , ..., in ) where ik is the value of the kth loop from the outermost. Each reference to an m–dimensional array U in such a loop nest is assumed to be an affine function of the iteration represented by I, i.e., I is mapped onto data element AU I + oU . Here AU is an m × n matrix called the access matrix and the m-vector oU is called the offset vector [13,9]. Consider the following example. Example 1: do i = li, ui do j = lj, uj U(j+1,i-1) = V(i-1,j+1) + W(j-1,i+j+1) + X(j) + Y(i) enddo enddo 01 1 , and oU = . Two references to an array Here, for array U , AU = 10 −1 are said to belong to a uniformly generated reference (UGR) set if they have the same access matrix [3]. We say that reuse exists during the execution of a loop nest if the same data element or nearby data elements (e.g., the elements stored consecutively in a column for a column-major array) are accessed more than once. Given a loop nest, it is important to know where data reuse exists. A simple way of detecting the loops which carry reuse is to derive a set of vectors called reuse vectors [13,9]. The objective here is first to use vectors to represent directions in which reuse occurs, and then to transform the loop nest such that these vectors will have some desired properties, i.e., forms. Consider a reference with an access matrix AU and offset vector oU . Two iterations I and J reference the same data element if AU I + oU = AU J + oU ⇒ AU (J − I) = 0 ⇒ AU rU t = 0 ⇒ rU t ∈ Ker{AU } where rU t = J − I. In this case we say that reference AU exhibits temporal reuse; that is, the same
424
M. Kandemir et al.
data element is used by more than one iteration; rU t is termed as temporal reuse vector and Ker{AU } is referred as temporal reuse vector space [13]. As an example, consider the reference X(j) in Example 1. The access matrix of this reference is (0, 1). Therefore, rX t ∈ Ker{(0, 1)}; that is, rX t ∈ span{(1, 0)T }. This means that there is a temporal reuse carried by the i loop. That is, for a fixed j, successive iterations of the i loop access the same element of array X. For concreteness, we say that (1, 0)T is the temporal reuse vector for that reference. In general, the loop corresponding to the first non-zero element in a reuse vector is said to carry the associated reuse. It should be noted that all references that belong to a single UGR set have the same access matrix; therefore, they have the same temporal reuse vector. The reuse vector for the reference Y (i) on the other hand is (0, 1)T . Intuitively, a reuse vector such as (0, 1)T means that the successive iterations of the innermost loop will access the same data element. In a sense, this is a desired reuse vector form; because, the reuse is exploited in the innermost loop where it is most useful. On the other hand, a reuse vector such as (1, 0)T implies that the same data item is accessed in the successive iterations of the outer loop. If the trip count of the innermost loop is very large then it might be the case that between two uses the data item is flushed from cache. In that case we say that reuse has not converted into locality. The main objective of the compiler-based data reuse analysis can be cast as the problem of transforming programs such that as many reuses as possible will be converted into locality. In our terms, this means to transform programs such that the resulting codes will have desired reuse vector forms. In addition to temporal reuse, spatial reuse is very important in scientific codes. Assuming a column-major memory layout for arrays, a spatial reuse occurs when the same column of an array is accessed more than once. Consider Example 1 again. Assuming column-major memory layouts for all arrays, we note that successive iterations of j loop will access elements in the same column of array U . In that case, we say that array U has spatial reuse with respect to j loop. Similarly, successive values of i loop access elements of the same column of array V ; that is, array V has spatial reuse with respect to loop i. It should be noted that spatial reuse is defined with respect to a given memory layout, which is column-major in our case. The spatial reuse exhibited by array W however is slightly more complicated to analyze. For this array, elements of a given column c are accessed if and only if i + j = c − 1. In mathematical terms, in order for two iteration vectors I and J access elements residing in the same column, they should satisfy the condition AUs I = AUs J where AUs is the matrix AU (for an array U ) with the first row deleted [13]. From this condition, since AUs represents a linear mapping, AUs (J − I) = 0 ⇒ J − I ∈ Ker{AUs } ⇒ rU s ∈ Ker{AUs } where rU s = J − I. In this case, rU s is termed as the spatial reuse vector and Ker{AUs } is referred as spatial reuse vector space [13]. When there is no confusion, we use rU instead of rU s for the sake of clarity. The loop which corresponds to the first non-zero element of the generator vector of Ker{AUs } is said to carry spatial reuse. In Example 1, for array U , the spatial reuse is carried by j loop. In this
Enhancing Spatial Locality via Data Layout Optimizations
425
example, there exists a spatial reuse vector for each reference and these are rU = (0, 1)T , rV = (1, 0)T , and rW = (1, −1)T . Recall that all these reuse vectors are with respect to column-major memory layout. It is possible that a reference may have more than one spatial reuse vector. In that case, these individual reuse vectors constitute a matrix called spatial reuse matrix, which will be denoted in this paper by RU for an array U or for a reference to an array U when there is no confusion. Individual spatial reuse matrices from individual references to possibly different arrays constitute a combined reuse matrix [9]. The order of the vectors in the combined reuse matrix might be important. Like Li [9], we assume that they are prioritized from left to right according to the frequency of occurrence. An important attribute of a reuse vector is its height which is defined as the number of dimensions from the first non-zero entry to the last entry. One way of increasing cache locality is to introduce more leading zeros in the reuse vector, called height reduction [9]. In this context, the best possible spatial reuse vector is r = (0, 0, · · · , 0, 1)T meaning that the spatial reuse will be carried by the innermost loop in the nest. We emphasize the fact that temporal reuse can only be manipulated by iteration space (loop) transformations. Data space transformations cannot directly affect the temporal reuse exhibited by a reference. Since, in this paper we are interested only in data layout transformations, we consider only spatial reuse vectors. In addition, we only focus on self-spatial reuses [13].
3
Hyperplanes: An Abstraction for Array Layouts
This section reviews our work [6] on representing memory layouts of multidimensional arrays using hyperplanes. In an m-dimensional data space, a hyperplane can be defined as a set of tuples {(a1 , a2 , ..., am ) | g1 a1 + g2 a2 + ... + gm am = c} where g1 , g2 ,...,gm are rational numbers called hyperplane coefficients and c is a rational number called hyperplane constant [12]. For convenience, we use a row vector g T = (g1 , g2 , ..., gm ) to denote an hyperplane family (with different values of c) whereas g corresponds to the column vector representation. In a two-dimensional data space, we can think of a hyperplane family as parallel lines for a fixed coefficient set and different values of c. An important property of the hyperplanes is that two data points (in our case, array elements) d1 and d2 belong to the same hyperplane g if g T d1 = g T d2 .
(1)
For example, a hyperplane vector such as (0, 1)T indicates that two array elements belong to the same hyperplane as long as they have the same value for the column index (i.e. the second dimension); the value for the row index does not matter. We note that a hyperplane family can be used to partially define memory layout of an array. For example, in two-dimensional array case, a hyperplane family represented by (0, 1) indicates that the array in question is divided into
426
M. Kandemir et al.
columns such that each column corresponds to a hyperplane with a different c value (hyperplane constant). The data elements that have the same c value (that is, the elements which make up a column) are stored in memory consecutively (ordered by their column indices). We assume in a broader sense that two data elements which lie along an hyperplane with a specific c value have spatial locality. Another way of saying this is that two elements d1 and d2 have spatial locality if they satisfy equation (1). Notice that this definition of spatial locality is coarse and does not hold in array boundaries; but it is suitable for our purposes. In higher dimensions, we need more than one hyperplane. We refer the interested reader to [6] for an in-depth discussion of hyperplane based layout representation. In the remainder of this paper, for the sake of simplicity we mainly focus on two-dimensional arrays. Our results easily extend to higher dimensional arrays.
4 4.1
Reuse Vectors under Different Layouts Determining Spatial Reuse Vectors
The definition of the spatial reuse vector presented earlier is based on the assumption that the memory layout for all arrays is column-major. In this section, we give a definition of spatial reuse vector under a given memory layout. We start with the following theorem. The reader is referred to [7] for the proofs of all the theorems in this paper. Theorem 1. Let g represent a memory layout for an array U , and rU the spatial reuse vector associated with a reference represented by the access matrix AU . Then, the following equality holds between g, rU and AU : g T AU rU = 0
(2)
If the memory layout of an array is represented by a matrix L (as in three- or higher dimensional cases), then the equation (2) should be satisfied for each row of L. This theorem gives us the relation between memory layouts and spatial reuse vectors and is very important. Notice that for a given g T , in general, from g T AU rU = 0, the spatial reuse vector rU can always be found as rU ∈ Ker{g T AU }
(3)
Let us consider the following example assuming row-major memory layout for all arrays: Example 2: do i = li, ui do j = lj, uj do k = lk, uk U(i+k,j+k) = V(i+j,i+k,j+k) + W(k,j,i) enddo enddo enddo
Enhancing Spatial Locality via Data Layout Optimizations
The access matrices for this loop nest are AU =
101 011
427
110 , AV = 1 0 1 , and 011
001 AW = 0 1 0 . The row-major layout for 2-dimensional arrays is represented 100 100 T for 3-dimensional arrays. Using relation (3), by g = (1, 0) and by L = 010 0 1 1 we find the following reuse matrices: RU = 1 0 , RV = −1 , and RW = 0 −1 −1 1 0 . Most of the previous research has concentrated on optimizing rU given 0 a fixed memory layout. For instance, if an array is stored in column-major order in memory, i.e., g T = (0, 1), from equation (2) g T AU rU = 0 ⇒ AUs rU = 0; this is the definition of spatial reuse vector used previously [9,13]. 4.2
Reproducing the Effects of Iteration Space Transformations
In this subsection, we show how to reproduce the effect of a linear loop transformation by using linear data transformations instead. Li [9] shows that an iteration space transformation T transforms a reuse vector rU into rU = T rU . In this paper, we treat temporal locality as a special case of spatial locality. That is, accessing the same element can be considered accessing the same column (for a column-major layout). Theorem 2. Given a ‘single’ loop nest, if a reference is optimized for spatial locality using an iteration space transformation matrix T , the same effect can be obtained by a corresponding data transformation matrix M . Consider the following matrix-multiplication example assuming column-major layout: Example 3: do i = li, ui do j = lj, uj do k = lk, uk C(i,j) = C(i,j) + A(i,k) * B(k,j) enddo enddo enddo 10 10 The spatial reuse matrices are RC = 0 0 , RA = 0 1 , and RB = 01 00 10 0 0 . By taking into account the frequency of occurrence of each vector, 01
428
M. Kandemir et al.
100 we can write a combined reuse matrix as R = 0 0 1 . Here the leftmost 010 column has the highest priority whereas the rightmost column has the lowest. In this Li’s algorithm [9] finds the following matrix the locality to optimize case, 001 010 T = 0 0 1 . The new reuse matrix is R = T R = 0 1 0 . Notice that the 100 100 reuse vector of highest priority is optimized very well, i.e., its height is reduced to 1. The transformed program is as follows: do u = lu, uu do v = lv, uv do w = lw, uw C(w,u) = C(w,u) + A(w,v) * B(v,u) enddo enddo enddo In the optimized program, the references C(w, u) and A(w, v) have spatial locality in the innermost loop (w–loop) whereas the reference B(v, u) has temporal locality in the innermost loop. Since, in this paper, we treat temporal locality in a loop as a special case of spatial locality, we conclude that in this program all three references have spatial locality in the innermost loop. Next we show how to obtain the effect of this transformation with data layout transformations. Notice that after the transformation spatial T , thefinal 00 01 reuse matrices for individual references are RC = 0 1 , RA = 0 0 , and 10 10 00 RB = 0 1 . Now, from each reuse matrix, we choose the most frequently 10 used reuse vector (considering all reuse vectors in all reuse matrices), and use it as a target reuse vector for the associated array. In this example, the target reuse vector happens to be (0, 0, 1)T for all references.
For array C:
g AC rC = 0 ⇒ (g1 , g2 )
For array A:
g AA rA = 0 ⇒ (g1 , g2 )
For array B:
g AB rB = 0 ⇒ (g1 , g2 )
T
T
T
100 010 100 001 001 010
0 0 1
= 0 ⇒ (g1 , g2 )T ∈ Ker{(0, 0)}
0 1
= 0 ⇒ (g1 , g2 )T ∈ Ker{(0, 1)}
0 1
= 0 ⇒ (g1 , g2 )T ∈ Ker{(1, 0)}
0 0
Thus, the array C can have any layout (say row-major), the array A should be row-major, and the array B should be column-major. Notice that if we assign
Enhancing Spatial Locality via Data Layout Optimizations
429
these layouts, then we have spatial locality for all the references with respect to the innermost loop, a result that has also been obtained by the loop transformation T given earlier. From the previous discussion, we can conclude that given a single loop nest where each array has one UGR set [3], it is always possible to obtain the locality effect of a linear loop transformation by corresponding linear data transformations. 4.3
Optimizing for the Best Locality
So far, we have shown that under certain conditions, it is possible to reproduce the effect of a loop transformation by data transformations. By studying the effect of a loop transformation on a spatial reuse vector, we can find a corresponding data transformation which can generate the same effect. Now, we go one step further, and prove a stronger result. Theorem 3. Given a loop nest where each array has one UGR set, then it is always possible to optimize the array layouts such that each reference will have spatial locality with respect to the innermost loop. Consider the following example with a column-major layout for all arrays. Example 4: do i = li, ui do j = lj, uj do k = lk, uk U(i,j,k) = U(i-1,j,k+1) + U(i,j,k-1) + V(j+1,k+1,i-1) enddo enddo enddo Ignoring the temporal reuses, the spatial reuse matrices for individual arrays T T are RU =(1, 0, 0) and RV = (0, 1, 0) . Therefore, the combined reuse matrix 10 is R = 0 1 . Notice that the first column occurs more frequently (three 00 times to be exact). The best iteration space transformation from the locality 001 point of view (using Li’s approach [9]) is T = 0 1 0 which would generate 100 00 R = T R = 0 1 . Unfortunately, due to the data dependence involving array 10 U , this transformation is not legal. In search for a second best transformation, we have two options: 010 01 For the first option, T1 = 1 0 0 which gives R = T1 R = 1 0 . 001 00 In the transformed nest, the spatial locality for array U is exploited in the second
430
M. Kandemir et al. for each array U ∈ U compute the access matrix AU representing U i=n while (p(i) = 1) i=i−1 end while r = ei compute N = Ker{(AU r)T } use the elements of the basis set of N as rows of the layout matrix for U end for each
Fig. 1. Algorithm for optimizing spatial locality.
innermost loop (v–loop); and the spatial locality for array V is exploited in the outermost loop (u–loop). 100 10 For the second option, T2 = 0 0 1 which gives R = T2 R = 0 0 . 010 01 In the transformed nest, the spatial locality for array U is exploited in the outermost loop (u–loop); and the spatial locality for array V is exploited in the innermost loop (w–loop). We note that while T1 improves the spatial locality of U slightly, T2 improves the spatial locality for V . It is easy to see that data layout transformations can easily reproduce the effects of T1 and T2 (see [7]). However, neither T1 nor T2 (nor their data transformation counterparts) is able to optimize the spatial locality for both arrays in the innermost loop. The reason for the failure of the iteration space transformations is the data dependences which prevent the most desired loop permutation. We now show that how a different data layout transformation can optimize the same nest. We use the best possible spatial reuse vector as the desired (or target) vector for both the arrays. Inother words, rU = rV = (0, 0, 1)T . For ar 100 0 ray U , g T AU rU = 0 ⇒ (g1 , g2 , g3 ) 0 1 0 0 = 0; therefore, (g1 , g2 , g3 )T ∈ 001 1 Ker{(0, 0, 1)} meaning that the array U should have a memory layout such that the last dimension should be the fastest changing dimension. The row-major 100 layout is such a memory layout with the layout matrix LU = . For 010 010 0 array V , g T AV rV = 0 ⇒ (g1 , g2 , g3 ) 0 0 1 0 = 0; thus, (g1 , g2 , g3 )T ∈ 100 1 Ker{(0, 1, 0)} meaning that the memory layout of array V must be such that the second dimension is the fastest dimension. An example layout matrix changing 100 which satisfies that is LV = . 001
Enhancing Spatial Locality via Data Layout Optimizations processors = 1
processors = 4
431
processors = 8 15
25 70
time (sec.)
40 30
15 10
10
5
20 5
-r lop t do pt
-r
-c
r-r
r-r
-r
-c
r-c
-c
c-r
r-c
-c c-c -r
c-r
c-c
-c c-c -r c-r -c c-r -r r-c -c r-c -r r-r -c r-r -r lop t do pt
c-c
-c c-c -r c-r -c c-r -r r-c -c r-c -r r-r -c r-r -r lop t do pt
10
c-c
time (sec.)
50
time (sec.)
20
60
Fig. 2. Execution times (in sec.) of syr2k with different layouts on SGI Origin. In this case, for all the references the spatial reuse will be exploited in the innermost loop. This example clearly shows that the data transformations may be effective in some cases where the iteration space transformations fail. 4.4
Optimizing for Shared-Memory Multiprocessors
In a shared-memory multiprocessor case, we should take into account the issues related to parallelism as well. As a rule, to prevent a common form of false sharing, a loop which carries spatial reuse should not be parallelized [9]. False sharing occurs when two processors access the same coherence unit (at least one of them writes) without sharing a data element [4]. We assume that the parallelism decisions are made by a previous pass in the compilation process, and the information for a loop nest is available to our algorithm as a form of vector p where p(i) (the ith element) is one if the loop i is parallelized otherwise it is zero. Our memory layout determination algorithm is given in Figure 1. In the figure, U is the set of all arrays referenced in the nest. The symbol n denotes the number of loops in the nest, and ei is a vector will all zero entries except for the ith entry which is 1. The algorithm assumes that there is no conflict (in terms of layout requirements) between references to a particular array. If there is a conflict, then a conflict resolution scheme should be applied [6]. For each array, a subset of all possible spatial reuse vectors starting with the best possible vector are tried. The sequence of trials correspond to en ,...,e2 ,e1 . A vector is rejected as target spatial reuse vector if the associated loop is parallelized. Otherwise ei with the largest i is selected as target reuse vector. Once a spatial reuse vector is chosen, the remainder of the algorithm involves only computation of a null set and a set of basis vectors for it. Since optimizing compilers for shared-memory multiprocessors are quite successful in parallelizing the outermost loops, the algorithm terminates quickly; and the spatial locality in the innermost loops is exploited without incurring severe false sharing.
5
Experimental Results
We conducted experiments on an SGI Origin, which is a distributed-sharedmemory machine that uses 195MHz R10000 processors, and has 32KB L1 data
432
M. Kandemir et al.
cache and 4MB L2 unified cache. We used the syr2k code (in C) from BLAS to show that our data layout transformation framework can handle complicated access patterns. Assuming a column-major memory layout, all the arrays have poor locality. The version that is optimized by loop transformations (from [9], assuming column-major layouts) optimizes spatial locality for all the arrays, but has two drawbacks: First, the loop bounds and array subscript expressions are very complicated, and second, the temporal locality for array C in the original program has been converted to spatial locality. Our framework, on the other hand, decides suitable memory layouts. Figure 2 shows the performances for eight possible (permutation-based) layouts on an SGI Origin using 1024 × 1024 double matrices. The legend x-y-z means that the memory layouts for C, A and B are x, y and z respectively, where c means column-major and r means row-major. The layout optimized version –obtained by our framework– (marked dopt) and the loop optimized version (marked lopt) are also shown as the last two bars. Notice that our framework optimizes the performance substantially, and no dimension re-indexing (permutation-based layout transformation) can obtain this performance.
6
Related Work
The compiler work in optimizing loop nests for locality can be divided into two groups: (1) iteration space based optimizations and (2) data layout optimizations. In the first group, Wolf and Lam [13] present definitions of different types of reuses and propose an algorithm to maximize locality by evaluating a subset of legal loop transformations. In contrast, Li [9] uses the concept of ‘reuse distance’. His algorithm constructs a combined reuse matrix and tries to reduce its ‘height’ using techniques from linear algebra. McKinley et al. [10] offer a unified optimization technique consisting of loop permutation, loop fusion and loop distribution. Our work on locality is different from those mentioned. First, we focus on data space transformations instead of iteration space transformations. Second, we use a new definition of spatial reuse vector whereas the reuse vectors used in [9] and [13] are oriented for a uniform memory layout. In the second group, O’Boyle and Knijnenburg [11] focus on restructuring the code given a data transformation matrix and show its usefulness in optimizing spatial locality; in contrast, we concentrate more on the problem of determining suitable layouts by taking into account false sharing as well. Anderson et al. [1] propose a data transformation technique—comprised of permutations and stripmining—that restructures the data in the shared memory space such that the data for each processor are stored in nearby locations. In contrast, we focus on a larger space of data transformations. Cierniak and Li [2] and Kandemir et al. [5] propose optimization techniques that combine loop and data transformations in a unified framework, but restrict the transformation space. Even in the restricted space, they perform sort of exhaustive search. Jeremiassen and Eggers [4] use data transformations to reduce false sharing. Our framework also con-
Enhancing Spatial Locality via Data Layout Optimizations
433
siders reducing a common form of false sharing as well. Leung and Zahorjan [8] present an array restructuring framework to optimize locality in uniprocessors. Our work differs from theirs in several points: (1) our technique is based on explicit representation of memory layouts which is central to a unified loop and data transformation framework such as ours; (2) our technique finds optimal memory layouts in a single step rather than first determining a transformation matrix and then refining it for minimizing memory space using Fourier-Motzkin elimination; and (3) we explicitly take false sharing into account for multiprocessors.
7
Conclusions
In this paper, we have presented a definition of reuse vector under a given memory layout. For this, we have represented the memory layout of a multidimensional array using hyperplane families. Then we have presented a relation between layout representation and spatial reuse vector. We have shown that this relation can be exploited at least in two ways. For a given layout and reference, we can find the spatial reuse vector or for a given desired spatial reuse vector, we can determine the target layout. This second usage allows us to develop a data layout reorganization framework based on the existing compiler technology. Given a loop nest, our framework can optimize the memory layouts of arrays by considering the best possible reuse vectors.
References 1. J. Anderson, S. Amarasinghe, and M. Lam. Data and computation transformations for multiprocessors. Proc. 5th SIGPLAN Symp. Prin. & Prac. Para. Prog., Jul. 1995. 432 2. M. Cierniak, and W. Li. Unifying data and control transformations for distributed shared memory machines. Proc. SIGPLAN ’95 Conf. Prog. Lang. Des.& Impl., Jun. 1995. 432 3. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformations, J. Para. & Dist. Comp., 5:587– 616, 1988. 423, 429 4. T. Jeremiassen, and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. Proc. 5th SIGPLAN Symp. Prin. & Prac. Para. Prog., Jul. 1995. 431, 432 5. M. Kandemir, J. Ramanujam, and A. Choudhary. Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines. Proc. 1997 International Conference on Parallel Architectures and Compilation Techniques, pp. 236–247, Nov. 1997. 432 6. M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A data layout optimization technique based on hyperplanes. TR CPDC-TR-97-04, Northwestern Univ. 425, 426, 431 7. M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. Optimizing spatial locality in loop nests using linear algebra. Proc. 7th Workshop Compilers for Parallel Computers, 1998. 426, 430
434
M. Kandemir et al.
8. S-T. Leung, and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report, CSE Dept., University of Washington, TR 95-09-01, Sep. 1995. 433 9. W. Li. Compiling for NUMA parallel machines. Ph.D. dissertation, Cornell University, 1993. 422, 423, 425, 427, 428, 429, 431, 432 10. K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996. 422, 432 11. M. ’Boyle, and P. Knijnenburg. Non-singular data transformations: Definition, validity, applications. Proc. 6th Workshop on Compilers for Parallel Computers, pp. 287–297, 1996. 432 12. J. Ramanujam, and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. In IEEE Trans. Para. & Dist. Sys., 2(4):472–482, Oct. 1991. 425 13. M. Wolf, and M. Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN 91 Conf. Programming Language Design and Implementation, pp. 30–44, June 1991. 422, 423, 424, 425, 427, 432
Parallelization of Unstructured Mesh Computations Using Data Structure Formalization Rainer Koppler Department of Computer Graphics and Parallel Processing (GUP) Johannes Kepler University, A-4040 Linz, Austria/Europe [email protected]
Abstract. This paper introduces a concept for semi-automatic parallelization of unstructured mesh computations called data structure formalization. Unlike existing concepts it does not expect knowledge about parallelism but just enough knowledge about the application semantics such that a formal description of the data structure implementation can be given. The parallelization tool Parlamat uses this description for deduction of additional information about arrays and loops such that more efficient parallelization can be achieved than with general tools. We give a brief overview of our data structure modelling language and first experiences with Parlamat’s capabilities by means of the translation of some real-size applications from Fortran 77 to HPF.
1
Introduction
Unstructured meshes are used in a large number of scientific codes, for example, in computational fluid dynamics and structural analysis codes. Unfortunately, these codes are hard to parallelize efficiently. Automatic parallelization cannot satisfy today’s expectations due to restrictions of compile-time techniques. On the other hand, languages such as High Performance Fortran (HPF) [9] require considerable skill and efforts from programmers, in particular with real-size unstructured mesh codes. In this paper we narrow the gap between the above two approaches using a novel concept called data structure formalization. Unlike existing semi-automatic approaches our concept does not confront users with parallelism but just expects knowledge about the sequential application such that the user can provide a brief and formal description of the mesh data structure. We show that such a description provides enough information in order to let a software tool perform automatic array distribution and detect parallelism in loops containing indirect array accesses. We implemented a tool with such capabilities that utilizes the results for translation of the Fortran 77 code to HPF. The tool’s back-end automatically inserts directives for array and loop distribution according to known HPF parallelization techniques. First experiences with the application of our concept to real-size codes gave encouraging results and justify development of special-purpose parallelization systems for particular application classes. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 435–445, 1998. c Springer-Verlag Berlin Heidelberg 1998
436
2
Rainer Koppler
Parallelization of Mesh Computations
The main reason why today’s parallelizing compilers fail to produce efficient results is the complexity of unstructured mesh implementations. For example, finite element meshes consist of several arrays, where some are used as index arrays and others store application data. First, complexity inhibits automatic distribution of arrays in a block or cyclic fashion because each mesh partition yields a set of irregular array distributions. Run-time partitioning techniques [18] address this problem but they are expensive and require parallel loops to be known. Secondly, automatic detection of parallel loops in unstructured mesh codes is unfeasible using known techniques. Such codes are dominated by indirect array accesses, which prevent application of compile-time dependence tests. Consequently, most parallelizers leave loops containing indirect array accesses sequential. Some approaches allow speculative parallel execution [2] but require additional run-time analysis, which is costly in terms of run time and storage overhead. Therefore, researchers became aware of the fact that knowledge from the programmer must be specified to the compiler in order to achieve efficient parallelization. They designed extensions to sequential languages, where HPF is the most popular result. Further, several parallelization strategies for unstructured mesh codes using the family of HPF languages have been developed [5,16]. However, these strategies still keep programmers far away from automatic parallelization. It has hardly been considered that they have been applied only to small and simple kernels and manual application to large-scale codes is timeconsuming and error-prone [8]. Further, in most cases the HPF programmer and the code author are not the same person. The semi-automatic concept we describe in this paper provides a basis for automatic application of parallelization strategies for unstructured mesh codes. In our tool we use a parallelization strategy that makes use of facilities provided by HPF. We adopted basic ideas from existing work on this subject and defined a general framework for automatic application of these ideas and our own additions. Our strategy consists of code parallelization and mesh data preparation. Details on the two steps are beyond the scope of this paper and can be found elsewhere. In the first step directives for array and loop distribution are inserted [5,16]. The second step consists of mesh partitioning and renumbering [16,13].
3
Data Structure Formalization
As mentioned earlier automated compiler techniques cannot parallelize unstructured mesh computations efficiently due to lack of information that is accessible to a compiler. An obvious solution to this problem is specification of missing information to the compiler in a systematic way, for example, specification of the mesh data structure. McManus [13] considers systematic description of data structures impractical because in theory there are unlimited possibilities of implementing an unstructured mesh.
Parallelization of Unstructured Mesh Computations
437
Although in theory there may be an unlimited variety of mesh implementations, in practice programmers prefer implementation variants that provide a good compromise between low storage overhead and simplicity and efficiency of algorithms. A comprehensive collection of different unstructured mesh codes showed us that in fact there exists one particular variant that is preferably used by Fortran 77 programmers. The main reasons for the dominance of this variant are legibility and efficiency of algorithms that are built up on it. A mesh implementation that follows this variant is shown in Fig. 1.
3 3
2
7 6 4
2 1
1
INTEGER elts (3, 6)
5 6 4 5
1 4 5 1
1 2 4 2
2 3 4 3
4 5 6 4
4 6 7 5
3 4 7 6
REAL coords (2, 7) 1 2 1 2 3 4 5 6 7
Fig. 1. Implementation of a triangular mesh
Thereupon we generalized the implemention variant for application to arbitrary mesh structures that can be embedded into the Entity-Relationship (ER) diagram shown in Fig. 2a. Further we developed a tiny modelling language that provides description of mesh structures and mapping of a mesh structure to variables of the application code. Whenever the data structure of an application can be described through this language, semantic knowledge about it can be specified to the compiler in machine-readable form. A remarkable property of our modelling language is that it does not confront the user with parallelism but just expects enough knowledge about the application semantics such that a ”formal documentation” of parts of the sequential code can be given. Hence, the language is accessible to a broad range of users, even to those who are inexperienced in parallelization or who do not want to get in touch with parallelism. The rest of this section gives a brief overview of our data structure modelling language. First we describe some basic concepts and terms, which emerged during generalization of the implementation variant mentioned above. Then we show applications of the language by means of different mesh structures taken from real applications and benchmarks. A mesh consists of a number of entities – so-called objects, which belong to different object types, for example, vertex, edge, triangle, or tetrahedron. Mesh applications define a hierarchy between object types, that is, objects may consist of several subobjects with lower dimensionality. For example, a hexahedron can be viewed as a composition of eight vertices, twelve edges, or six quadratic faces. Each application defines a particular hierarchy, which reflects a subdiagram of the ER diagram shown in Fig. 2a. Mesh applications associate object types with specific properties – so-called attributes. For example, fluid dynamics applications usually associate vertices
438
Rainer Koppler
with density, velocity, and energy [14], or a solid mechanics application may associate hexahedrons with normal stresses and shear stresses [12]. We call application-specific properties data attributes. Besides those, each object type that refers to one or several subobjects has a structure attribute. For example, the structure attribute of the triangle type may be a triplet of vertex numbers. The modelling language covers two major issues: (1) definition of applicationspecific object types with their attributes, which yields a subdiagram of the ER diagram shown in Fig. 2a, and (2) mapping of attributes to array variables of the application code. Application-specific object types are defined through derivation of pre-defined base types, which are enclosed by rectangles in Fig. 2a. A definition specifies the name of the derived type, a base type, possibly a subobject type, and a list of data attributes. Figure 3a shows a simplified mesh specification taken from a fluid dynamics application and Fig. 2b depicts its corresponding ER diagram. The pre-defined type Vertex provides the pre-defined data attributes x and y for the description of vertex coordinates.
(a)
(b)
Mesh
Elem2D
NSTri area
Boundary
Elem3D
Vertex x y
Elem2D Edge
NSVtx x y density ...
consists of refers to
Vertex
inherits
Fig. 2. (a) Universal ER diagram (b) Application-specific ER diagram
Mapping of attributes to variables of the application code assumes a mesh implementation as it is shown in Fig. 1. Each attribute of object type T is implemented by an array, which is indexed by numbers of T objects. Figure 3b shows how attributes of the mesh specified in Fig. 3a can be mapped to the arrays given in Fig. 1.
(a) object NSVtx is Vertex with density, velocity, energy. object NSTri is Elem2D(NSVtx) with area.
(b) NSVtx(i).x -> coords(1,i). NSVtx(i).y -> coords(2,i). NSVtx(i).density -> d(i). ... NSTri(i) -> elts(:, i). NSTri(i).area -> a(i).
Fig. 3. Specification of a triangular mesh: (a) structure (b) attribute mapping
Parallelization of Unstructured Mesh Computations
439
Some applications implement special treatment of objects that lie on the boundary of a mesh. Such objects may require additional attributes and may be collected in additional object lists. Our modelling language allows definition of boundary object types through derivation of the pre-defined base type Boundary. Figure 4a shows an example of a boundary vertex type. The array bndlst stores indices of all NSVtx objects that belong to the boundary vertex type SlippingVtx. Finally the language also provides support for applications that use hybrid meshes. The number of subobjects of a hybrid object type is not constant but must be specified for each object separately, for example, using an additional array or array section. Figure 4b shows the definition and mapping of a hybrid object type we found in a Poisson solver. The code uses mixed meshes consisting of tetrahedra and hexahedra.
(a) object SlippingVtx is Boundary(NSVtx) with info.
(b) object FEMSolid is Elem3D(FEMNode). object FEMNode is Vertex.
SlippingVtx(i) -> bndlst(i). ! bndlst contains NSVtx indices SlippingVtx(i).info -> bndinf(i).
FEMSolid(i).# -> elts(1,i). ! 1st value = number of FEMNode indices FEMSolid(i) -> elts(2:9,i). ! room for up to eight FEMNode indices
Fig. 4. (a) Boundary object type (b) Hybrid object type
In this paper we describe our modelling language using simple examples. However, we have also studied mesh implementation strategies designed for multigrid methods [7] and dynamic mesh refinement [10] and observed that these strategies do not necessitate changes to the current language design.
4
The Parallelization System
Usually today’s parallelizers are applicable to every program written in a given language and their analysis capabilities are restricted to the language level. By way of contrast, very few parallelizers have been developed that focus on special classes of applications, for example, unstructured mesh computations. Such systems receive more knowledge about the application than general-purpose parallelizers, thus they require almost no user intervention and can produce more efficient results. Parlamat (Parallelization Tool for Mesh Applications) is a special-purpose parallelizer, which uses our modelling language for specification of knowledge about the implementation of mesh data structures. It consists of a compiler that translates Fortran 77 codes to HPF and a tool for automatic mesh data preparation. The compiler uses the specification for deduction of knowledge for array distribution and dependence analysis and for recognition of particular programming idioms.
440
4.1
Rainer Koppler
Additional Array Semantics
After Parlamat has carried out basic control and data flow analysis, it performs a special analysis that determines additional semantics for arrays of the application code. It examines the use of arrays in loops over mesh objects, in particular, array indexing and assignments to INTEGER arrays, and extends symbol table information by the following properties: – Index type: Arrays that are indexed using a loop variable receive the object type associated with the loop variable as index type. Other arrays, for example, coords in Fig. 3b, receive this type using the mesh specification. – Contents type: Index arrays that store object indices, which can be detected through analysis of assignments, receive the type of these objects as contents type. Other arrays, for example, elts in in Fig. 3b receive this type using the mesh specification. – Uniqueness: Index arrays that do not contain repeated values are called unique. This property is set using knowledge from the mesh specification or through analysis of index array initializations in the application code. For example, elts, which is the structure array of an Elem2D type, is not unique by definition, which can be seen from Fig. 1. The first two properties play an important role for array distribution. Arrays that show the same index type become aligned to each other. The distribution of object types is determined by the mesh partition. The third property is used by the dependence test. Before special analysis starts, the mesh specification is evaluated in order to determine the above properties for arrays occuring in the specification. After this, the analysis proceeds in an interprocedural manner, where properties are propagated from actual to dummy arguments and vice versa. 4.2
Dependence Analysis
Dependence analysis [1] is essential in order to classify loops as sequential or parallel. For each array that is written at least once in a loop a dependence test checks if two iterations refer to the same array element. Parlamat performs dependence analysis for loops over mesh objects. Analysis consists of a flowsensitive scalar test, which detects privatizeable scalars and scalar reductions, and a flow-insensitive test for index-typed arrays. Direct array accesses, as they appear in unstructured mesh codes, can be handled sufficiently by conventional tests. On the contrary, indirect array accesses must be passed on to a special test. The core of our test is a procedure that checks if two indirect accesses may refer to the same array element. The procedure performs a pairwise comparison of the dependence strings of both accesses. Loosely speaking, a dependence string describes the index chain of an indirect access as a sequence of contents-typed accesses that end up with the loop variable. Figure 5 shows two loops containing simple indirect accesses and
Parallelization of Unstructured Mesh Computations
441
(a) DO i = 1, eltnum (b) DO i = 1, bndvtxnum v1 = elts(1, i) v = bndlst(i) ! see Fig. 4a v2 = elts(2, i) S: d(v) = ... [ (d(v); bndlst(i); i) ] i [ (d(v1); elts(1, i); i) ] i1 S1: d(v1) = ... 1 [ (d(v); bndlst(i); i) ] i [ (d(v2); elts(2, i); i) ] S2: ... = d(v2) 2 i2
Fig. 5. Loop-carried dependences: (a) present, (b) absent comparisons of their dependence strings. The notation [x]i stands for “value of expression x in iteration i ”. The procedure compares strings from the right to the left. In order to detect loop-carried dependences it assumes that i1 = i2 . Hence, in Fig. 5a [v1]i1 = [v2]i2 is true iff elts(1, i1 ) = elts(2, i2 ). At this point the procedure requires information about the uniqueness of elts. As mentioned above, elts is not unique by definition. Hence, the condition is true and the accesses d(v1) and d(v2) will definitely overlap. For example, with respect to the mesh shown in Fig. 1 this will occur for i1 = 3 and i2 = 2. Unless both accesses appear in reduction operations, the true dependence S1 δ t S2 disallows parallelization. Figure 5b shows how Parlamat can disprove dependence using uniqueness. The array bndlst is the structure array of a boundary object type and is unique by definition as a list of boundary objects does not contain duplicates. Hence, the loop can be parallelized. With respect to the code, where the loop was taken from, this is an important result, because the loop is enclosed by a time step loop and thus executed many times.
5
Experiences
We consider special-purpose systems such as Parlamat primarily useful for parallelization of real-size application codes, where it helps to save much work. On the other hand, even a special-purpose solution should be applicable to a broad range of applications. In order to keep this goal in mind we based the development of our solution on a set of various unstructured mesh codes, which comprises realsize codes as well as benchmarks. In this paper we report on experiences with three codes taken from this set: 1. The benchmark euler [11] is the kernel of a 2D fluid dynamics computation. We mention it here because at the moment its HPF version is the only example of a parallel unstructured mesh code that is accepted by one of today’s HPF compilers. 2. The structural mechanics package FEM3D [12] solves linear elasticity problems in 3D using meshes consisting of linear brick elements and an elementwise PCG method. 3. The fluid dynamics package NSC2KE [14], which has been used in several examples throughout this paper, solves the Euler and Navier-Stokes equations in 2D with consideration of turbulences. It uses meshes consisting of triangles.
442
Rainer Koppler
We regard the codes FEM3D and NSC2KE as real-size applications for two reasons. First, they are self-contained packages that solve real-world problems. Secondly, they consist of 1100 and 3000 lines of code, respectively. Table 1 gives a brief characterization of the three codes. The lengths of the data structure specifications show that specifications – even if they belong to large applications – usually fit into one screen page.
Table 1. Characterization of three unstructured mesh applications application
euler FEM3D NSC2KE
data structure specification DO loops boundary length object object attribute in parallel sequential types types mappings lines I/O others 2 2 3
– – 6
7 10 38
11 14 56
5 (3) 25 (5) 66 (17)
2 6 15
1 29 22
total 8 (3) 60 (5) 103 (21)
The table also reports on parallelization of DO loops. It shows how much loops of each application can be marked as parallel through safe use of the INDEPENDENT directive, sometimes after minor automatic restructuring. Some of the remaining loops cannot be parallelized because they contain I/O statements that refer to one and the same file. Numbers given in parentheses refer to loops that are executed many times, for example, within a time step loop. The statistics indicate that the knowledge accessible to Parlamat helps to achieve a high number of parallel loops for all three applications. Furthermore nearly all of the frequently executed loops can be parallelized, which should lead to good parallel speed-ups. Our experiences with the HPF versions of the above three codes are very limited due to several restrictions of today’s HPF compilers. In order to handle arbitrary parallel loops containing indirect array accesses, an HPF compiler must support the inspector-executor technique [15]. Our attempts to compile the HPF version of euler using ADAPTOR 4.0 [4] and pghpf 2.2 [17] gave evidence that these compilers do not support this technique. Thereupon we switched to a prerelease of the HPF+ compiler vfc [3]. The HPF+ version could be compiled with vfc and executed on a Meiko CS-2 without problems. On the other hand, the HPF+ versions of FEM3D and NSC2KE were not accepted by vfc due to preliminary compiler restrictions. In spite of problems with HPF compilers we could verify the correctness of the transformations applied to FEM3D and NSC2KE. Due to the nature of the language HPF programs can be compiled also with a Fortran 9X compiler and executed on a sequential computer. We tested Parlamat’s code transformation and mesh data preparation facilities thoroughly on a single-processor workstation through comparison of the results produced by the HPF version and the
Parallelization of Unstructured Mesh Computations
443
original version. The differences between the results are hardly noticeable and stem from the fact that renumbering of mesh object changes the order of arithmetic operations with real numbers, which causes different rounding.
6
Related Work
McManus [13] and Hascoet [8] investigated automatization concepts for another parallelization strategy for unstructured mesh computations. Unlike the HPF strategy described here they focus on execution of the original code as node code on a distributed-memory computer after some changes such as communication insertion or adjustment of array and loop bounds have been made. It is obvious that this strategy is less portable than HPF. McManus considers some steps of his five-step strategy automateable within the framework of the CAPTools environment. However, such a general-purpose system requires extensive knowledge from the user in order to achieve efficient results. Hascoet has designed a special-purpose tool that provides semi-automatic identification of program locations where communication operations must be inserted. The tool requests information including the loops to be parallelized, association of loops with mesh entities, and a so-called overlap automaton.
7
Conclusion
In this paper we introduced a concept for semi-automatic parallelization of unstructured mesh computations. The core of this concept is a modelling language that allows users to describe the implementation of mesh data structures in a formal way. The language is simple and general enough to cover a broad variety of data structure implementations. We also showed that data structure specifications provide the potential for automatization of a strategy that translates Fortran 77 codes to HPF. In particular, we outlined their usefulness for efficient loop parallelization by means of enhanced dependence testing. Finally we reported on application of our concept to a set of unstructured mesh codes, with two real-size applications among them. Our experiences show that data structure formalization yields significant savings of work and provides enough knowledge in order to transform applications with static meshes into highly parallel HPF equivalents. Hence, they justify and recommend the development of special-purpose parallelizers such as Parlamat. We believe that the potential of data structure formalization for code parallelization has not been exploited yet by far. The current version of Parlamat utilizes semantic knowledge just for most obvious transformations. In particular, applications using dynamic mesh refinement provide several opportunities for exploitation of semantic knowledge for recognition of coarse-grain parallelism. We regard development of respective HPF parallelization strategies and concepts for automatic translation as interesting challenges for the future.
444
Rainer Koppler
Acknowledgement The author thanks Siegfried Benkner for access to the HPF+ compiler vfc.
References 1. Banerjee, U.: Dependence Analysis for Supercomputing. Reading, Kluwer Academic Publishers, Boston (1988) 440 2. Blume, W., Eigenmann, R., Hoeflinger, J., Padua, D., Petersen, P., Rauchwerger, L., Tu, P.: Automatic Detection of Parallelism. IEEE Parallel and Distributed Technology 2 (1994) 37–47 436 3. Benkner, S., Pantano, M., Sanjari, K., Sipkova, V., Velkow, B., Wender, B.: VFC Compilation System for HPF+ – Release Notes 0.91. University of Vienna (1997) 442 4. Brandes, T.: ADAPTOR Programmer’s Guide Version 4.0. GMD Technical Report, German National Research Center for Information Technology (1996) 442 5. Chapman, B., Zima, H., Mehrotra, P.: Extending HPF for Advanced Data-Parallel Applications. IEEE Parallel And Distributed Technology 2 (1994) 59–70 436 6. Cross, M., Ierotheou, C., Johnson, S., Leggett, P.: CAPTools – semiautomatic parallelisation of mesh based computational mechanics codes. Proc. HPCN ‘94 2 (1994) 241–246 7. Das, R., Mavriplis, D., Saltz, J., Gupta, S., Ponnusamy, R.: The Design and Implementation of a Parallel Unstructured Euler Solver Using Software Primitives. AIAA Journal 32 (1994) 489–496 439 8. Hascoet, L.: Automatic Placement of Communications in Mesh-Partitioning Parallelization. ACM SIGPLAN Notices 32 (1997) 136–144 436, 443 9. High-Performance Fortran Forum: High Performance Fortran Language Specification – Version 2.0. Technical Report, Rice University, TX (1997) 435 10. Kallinderis, Y., Vijayan, P.: Adaptive Refinement-Coarsening Scheme for ThreeDimensional Unstructured Meshes. AIAA Journal 31 (1993) 1140–1447 439 11. Mavriplis, D.J.: Three-Dimensional Multigrid for the Euler Equations. AIAA Paper 91-1549CP, American Institute of Aeronautics and Astronautics (1991) 824–831 441 12. Mathur, K.: Unstructured Three Dimensional Finite Element Simulations on Data Parallel Architectures. In: Mehrotra, P., Saltz, J., Voigt, R. (eds.): Unstructured Scientific Computation on Scalable Multiprocessors. MIT Press (1992) 65–79 438, 441 13. McManus, K.: A Strategy for Mapping Unstructured Mesh Computational Mechanics Programs onto Distributed Memory Parallel Architectures. Ph.D. thesis,University of Greenwich (1996) 436, 443 14. Mohammadi, B.: Fluid Dynamics Computations with NSC2KE – A User Guide. Technical Report RT-0164, INRIA (1994) 438, 441 15. Mirchandaney, R., Saltz, J., Smith, R., Nicol, D., Crowley, K.: Principles of runtime support for parallel processors. Proc. 1988 ACM International Conference on Supercomputing (1988) 140–152 442 16. Ponnusamy, R., Hwang, Y.-S., Das, R., Saltz, J., Choudhary, A., Fox, G.: Supporting Irregular Distributions Using Data-Parallel Languages. IEEE Parallel and Distributed Technology 3 (1995) 12–14 436
Parallelization of Unstructured Mesh Computations
445
17. The Portland Group, Inc.: pghpf User’s Guide. Wilsonville, OR (1992) 442 18. Ponnusamy, R., Saltz, J., Das, R., Koelbel, Ch., Choudhary, A.: Embedding Data Mappers with Distributed Memory Machine Compilers ACM SIGPLAN Notices 28 (1993) 52–55 436
Parallel Constant Propagation Jens Knoop Universit¨ at Passau, D-94030 Passau, Germany phone: ++49-851-509-3090 fax: ++49-...-3092 [email protected]
Abstract. Constant propagation (CP) is a powerful, practically relevant optimization of sequential programs. However, systematic adaptions to the parallel setting are still missing. In fact, because of the computational complexity paraphrased by the catch-phrase “state explosion problem”, the successful transfer of sequential techniques is currently restricted to bitvector-based optimizations, which because of their structural simplicity can be enhanced to parallel programs at almost no costs on the implementation and computation side. CP, however, is beyond this class. Here, we show how to enhance the framework underlying the transfer of bitvector problems obtaining the basis for developing a powerful algorithm for parallel constant propagation (PCP). This algorithm can be implemented as easily and as efficiently as its sequential counterpart for simple constants computed by state-of-the-art sequential optimizers.
1
Motivation
In comparison to automatic parallelization (cf. [15,16]), optimization of parallel programs attracted so far only little attention. An important reason may be that straightforward adaptions of sequential techniques typically fail (cf. [10]), and that the costs of rigorous, correct adaptions are unacceptably high because of the combinatorial explosion of the number of interleavings manifesting the possible executions of a parallel program. However, the example of unidirectional bitvector (UBv) problems (cf. [2]) shows that there are exceptions: UBv-problems can be solved as easily and as efficiently as for sequential programs (cf. [9]). This opened the way for the successful transfer of the classical bitvector-based optimizations to the parallel setting at almost no costs on the implementation and computation side. In [6] and [8] this has been demonstrated for partial dead-code elimination (cf. [7]), and partial redundancy elimination (cf. [11]). Constant propagation (CP) (cf. [4,3,12]), however, the application considered in this article, is beyond bitvector-based optimization. CP is a powerful and widely used optimization of sequential programs. It improves performance by replacing terms, which at compile-time can be determined to yield a unique constant value at run-time by that value. Enhancing the framework of [9], we develop in this article an algorithm for parallel constant propagation (PCP). Like the bitvector-based optimizations it can be implemented as easily and as efficiently as its sequential counterpart for simple constants (SCs) (cf. [4,3,12]), which are computed by state-of-the-art optimizers of sequential programs. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 445–455, 1998. c Springer-Verlag Berlin Heidelberg 1998
446
Jens Knoop
The power of this algorithm is demonstrated in Figure 1. It detects that y has the value 15 before entering the parallel statement, and that this value is maintained by it. Similarly, it detects that b, c, and e have the constant values 3, 5, and 2 before and after leaving the parallel assignment. Moreover, it detects that independently of the interleaving sequence taken at run-time d and x are assigned the constant values 3 and 17 inside the parallel assignment. Together this allows our algorithm not only to detect the constancy of the right-hand side terms at the edges 48, 56, 57, 60, and 61 after the parallel statement, but also the constancy of the right-hand side terms at the edges 22, 37, and 45 inside the parallel statement. We do not know of any other algorithm achieving this. s 1
a := 2
2
x := 5
9
10
6
3
b := a+1 4 3 5
8 11
c := a+b 5
12
e := a
13
2 y := b*c
14 15
b := x-a 3
7
49
15
35
24
x := y+z
d := 3 36
16 17
18 21
19
x := 0
25
d := 3
x := x+1
k := 5 q := d*d 9 23
40
k := 7 51
43 38
41
29
53
x := 17 y := x+b-c 15 52 s := y+b 45 18 44
20 22
x := b*x+2 17
39
p := d*e 42 6
27
26
28
37
50
30
31
x := 17 32 33 34
46
47
48 55
54
59
f := e*b 56 6 z := c+d 57 8 58
a := k*(y+d) k* 18
60 61
e
z := x+1
18
z := y*z 270
62
Fig. 1. The power of parallel constant propagation.
Importantly, this result is obtained without any performance penalty in comparison to sequential CP. Fundamental for achieving this is to conceptually decompose PCP to behave differently on sequential and parallel program parts. Intuitively, on sequential program parts our PCP-algorithm coincides with its sequential counterpart for SCs, on parallel program parts it computes a safe
Parallel Constant Progagation
447
approximation of SCs, which is tailored with respect to side-conditions allowing us as for unidirectional bitvector problems to capture the phenomena of interference and synchronization of parallel components completely without having to consider interleavings at all. Summarizing, the central contributions of this article are as follows: (1) Extending the framework of [9] to a specific class of nonbitvector problems. (2) Developing on this basis a PCP-algorithm, which works for reducible and irreducible control flow, and is as efficient and can as easily be implemented as its sequential counterpart for SCs. An extended presentation can be found in [5].
2
Preliminaries
This section sketches our parallel setup, which has been presented in detail in [9]. We consider parallel imperative programs with shared memory and interleaving semantics. Parallelism is syntactically expressed by means of a par statement. As usual, we assume that there are neither jumps leading into a component of a parallel statement from outside nor vice versa. As shown in Figure 1, we represent a parallel program by an edge-labelled parallel flow graph G with node set N , and edge set E. Moreover, we consider terms t ∈ T, which are inductively built from variables v ∈ V, constants c ∈ C, and operators op ∈ Op of arity r ≥ 1. Their semantics is induced by an interpretation I = ( D ∪ {⊥, }, I0 ), where D denotes a non-empty data domain, ⊥ and two new data not in D , and I0 a function mapping every constant c ∈ C to a datum I0 (c) ∈ D , and every r–ary operator op ∈ Op to a total, strict function I0 (op) : Dr → D, D=df D ∪ {⊥, } (i.e., I0 (op)(d1 , . . . , dr ) = ⊥, whenever there is a j, 1 ≤ j ≤ r, with dj = ⊥). Σ = { σ | σ : V → D } denotes the set of states, and σ⊥ the distinct start state assigning ⊥ to all variables v ∈ V (this reflects that we do not assume anything about the context of a program being optimized). The semantics of terms t ∈ T is then given by the evaluation function E : T → (Σ → D). It is inductively defined by: ∀ t ∈ T ∀ σ ∈ Σ. if t = x ∈ V σ(x) if t = c ∈ C I0 (c) E(t)(σ)=df I0 (op)(E(t1 )(σ), . . . , E(tr )(σ)) if t = op(t1 , . . . , tr ) In the following we assume D ⊆ T, i.e., the set of data D is identified with the set of constants C. We introduce the state transformation function θι : Σ → Σ, ι ≡ x := t, where θι (σ)(y) is defined by E(t)(σ), if y = x, and by σ(y) otherwise. Denoting the set of all parallel program paths, i.e., the set of interleaving sequences, from the unique start node s to a program point n by PP[s, n], the set of states Σn being possible at n is given by Σn =df { θp (σ⊥ ) | p ∈ PP[s, n]}. Here, θp denotes the straightforward extension of θ to parallel paths. Based on Σn we can now define the set Cn of all terms yielding a unique constant value at program point n ∈ N at run-time: Cn =df {Cnd | d ∈ D }
448
Jens Knoop
with Cnd =df {t ∈ T | ∀ σ ∈ Σn . E(t)(σ) = d} for all d ∈ D . Unfortunately, even for sequential programs Cn is in general not decidable (cf. [12]). Simple constants recalled next are an efficiently decidable subset of Cn , which is computed by state-of-the-art optimizers of sequential programs.
3
Simple Constants
Intuitively, a term is a (sequential) simple constant (SC) (cf. [4]), if it is a program constant or all its operands are simple constants (in Figure 2, a, b, c and d are SCs, whereas e, though it is a constant of value 5, and f , which in fact is not constant, are not). Data flow analysis provides the means for computing SCs.
a)
b) a := 2
a := 2 a |--> 2
b := 2
b := a a := 3 a |--> 3, b |--> 2 c := a+1 a |--> 3, b |--> 2, c |--> 4 d := c-2 a |--> 3, b,d |--> 2, c |--> 4
a,b |--> 2 c := a+b a,b |--> 2, c |--> 4 d := a+1 a,b |--> 2, c |--> 4, d |--> 3 b |--> 2, c |--> 4 e := a+d b |--> 2, c |--> 4 f := a+b*c
a := 3 c := 4 c := 4 d := 3 d := 2 e := a+d f := a+8
Fig. 2. Simple constants.
Data Flow Analysis. In essence, data flow analysis (DFA) provides information about the program states, which may occur at a specific program point at run-time. Theoretically well-founded are DFAs based on abstract interpretation (cf. [1]). Usually, the abstract semantics, which is tailored for a specific problem, is specified by a local semantic functional [[ ]] : E → (L → L) giving abstract meaning to every program statement (here: every edge e ∈ E of a sequential or parallel flow graph G) in terms of a transformation function on a complete lattice (L, , , ⊥, ). Its elements express the DFA-information of interest. Local semantic functionals can easily be extended to capture finite paths. This is the key to the so-called meet over all paths (MOP ) approach. It yields the intuitively desired solution of a DFA-problem as it directly mimics possible program executions: it “meets” all informations belonging to a program path leading from s to a program point n ∈ N . The MOP -Solution: ∀ l0 ∈ L. MOP l0 (n) = { [[ p ]](l0 ) | p ∈ P[s, n] } Unfortunately, this is in general not effective, which leads us to the maximal fixed point (MFP ) approach. Intuitively, it approximates the greatest solution
Parallel Constant Progagation
449
of a system of equations imposing consistency constraints on an annotation of the program with DFA-information with respect to a start information l0 ∈ L: l0 if n = s mfp(n) = { [[ (m, n) ]](mfp(m)) | (m, n) ∈ E } otherwise The greatest solution of this equation system, denoted by mfpl0 , can effectively be computed, if the semantic functions [[ e ]], e ∈ E, are monotonic, and L is of finite height. The solution of the MFP -approach is then defined as follows: The MFP -Solution: ∀ l0 ∈ L ∀ n ∈ N. MFP l0 (n)=df mfpl0 (n) Central is now the following theorem relating both solutions. It gives sufficient conditions for the correctness and even precision of the MFP -solution (cf. [3,4]): Theorem 1 (Safety and Coincidence). 1. Safety: MFP (n) MOP (n), if all [[ e ]], e ∈ E, are monotonic. 2. Coincidence: MFP (n) = MOP (n), if all [[ e ]], e ∈ E, are distributive. Computing Simple Constants. Considering D a flat lattice as illustrated in Figure 3(a), the set of states Σ together with the pointwise ordering is a complete lattice, too. The computation of SCs relies then on the local semantic functional [[ ]]sc : E → (Σ → Σ) defined by [[ e ]]sc =df θe for all e ∈ E, with respect to the start state σ⊥ . Because the number of variables occurring in a program is finite, we have (cf. [3]): Lemma 1. Σ is of finite height, and all functions [[ e ]]sc , e ∈ E, are monotonic. Hence, the MFP -solution MFP sc σ⊥ of the SC-problem can effectively be computed inducing the formal definition of (sequential) simple constants. Cnsc =df Cnsc,mf p =df {t ∈ T | ∃ d ∈ D . E(t)(MFP sc σ⊥ (n)) = d } Defining dually the set Cnsc,mop =df {t ∈ T | ∃ d ∈ D . E(t)(MOP sc σ⊥ (n)) = d } induced by the MOP sc -solution, we obtain by means of Theorem 1(1) (cf. [3]): σ⊥ Theorem 2 (SC-Correctness). ∀ n ∈ N. Cn ⊇ Cnsc,mop ⊇ Cnsc,mf p = Cnsc . a)
c)
b) true di-j
d i-1
di
di+1
d i+j
di-j
d i-1
false
p
di-j
Fig. 3. a), b) Flat (data) lattices.
p
d i-1
di
p
di
c) Extended lattice.
di+1
p
di+1
d i+j
p
d i+j
450
4
Jens Knoop
Parallel Constant Progagation
In a parallel program, the validity of the SC-property for terms occurring in a parallel statement depends mutually on the validity of this property for other terms; verifying it cannot be separated from the interleaving sequences. The costs of investigating them explicitly, however, are prohibitive because of the wellknown state explosion problem. We therefore introduce a safe approximation of SCs, called strong constants (StCs). Like the solution of a unidirectional bitvector (UBv) problem they can be computed without having to consider interleavings at all. The computation of StCs is based on the local semantic functional [[ ]]stc : E → (Σ → Σ) defined by [[ e ]]stc (σ)=df θestc (σ) for all e ∈ E and σ ∈ Σ, where θestc denotes the state transformation function of Section 2, where, however, the evaluation function E is replaced by the simpler E stc , where E stc (t)(σ) is defined by I0 (c), if t = c ∈ C, and ⊥ otherwise. The set of strong constants at n, denoted by Cnstc , is then analogously defined to the set of SCs at n. Denoting by FD =df { Cst d | d ∈ D} ∪ {IdD } the set of constant functions Cst d on D, d ∈ D, enlarged by the identity IdD on D, we obtain: Lemma 2. ∀ e ∈ E ∀ v ∈ V. [[ e ]]stc |v ∈ FD \{Cst }, where |v denotes the restriction of [[ e ]]stc to v. Intuitively, Lemma 2 means that (1) all variables are “decoupled”: each variable in the domain of a state σ ∈ Σ can be considered like an (independent) slot of a bitvector. (2), the transformation function relevant for a variable is either a constant function or the identity on D. This is quite similar to UBvproblems. There, for every bit (slot) of the bitvector the set of data is given by the flat lattice B of Boolean truth values as illustrated in Figure 3(b), and the transformation functions relevant for a slot are the two constant functions and the identity on B. However, there is also an important difference. FD is not closed under pointwise meet: consider the pointwise meet of IdD and Cst d for some d ∈ D; it is given by the “peak”-function pd with pd (x) = d, if x = d, and ⊥ otherwise. These two facts together with an extension allowing us to mimic the effect of pointwise meet in a setting with only constant functions and the identity are the key for extending the framework of [9] focusing on the scenario of bitvector problems to the scenario of StC-like problems. 4.1
Parallel Data Flow Analysis
As in the sequential case, the abstract semantics of a parallel program is specified by a local semantic functional [[ ]] : E → (L → L) giving meaning to its statements in terms of functions on an (arbitrary) complete lattice L. In analogy to their sequential counterparts, they can be extended to capture finite parallel paths. Hence, the definition of the parallel variant of the MOP -approach, the PMOP -approach and its solution, is obvious: The PMOP -Solution: ∀ l0 ∈ L. PMOP l0 (n) = { [[ p ]](l0 ) | p ∈ PP[s, n] }
Parallel Constant Progagation
451
The point of this section now is to show that as for UBv-problems, also for the structurally more complex StC-like problems the PMOP -solution can efficiently be computed by means of a fixed point computation. StC-like problems. StC-like problems are characterized as UBv-problems by (the simplicity of) their local semantic functional [[ ]] : E → (ID → ID). It specifies the effect of an edge e on a single component of the problem under consideration (considering StCs, for each variable v ∈ V), where ID is a flat lattice of some underlying (finite) set ID enlarged by two new elements ⊥ and (cf. Figure 3(a)), and where every function [[ e ]], e ∈ E, is an element of FID =df { Cst d | d ∈ ID} ∪ {IdID }. As shown before, FID is not closed under pointwise meet. The possibly resulting peak-functions, however, are here essential in order to always be able to model the effect when control flows together. Fortunately, this can be mimiced in a setting, where all semantic functions are constant ones (or the identity). To this end we extend ID to IDX by inserting a second layer as shown in Figure 3(c), considering then the set of functions FIDX =df (FID \{IdID }) ∪ {IdIDX } ∪ { Cst dp | d ∈ ID}. The functions Cst dp , d ∈ ID , are constant functions like their respective counterparts Cst d , however, they play the role of the peakfunctions pd , and in fact, after the termination of the analysis, they will be interpreted as peak-functions. Intuitively, considering StCs and synchronization, the “doubling” of functions allows us to distinguish between components of a parallel statement assigning a variable under consideration a unique constant value not modified afterwards along all paths (Cst d ), and those doing this along some paths (Cst dp ). While for the former components the variable is a total StC (or shorter: an StC) after their termination (as the property holds for all paths), it is a partial StC for the latter ones (as the property holds for some paths only). Keeping this separately in our data domain, gives us the handle for treating synchronization and interference precisely. E.g., in the program of Figure 1, d is a total StC for the parallel component starting with edge 15, and a partial one for the component starting with edge 35. Hence, d is an StC of value 3 after termination of the complete parallel statement as none of the parallel “relatives” destroys this property established by the left-most component because it is a partial StC of value 3, too, for the right-most component, and it is transparent for the middle one. On the other hand, d would not be an StC after the termination of the parallel statement, if, let’s say, d would be a partial StC of value 5 of the right-most component. Obviously, all functions of FIDX are distributive. Moreover, together with the orderings seq and par displayed in Figure 4, they form two complete lattices with least element Cst ⊥ and greatest element Cst , and least element Cst ⊥ and greatest element IdIDX , respectively. Moreover, FIDX is closed under function composition (for both orderings). Intuitively, the “meet” operation with respect to par models the merge of information in “parallel” join nodes, i.e., end nodes of parallel statements, which requires that all their parallel components terminated. Conversely, the “meet” operation modelling the merge of information at points, where sequential control flows together, which actually would be given by the pointwise (indicated below by the index “pw”) meet-operation on
452
Jens Knoop
seq
Id D
X
Id D
"Parallel Join"
"Sequential Join" Cst
Cst d
i-j
Cst d p
i-j
Cst d
i-1
Cst d p
i-1
X
par
Cst
Cst d
i
Cst d p i
Cst d
i+1
Cst d p
i+1
Cst d
i+j
Cst d p
i+j
Cst d p
i-j
Cst d p
Cst d p i
Cst d p
Cst d p
Cst d
i-j
Cst d
Cst d
i
Cst d
Cst d
Cst
i-1
i-1
i+1
i+1
i+j
i+j
Cst
Fig. 4. The function lattices for “sequential” and “parallel join”. functions of FID , is here equivalently modelled by the “meet” operation with respect to seq . In the following we drop this index. Hence, and expand to seq and seq , respectively. Recalling Lemma 2 and identifying the functions Cst dp , d ∈ ID , with the peak-functions pd , we obtain for every sequential flow graph G and start information out of F : Lemma 3. ∀ n ∈ N. MOP pw (n) = MOP (n) = MFP (n) Lemma 3 together with Lemma 4, of which in [9] its variant for bitvector problems was considered, are the key for the efficient computation of the effects of “interference” and “synchronization” for StC-like problems. Lemma 4 (Main-Lemma). Let fi : FIDX → FIDX , 1 ≤ i ≤ q, q ∈ IN , be functions of FIDX . Then: ∃ k ∈ {1, ..., q}. fq ◦ ... ◦ f2 ◦ f1 = fk ∧ ∀ j ∈ {k + 1, ..., q}. fj = IdIDX . Interference. As for UBv-problems, also for StC-like problems the relevance of Main Lemma 4 is that each possible interference at a program point n is due to a single statement in a parallel component, i.e., due to a statement whose execution can be interleaved with the statement of n’s incoming edge(s), denoted as n’s interleaving predecessors IntPred(n) ⊆ E. This is implied by the fact that for each e ∈ IntPred (n), there is a parallel path leading to n whose last step requires the execution of e. Together with the obvious existence of a path to n that does not require the execution of any statement of IntPred (n), this implies that the only effect of interference is “destruction”. In the bitvector situation considered in [9], this observation is boiled down to a predicate NonDestruct inducing the constant function “true” or “false” indicating the presence or absence of an interleaving predecessor destroying the property under consideration. Similarly, we here define for every node n ∈ N the function InterferenceEff (n)=df {[[ e ]] | e ∈ IntPred(n) ∧ [[ e ]] = IdID } These (precomputable) constant functions InterferenceEff (n), n ∈ N , suffice for modelling interference.
Parallel Constant Progagation
453
Synchronization. In order to leave a parallel statement, all parallel components are required to terminate. As for UBv-problems, the information required to model this effect can be hierarchically computed by an algorithm, which only considers purely sequential programs. The central idea coincides with that of interprocedural DFA (cf. [14]): one needs to compute the effect of complete subgraphs, in this case of complete parallel components. This information is computed in an “innermost” fashion and then propagated to the next surrounding parallel statement. In essence, this three-step procedure A is a hierarchical adaptation of the functional version of the MFP -approach to the parallel setting. Here we only consider the second step realizing the synchronization at end nodes of parallel statements in more detail. This step can essentially be reduced to the case of parallel statements G with purely sequential components G1 , . . . , Gk . ∗ Thus, the global semantics [[[ Gi ]]] of the component graphs Gi can be computed as in the sequential case. Afterwards, the global semantics [[[ G ]]]∗ of G is given by: [[[ G ]]]∗ =
par { [[[ Gi ]]]∗ | i ∈ {1, . . . , k} }
(Synchronization) ∗
Central for proving the correctness of this step, i.e., [[[ G ]]] coincides with the PMOP -solution, is again the fact that a single statement is responsible for the entire effect of a path. Thus, it is already given by the projection of this path onto the parallel component containing the vital statement. This is exploited in the synchronization step above. Formally, it can be proved by Main Lemma 4 together with Lemma 2 and Lemma 3, and the identification of the functions Cst dp with the peak-function pd . The correctness of the complete hierarchical process then follows by a hierarchical coincidence theorem generalizing the one of [9]. Subsequently, the information on StCs on parallel program parts can be fed into the analysis for SCs on sequential program parts. In spirit, this follows the lines of [9], however, the process is refined here in order to take the specific meaning of peak-functions represented by Cst dp into account. 4.2
The Application: Parallel Constant Propagation
We now present our algorithm for PCP. It is based on the semantic functional [[ ]]pcp : E → (Σ → Σ), where [[ e ]] is defined by [[ e ]]stc , if e belongs to a parallel statement, and by [[ e ]]sc otherwise. Thus, for sequential program parts, [[ ]]pcp coincides with the functional for SCs, for parallel program parts with the functional for StCs. The computation of parallel constants proceeds then essentially in three steps. (1) Hierarchically computing (cf. procedure A) the semantics of parallel statements according to Section 4.1. (2) Computing the data flow information valid at program points of sequential program parts. (3) Propagating the information valid at entry nodes of parallel statements into their components. The first step has been considered in the previous section. The second step proceeds almost as in the sequential case because at this level parallel statements ∗ are considered “super-edges”, whose semantics ([[[ ]]] ) has been computed in the first step. The third step, finally, can similarly be organized as the corresponding step of the algorithm of [9]. In fact, the complete three-step procedure evolves as
454
Jens Knoop
an adaption of this algorithm. Denoting the final annotation computed by the PCP-algorithm by Ann pcp : N → Σ, the set of constants detected, called parallel simple constants (PSCs), is given by Cnpsc =df {t ∈ T | ∃ d ∈ D . E(t)(Ann pcp n ) = d} The correctness of this approach, whose power has been demonstrated in the example of Figure 1, is a consequence of Theorem 3: Theorem 3 (PSC-Correctness). ∀ n ∈ N. Cn ⊇ Cnpsc . Aggressive PCP. PCP can detect the constancy of complex terms inside parallel statements as e.g. of y + b at edge 45 in Figure 1. In general, though not in this example, this is the source of second-order effects (cf. [13]). They can be captured by incrementally reanalyzing affected program parts starting with the enclosing parallel statement. In effect, this leads to an even more powerful, aggressive variant of PCP. Similar this holds for extensions to subscripted variables. We here concentrated on scalar variables, however, along the lines of related sequential algorithms, our approach can be extended to subscripted variables, too.
5
Conclusions
Extending the framework of [9], we developed a PCP-algorithm, which can be implemented as easily and as efficiently as its sequential counterpart for SCs. The key for achieving this was to decompose the algorithm to behave differently on sequential and parallel program parts. On sequential parts this allows us to be as precise as the algorithm for SCs; on parallel parts it allows us to capture the phenomena of interference and synchronization without having to consider any interleaving. Together with the successful earlier transfer of bitvector-based optimizations to the parallel setting, complemented here by the successful transfer of CP, a problem beyond this class, we hope that compiler writers are encouraged to integrate classical sequential optimizations into parallel compilers.
References 1. P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Conf. Rec. 4th Symp. Principles of Prog. Lang. (POPL’77), pages 238 – 252. ACM, NY, 1977. 448 2. M. S. Hecht. Flow Analysis of Computer Programs. Elsevier, North-Holland, 1977. 445 3. J. B. Kam and J. D. Ullman. Monotone data flow analysis frameworks. Acta Informatica, 7:309 – 317, 1977. 445, 449 4. G. A. Kildall. A unified approach to global program optimization. In Conf. Rec. 1st Symp. Principles of Prog. Lang. (POPL’73), pages 194 – 206. ACM, NY, 1973. 445, 448, 449 5. J. Knoop. Constant propagation in explicitly parallel programs. Technical report, Fak. f. Math. u. Inf., Univ. Passau, Germany, 1998. 447
Parallel Constant Progagation
455
6. J. Knoop. Eliminating partially dead code in explicitly parallel programs. TCS, 196(1-2):365 – 393, 1998. (Special issue devoted to Euro-Par’96 ). 445 7. J. Knoop, O. R¨ uthing, and B. Steffen. Partial dead code elimination. In Proc. ACM SIGPLAN Conf. on Prog. Lang. Design and Impl. (PLDI’94), volume 29,6 of ACM SIGPLAN Not., pages 147 – 158, 1994. 445 8. J. Knoop, B. Steffen, and J. Vollmer. Code motion for parallel programs. In Proc. of the Poster Session of the 6th Int. Conf. on Compiler Construction (CC’96), pages 81 – 88. TR LiTH-IDA-R-96-12, 1996. 445 9. J. Knoop, B. Steffen, and J. Vollmer. Parallelism for free: Efficient and optimal bitvector analyses for parallel programs. ACM Trans. Prog. Lang. Syst., 18(3):268 – 299, 1996. 445, 447, 450, 452, 453, 454 10. S. P. Midkiff and D. A. Padua. Issues in the optimization of parallel programs. In Proc. Int. Conf. on Parallel Processing, Volume II, pages 105 – 113, 1990. 445 11. E. Morel and C. Renvoise. Global optimization by suppression of partial redundancies. Comm. ACM, 22(2):96 – 103, 1979. 445 12. J. H. Reif and R. Lewis. Symbolic evaluation and the global value graph. In Conf. Rec. 4th Symp. Principles of Prog. Lang. (POPL’77), pages 104 – 118. ACM, NY, 1977. 445, 448 13. B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global value numbers and redundant computations. In Conf. Rec. 15th Symp. Principles of Prog. Lang. (POPL’88), pages 2 – 27. ACM, NY, 1988. 454 14. M. Sharir and A. Pnueli. Two approaches to interprocedural data flow analysis. In S. S. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 7, pages 189 – 233. Prentice Hall, Englewood Cliffs, NJ, 1981. 453 15. M. Wolfe. High performance compilers for parallel computing. Addison-Wesley, NY, 1996. 445 16. H. Zima and B. Chapman. Supercompilers for parallel and vector computers. Addison-Wesley, NY, 1991. 445
Optimization of SIMD Programs with Redundant Computations J¨ orn Eisenbiegler Institut f¨ ur Programmstrukturen und Datenorganisation, Universit¨ at Karlsruhe, 76128 Karlsruhe, Germany [email protected]
Abstract. This article introduces a method for using redundant computations in automatic compilation and optimization of SIMD programs for distributed memory machines. This method is based on a generalized definition of parameterized data distributions, which allows an efficient and flexible use of redundancies. An example demonstrates the benefits of this optimization method in practice.
1
Introduction
Redundant computations, i.e. computations that are done on more than one processor, can improve the performance of parallel programs on distributed memory machines [2]. In the examples given in section 5, the execution time is improved by nearly one magnitude. Optimization methods (e.g. [5,7]) which use redundant computations have two major disadvantages: First the task graph of the program has to be investigated node per node. Second the communication costs have to be estimated without knowledge about the final data distribution. To avoid these disadvantages, other methods as for example proposed in [3,4,8] use parameterized data distributions. But, the data distributions given there do not allow redundant computations in a more flexible way than “all or nothing”. Therefore, this article introduces a scheme for defining parameterized data distributions, which allow the efficient and flexible use of redundant computations. Based on these definitions, this article describes a method for optimizing SIMD programs for distributed memory systems. This method is divided into three steps. As in [1], the first step uses heuristics (as described in [3,4,8]) to propose locally good data distributions. However, these distributions are not used on the level of loops but on the level of parallel assignments. In the second step, additional data distributions (which lead to redundant computations) are calculated which connect subsequent parallel assignments in a way such that no communication is needed between them. The last step selects a globally optimal proposition for each parallel assignment. In contrast to [1], an efficient algorithm is described instead of 0–1 integer programming. The benefits of this optimization method are shown by the example of numerically solving of a differential equation. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 456–462, 1998. c Springer-Verlag Berlin Heidelberg 1998
Optimization of SIMD Programs with Redundant Computations
2
457
Parameterized Data Distributions
In order to define parameterized data distributions, a definition for data distributions is required. Generally, a data element can exist on any number of processors and a processor can hold any number of data elements. Therefore, distributions have to be defined as relations between elements and processors: Definition 1 (data distribution). A data distribution is a set D ⊆ N0 × N0 of tuples. (p, i) ∈ D iff element i is valid on processor p, i. e. element i has the correct value on processor p. As an example, assume a machine with P = 4 processors and an array a with 15 elements. The set D = { (0, 0), (0, 1), (0, 2), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 13), (1, 14), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (3, 7), (3, 8), (3, 9), (3, 10), (3, 11)} is a data distribution where the elements 1, 2, 4, 5, 7, 8, 10, 11, 13, 14 are valid on exactly two out of four processors. For practical purposes in a compiler or for hand optimization of non–trivial programs, such data structures are too large. Therefore, parameterized data distributions are used. Each combination of parameters represents a certain data distribution. The correlation between the parameters and the data distributions they represent is the interpretation, i.e. an interpretation of a parameterized data distribution maps parameters to a data distribution as defined in Definition 1: Definition 2 (parameterized data distribution). A parameterized data distribution space is a pair (S, I), where S is an arbitrary set and I : S → 2N0 ×N0 is interpretation function. Every P ∈ S is a parameterized data distribution, i.e. (p, i) ∈ I(P) iff element i is valid on processor p. Usually, the set S (the set of parameters) is a set of tuples whose elements describe the number of processors, block length etc. for each array dimension. In order to carry on with the example, regard the data distribution space (N30 , I) with interpretation I: I(b, a, r) := {(p, i)|a + kbP + pb − r ≤ i < a + kbP + (p + 1)b + r, k∈Z} The parameter b is the block size, the parameter a the alignment, and the parameter r describes redundancy. In this data distribution space, the data distribution D described above is parameterized with (3, −1, 1), because I(3, −1, 1) = D. Note, that the correlation between processors and data elements is still relational, even if the interpretation I of the parameters is a function. Therefore definition 2 is a generalization of former definitions (e.g. [3,8]). The subset relation defines a semi–ordering for data distributions. This semi– ordering can be transferred to parameterized data distributions: Definition 3 (semi–ordering). Let (S, I) be a parameterized data distribution space, P1 , P2 ∈ S: P1 ≤ P2 , iff I(P1 ) ⊆ I(P2 ). Theorem 1. Let (S, I) be a parameterized data distribution space, P1 , P2 ∈ S. There are no communication costs for changing the distribution of an array from P1 to P2 , iff P1 ≥ P2 .
458
J¨ orn Eisenbiegler
Proof. Let P1 ≥ P2 . Then, every data element which is valid on a processor in P1 is also valid their in P2 , since I(P1 ) ⊇ I(P2 ). Therefore, no data elements have to be transferred and no communication costs occur. If P1 ≥ P2 , there is at least one data element valid on a processor in I(P2 ) which is not valid their in I(P1 ). This data element has to be transferred generating communication costs. The data distribution state of a program is described by the data distribution of each array in the program. This state is defined by a data distribution function, which maps arrays to parameterized data distributions. Definition 4 (data distribution functions). Let ID be the name space of a program and (S, I) be a parameterized data distribution space. Then, a function f : ID → S is called a data distribution function. Let f1 and f2 be two data distribution functions. f1 ≤ f2 iff ∀x ∈ ID : f1 (x) ≤ f2 (x).
3
Parallel Array Assignments
With the definitions given in section 2, it is possible to distinguish between the input and output data distribution of a parallel assignment. As an example, regard the data distribution space described above and the parallel assignment par i in [1, n − 1] x[i] := x[i − 1] + x[i] + x[i + 1]; If the output data distribution of x is to be (3, −1, 0), the input data distribution for x must be at least (3, −1, 1). This distinction is not possible with former definitions of parameterized data distributions, since it is not possible to describe the input data distribution there. In order to optimize data distributions for arrays, the execution of an SIMD– program can now be seen as a sequence of array assignments. Every assignment has an input and an output data distribution function. Between two assignments the data distribution state has to be changed from the output data distribution of the first to a data distribution which is larger or equal to the input data distribution of the second assignment. Note that there is no difference between the remapping of data and getting the necessary input for an assignment. In this article, the overlapping of communication and computation is not examined. However, this might be used for further optimization.
4 4.1
Optimization Proposition
In the first step of the optimization, heuristics propose locally good data distributions for each parallel assignment. This paper does not go into detail about searching locally good data distributions, since this problem is well–investigated (e.g. [3,8]). It is allowed to propose more than one data distribution for one parallel assignment using more than one proposition method to exploit the benefits of different heuristics.
Optimization of SIMD Programs with Redundant Computations
459
The methods mentioned above only compute the output data distributions for the array written in the assignment. Therefore the corresponding input and output data distribution functions have to be calculated. If there are no data elements of an array needed in a data distribution function, this array is declared as undefined. As the result of this step, each parallel assignment is applied with several proposals of pairs of input and output data distribution functions. 4.2
Redundancy
In the second step of optimization, additional propositions are introduced which connect subsequent parallel assignments in a way such that no communication is necessary. Let a1 and a2 be two subsequent parallel assignments and (i1 , o1 ) and (i2 , o2 ) pairs of input and output distribution functions proposed for a1 and a2 , respectively. Let n be the array written in a1 and d1 ∪ d2 be a parameterized data distribution which is larger than d1 and d2 . First, (i1 , o1 ) is set to the pair of distribution functions corresponding to the proposal i2 (n) ∪ o2 (n) for n in a1 (as above). Hence, no communication is needed to redistribute n from o1 (n) to i2 (n). Second, to avoid communication for other arrays, (i∗1 , o∗1 ) is set to o1 (x) ∪ i2 (x) x i1 (x) ∪ i2 (x) x =n =n ∗ o∗1 (x) = (x) = i 1 o1 (x) i1 (x) x=n x=n As a result of this calculation, o∗1 is larger than i2 , and therefore no communication is needed to redistribute from o∗1 to i2 . However, in most cases some data elements have to be calculated redundantly on more than one processor for a1 . The method can be applied to all combinations of proposals for a1 and a2 . Especially, it can be applied to cost tuples for a2 which were themselves proposed this way. This leads to (potential) sequences in the program where no communication is needed. At the end of this step, again all parallel assignments are applied with several pairs of input and output data distribution functions. 4.3
Configuration
The configuration step tries to select the globally optimal data distribution for each parallel assignment from the local proposals. If the program is not oblivious (see [5]), it can not be guaranteed that the optimal selection is found due to the lack of compile time knowledge. In order to find an optimal solution for arbitrary programs, the configuration method by Moldenhauer [6] is adopted. The proposals for the parallel assignments are converted to cost tuples of the form (i, c, o) where i and o are the proposed input and output distribution functions and c are the costs for computing this parallel assignment with the proposed distributions. To find an optimal solution for the whole program, these sets of cost tuples are combined to sets of cost tuples for larger program blocks. Let a1 and a2 be two subsequent program blocks and M1 and M2 be the sets of cost tuples proposed or calculated for these blocks. If a1 is executed
460
J¨ orn Eisenbiegler
with (i1 , c1 , o1 ) and a2 is executed with (i2 , c2 , o2 ) then a1 ; a2 is executed with (i, c, o) = (i1 , c1 , o1 ) ◦t (i2 , c2 , o2 ) where i1 (x) if i1 (x) or o1 (x) is defined i(x) = i2 (x) else if i2 (x) is defined c = c1 + c2 + redistribution costs(o1 , i2 ) o2 (x) if o2 (x) is defined o(x) = o1 (x) else For further combination with other program blocks, the cost tuples with minimal costs for all combinations of input and output data distribution functions which can occur using the proposals are calculated, i.e. the set of cost tuples M := M1 ◦ M2 for the sequence a1 ; a2 is M := M1 ◦ M2 := {(i, c, o)|(i, c, o) is cost–minimal in {(i, c∗ , o) = (i1 , c1 , o1 ) ◦t (i2 , c2 , o2 )|(ik , ck , ok ) ∈ Mk }} Rules for other control constructs can be given analogously and are omitted in this article. Using these rules, the set of cost tuples for the whole program can be calculated. In the set for the whole program the element with minimal costs is selected. For oblivious programs it represents the optimal selection from the proposals. For non–oblivious programs heuristics have to be used for the combination of the sets. To obtain the optimal proposals for each parallel assignment, the origin of the tuples has to be remembered during the combination step. This configuration problem is not in APX, i.e. no constant approximation can be found. Accordingly, the algorithm described above is exponential. But in contrast to the solution proposed in [1], it is only exponential in the number of array variables simultaneously alive. Since this number is small in real applications the method is feasible.
T[s] 500 250
block-wise distribution optimized
100 50 25
10 5 2.5 1000
2250
5062
11390 n
25629
57665
129746
Fig. 1. Solving a differential equation
Optimization of SIMD Programs with Redundant Computations
5
461
Practical Results
The optimization method described above is implemented in a prototype compiler. This section shows the optimization results compared to results that can be achieved if no redundancy is used and data distribution is not done for parallel assignments but for loops only. All measurements were done on an 8 processor Parsytec PowerXplorer. Figure 1 shows the runtimes for solving a differential equation in n points. If no redundant computations are allowed, a blockwise distribution is optimal. Therefore the optimized program is compared to a version using blockwise distribution. Especially for small input sizes, the optimized program is nearly one magnitude faster than the blockwise distributed version.
6
Conclusion and Further Work
This article introduces a formal definition for parameterized data distributions that allow redundancies. This makes it possible to transfer the concept of redundant computations to the world of parameterized data distributions such that compilers can use redundant computations efficiently. Based on these definitions, a method for optimizing SIMD programs on distributed systems is given. This method allows combining several local optimization methods for computing local proposals. Furthermore, redundant computations are generated to save communication. For oblivious programs the method selects the optimal proposals for each array assignment. The practical results show a large gain in performance. In the future different data distribution spaces, i.e. parameters and their interpretations, have to be examined to select suitable ones. Additionally, optimal redistribution schemes for these data distribution spaces should be selected.
References 1. R. Bixby, K. Kennedy, and U. Kramer. Automatic data layout using 0–1 integer programming. In M. Cosnard, G. R. Gao, and G. M. Silberman, editors, Parallel Architecture and Compilation Techniques (A–50). IFIP, Elsevier Science B.V. (North–Holland), 1994. 456, 460 2. J. Eisenbiegler, W. L¨ owe, and A. Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC’97), pages 149–155. IEEE Computer Society Press, March 1997. 456 3. M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, Department of Computer Science, University of Illinois at UrbanaChampaign, 1992. 456, 457, 458 4. P. Z. Lee. Efficient algorithms for data distribution on distributed memory parallel computers. IEEE Transactions on Parallel and Distributed Systems, 8(8):825 – 839, August 1997. 456
462
J¨ orn Eisenbiegler
5. W. L¨ owe, W. Zimmermann, and J. Eisenbiegler. Optimization of parallel programs with expensive communication. In EUROPAR ’96, Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 602–610, 1996. 456, 459 6. H. Moldenhauer. Kostenbasierte Konfigurierung f¨ ur Programme und SW–Architekturen. Logos Verlag, Berlin, 1998. Dissertation, Universit¨ at Karlsruhe. 459 7. C. Papadimitriou and M. Yannakakis. Towards an architecture-independent analysis of parallel algorithms. SIAM Journal on Computing, 19(2):322 – 328, 1990. 456 ¨ 8. M. Philippsen. Optimierungstechniken zur Ubersetzung paralleler Programmiersprachen. Number 292 in VDI Fortschritt-Berichte, Reihe 10: Informatik. VDIVerlag GmbH, D¨ usseldorf, 1994. Dissertation, Universi¨ at Karlsruhe. 456, 457, 458
Exploiting Course Grain Parallelism from FORTRAN by Mapping it to IF1 Adrianos Lachanas and Paraskevas Evripidou Department of Computer Science,University of Cyprus P.O.Box 537,CY-1687 Nicosia, Cyprus Tel:+357-2-338705, FAX:-339062 [email protected] Abstract. FORTRAN, a classical imperative language is mapped into IF1, a machine-independent dataflow graph description language with Single Assingment Semantics (SAS). Parafrase 2 (P2) is used as the front-end of our system. It parses the source code, generates an intermediate representation and performs several types of analysis. Our system extends the internal representation of P2 with two data structures: the Variable to Edge Translation Table and the Function to Graph Translation Table. It then proceeds to map the internal representation of P2 into IF1 code. The generated IF1 is then processed by the back-end of the Optimizing SISAL Compiler that generates parallel executables on multiple target platforms. We have tested the correctness and the performance of our system with several benchmarks. The results show that even on a single processor there is no performance degradation from the translation to SAS. Furthermore, it is shown that on a range of processors, reasonable speedup can be achieved.
1
Introduction
The translation of imperative languages to Single Assingment Semantics was thought to be a hard, if not an impossible task. In our work we tackled this problem by mapping Fortran to a Single Assingment Intermediate Form, IF1. Mapping Fortran to SAS is a worthwhile task by itself. However, our work has also targeted the automatic and efficient extraction of parallelism. In this paper we present the compiler named Venus that transforms a classical imperative language (Fortran, C) into an inherently parallel applicative format (IF1).The Venus system uses a multi-language precompiler (Parafrase 2 [5]) and a multiplatform post compiler (Optimizing SISAL Compiler [4]) in order to simplify the translation process and to provide multi-platform support [1]. The P2 intermediate representation of the source code is processed by Venus to generate a dataflow description of the source program. The IF1 dataflow graph description is then passed to the last stage, the Optimizing SISAL Compiler (OSC), which performs optimizations and produces parallel executables. Since IF1 is machine independent and OSC has been implemented successfully on many major parallel computers, the final stage can generate executables for a wide range of computer platforms [4]. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 463–469, 1998. c Springer-Verlag Berlin Heidelberg 1998
464
Adrianos Lachanas and Paraskevas Evripidou
Our systems maps the entire Fortran 77 semantics into IF1. The only construct that is not yet supported is I/O. We have tested our system with a variety of benchmarks. Our analysis have shown that there is no performance degradation from the mapping to SAS. Furthermore, the SAS allows for more efficient exploitation of parallelism. In this paper we concentrate on the description of the mechanism for mapping coarse grain Fortran constructs into SAS.
2
Compilation Environment
The Target Language (IF1): IF1 is a machine independent dataflow graph description language that adheres to single assignment semantics. It was developed as the intermediate form for the compilation of the SISAL language. As a language, IF1 is based on hierarchical acyclic graphs with four basic components: nodes, edges, types, and graph boundaries. The IF1 Graph Component is an important tool in IF1 for procedural abstraction. Graph boundaries are used to encapsulate related operations. In the case of compound nodes, the graph serves to separate the operations of the various functional parts. Graph boundaries can also delimit functions. The IF1 Edge Component describes the flow of data between nodes and graphs, representing explicit data dependencies. The IF1 Node Component represents operations in the IF1 dataflow graph. Nodes are referenced using their label which must be unique within the graph. All nodes are functions, and have both input and output ports. The front-end, Parafrase 2 [5], is a source to source multilingual restructuring compiler that translates Fortran to Cedar Fortran. The P2 invokes different passes to transform the program to an Internal Representation: in the form of a parse tree, symbol table, call graph and dependency graph [5]. P2 does dependency analysis, variable alias analysis, and loop parallelization analysis. Venus starts by traversing the Parse Tree of P2. The basic module of the parse tree is the function and the subroutine. Each such module contains a parameter list, a declarations list and a linked list of statments/expressions. For each module we generate an FGTT and update it with the paramemter and declarations lists. Then we proceed to generate a new scope for the VETT. We then continue to update the VETT with all variables in the delacration list. For each statment in the statment/expression list we generate a new IF1 node and its corresponding edges and update the VETT accordingly. The back-end, the Optimizing SISAL Compiler [2] is used to produce executable code from the IF1 programs. OSC was chosen for this project since it has been successfully implemented for performance machines like Sequent, SGI, HP, Cray, Convex and most UNIX based workstations [2].
3
Fine and Medium Grain Parallelism
The problem when generating code on the basic (expression) level is the single assignment restriction. This is a concern when Fortran variables are translated into IF1 edges [1]. In Fortran a variable is a memory location that can have many
Exploiting Course Grain Parallelism from FORTRAN by Mapping it to IF1
465
values stored in it over time whereas IF1 edges carry single values. An IF1 edge is a single assignment. Our system handles this problem on the basic block level using a structure called Variable to Edge Translating Table (VETT) [3]. The VETT always allows the translator to be able to identify the node and port whose output represent the current value of a particular variable. When a variable is referenced in an expression, VETT is accessed to obtain the current node and port number of the variable. The VETT fields do not need to change since the value of the variable is not changed. When a variable is referenced on the left hand side (LHS) of an assignment, the node and port entry in VETT will change to represent the new node and port that assigned/carried the new value. An UPDATED flag is set to VETT entry to show that the variable has changed value within the scope. A variable write consists only of updating the VETT entries (SN, SP) of that variable. In IF1, arrays can be accessed sequentially or in parallel. Whether an array is accessed sequentially or in parallel is decided by Parafrase 2. During the Sequential LHS Access the dereferencing nodes that expose the location of the old element, the nodes that provide the value for the new element, and the reconstruction nodes that combine the deconstructed pieces of the array together with the new element value, build a new array. The mechanism used in Parallel RHS Access is similar to that used in sequential array access. Multiple values of the array elements are generated and are forwarded to IF1 nodes within the compound node. Parallel LHS access is done using a loop reduction node. Values for the array elements are generated by the parallel loop iterations and are gathered into a new array.
4
Coarse Grain Parallelism
Fortran limits the amount of parallelism we can exploit because of its parameter interface. In Fortran, all input parameters of a module are passed by reference. This allows the body of the function to have access on the input parameter and change its value. 4.1
Mapping of Functions and Subroutines into IF1 Graphs
Fortran Modules are interrelated parts of code that have common functionality. Translating modules into IF1 graphs requires additional information about the module like: the type of the module (function or subroutine), the input parameters, the output parameters and the source-destination node-port of each parameter. The translator handle this problem using a structure named Function to Graph Translating Table (FGTT). The FGTT is generated in two passes. During the First Pass the translator builds the FGTT and stores all the information about a module such as: the name of the module, the type of module (function or subroutine) and the common variables that the module has access (see Figure 1). In IF1 there are not any global variables. The translator removes all common statements and replaces
466
Adrianos Lachanas and Paraskevas Evripidou
Input Parameter List subroutine sample(pphd,ppld,pfdh) integer pphd, ppld, pfhd
pphd ID: 1
Funct ID: sample
ppld ID: 2
pfdh ID: 3
Output Parameter List Type: Subroutine
pphd = ppld * pfdh call foo(pphd) ppld = goo(pphd,pfdh) return
pphd ID: 1
ppld ID: 2
Function Calls List
Label : 1070
foo
goo
end pphd ID: 1
pphd ID: 1
pfdh ID: 2
Fig. 1. FGTT functionality at coarse grain parallelism
b
real function foo(b) real b real a a=b*b*2 b = b +2 * b foo = a end
a
1 1
Function Body 1 1 foo
2 2 b
subroutine goo(a, b) real a, b a=a+a b=b+b
b
1 1
2 2
Function Body 1 1
end a
2 2 b
Fig. 2. (a) The IF1 graph of a Fortran function. Output port 1 is reserved for the function return value.(b)The IF1 graph of a Fortran subrouti ne.
them with input and output parameters. The values from a Fortran module are returned explicitly or implicitly. The function return values are exported at the first port of the IF1 graph (Figure 2a) whereas the other values will be returned on greater ports. At the first pass the parameters that are assigned at the function body are flagged as updated. At the Second Pass FGTT is processed. A function’s parameter can also be changed by a function call that has as input that parameter. If the parameter has changed by the function call, it should be returned by the IF1 graph. For each function, the translator accesses the function call list with their input parameters, and detect which parameters are changed by each function call. The parameters returned by a module are only those that have been changed at the function body. This reduces the data dependencies in the dataflow model because when a function parameter does not change then its value can be fetched by the nodes ascending the function call node. Fine and Medium Grain Parallelism at function body are translated with the aid of the VETT as described in Section 3. To create output edges from the graph, the translator traverses the FGTT output parameter list. For each output parameter the translator finds its corresponding VETT and creates an output edge with source node/port values, the corresponding VETT’s source node/port field values (SN, SP).
Exploiting Course Grain Parallelism from FORTRAN by Mapping it to IF1
4.2
467
Structured Control Flow: Nested IFs
Structured control flow of Fortran is implemented using the IF1 compound node “Select”. The Select node consist of at least three subgraphs: a selector subgraph to determine which code segment to execute, and at least two alternative subgraph branches (corresponding to THEN and ELSE block) containing each code branch.
FORTRAN Code if (mytest .EQ. 0) then a=b+c else b=a+c end if
VETT after the scanning of THEN part ID: a then: true
ID: b then: false
ID: c then: false
else: false
else: false
else: false
FORTRAN Code if (mytest .EQ. 0) then a=b+c b=2*b end if
VETT after the scanning of THEN part ID: a then: true
ID: b then: true
ID: c then: false
else: false
else: false
else: false
Select Graph
Select Graph Test
Test
VETT after the scanning of ELSE part
a
b
a
b
ID: a then: true
ID: b then: false
ID: c then: false
else: false
else: true
else: false
VETT after the scanning of ELSE part a
a
b
ID: a then: true
ID: b then: true
ID: c then: false
else: false
else: false
else: false
b
Fig. 3. Select Node handling (a) else statement is present, (b) else statement is missing.
A compound node constitutes a new IF1 scope. The code within the Fortran IF statement is scanned twice by Venus. During the first pass, the translator builds IF1 code for each alternative branch, and updates the “then” or “else” flag of VETT (see Figure 3 a, b) of variables being updated. For example, if a variable is updated on the THEN branch, then its “then” flag in VETT is set to 1 (True). For variables updated on ELSE branch, their “else” flag at VETT is set to 1. At second pass the translator rebuilds the IF1 code for each branch and generates output edges. A variable within compound node scope that has the “then” or “else” flag in VETT set to 1, is returned by the compound node. Variables that were assigned as updated have their VETT entries changed in order to reflect the new values exported by the compound node.
5
Performance Analysis
Our benchmark suite consists of a number of Livermore Loops and several benchmarks from weather applications. In Table 1 we present the execution time on a single processor of the Livermore Loops written in Fortran, Sisal and IF1 generated from the Venus translator. As shown in the table the loops execute in time similar to their Sisal equivalent and Fortran. This shows that there is no performance degradation when we map Fortran to SAS. Furthermore the results
468
Adrianos Lachanas and Paraskevas Evripidou
shows that Venus can produce as efficient code from Fortran as OSC does from SISAL. The results obtained from a multiprocessor Sequent Symmetry S81 and a HP Convex system are summarized in Table 2. (12L/20L is a collection of 12 and 20 Livermore Loop subroutines. A,E,F,G are symbols that corespond to vtadv, cloudy, hzadv, depos modules from Eta-dust code for regional weather prediction and dust accumulation. Eta-dust is a product of National Meteorological Center and University of Belgrade. B,C,D are symbols that corespond to modules simple, ricard, yale which are benchmarks taken from weather prediction fortran codes ) The results are summarized in terms of speedup gained in a number of processors (ranged from 2 to 6) over 1 processor. It shows that in all cases the Venus was able to produce IF1 code with respectable speedups.
Table 1. Execution time of Livermore Loops on a Sun OS/5.
Livermore Loop Loop # 1 2 3 4 5 6 7 8 9 10 11 12 12L 20L
6
Serial Execution Time in sec Fortran to IF1 SISAL Fortran Number of Iterations 300 600 1000 300 600 1000 300 600 1000 72.3 118.5 179.9 70.1 115.3 162.8 70.4 112.3 177.5 30.1 42.2 58.7 30.9 42.1 60.4 30.4 40.8 55.7 57.3 96.6 149.1 60.2 92.9 144.7 54.8 93.2 148.1 369.7 738.0 1228.2 550.7 742.1 1220.7 361.2 735.4 1199.9 54.6 83.6 122.2 55.7 83.9 130.4 51.7 83.1 118.3 26.2 35.4 57.3 26.5 33.3 59.1 24.9 32.5 58.7 53.8 71.7 95.9 54.4 70.2 99.3 50.9 70.5 99.6 119.7 134.5 154.7 123.4 154.2 162.1 102.4 127.9 145.2 42.2 74.5 117.5 42.3 74.9 121.4 38.7 72.9 114.7 32.6 51.4 74.3 31.4 52.3 72.6 30.5 50.4 72.8 50.9 83.6 127.5 52.3 82.6 125.3 47.4 85.7 131.4 33.2 52.1 77.3 33.7 53.7 74.2 32.5 52.6 74.6 784.8 731.4 1746.1 1672.8
Conclusions
In this paper we presented, Venus, a system thats maps Fortran programs into Single Assignment Semantics with special emphasis in the extraction of parallelism. Venus is able to process Fortran 77 with the exception of the I/O constructs. We have developed the necessary frame work for the mapping and developed two special data structures for implementing the mapping. The overall system is robust and has been thoroughly tested both on a uniprocessor and on
Exploiting Course Grain Parallelism from FORTRAN by Mapping it to IF1
469
Table 2. Speedup gained from Fortran-to-IF1 programs over a range of processors CPUs 2 3 4 5 6
2 2.1 3.0 4.0 4.2 5.9
6 1.8 2.1 2.5 2.8 3.2
7 2.1 2.9 3.9 4.9 5.9
8 1.9 2.6 3.7 4.2 4.7
Livermore Loop # 10 14 15 16 17 19 1.6 2.0 1.8 1.3 1.4 1.9 2.3 2.9 2.7 1.6 1.8 2.7 3.2 4.0 3.5 2.1 2.2 3.4 3.9 4.9 4.7 2.6 2.2 4.0 4.5 5.7 5.3 4.2 3.5 4.6
20 1.2 1.4 1.5 1.7 2.1
21 1.9 3.1 3.8 4.9 5.7
12L 1.8 2.7 3.4 4.0 4.7
20L 1.6 2.5 3.2 3.7 4.3
Weather Benchmarks A B C D E F G 1.4 1.7 1.6 1.7 2.0 1.5 2.4 1.7 2.5 2.3 2.3 2.6 1.9 3.0 2.2 3.1 3.2 2.9 3.4 2.2 4.1
a multiprocessor. The results have shown that there is no performance penalty when we map from Fortran to SAS. Furthermore we can exploit the same level of parallelism as we do from equivalent programs written in a functional language. This paper has presented the case for combining the benefits of imperative syntax with functional semantics by transforming structured sequential Fortran code to the single assignment intermediate format, IF1. Furthermore, this project couples OSC with Fortran 77. Thus, it has the potential of exposing a much larger segment of the scientific community to portable parallel programming with OSC.
References 1. Ackerman, W. B., “Dataflow languages.” In S. S. Thakkar, editor, Selected Reprints on Dataflow and Reduction Architectures, Computer Society Press, Washington D.C., 1987. 463, 464 2. Cann, C. D., “The Optimizing SISAL Compiler.” Lawrence Livermore National Laboratory, Livermore, California. 464 3. Evripidou P. and Robert, J. B., “Extracting Parallelism in FORTRAN by Translation to a Single Assignment Intermediate Form.” Proceedings of the 8 th IEEE International Parallel Processing Symposium, April 1994. 465 4. Feo, J. T., and Cann, D. C. “A report on the Sisal language project.” Journal of Parallel and Distributed Computing, 10, (1990) 349-366. 463 5. Polychronopoulos, C. D., “Parafrase 2: an environment for parallelizing, partitioning, synchronizing and scheduling programs on multiprocessors.” Proceedings of the 1989 International Conference on Parallel Processing, 1989. 463, 464 6. Polychronopoulos, C. D., and Girkar, M. B., “Parafrase-2 Programmer’s Manual”, February 1995. 7. Skedzielewski, S, “SISAL.” In B. K. Szymanski, editor, Parallel Functional Languages and Compilers, Addison-Wesley, Menlo Park, CA, 1991. 8. Skedzielewski, S, Glauert, J, “IF1: An intermediate form for Applicative languages.” Technical Report TR M-170, University of California - Lawrence Livermore National Laboratory, August 1985.
A Parallelization Framework for Recursive Tree Programs Paul Feautrier Laboratoire PRiSM, Universit´e de Versailles St-Quentin 45 Avenue des Etats-Unis 78035 VERSAILLES CEDEX FRANCE [email protected]
Abstract. The automatic parallelization of “regular” programs has encountered a fair amount of success due to the use of the polytope model. However, since most programs are not regular, or are regular only in parts, there is a need for a parallelization theory for other kinds of programs. This paper explore the suggestion that some “irregular” programs are in fact regular on other data and control structures. An example of this phenomenon is the set of recursive tree programs, which have a well defined parallelization model and dependence test.
1
A Model for Recursive Tree Programs
The polytope model [7,3] has been found a powerful tool for the parallelization of array programs. This model applies to program that use only DO loops and arrays with affine subscripts. The relevant entities of such programs (iteration space, data space, execution order, dependences) can be modeled as Z-polytopes, i.e. as sets of integral points belonging to bounded polyhedra. Finding parallelism depends on our ability to answer questions about the associated Z-polytopes, for which task one can use well known results from mathematics and operation research. The aim of this paper is to investigate whether there exists other program models for which one can devise a similar automatic parallelization theory. The answer is yes, and I give as an example the recursive tree programs, which are defined in sections 1.4 and 1.5. The relevant parallelization framework is presented in section 2. In the conclusion, I point to unsolved problems for the recursive tree model, and suggest a search for other examples of parallelization frameworks. 1.1
An Assessment of the Polytope Model
The main lesson of the polytope model is that the suitable level of abstraction for discussing parallelization is the operation, i.e. an execution of a statement. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 470–479, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Parallelization Framework for Recursive Tree Programs
471
In the case of DO loop programs, operations are created by loop iterations. Operations can be named by giving the values of the surrounding loop counters, arranged as an iteration vector. These values must be within the loop bounds. If these bounds are affine forms in the surrounding loop counters and constant structure parameters, then the iteration vector scans the integer points of a polytope, hence the name of the model. To achieve parallelization, one has to find subsets of independant operations. Two operations are dependent if they access the same memory cell, one at least of the two accesses being a write. This definition is useful only if operations can be related to the memory cells they access. When the data structures are arrays, this is possible if subscripts are affine functions of iteration vectors. The dependence condition translates into a system of linear equations and inequations, whose unknowns are the surrounding loop counters. There is a dependence iff this system has a solution in integers. These observations can be summarized as a set of requirements for a parallelization framework: 1. We must be able to describe, in finite terms, the set of operations of a program. This set will be called the control domain in what follows. The control domain must be ordered. 2. Similarly, we must be able to describe a data structure as a set of locations, and a fonction from locations to values. 3. We must be able to associate sets of locations to operations through the use of address functions. The aim of this paper is to apply these prescriptions to the design of a parallelization framework for recursive tree programs. 1.2
Related Work
This section follows the discussion in [5]. The analysis of programs with dynamic data structures has been carried mainly in the context of pointer languages like C. The first step is the identification of the type of data structures in the program, i.e. the classification of the pointer graph. The main types are trees (including lists), DAG and general graphs. This can be done by static analysis at compile time [4], or by asking the programmer for the information. This paper uses the second solution, and the data structures are restricted to trees. The next step is to collect information on the possible values of pointers. This is done in a static way in the following sense: the sets of possible pointer values are associated not to an operation but to a statement. These sets will be called regions here, by analogy to the array regions [9] in the polytope model. Regions are usually represented as path expressions, which are regular expressions on the names of structure attributes [8]. Now, a necessary (but not sufficient) condition for two statements to be in dependence is that two of their respective regions intersect. It is easy to see that this method incurs a loss of information which may forsake parallelisation in
472
Paul Feautrier
important cases. The main contribution of this paper is to improve the precision of the analysis for a restricted category of recursive tree programs. 1.3
Basic Concepts and Notations
The main tool in this paper is the elementary theory of finite state automata and rational transductions. A more detailed treatment can be found in Berstel’s book [1]. In this paper, the basic alphabet is IN (the set of non negative integers). is the zero-length word and the point (.) denotes concatenation. In practical applications, the alphabet is always some finite subset of IN. A finite state automaton (fsa) is defined in the usual way by states and labelled transitions. One obtains a word of the language generated (or accepted) by an automaton by concatenating the labels on a path from an initial state to a terminal state. A rational transduction is a relation on IN∗ × IN∗ which is defined by a generalized sequential automaton (gsa): an fsa whose edges are labelled by input and output words. Each time an edge is traversed, its input word is removed at the begining of the input, and its output word is added at the end of the ouput. Gsa are also known as Moore machines. The family of rational transductions is closed by inversion (simply reverse the elements of each digram), concatenation and composition [1]. A regular language can also be represented as a regular expression: an expression built from the letters and by the operations of concatenation (.), union (+) and Kleene star. This is also true for gsa, with letters replaced by digrams. The domain of a rational transduction is a regular language whose fsa is obtained by deleting the second letter of each edge label. There is a similar construction for the range of a rational transduction. From one fsa c, one may generate many others by changing the set of initial or terminal states. c(s; ) is deduced from c by using s as the unique initial state. c(; t) has t as its unique terminal state. In c(s; t), both the initial and terminal states have been changed. Since rational transductions are defined by automata, similar operations can be defined for them. 1.4
The Control Domain of Recursive Programs
The following example is written in a C-like language which will be explained presently: BOOLEAN tree leaf; void main(void) { int tree value; 5 : sum([]);} void sum(address I) { } 1 : if(! leaf[I]) { 2 : sum(I.1); 3 : sum(I.2); 4 : value[I] = value[I.1] + value[I.2] } }
A Parallelization Framework for Recursive Tree Programs
473
value is a tree whose nodes hold integers, and leaf is a Boolean tree. The problem is to sum up all integers on the leaves, the final result being found at the root of the tree. Addresses will be discussed in Sect. 1.5. Let us consider statement 4. Execution of this statement results from a succession of recursive calls to sum, the statement itself being executed as a part of the body of the last call. Naming these operations is achieved by recording where each call to sum comes from: either line 2 or 3. The required call strings can be generated by a control automaton whose states are the functions names and the basic statements of the program. The state associated to main is initial and the states associated to basic statements are terminal. There is a transition from state p to state q if p is a function and q is a statement occuring in p body. The transition is labelled by the label of q in the body of p.
✗
✔
✖
✕
main
2
✲
5
✗
❄ ✗
✶ ✏ S1 ✏✏ ✔ ✖ ✏✏
✖ ✻
PP ✗ ✕ PP q P S2
sum 3
1
4
✖
✔ ✕ ✔ ✕
Fig. 1. The Control Automaton of sum
As the example shows, if the statements in each function body are labeled by ascending numbers, the execution order is exactly lexicographic ordering on the call strings. 1.5
Addressing in Trees
Remark first that most tree algorithms in the literature are expressed recursively. Observe also that in the polytope model, the same mathematical object is used as the control space and the set of locations of the data space. Hence, it seems quite natural to use trees as the prefered data structure for recursive programs. In a tree, let us number the edges coming out of a node from left to right by consecutive integers. The name of node n is then simply the string of edge numbers which are encountered on the unique path from the root of the tree to n. The name of the root is the zero-length string, . This scheme dates back at least to Dewey Decimal Notation as used by librarians the world over. The set of locations of a tree structure is thus IN∗ , and a tree object is a partial function from IN∗ to some set of values, as for instance the integers, the floating point numbers or the characters.
474
Paul Feautrier
Address functions map operations to locations, i.e. integer strings to integer strings. The natural choice for them is the family of rational transductions [1]. Consider again the above example. Notice the global declaration for the trees value and leaf. address is the type of integer strings. In line 4, such addresses are used to access value. The second address, for instance, is built by postfixing the integer 1 to the value of the address variable I. This variable is initialized to at line 5 of main. If the call at line 2 (resp. 3) of sum is executed, then a 1 (resp. 2) is postfixed to I. The outcome of this discussion is that at entry into function sum, I comes either from lines 2 or 3 or 5, hence the regular equation: I = 5, + I.2, 1 + I.3, 2, whose solution is the regular expression: I = 5, .(1, 2 + 3, 2)∗ . Similarly, the second address in line 4 is given by the following rational transduction : 4, .(2, 1 + 3, 2)∗ .4, 1. I conjecture that the reasoning that has been used to find the above address functions can be automated, but the details have not been worked out yet. It seems probable that this analysis will succeed only if the operations on adresses are suitably limited. Here, the only allowed operator is postfixing by a word of IN∗ . This observation leads to the definition of a toy language, T , which is similar to C, with the following restrictions: – No pointers are allowed. They are replaced by addresses. – The only data structures are scalars (integers, floats and so on) and trees thereof. Trees are always global variables. Addresses can only be used as local variables or functions parameters. No function may return an address. – The only control structures are the conditionals and the function calls, possibly recursive. No loops or goto are allowed.
2 2.1
Dependence Analysis of T Parallelization Model
When parallelizing static control programs, one has first to decide the shape of the parallel version. One usually distinguishes between control parallelism, where operations executed in parallel are instances of different statements, and data parallelism, where parallelism is found among iterations of the same statement. In recursive programs, repetition of a statement is obtained by enclosing it in the body of a recursive function, as for example in the program of Fig. 2. Suppose it is possible to decide that the operations associated to S and all operations generated by the call to foo are independent. The parallel version in Fig. 3 (where {^ ... ^} is used as the parallel version of { ... } [6]) is equivalent to the sequential original. The degree of parallelism of this program
A Parallelization Framework for Recursive Tree Programs
void foo(x) { S; if (p) foo(y); }
Fig. 2. Sequential version
475
void foo(x) { {^ S; if(p) foo(y); ^} }
Fig. 3. Parallel version
is of the order of the number of recursive calls to foo, which probably depends on the data set size. This is data paralellism expressed as control parallelism. A possible formalization is the following. Let us consider a function foo and the statements {S1 , . . . , Sn } of its body. The statements are numbered in textual order, and i is the label of Si . Tests in conditional statements are to be considered as elementary, and must be numbered as they occur in the program text. Let us construct a synthetic dependence graph (SDG) for foo. The vertices of the SDG are the statements of foo. There is a dependence edge from Si to Sj , i < j iff there exists three iteration words u, v, w such that: – u is an iteration of foo. – Both u.i.v and u.j.w are iterations of some terminal statements Sk and Sl . – u.i.v and u.j.w are in dependence. Observe that u is an iteration word for foo, hence belongs to the language generated by c(; foo). Similarly, v is an iteration of Sk relative to an iteration of Si , hence belongs to c(Si ; Sk ), and w ∈ c(Sj ; Sl ). As a consequence, the pair u.i.v, u.j.w belongs to the following rational transduction: h = c(; foo)= .i, j.c(Si ; Sk ).c(Sj ; Sl )−1 , in which if a is an automaton, then a= is the transduction obtained by setting each output word equal to the corresponding input word. This formula also uses an automaton as a transduction whose output words have zero length. Similarly, the inverse of an automaton is used as a transduction whose input words have zero length. Si and Sj may also access local scalar variables, for which the dependence calculation is trivial. Besides, one must remember to add control dependences, from the test of each conditional to all statements in its branches. Lastly, dependences between stqtements belonging to opposite arms of a conditional are to be ommited. Once the SDG is computed, a parallel program can be constructed in several well known ways. Here, the program is put in serie/parallel form by topological sorting of the SDG. As above, I will use the C-EARTH version of the fork ... join construct, {^ ... ^}. The run time exploitation of this kind of parallelism is a well known problem [6].
476
2.2
Paul Feautrier
The Dependence Test
Computing dependences for the body of function foo involves two distinct algorithms. The first, (or outermost) one enumerates all pairs of references which are to be checked for dependence. This is a purely combinatorial algorithm, of polynomial complexity, which can be easily reconstructed by the reader. The inner algorithm has to decide whether there exists three strings x, y, w such that: x, y ∈ h, x, w ∈ f, y, w ∈ g. where the first term expresses the fact that x and y are iterations of Sk and Sl which are generated by one and the same call to foo, the second and third ones expressing the fact that both x and y access location w. f ans g are the address transductions of Sk and Sl . The first step is to eliminate w, giving x, y ∈ k = g −1 ◦ f . k is a rational transduction by Elgot and Mezei theorem [2]. Hence, the pair x, y belongs to the intersection of the two transductions h and k Deciding whether h ∩ k is empty is clearly equivalent to deciding whether ∩ = is empty where = is the equality relation and = k −1 ◦ h. Deciding whether the intersection of two transductions is empty is a well known undecidable problem [1]. It is possible however to solve it by a semialgorithm, wich is best presented as a (one person) game. A position in the game is a triple u, v, p where u and v are words and p is a state of . The initial state is , , p0 , where p0 is the initial state of . A position is a win if u = v = and if p is terminal. A move in the game consists in selecting a transition from p to q in with label x, y. The outcome is a new position u , v , q where u and v are obtained from u.x and v.y by deleting their common prefix. A position is a loss if u and v begin by distinct letters: in such a case, no amount of postfixing can complete u and v to equal strings. There remains positions in which either u or v or both are . Suppose u = . Then, for success, v must be the prefix of a string in the domain of when starting from p. This can be tested easily, and, if the check fails, then the position again is a loss. The situation is symmetrical if v = . This game may have three outcomes: if a win can be reached, then by restoring the deleted common prefixes, one reconstructs a word u such that u, u ∈ , hence a solution to the dependence problem. If all possible moves have been explored without reaching a win, then the problem has no solution. Lastly, the game can continue for ever. One possibility is to put an upper bound to the number of moves. If this bound is reached, one decides that, in the absence of a proof to the contrary, a dependence exists. The following algorithm explores the game tree in breadth-first fashion. Algorithm D. 1. 2. 3. 4.
Set D = ∅ and L = {, , p0 } where p0 is the initial node of . If L = ∅, stop. There is no dependence. Extract the leftmost element of L, u, v, p. If u, v, p ∈ D, restart at step 2.
A Parallelization Framework for Recursive Tree Programs
477
5. If u = v = and if p is terminal, stop. There is a dependence. 6. If both u = and v = , the position is a loss. Restart at step 2. 7. If u = and if v is not a prefix of a word in the domain of (p; ), restart at step 2. 8. If v = and if u is not a prefix of a word in the range of (p; ), restart at step 2. 9. Add u, v, p to D. Construct all the positions which can be reached in one move from u, v, p and add them on the right of L. Restart at step 2. Since the exploration is breadth-first, it is easy to prove that if there is a dependence, then the algorithm will find it. This algorithm has been implemented as a stand alone program in Objective Caml. The user has to supply the results of the analysis of the input program, including the control automaton, the address transductions, and the list of statements with their accesses. The program then computes the SDG. All examples in this paper have been processed by this pilot implementation. As far as my experience goes, the case where algorithm D does not terminate has never been encountered. 2.3
sum Revisited
Consider the problem of parallelizing the body of sum. There are already control dependences from statement 1 to 2, 3, and 4. The crucial point is to prove that there are no dependences from 2 to 3. One has to test one output dependence from value[I] to itself, two flow dependences from value[I] to value[I.1] and value[I.2], and two symmetrical anti-dependences. Let us consider for instance the problem of the flow dependence from value[I] to value[I.1]. The transduction begins in the following way: = (2, 2 + 3, 3)∗ .3, 2 . . . . Algorithm D finds that there is no way of crossing the 3, 2 edge without generating distinct strings. Hence, there is no dependence. On the other hand, algorithm D readily finds a dependence from 2 to 4. All in all, the SDG of sum is given by Fig. 4, to which corresponds the parallel program in Fig. 5 — a typical case of parallel divide-and-conquer.
3
Conclusion and Future Work
I have presented here a new framework in which to analyze recursive tree programs. The main differences between the present method and the more usual pointer alias analysis are: – Data structures are restricted to trees, while in alias analysis, one has to determine the shape of the structures. This is a weakness of the present approach.
478
Paul Feautrier
✎
✠ ✎
1 ✍✌ { ❅ 1 : if(! leaf[I]) { ❅ {^ ❅ sum(I.1); ❅ ❘ ❅ ✎ 2 :
2 ✍✌ ❅ ❅ ❅ ✎ ❘ ❄✠ ❅ 4 ✍✌
void sum(address I)
3 ✍✌
Fig. 4. The SDG of sum
3 : 4 :
sum(I.2) ^}; value[I] = value[I.1] + value[I.2]; }
}
Fig. 5. The parallel version of sum
– In T , the operations on addresses are limited to postfixing, which, translated in the language of pointers, correspond to the usual pointer chasing. – The analysis is operation oriented, meaning that addresses are associated to operations, not to statements. This allows to get more precise results when computing dependences. Pointer alias analysis can be transcribed in the present formalism in the following way. Observe that, in the notations of Sect. 2.2, the iteration word x belongs to Domain(h), which is a regular language, hence w belongs to f ( Domain(h)), which is also regular. This is the region associated to Sk . Similarly, w is in the region g( Range(h)). If the intersection of these two regions is empty, there is no dependence. When compared to the present approach, the advantage of alias analysis is that the regions can be computed or approximated without restrictions on the source program. It is easy to prove that when f ( Domain(h)) ∩ g( Range(h)) = ∅ then is empty. It is a simple matter to test whether this is the case. The present implementation reports the number of cases which can be decided by testing for the emptiness of , and the number of cases where Algorithm D has to be used. In a T implementation of the merge sort algorithm, there were 208 dependence tests. Of these, 24 were found to be actual dependences, 34 where solved by region intersection, and 150 required the use of algorithm D. While extrapolating from this example alone would be jumping at conclusions, it gives at least an indication of the relative power of the region intersection and of algorithm D. Incidentaly, the SDG of merge sort was found to be of the same shape as the SDG of sum, thus leading to another example of divide-and-conquer parallelism. The T language as it stands clearly needs a sequential compiler, and a tool for the automatic construction of address relations. Some of the petty restrictions of Sect. 1.5 can probably be removed without endangering dependence analysis. For instance, having trees of structures or structure of trees poses no difficulty. Allowing trees and subtrees as arguments to functions would pose the usual
A Parallelization Framework for Recursive Tree Programs
479
aliasing problems. A most useful extension would be to allow trees of arrays, as found for instance in some versions of the adaptive multigrid method. How is T to be used? Is it to be another programming language, or is it better used as an intermediate representation when parallezing pointer programs as in C or ML or Java? The latter choice would raise the question of translating C (or a subset of C) to T , i.e. translating pointer operations to address operations. Another problem is that T is static with respect to the underlying set of locations. It is not possible, for instance, to insert a cell in a list, or to graft a subtree to a tree. Is there a way of allowing that kind of operations? Lastly, trees are only a subset of the data structures one encounter in practice. I envision two ways of dealing, e.g., with DAGs and cyclic graphs. Adding new address operators, for instance a prefix operator: π(a1 . . . . .an ) = (a1 . . . . .an−1 ) allows one to handle doubly linked lists and trees with an upward pointer. The other possibility is to use other mathematical structures as a substrate. Finitely presented monoids or groups come immediately to mind, but there might be others.
References 1. Jean Berstel. Transductions and Context-Free Languages. Teubner, Stuttgart, 1979. 472, 474, 476 2. C.C. Elgot and J.E. Mezei. On relations defined by generalized finite automata. IBM J. of Research and Development, 47–68, 1965. 476 3. Paul Feautrier. Automatic parallelization in the polytope model. In Guy-Ren´e Perrin and Alain Darte, editors, The Data-Parallel Programming Model, pages 79– 103, Springer, 1996. 470 4. Rakesh Ghiya and Laurie J. Hendren. Is it a tree, a dag or a cyclic graph? In PoPL’96, pages 1–15, ACM, 1996. 471 5. Joseph Hummel, Laurie J. Hendren, and Alexandru Nicolau. A general dependence test for dynamic, pointer based data structures. In PLDI’94, pages 218–229, ACM Sigplan, 1994. 471 6. Laurie J. Hendren, Xinan Tang, Yingchun Zhu, Shereen Ghobrial, Guang R. Gao, Xun Xue, Haiying Cai, and Pierre Ouellet. Compiling C for the EARTH multithreaded architecture. Int. J. of Parallel Programming, 25(4):305–338, August 1997. 474, 475 7. Christian Lengauer. Loop parallelization in the polytope model. In Eike Best, editor, CONCUR’93, LNCS 715, pages 398–416, Springer-Verlag, 1993. 470 8. James R. Larus and Paul N. Hilfinger. Detecting conflicts between structure accesses. In PLDI’88, pages 31–34, ACM Sigplan, 1988. 471 9. R´emi Triolet, Fran¸cois Irigoin, and Paul Feautrier. Automatic parallelization of FORTRAN programs in the presence of procedure calls. In Bernard Robinet and R. Wilhelm, editors, ESOP 1986, LNCS 213, Springer-Verlag, 1986. 471
Optimal Orthogonal Tiling Rumen Andonov1 , Sanjay Rajopadhye2, and Nicola Yanev3 1
LIMAV, University of Valenciennes, France. [email protected] 2 Irisa, Rennes, France. [email protected] 3 University of Sofia, Bulgaria. [email protected]
Abstract. Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to orthogonal tiling—uniform dependency programs with (hyper) parallelepiped shaped iteration domains which can be tiled with hyperplanes parallel to the domain boundaries. Our formulation includes many machine and program models used in the literature, notably the bsp programming model. We resolve the optimization problem analytically, yielding a closed form solution.
1
Introduction
Tiling the iteration space [17,7,14] is a common method for improving the performance of loop programs on distributed memory machines. It may be used as a technique in parallelizing compilers as well as in performance tuning of parallel codes by hand (see also [6,13,10,15]). A tile in the iteration space is a (hyper) parallelepiped shaped collection of iterations to be executed as a single unit, with the communication and synchronization being done only once per tile. Typically, communication are performed by send/receive calls and also serve as synchronization points. The code for the body contains no communication calls. The tiling problem can be broadly defined as the problem of choosing the tile parameters (notably the shape and size) in an optimal manner. It may be decomposed into two subproblems: tile shape optimization [5], and tile size optimization [2,6,9,13] (some authors also attempt to resolve both problems under some simplifying assumptions [14,15]). In this paper, we address the tile size problem, which, for a given tile shape, seeks to choose the size (length along each dimension of the hyper parallelepiped) so as to minimize the total execution time. We assume that the dependencies are uniform, the iteration space (domain) is an n-dimensional hyper-rectangle, and the tile boundaries are parallel to the domain boundary (this is called orthogonal tiling). A sufficient condition for this that in all dependence vectors, all non-zero terms have the same sign. Whenever orthogonal tiling is possible, it leads to the simplest form of code; indeed most compilers do not even implement any other tiling strategy. We show here that when orthogonal tiling is possible the tile sizing problem is easily solvable even in its most general, n-dimensional case. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 480–490, 1998. c Springer-Verlag Berlin Heidelberg 1998
Optimal Orthogonal Tiling
481
Our approach is based on the two step model proposed by Andonov and Rajopadhye for the 2-dimensional case in [2], and which was later extended to 3 dimensions in [3]. In this model we first abstract each tile by two simple parameters: tile period, P and inter-tile latency, L. We then “instantiate” the abstract model by accounting for specific architectural and program features, and analytically solve the resulting non-linear optimization problem, yielding the desired tile size. In this paper we first extend and generalize our model to the case where an n-dimensional iteration space is implemented on a k-dimensional (hyper) toroid (for any 1 ≤ k ≤ n − 1). We also consider a more general form of the specific functions for L and P. These functions are general enough to not only include a wide variety of machine and program models, but also be used with the bsp model [11,16], which is gaining wide acceptance as a well founded theoretical model for developing architecture independent parallel programs. Being an extension of previous results, the emphasis of this paper is on the new points in the mathematical framework and on the relations of our model to others proposed in the literature. More details about the definitions, notations and their interpretations, as well about some practical aspects of the experimental validations are available elsewhere [1,2,3]. The remainder of this paper is organized as follows. In the following section we develop the model and formulate the abstract optimization problem. In Section 3 we instantiate the model for specific machine and program parameters, and show how our model subsumes most of those used in the literature. In Section 4 we resolve the problem for the simpler HKT model which assumes that the communication cost is constant, independent of the message volume (and also a sub-case of the bsp model, where the network bandwidth is very high). Next, in Section 5 we resolve the more general optimization problem. We present our conclusions in Section 6.
2
Abstract Model Building
We first develop an analytical performance model for the running time of the tiled program. We introduce the notation required as we go along. The original iteration space is an N1 × N2 × . . . × Nn hyper-rectangle, and it has (at least n linearly independent) dependency vectors, d1 , . . . dm . The nonzero elements of the dependency vectors all have the same sign (say positive). Hence, orthogonal tiling is possible (i.e., does not induce any cyclic dependencies between the tiles). i Let the tiles be x1 × x2 × . . . × xn hyper-rectangles, and let qi = N xi be the number of tiles in the i-th dimension. The tile graph is the graph where each node represents a tile and each arc represents a dependency between tiles, and can be modeled by a uniform recurrence equation [8] over an q1 × q2 × . . . × qn hyper-rectangle. It is well known that if the xi ’s are large as compared to the elements of the dependency vectors1 , then the dependencies between the tiles are unit vectors (or binary linear combinations thereof, which can be neglected 1
In general this implies that the feasible value of each xi is bounded from below by some constant. For the sake of clarity, we assume that this is 1.
482
Rumen Andonov et al.
for analysis purposes without any loss of generality). A tile can be identified by an index vector z = [z1 , . . . , zn ]. Two tiles, z and z , are said to be successive, if z − z is a unit vector. A simple analysis [7] shows that the earliest “time instant” (counting a tile execution as one time unit) that tile z can be executed is tz = z1 + . . . + zn . We map the tiles to p1 × p2 × . . . × pk processors, arranged in a k dimensional hyper-toroid2 (for k < n). The mapping is by projection onto the subspace spanned by k of the n canonical axes. To visualize this mapping, first consider the case when k = n − 1, where tiles are allocated to processors by projection along (without loss of generality) the n-th dimension, i.e., tile z is executed by processor [z1 , . . . , zn−1 ]. This yields a (virtual) array of q1 × q2 × . . . × qn−1 processors, each one executing all the tiles in the n-th dimension. In general pi ≤ qi , so this array is emulated by our p1 × p2 × . . . × pn−1 hyper-toroid by using multiple passes (there are pqii passes in the i-th dimension). The case when k < n − 1 is modeled by simply letting pk+1 = . . . = pn−1 = 1. The period, P of a tile is defined as the time between executions of corresponding instructions in two successive tiles (of the same pass) mapped to the same processor. The latency, L between tiles is defined to be the time between executions of corresponding instructions in two successive tiles mapped to different processors. Depending on the volume of the data transmitted and the nature of the program dependencies, the latency may be different for different dimensions of the hyper-toroid, and we use Li (for i = 1 . . . k) to denote the latency in the i-th dimension. We assume that by the time the first processor, 1 = [1 . . . 1], finishes computing its macro column, at least one of the “last” processors (i.e., one of [1, . . . , pi , . . . , 1], for 1 ≤ i ≤ k) has finished its firsttile3 . In this case, pron cessor 1 can immediately start n another pass. Let W = i=1 Ni denote n the total computation volume, v = i=1 xi be the tile volume andP = i=1 pi be the n total number of processors. The total number of tiles is i=1 qi = W i v . Let p k = k pi and vmax = ( k Ni )/P . denote pi − 1, P i=1 i=1 Let us first analyze a single pass. Each processor must execute qn tiles, and this takes qn P time. However, the last processor (i.e., the one with coordinates [p1 , p2 , . . . pk ]) cannot start because of the dependencies of the tile graph. Indeed, processor [p1 , 1 . . . , 1] can only start at time (p1 − 1)L1 , processor [p1 , p2 . . . , 1], at time (p1 − 1)L1 + (p2 − 1)L2 , and hence processor [p1 , p2 . . . , pk ] can only start its first pass at time ki=1 pi Li . There are W v tiles, of which P qn are executed in each pass, and hence there W/v are P qn passes. Because there is no idle time between passes, the last pass starts 2
3
This is just an abstract machine architecture. Our machine model is independent of the topology, since we will later assume that communication time is independent of distance and/or contention. There is no loss of generality in this assumption. Were it not true, the processors would be idle between passes, and this would lead to a sub-optimal solution. this has been proved for 2 and 3 dimensions [2,3] and the arguments carry over.
Optimal Orthogonal Tiling
483
− 1 Pqn . Thus, the last processor starts executing its last macro k W column at time instant − 1 Pqn + pi Li . It takes another Pqn time, P vqn i=1 and hence the total running time and the corresponding optimization problem are as follows. k WP + pi Li (1) T (x1 , . . . xn ) = Pv i=1
at time
W P vqn
Prob. 1: Minimize (1) in the feasible space, R given below (recall that the lower bounds may be other than 1, based on the dependencies of the original recurrence). R = {[x1 . . . xn ] ∈ Z n | 1 ≤ xi ≤ ui } (2) where ui =
3
Ni pi
for i = 1 . . . k, and ui = Ni for k < i ≤ n.
Machine and Program Specific Model
We now “instantiate” Prob. 1 for a specific program and machine architecture. The code executed for a tile is the standard loop: repeat receive(v1); receive(v2), ...,receive(vk) ; compute(body); send(v1); send(v2), ..., send(vk) ; end where vi denotes the message transmitted in the i-th dimension. We will now determine P and Li . Our development is based on Andonov & Rajopadhye [1], and uses standard assumptions about low level behavior of the architecture and program [4]. The sole difference is that each tile now makes k systems calls to send (and receive) messages. A tile depends directly on its n neighbors in each dimension. The volume of data transfer along the i-th dimension is proportional to the (hyper) surface of the tile in that dimension, i.e., j=i xj = v/xi . In the first k dimensions, this corresponds to an inter-processor communication, whereas in the dimensions k + 1 . . . n, the transfer is achieved through local memory. Hence the period of a tile can be written as follows.
P = k(βr + βs ) +
k 1 2τc v x i=1 i
+ αv
(3)
Here βs (resp. βr ) is the overhead of the send (resp. receive) system call, τc is the time (per byte) to copy from user to system memory, α is the computation time for a single instance of the loop body (we neglect the overhead of setting up the loop for each tile, as well as cache effects). Similarly, if 1/τt is the network bandwidth, the latency is as follows (see [2] for details).
484
Rumen Andonov et al.
v − kβr xi
k 1 v = kβs + 2τc v + αv + τt x xi i=1 i
Li = P + τt
(4)
Note that the βr ’s are subtracted because the receive occurs on a different processor, and when the sender and receiver are properly synchronized, the calls are overlapped. Substituting in Eqn. (1) and simplifying, we obtain
k v k + 2τc P Tk (x) = αv P
k k k 1 pi 2τc W 1 + τt v + + x x P i=1 xi i=1 i i=1 i
+ k(βr + βs ) 3.1
W k + αW + kβs P Pv P
(5)
Simplifying Assumptions and Particular Cases
The model of (5) is very general. One may want to specialize it for a number of reasons—say rendering the final optimization problem more tractable, or modeling a certain class of architectures or computations. It turns out that many of these simply consist of choosing the parameters appropriately in the above function. The HKT model, first used by Hiranandani et al. [6], corresponds to setting βr = τc = τt = 0 (a slightly more general version consists of letting βr be nonzero). This model assumes that communication cost is independent of the message size, but is dominated by the startup time(s) βs (and βr ). At first sight this may seem an oversimplification. However, in addition to making the mathematical problem more tractable, it is not far from the truth, as corroborated by other authors [12,13]. Indeed, experimental as well as analytic evidence [1] shows that on machines such as the Intel Paragon the more accurate models yield no observable difference in the predictions. With τc = τt = 0, we obtain the HKT cost function: αW W k + k(βr + βs ) + (αv + kβs )P (6) P Pv The BSP model [11,16] has been proposed as a formal model for developing architecture independent parallel programs. It is a bridge between the PRAM model which is general but somewhat unrealistic, and machine-specific models which lead to lack of portability and predictability of performance. Essentially, the computation is described in terms of a sequence of “super-steps” executed in parallel by all the processors. A super-step consists of (i) some local computation, (ii) some communication events launched during the super-step, and (iii) a synchronization which ensures that the communication events are completed. The time for a super-step is the sum of the times for each of the above activities. Tk (x) =
Optimal Orthogonal Tiling
485
This is very similar to our tile model: indeed, if we simply set τt = βr = 0 and βs = βk (β is the BSP synchronization cost) we obtain the running time of the program under the BSP model. With this simplification, we obtain the BSP cost function as follows. k βW αW W 1 + + (αv + β)Pk + 2τc Pk v + Tk (x) = P Pv P i=1 xi
(7)
In the BSP model, the communication startup cost is replaced by the synchronization cost. However, in our general cost function (5) we incur the startup cost k times. As a result, if we take a particular case of the BSP model where the network bandwidth is extremely high (the communication time is dominated by the synchronization cost), then this high bandwidth BSP cost function is given as follows (it is similar but not identical to the HKT model). βW αW k + + (αv + β)P (8) P Pv With some other simple modifications, our cost function can also model the overlap of communication and computation, which is often used as a performance tuning strategy. This is not detailed here due to space constraints. Tk (x) =
4
Solution for HKT and High Bandwidth BSP Models
In this section, we will focus only on the simple models (6, 8). Our main results are that the optimal tile volume and the dimension of optimal virtual architecture can be determined as a closed form solution. These results serve two important purposes, in spite of the apparent simplicity of the model. First, they are valid for a number of current machines where the communication latency and network bandwidth are both relatively high (such as the Intel Paragon, and a number of similar machines as well as networks of workstations). Second, they give a good indication of our solution method for the more general results. We first solve the problem for the HKT model, and then show how the high bandwidth BSP model follows almost directly. 4.1
Optimal Tile Volume
It is easy to see that the running time (6) depends only on the tile volume and is a strictly convex function of the form Tk (v) = A v + Bv + C, which attains its √ A optimal value of C + 2 AB at v = B . By substituting we obtain
v =
k(βr + βs )W k P αP
k = αW + kβs P k + 2 T P
(9)
k kα(βr + βs )W P P
(10)
486
Rumen Andonov et al.
The optimal solution will be as given above if v is a feasible tile volume. Now, W i observe that each xi is bounded from above by N pi , and hence v ≤ P . Since W grows much faster than P , we have 1 ≤ v ≤ W P asymptotically. Hence we have the following result. Theorem 1. The optimal tile volume and the corresponding running time for the HKT model are given by (9-10).
4.2
Optimal Architecture
So far, we have assumed that k is a fixed constant as are each of the pi ’s. In practice, we typically have P processors, and the values of each pi are not specified, and neither is k, and we now solve this problem. Note that the dominant term architecture in (10) is αW P , the “ideal” parallel time with no overhead. The ideal W will seek to minimize the overhead whose dominant term is O( P ). From (10) it can be easily deduced that for a given k and P , the optimal √ k , i.e., the torus with pi = k P . Subarchitecture is the one that minimizes P stituting this in (10), the coefficient of the overhead term is proportional to the 1 k is minimized for the square root of the function f (k, P ) = k 2 P k − k 2 . Thus T ∗ value of k that minimizes f (k, P ), i.e., k = 0.625 ln P . Since f (k, P ) is monotonically decreasing up to k ∗ (for each P ) and monotonically increasing thereafter, we have the following result. Corollary 1. If n ≤ k ∗ the optimal architecture is a balanced n − 1 dimensional torus, and for n > k ∗ it is either a balanced k ∗ or a k ∗ dimensional torus depending respectively on whether f ( k ∗ , P ) is smaller than f (k ∗ , P ) or not. It is thus clear that as the number of processors grows, the optimal architecture tends towards an n − 1 dimensional hyper-torus. However, k ∗ grows logarithmically with P , and for a limited number of processors (most practical cases), the optimal may not be n − 1 dimensional. Indeed, the sensitivity of k ∗ with respect to number of processors P is illustrated by the following: for up to 25 processors, the optimal architecture is a linear array, from 25–130 it is 2-dimensional, for 130–650 processors it is a 3-dimensional, etc. The extension of these results to the high bandwidth BSP model (Eqn 8) is straightforward, and indeed the mathematical treatment is a little simpler. The solution is the same as that given by (9-10), but with k, βr and βs respectively equal to 1, 0 and β. The sole subtle difference is that the function f (k, P ) is kP 1/k − k, and hence k ∗ = +∞, i.e., f (k, P ) is monotonically decreasing. Hence, the optimal architecture is always a balanced n − 1 dimensional torus. Finally, note that the optimization of the processor architecture yields a second order improvement—it does not affect the dominant term αW P .
Optimal Orthogonal Tiling
5
487
Solution for the BSP Cost Function
We now solve our optimization problem for the BSP cost function (7) in the feasible space specified by (2). We will show that our problem can be decomposed into two special cases. The first case is very similar to the HKT model, but the second one is more complicated, for which we first solve the corresponding unconstrained optimization problem and then determine where the constrained solution lies. Finally, we show that the second solution is globally optimal asymptotically. Lemma 1. The minimum of (7) over conv(R) is attained at: either or
i xi = N pi , for i = 1 . . . k xi = 1, for i = k + 1 . . . n
Proof. Let v be an arbitrary feasible volume and let there exist two indices l, m for l ≤ k, k + 1 ≤ m ≤ n such that xl < Nl /pl and xm > 1. Then by increasing xl , decreasing xm and keeping their product constant, the function (7) strictly decreases and we obtain the needed. Based on this, we have to look for the solution in the two regions of R i corresponding to the above two conditions. Let R1 = R ∩ xi = N pi for i = 1 . . . k, and R2 = R ∩ xi = 1 for i = k + 1 . . . n. 5.1
Case I
The cost function in region R1 can be simplified to k ) W + β P k )P k + βW + (α + 2τc N k v Tk (x) = (α + 2τc N (11) P Pv k = k pi . We can now use the same reasoning as in Section 4, and where N i=1 Ni obtain the optimal volume and running time as follows: βW W k ) k = (α + 2τc N k + 2 βW ω and T + βP (12) v = Pω P P k )P k . As before, appropriate values for xk+1 . . . xn in R1 where ω = (α + 2τc N that yield the optimal volume are all equivalent. Observe that in (12) we have k in the dominant term, and this turns out to be important as we a factor 2τc N shall see later. 5.2
Case II
k Let us now consider region R2 . Here, v = i=1 xi , but the cost function remains the same as (7), except that we have only k variables to solve for. We obtain the solution in two steps.
488
Rumen Andonov et al.
Unconstrained Optimization We first solve the problem in the entire positive orthant, without any constraints. This can be formulated as follows. T k Prob. 2: Minimize (7) in the feasible space R+ = [x1 . . . xk ] | xi ≥ 0 . k Let Hv be the hyperboloid defined by {x ∈ Rk+ | i = v}. Observe i=1 x k k that the set of families Hv is a partition of R+ , i.e., R+ = Hv , and hence v
min Tk (x) = min min Tk (x). Thus, we first minimize (7) for a given tile volume v
Rk +
Hv
(i.e., over a given hyperboloid Hv ) and then choose the volume. Now, observe that for a fixed v, (7) is of the form A + B i x1i whose minimum is attained at x ∈ Hv , x1 = x2 = . . . = xk . Hence, for a given tile volume, v, the optimal tile k √ 1 k , = √ shape is (hyper) cubic with xi = k v, for each i = 1 . . . k. Thus, k x v i=1 i and we can define, A k v 1− k1 + kDv − k1 + αW + β P k f (v) = min T (x) = + Bv + 2kτc P (13) Hv v P 2τc W where A = βW P , B = αPk , and D = P . Now we have to determine the optimal tile volume by minimizing f (v) in the feasible space, 1 ≤ v ≤ vmax . It can easily be shown that f (v) is an unimodal function and attains its minimum at the root of f (v) = 0. Unfortunately this is not easily obtained in closed form but it is quite well approximated by the root of the function h(v) = −A v2 + B + D A+D . Thus, we obtain the following 2τc (k −1)Pk − v2 , whose zero is at 2τc (k−1)Pk +B (approximate) optimal solution of the unconstrained problem.
v ≈
γW k ≈ αW + 2kτc and T P P
1 2k 2k−1 2k 1 W W +O γ P P
(14)
k ). Of course we could always determine where γ = (β+2τc )/((2τc (k−1)+α)P an exact solution if needed, but we have found the approximate solution to be reasonable in practice. Moreover, as we shall see later, it illustrates some interesting points. Also recall that in the problem as we have resolved so far, we assume that k and the pi ’s are fixed, as also the choice of which of the Ni ’s to map to the processor space. Constrained Optimization We now address the question of the restrictions on the optimal solution imposed by the feasibility constraints (2) namely 1 ≤ xi ≤ ui . Note that the unconstrained (global) optimal solution is on the intersection , where v is the of the line defined by the vector 1, and the hyperboloid H v optimal tile volume (14). If this intersection is outside the feasible space (2) we will need to solve the constrained optimization problem. We have observed that the optimal running time is extremely sensitive to the tile volume, and much less dependent on the particular values of xi (indeed this is predicted by the HKT model). We have the following cases
Optimal Orthogonal Tiling
489
Case A: v > vmax In this case, the optimal volume hyperboloid does not intersect the feasible space, and according to the properties of the function f (v) the i optimal tile size is given by xi = N pi , for i = 1 . . . k. However, note that since v = O( W/P ) while vmax = O(W/P ), this case is unlikely. Case B: v ≤ vmax Now, H has a non-empty intersection with (2) and so our v heuristic is to choose a solution by moving one of the xi ’s such that we move within the feasible region. 5.3
Where is the Global Optimum?
The optimal solutions for the two cases are given, respectively by (12) for region R1 and by (14) for region R2 . Most authors [6,13,12] have only considered region R1 either implicitly by not posing the problem in full generality, or by erroneously claiming that the solution is always in R1 . This is incorrect because the final solution will always asymptotically be in R2 , due i in the dominant term in (12) as compared to to the additional factor, 2τc N (14). The following simple example illustrates the prediction of lemma 1: for n = 3, k = 2, p1 = p2 = 4, α = β = τc = 10−6 , N1 = 50, N2 = N3 = 1000, the optimal time in R1 is 1.8 while that in R2 is 1.5. More illustrative instances with real data can be found in [3].
6
Conclusions
We addressed the problem of finding the tile size that minimizes the running time of SPMD programs. We formulated a discrete non-linear optimization problem using first an abstract model and then specific machine model. The resulting cost function is general enough to subsume most of those in the literature, including the bsp model. We then analytically solved the resulting discrete nonlinear optimization problem, yielding the desired solution. There are a number of open questions. The first one is the direct extension to the non orthogonal case (when the tiles boundaries cannot be parallel to the domain boundaries). We have addressed this elsewhere (for the 2-dimensional case) and formulated a non-linear optimization problem [2], but a closed form solution is not available. Finally, experimental validation on a number of target machines is the subject of our ongoing work.
References 1. R. Andonov, H. Bourzoufi, and S. Rajopadhye. Two-dimensional orthogonal tiling: from theory to practice. In International Conference on High Performance Computing, pages 225–231, Tiruvananthapuram, India, December 1996. IEEE. 481, 483, 484 2. R. Andonov and S. Rajopadhye. Optimal orthogonal tiling of 2-d iterations. Journal of Parallel and Distributed Computing, 45(2):159–165, September 1997. 480, 481, 482, 483, 489
490
Rumen Andonov et al.
3. R. Andonov, N. Yanev, and H. Bourzoufi. Three-dimensional orthogonal tile sizing problem: Mathematical programming approach. In ASAP 97: International Conference on Application-Specific Systems, Architectures and Processors, pages 209–218, Zurich, Switzerland, July 1997. IEEE, Computer Society Press. 481, 482, 489 4. S. Bokhari. Communication overheads on the Intel iPSC-860 Hypercube. Technical Report Interim Report 10, NASA ICASE, May 1990. 483 5. P. Boulet, A. Darte, T. Risset, and Y. Robert. (pen)-ultimate tiling? INTEGRATION, the VLSI journal, 17:33–51, Nov? 1994. 480 6. S. Hiranandani, K. Kennedy, and C-W. Tseng. Evaluating compiler optimizations for Fortran D. Journal Of Parallel and Distributed Computing, 21:27–45, 1994. 480, 484, 489 7. F. Irigoin and R. Triolet. Supernode partitioning. In 15th ACM Symposium on Principles of Programming Languages, pages 319–328. ACM, Jan 1988. 480, 482 8. R. M. Karp, R. E. Miller, and S. V. Winograd. The organization of computations for uniform recurrence equations. JACM, 14(3):563–590, July 1967. 481 9. C-T. King, W-H. Chou, and L. Ni. Pipelined data-parallel algorithms: Part I– concept and modelling. IEEE Transactions on Parallel and Distributed Systems, 1(4):470–485, October 1990. 480 10. C-T. King, W-H. Chou, and L. Ni. Pipelined data-parallel algorithms: Part II– design. IEEE Transactions on Parallel and Distributed Systems, 1(4):486–499, October 1990. 480 11. W. F. McColl. Scalable computing. In J. van Leeuwen, editor, Computer Science Today: Recent Trends and Developments, volume 1000, pages 46–61. Springer Verlag, 1995. 481, 484 12. H. Ohta, Y. Saito, M. Kainaga, and H. Ono. Optimal tile size adjsutment in compiling general DOACROSS loop nests. In International Conference on Supercomputing, pages 270–279, Barcelona, Spain, July 1995. ACM. 484, 489 13. D. Palermo, E. Su, J. Chandy, and P. Banerjee. Communication optimizations used in the PARADIGM compiler for distributed memory multicomputers. In International Conference on Parallel Processing, pages xx–yy, St. Charles, IL, August 1994. IEEE. 480, 484, 489 14. J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for non shared-memory machines. In Supercomputing 91, pages 111–120, 1991. 480 15. R. Schreiber and J. Dongarra. Automatic blocking of nested loops. Technical Report 90.38, RIACS, NASA Ames Research Center, Aug 1990. 480 16. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. 481, 484 17. M. J. Wolfe. Iteration space tiling for memory hierarchies. Parallel Processing for Scientific Computing (SIAM), pages 357–361, 1987. 480
Enhancing the Performance of Autoscheduling with Locality-Based Partitioning in Distributed Shared Memory Multiprocessors Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos, and Theodore S. Papatheodorou High Performance Computing Architectures Laboratory Department of Computer Engineering and Informatics University of Patras Rio 26500, Patras, Greece {dsn,edp,tsp}@hpclab.ceid.upatras.gr Abstract. Autoscheduling is a parallel program compilation and execution model that combines uniquely three features: Automatic extraction of loop and functional parallelism at any level of granularity, dynamic scheduling of parallel tasks, and dynamic program adaptability on multiprogrammed shared memory multiprocessors. This paper presents a technique that enhances the performance of autoscheduling in Distributed Shared Memory (DSM) multiprocessors, targetting mainly at medium and large scale systems, where poor data locality and excessive communication impose performance bottlenecks. Our technique partitions the application Hierarchical Task Graph and maps the derived partitions to clusters of processors in the DSM architecture. Autoscheduling is then applied separately for each partition to enhance data locality and reduce communication costs. Our experimental results show that partitioning achieves remarkable performance improvements compared to a standard autoscheduling environment and a commercial parallelizing compiler.
1
Introduction
Distributed Shared Memory (DSM) multiprocessors have evolved as a powerful and viable platform for parallel computing. Unfortunately, automatic parallelization for these systems becomes often an onerous duty, requiring numerous manual optimizations and substantial effort from the programmer to sustain high performance [3]. The development of programming models and compilation techniques that exploit the advantages and overcome the architectural bottlenecks of DSM systems remains a challenge. Among several parallel compilation and execution models proposed in the literature, autoscheduling [13] is one of the most promising approaches. Autoscheduling provides a compilation framework which merges the control flow and data flow models to efficiently exploit multiple levels of loop and functional parallelism. The compilation framework is coupled with an execution environment
This work was supported by the European Commission under ESPRIT IV project No. 21907 (NANOS)
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 491–501, 1998. c Springer-Verlag Berlin Heidelberg 1998
492
Dimitrios S. Nikolopoulos et al.
that schedules parallel programs dynamically on a variable number of processors, by controlling the granularity of parallel tasks at runtime. Coordination between the compiler and the execution environment achieves effective integration of multiprocessing with multiprogramming. In this paper, we identify sources of performance degradation for an autoscheduling environment running on a DSM multiprocessor. Performance bottlenecks arise as a consequence of poor caching performance and excessive communication through the interconnection network. We present a technique that attempts to resolve these problems by augmenting autoscheduling with partitioning and clustering. We partition the application task graph in subgraphs with independent dataflows and map the derived subgraphs to processor clusters. The topology of the clusters conforms to the system architecture. Partitioning and clustering are performed dynamically at runtime. After mapping partitions to clusters, autoscheduling is applied separately to each partition to extract and schedule the parallelism. Using application and synthetic benchmarks, we demonstrate that partitioning reduces significantly the execution time of autoscheduled programs on a 32-processor SGI Origin2000. For the same set of benchmarks, partitioned autoscheduling outperforms the native SGI loop parallelizer, even in situations where standard autoscheduling fails to do so. The rest of this paper is organized as follows: Section 2 provides some background on autoscheduling and discuss its performance limitations. Section 3 presents our technique along with some implementation issues. Section 4 describes our evaluation framework and Section 5 provides detailed experimental results. We summarize our conclusions in Section 6.
2
Autoscheduling and Performance Limitations
An autoscheduling environment consists of a parallelizing compiler, a multithreading runtime library and an enhanced operating system scheduler. The compiler applies control and data dependence analysis to produce an intermediate program representation called the Hierarchical Task Graph (HTG) [5]. Programs are represented with a hierarchical structure consisting of simple and compound nodes. Simple nodes correspond to basic blocks of sequential code. Compound nodes represent tasks amenable to parallelization, such as parallel loops and parallel sections. The levels of the hierarchy are determined by the nesting of loops and correspond to different levels of exploitable parallelism. Nodes are connected with arcs representing execution precedences. A boolean expression with a set of firing conditions is associated with each node. When this expression evaluates to TRUE, all data and control dependences of the node are satisfied and the corresponding task is enabled for execution. The compiler generates parallel code by means of a user-level runtime system which creates and schedules lightweight execution flows called nano-threads [13]. The granularity of the generated parallelism is controlled dynamically at runtime. A communication mechanism between the runtime system and the operat-
Enhancing the Performance of Autoscheduling
493
ing system scheduler keeps parallel programs informed of processor availability and enables dynamic program adaptability to varying execution conditions [14]. Each compound node in the HTG is annotated as a candidate source of parallelism. The runtime system decomposes all the compound nodes which contain enough parallelism to utilize the processors available at a given execution instance. If several compound nodes coexist at the same level of the hierarchy, parallelism can be extracted from all of them simultaneously. In that case, nanothreads originating from different compound nodes are executed in an overlapped manner across the processors. As an example, consider the HTG of the multiplication of two matrices Y, Z with complex entries, calculated as: X = Y Z = (A + jB)(C + jD) = (AC − BD) + j(BC + AD)
(1)
The computation consists of four independent matrix by matrix multiplications, followed by an addition and a subtraction. Each of these tasks is a compound node with inherent loop parallelism. If a sufficient number of processors is available at runtime, autoscheduling may execute the matrix multiplications simultaneously and each matrix multiplication in parallel to exploit two levels of parallelism. Overlapped execution of nano-threads originating from different compound nodes can have undesirable effects on a DSM multiprocessor. This is particularly true when each compound node individually uses all the available processors. Three performance implications may arise in this case: Increase of conflict and capacity cache misses, heavy traffic in the interconnection network, and higher runtime overhead. Bursts of conflict and capacity misses occur when the working sets of concurrently executing compound nodes do not fit in the caches, thus forcing the caches to alternate frequently between different working sets. This happens as processors are switched between nanothreads originating from different compound nodes. Overlapped execution of compound nodes stresses the interconnection network, due to excessive data communication. The runtime system ovehead is higher, since multiple processors are forced to access heavily shared ready queues and memory pools, in order to create and schedule nano-threads. This side-effect exacerbates contention and communication traffic due to remote memory accesses [11].
3
Performance Enhancements
Our method to alleviate the problems presented in Sect. 2 while preserving the semantics of autoscheduling, is to decompose the program HTG into partitions with independent data flows and the processor space of the program into clusters, with a one-to-one mapping between HTG partitions and processor clusters. After partitioning and clustering, autoscheduling is applied separately for each partition. Parallelism from each partition is extracted to utilize the processors of the cluster to which the partition is mapped. Processor clusters serve as memory
494
Dimitrios S. Nikolopoulos et al.
locality domains for the partitions. This means that the working set of a partition is stored in physical memory that lies within the corresponding cluster, in order to reduce memory latency during the execution of the partition. The communication required for a partition is performed within the cluster, thus putting less pressure in the interconnection network and avoiding long distance memory accesses. Furthermore, data dependent partitions use the same processors to preserve locality. If a partition P is data dependent on a partition Q, our mechanism executes P either in a superset, or a subset of the cluster to which P is mapped. The relative size of the clusters is determined with the following procedure: Compound nodes that can be simultaneously executed are detected in the HTG and groupped in sets. The computational requirements of each compound node are estimated with execution profiling and then weighed with the aggregate computational requirements of the compound nodes that belong to the same set. Processors are distributed proportionally to each compound node according to its weighed computational requirements. The actual size of the clusters is set at runtime, and depends on their precomputed relative size and processor availability. Let P be the number of processors available to the application at an execution instance, n the number of concurrently executing compound nodes at that instance and ri , i = 1 . . . n, the computational requirement of each compound node, expressed in CPU time. Then the processor set is partitioned into n clusters ci , i = 1 . . . n, and each cluster receives pci processors where: ri pci = max{1, n
i=1 ri
P}
(2)
If the number of available processors does not exceed the degree of functional parallelism exposed by the application —expressed as the number of concurrently executing compound nodes—, then our technique degenerates to standard autoscheduling. Otherwise, autoscheduling is applied within the bounds of each cluster by extracting the data and/or functional parallelism of the associated HTG partition. The clusters can be dynamically reconfigured at runtime. Load imbalances between clusters are not considered. We rather rely on the proportional allocation of processors to clusters to handle these cases. However, load imbalances within the clusters across parallel loops or parallel sections are handled by the user-level scheduler which employs affinity scheduling and work stealing [15]. The topology of the clusters is kept consistent with the underlying system architecture. For DSM architectures, clusters are built incrementally using the Symmetric Multiprocessing (SMP) nodes as the basic building block. The interconnection network is exploited to form clusters with SMP nodes which are physically close to each other. The clusters can shrink or grow at runtime, either by splitting, or by joining directly connected clusters. The entire mechanism is implemented at user-level. Partitions are detected from the HTG, and communicated to the runtime system through directives. Static information about the relative processor requirements of each partition is also communicated to the runtime system. The actual cluster sizes are computed
Enhancing the Performance of Autoscheduling
495
x1
x2
x3
−12
x4
processors
START
:::<<< 999### """""" ### 999::: <<< <<<### """ ### ### """ :::<<< <<<999 999 ###""" """""" """### ### 999 <<< <<< """ ### ### """ 999 <<< <<< """ ### ### """ ::: :::999999 999::: :::### ###""" """""" """### ### <<< ::: """ ### ### """ ::: 999 ::: ### """ """ ### <<< 999 ::: """ ### ### ::: <<< 999### ###""" """ """ ### <<<::: 999<<< :::999 """ ###""" ###""" """ ### 999 <<< ::: """ ### ### ::: <<< 999 ### """ """ ### 999<<< <<< :::""" """### ###### ###""" """ 999 ::: """ ::: 999 <<< ### """ """ ### ::: 999 <<< ### """ """ <<< 999::: ::: """ ###### ###### """ :::999 999<<<""" ###### """ """ ### <<< <<<999::: """######""" """
+34
STOP
(a)
<< << <<
999 time ::: x1 999 999 x2 ::: ::: """ x4 ### ### − """ ### """
(b)
processors
at runtime using Eq. 2. The infrastructure for mapping the partitions to processors is provided by a hierarchy of ready queues [2]. The hierarchy corresponds to a tree with a global system queue at the root, per-processor local queues at the leaves and a number of intermediate levels of ready queues. Mapping of partitions to clusters is performed with Distributed Hierarchical Control (DHC) [4]. When a compound node is assigned to a cluster of size p, the corresponding nano-thread is mapped to a ready queue at the root of a subtree that has at least p processors at the leaves. The processor that creates the nano-thread searches the hierarchy up until it finds a queue from which a tree of the appropriate size is emanated. Nested nano-threads created by the outermost nano-thread are mapped to the queues that reside at the leaves of the subtree. Processor clusters are implicitly configured using DHC, without invoking the operating system to construct the memory locality domains. As an example, Fig. 1.(a) illustrates the task graph of complex matrix multiplication. Following our scheme, the processor set that executes the computation is initially partitioned in four clusters of equal size, and a matrix multiplication is executed separately in each cluster. Within a cluster, the matrix multiplication is autoscheduled on a number of processors equal to the size of the cluster. When the subtraction and addition tasks are activated, pairs of neighbouring clusters are joined to form two clusters of doubled size and execute the tasks. Figures 1.(b) and 1.(c) demonstrate executions of the task graph under standard autoscheduling and our scheme respectively. Circles indicate nano-threads, while their fill patterns indicate compound nodes from which the nano-threads originate. Under standard autoscheduling, the execution sequence of nano-threads from the concurrently executing compound nodes is determined from the serialization of processor accesses to the ready queues and may change between subsequent executions of the program. With partitioning, compound nodes are still executed concurrently, but the parallelism extracted from each partition is scheduled in isolation within a cluster. Figure 2 illustrates a hierarchy of ready queues for an 8-processor sytsem and the scheduling strategy for the complex matrix multiplication example.
99 99 99 99 99 :: :: :: :: << :: << << << <<
99 99 99 99 99 :: :: :: :: << :: << << << <<
99 99 99 99 99 :: :: :: :: << :: << << << <<
### ###### ### ### ###### ###### ###### ### ### ### ### ### ### ### ### 99### ###### ###### ###### ### 99 ### 99 ### ### ### 99### ###### ###### ###### ### 99 ### :: """ """ """ ::""" """""" """""" """""" """ :: """ :: """ """ """ """ << :: <<""" """""" """""" """""" """ << """ """ """ """ << """ """ """ <<"""""""""""" """ time
x3 +
<<< <<< <<<
999 ::: x1 999 999 x2 ::: ::: """ x4 ### ### """ − ### """
x3 +
(c)
Fig. 1. Complex matrix multiplication task graph and execution with standard and partitioned autoscheduling . ×, +, − indicate matrix by matrix operations.
496
Dimitrios S. Nikolopoulos et al.
Processor Local Queues
+ +
x1 x1
0
+ +
x1 x1
1
+ +
x2 x2
2
+ +
x2 x2
3
− −
x3 x3
4
− −
x3 x3
5
− −
x4 x4
6
− −
x4 x4
7
x1
+ x2
x3
− x4
Fig. 2. Hierarchy of ready queues and enqueing of complex matrix multiplication tasks.
4
Evaluation Framework
We evaluate our technique with a number of application and synthetic benchmarks. The application benchmarks presented here were used in a previous study of autoscheduling [9]. This gives us the opportunity for a quantitative analysis of our results in direct comparison with existing work in the field. The purpose of the synthetic benchmarks is to demonstrate the performance implications of autoscheduling and the ability of our method to overcome them. The application benchmarks include the complex matrix multiplication kernel, the kernel of a Fourier-Chebyschev spectral computational fluid dynamics code [7], and an industrial molecular dynamics code [16]. All benchmarks have two levels of exploitable parallelism, namely a constant degree of functional parallelism at the outer lelel and data parallelism at the inner level. The complex matrix multiplication and the cfd kernels use matrices of size 256×256. The molecular dynamics kernel consists of 14 parallel functions, each one executing a parallel loop with 1000 iterations. More details on the structure of these benchmarks, as well as their performance under autoscheduling on a bus-based multiprocessor can be found in [9]. Our first synthetic benchmark consists of 8 parallel dot products. Each dot product is computed with two vectors of 64 Kilobytes of double precision elements. A detailed description of the benchmark can be found in [11]. Two levels of functional and data parallelism are exploitable, similarly to the application benchmarks. The dataflow of this benchmark favours processor clustering to maintain locality. The benchmark experiences increased conflict cache misses due to the alignment of data in memory, and a low ratio of computation to overhead because of the fine thread granularity. The second synthetic benchmark consists of four parallel point LU factorizations, followed by two parallel additions of the factorized matrices. The matrix size used is 256 × 256. Autoscheduling is expected to lead to conflicts between elements of different matrices in the caches, and increased remote cache invalidations due to the all to all communication pattern of the computation. Partitioning is expected to save cache misses and localize communication.
Enhancing the Performance of Autoscheduling
497
Table 1. Maximum speedups attained for the benchmarks with the three paralellization schemes. Benchmark synthetic dot product synthetic LU cmm kernel cfd kernel md kernel
Loop Parallel Smax p 1.42 4 7.06 16 7.14 16 5.59 12 1.13 14
Autoscheduled Partitioned Autosch. Smax p Smax p 1.82 4 3.52 8 7.34 16 15.72 32 8.51 16 10.08 16 6.22 12 7.31 12 1.03 14 1.34 28
We performed our experiments on a SGI Origin2000 [6], equipped with 32 MIPS R1000 processors. We implemented partitioning and clustering using NthLib [8]. NthLib is a multithreading runtime library designed to support autoscheduling. The library is currently under development in the context of the ESPRIT project NANOS [10]. NthLib provides lightweight threads, created and scheduled entirely at user-level. The runtime system translates HTG representations into multithreading code, while attempting to preserve desirable properties such as data locality and load balancing. At the same time, NthLib communicates with the operating system to enable application adaptivity to the runtime conditions. Automatic parallelization was performed with HPF-like directives and the Parafrase-2 compiler [1,12]. Three versions of each benchmark were used in the experiments: (a) a loop parallel version produced by the native SGI parallelizing compiler and exploiting a single level of parallelism from loops; (b) a autoscheduled version, which exploits multiple levels of loop and functional parallelism; and (c) a partitioned autoscheduled version which uses autoscheduling enhanced with partitioning and clustering. All benchmarks were compiled with the -O3 optimization flag and executed in a dedicated environment.
5
Performance Results
In this section, we present detailed results from our experiments. Table 1 shows the maximum speedups attained for the benchmarks presented in Sect. 4 with the three parallelization schemes. The table also shows the number of processors on which the maximum speedup is attained in each case. The obtained speedups are justified as follows: For the synthetic dot product benchmark, poor speedup is a consequence of increased runtime overhead and cache conflicts. This benchmark scales better only if the problem size and thread granularity are significantly increased. The molecular dynamics kernel exhibits poor speedups due to poor data locality and high irregularity in the native code. Our results for this code agree with those reported in [9]. The other three codes exhibit good speedups, with respect to the structure of the computations, the platform architecture and the problem size. The primary conclusion drawn from Table 1 is that in all cases, the maximum attainable speedup is obtained from the partitioned autoscheduled versions. The
498
Dimitrios S. Nikolopoulos et al.
partitioned autoscheduled versions of the synthetic benchmarks and the molecular dynamics kernel scale up to a higher number of processors than both the autoscheduled and the loop parallel versions. The autoscheduled versions have always better speedups than the loop parallel versions with the exception of the molecular dynamics kernel. These results are consistent with the results in [9] and strengthen the argument that autoscheduling exploits unstructured parallelism more effectively. All benchmarks achieve better speedups and some of them do scale up to 32 processors if the problem sizes and thread granularities are adequately increased. However, we insisted in using small problem sizes to stress the ability of the runtime system to handle fine-grain parallelism, a prerequisite for the applicability of autoscheduling. We believe that parallel sections and loops with such a fine granularity are frequently met in scientific codes, and the question whether parallelizing these code fragments or not is worth the effort remains often dangling. Autoscheduling is a practical scheme for exploiting this kind of parallelism from scientific applications and our experiments intend to demonstrate this feature. A more thorough study of the codes as well as architecture-specific optimizations that could improve the speedups on our experimental platform are out of the scope of this paper. Figure 3 illustrates the normalized execution times of the benchmarks when executed on a variable number of processors. Normalization is performed against the sequential execution time excluding the parallelization overhead. The conclusions concerning the relative performance of the three parallelization methods, agree with those derived from Table 1. The partitioned autoscheduled versions exhibit good behaviour even when executed on 32 processors, despite the fact that most benchmarks do not scale up to this point. Compared to loop parallelization and standard autoscheduling, our technique improves speedup by a factor of two and 36% respectively for the complex matrix multiplication. The corresponding numbers for the CFD kernel are 55% and 45%. For the synthetic dot product benchmark, partitioned autoscheduling is the only mechanism that achieves any speedup on 32 processors. Two benchmarks that exhibit interesting behaviour are the synthetic LU benchmark and the molecular dynamics kernel. In the former, autoscheduling gives only marginal improvements compared to loop parallelization, while in the latter, autoscheduling results in performance degradation. In both cases, partitioning overcomes these limitations. The speedup of the synthetic LU benchmark is almost doubled with partitioning. The molecular dynamics kernel enjoys a noticeable speedup improvement of 19%. In order to gain a better insight into our results, we performed experiments that measured memory performance under the three parallelization schemes. We used the hardware counters of MIPS R10000 and the perfex utility to obtain various performance statistics. Figure 4 presents a summary of the results from these experiments. The charts show the total number of cache misses and remote invalidations for each benchmark when executed on 32 processors. The results indicate clearly that partitioned autoscheduling improves memory performance
Enhancing the Performance of Autoscheduling
complex matrix multiplication
computational fluid dynamics kernel
0,15 0,10 0,05
1,0
0,30 normalized execution time
normalized execution time
normalized execution time
0,20
0,00
0,25 0,20 0,15 0,10 0,05
8
16
0,00
32
12
18
processors
Loop Parallel
molecular dynamics kernel
0,35
0,25
Autoscheduled
0,6 0,4 0,2 0,0
30
Loop Parallel
Autoscheduled
14
synthetic dot product benchmark
Autoscheduled
Partitioned Autoscheduled
synthetic lu benchmark
normalized execution time
1,0
0,5
8
16
32
0,20 0,15 0,10 0,05 0,00
processors
Loop Parallel
Loop Parallel
0,25
1,5
0,0
28 processors
Partitioned Autoscheduled
2,0 normalized execution time
0,8
processors
Partitioned Autoscheduled
499
Autoscheduled
Partitioned Autoscheduled
8
16
32
processors
Loop Parallel
Autoscheduled
Partitioned Autoscheduled
Fig. 3. Normalized execution times of the benchmarks by reducing cache misses and interconnection network traffic. Cache misses and remote invalidations are on average halved with partitioning and clustering.
6
Conclusions
This paper demonstrated that although autoscheduling exploits potentially more parallelism, it is still amenable to enhancements in order to achieve high performance in DSM multiprocessors. We presented a simple and easy to implement technique which takes advantage of data and communication locality by applying partitioning and clustering of autoscheduled programs. The results prooved that autoscheduling remains the exploiting choice for multilevel parallelism when the mechanisms that extract and schedule the parallelism are aware of the target architecture. Several issues of partitioned autoscheduling need further investigation. Topics for further research might be the automation of the clustering procedure using program analysis in the compiler, and the evaluation of partitioned autoscheduling under multiprogramming conditions. Our current work is oriented towards these directions.
Acknowledgements We would like to thank Constantine Polychronopoulos for his support and valuable comments, the European Center for Parallelism in Barcelona (CEPBA) for providing us access to their Origin2000, all our partners in the NANOS project and George Tsolis for his help in improving the appearance of the paper.
Dimitrios S. Nikolopoulos et al.
computational fluid dynamics kernel
100
cache misses/ invalidations ( thousands)
cache misses/ invalidations ( thousands)
complex matrix multiplication
80 60 40 20 0
Loop Parallel
Autoscheduled
cache misses
5000 4000 3000 2000 1000 0
Partitioned Autoscheduled
Loop Parallel
invalidations
Autoscheduled
cache misses
20000 15000 10000 5000 Loop Parallel
Autoscheduled
cache misses
Partitioned Autoscheduled
invalidations
120 100 80 60 40 20 0
invalidations
Loop Parallel
Autoscheduled
cache misses
Partitioned Autoscheduled
invalidations
synthetic lu benchmark
cache misses/ invalidations ( thousands)
cache misses/ invalidations ( thousands)
25000
140
Partitioned Autoscheduled
synthetic dot product benchmark 30000
0
molecular dynamics kernel
cache misses/ invalidations ( thousands)
500
2000
1500
1000
500
0
Loop Parallel
Autoscheduled
cache misses
Partitioned Autoscheduled
invalidations
Fig. 4. Memory performance of the benchmarks
References 1. E. Ayguad´e, X. Martorell, J. Labarta, M. Gonz` alez, and N. Navarro, Exploiting Parallelism through Directives on the Nano-Threads Programming Model, Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing, Minneapolis, Minnesota, August 1997. 497 2. S. Dandamundi, Reducing Run Queue Contention in Shared Memory Multiprocessors, IEEE Computer, Vol. 30(3), pp. 82–89, March 1997. 495 3. R. Eigenmann, J. Hoeflinger and D. Padua, On the Automatic Parallelization of Perfect Benchmarks, IEEE Transactions on Parallel and Distributed Systems, Vol. 9(1), pp. 5–23, January 1998. 491 4. D. Feitelson and L. Rudolph, Distributed Hierarchical Control for Parallel Processing, IEEE Computer, Vol. 23(5), pp. 65–79, May 1990. 495 5. M. Girkar and C. Polychronopoulos, Automatic Extraction of Functional Parallelism from Ordinary Programs, IEEE Transactions on Parallel and Distributed Systems, vol. 3(2), pp. 166–178, March 1992. 492 6. J. Laudon and D. Lenoski, The SGI Origin: A ccNUMA Highly Scalable Server, Proceedings of the 24th Annual International Symposium on Computer Architecture, pp. 241–251, Denver, Colorado, June 1997. 497 7. S. Lyons, T. Hanratty and J. MacLaughlin, Large-scale Computer Simulation of Fully Developed Channel Flow with Heat Transfer, International Journal of Numerical Methods for Fluids, vol. 13, pp. 999–1028, 1991. 496 8. X. Martorell, J. Labarta, N. Navarro and E. Ayguad´e, A Library Implementation of the Nano-Threads Programming Model, Proceedings of Euro-Par’96, pp. 644–649, Lyon, France, August 1996. 497 9. J. Moreira, On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, PhD Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, 1995. 496, 497, 498
Enhancing the Performance of Autoscheduling
501
10. NANOS Consortium, Nano-Threads Programming Model Specification, ESPRIT Project No. 21907 (NANOS), Deliverable M1.D1, July 1997, also available at http://www.ac.upc.es/NANOS 497 11. D. Nikolopoulos, E. Polychronopoulos and T. Papatheodorou, Efficient Runtime Thread Management for the Nano-Threads Programming Model, Proceedings of the IPPS/SPDP’98 Workshop on Runtime Systems for Parallel Programming, LNCS vol. 1388, pp. 183–194, Orlando, Florida, March/April 1998. 493, 496 12. C. Polychronopoulos, M. Girkar, M. Haghighat, C. Lee, B. Leung and D. Schouten, Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing and Scheduling Programs, International Journal of High Speed Computing, Vol. 1 (1), 1989. 497 13. C. Polychronopoulos, N. Bitar and S. Kleiman, Nano-Threads: A User-Level Threads Architecture, CSRD Technical Report 1297, University of Illinois at Urbana-Champaign, 1993. 491, 492 14. E. Polychronopoulos, X. Martorell, D. Nikolopoulos, J. Labarta, T. Papatheodorou and N. Navarro, Kernel-Level Scheduling for the Nano-Threads Programming Model, Proceedings of the 12th ACM International Conference on Supercomputing, Melbourne, Australia, July 1998. 493 15. E. Polychronopoulos and T. Papatheodorou, Dynamic Bisectioning Scheduling for Scalable Shared-Memory Multiprocessors, Technical Report LHPCA-010697, University of Patras, June 1997. 494 16. B. Quentrec and C. Brot, New Method for Searching for Neighbors in Molecular Dynamics Computations, Journal of Computational Physics, Vol. 13, pp. 430–432, 1975. 496
Workshop 05+15 Distributed Systems and Databases Lionel Brunie and Ernst Mayr Co-hairmen
Distributed Systems The 15 papers in this workshop have been selected from the submissions to two areas: five from parallel and distributed databases, and another ten from distributed systems and algorithms. Parallel and distributed database systems are critical for many application domains such as high-performance transaction systems, data warehousing and interoperable information systems. A number of solutions are commercially available, but many open problems remain. Future database systems must support flexible and adaptive approaches for data allocation, load balancing and parallel query processing, both at the DML and at the transaction program level. Distributed databases have to meet new challenges by the Internet and by the integration into workflow management systems. New application areas such as decision support, data mining, text handling imply new requirements with respect to functionality and performance. Interprocessor communication has become fast enough that distributed systems can be used to solve highly parallel problems in addition to more traditional distributed problems (such as client-server applications). These distributed systems range from a local area network of homogeneous workstations to coordinated heterogenous workstations and supercomputers. Algorithmic and architectural solutions to problems from the fields of distributed and parallel processing (as well as new solutions) can often be applied or adapted to these kinds of systems. Typical examples are implementations of shared memory abstractions on top of message-passing systems, scheduling parallel applications on distributed heterogeneous systems, mechanism and abstractions for fault tolerance, and algorithms to provide elementary system functions and services. The first session on Distributed Systems deals with networks of processors, connected via various type of networks, like hypercubic networks or tree networks. The talks are concerned with issues of load balancing, with synchronization (mutual exclusion) between neighboring processors, and with fault propagation. The second session also addresses several areas. The first pair of talks discusses efficient protocols for inexpensive networks of workstations, the following paper presents some refinements of causality relations between non-atomic poset events in a distributed environment, and the last talk presents a new proposal to communicate channels over channels, an idea interesting for instance for mobile computing systems. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 503–504, 1998. c Springer-Verlag Berlin Heidelberg 1998
504
Lionel Brunie and Ernst Mayr
In the third session on Distributed Systems, some practical solutions to efficiency problems are presented. The first talk describes an SCI-based implementation of shared-memory and the operating system support for it, while the next paper presents a simple and efficient solution for garbage collection in an environment with task migration. Finally, a proposal based on active messages for improving communication efficiency in local networks of workstations gets discussed. We would like to express our sincere appreciation to the other members of the Program Committee, Prof. Paul Spirakis, Prof. Friedemann Mattern, and Dr. Marios Mavronicolas, for their invaluable help in the entire selection process. We would also like to thank the numerous reviewers for their time and effort.
Parallel and Distributed Databases The distributed and parallel databases session is composed of 5 papers. 2 papers concern parallel databases while 3 papers address distributed databases topics. Though object oriented databases are widely accepted as a very promising paradigm, only a few papers have addressed the parallelization of object oriented queries. The first paper is one of them. Authors propose new algorithms for the paralllel processing of collection intersection join queries, i.e. queries which check for whether there is an intersection between collection attributes (e.g. sets, lists, arrays...). Performance evaluation of these algorithms is very promising. With the huge development of the Internet, information retrieval (IR) has received a new attention. In this framework, symmetric multiprocessors, by exploiting the parallelism implicit in the queries, allow processing tremendous loads. The paper by Lu et al. investigate how to balance hardware and software resources to exploit a symmetric multiprocessor devoted to IR. Last papers explore three of the main issues in distributed database management systems : update protocols, query optimization and concurrency control. The paper by Pedone et al. proposes to use atomic broadcast primitives to ensure query termination in replicated databases. Experiments demonstrate that this approach shows a better throughput and response time than traditional atomic commitment methods. F. Najjar and Y. Slimani give a novel insight on distributed query optimization based on semi-joins. Using dynamic programming techniques, the optimization heuristics they propose allow exploiting all the parallelism inherent to the query. Finally, Boukerche and al. investigate how to adapt virtual time techniques, classicaly used in distributed systems and VLSI design, to concurrency control problems in distributed databases. They show this new approach outperforms multiversion timestamp techniques as the number of conflicts or transaction size increases. In summary, this session proposes a very exciting panel of papers which give new insights on some of the hottest issues in parallel and distributed databases.
Collection-Intersect Join Algorithms for Parallel Object-Oriented Database Systems David Taniar1 and J. Wenny Rahayu2 1 2
Monash University - GSCIT, Churchill, Vic 3842, Australia [email protected] La Trobe University, Dept. of Computer Sc. & Comp. Eng., Bundoora, Vic 3083, Australia [email protected]
Abstract. One of the differences between relational and object-oriented databases (OODB) is that attributes in OODB can of a collection type (e.g. sets, lists, arrays, bags) as well as a simple type (e.g. integer, string). Consequently, explicit join queries in OODB may be based on collection attributes. One form of collection join queries in OODB is collectionintersect join queries, where the joins are based on collection attributes and the queries check for whether there is an intersection between the two join collection attributes We propose two algorithms for parallel processing of collection-intersect join queries. The first one is based on sortmerge, and the second is based on hash. We also present two data partitioning methods (i.e. simple replication and ”divide and partial broadcast”) used in conjunction with the parallel collection-intersect join algorithms. The parallel sort-merge algorithm can only make use of the divide and partial broadcast data partitioning, whereas the parallel hash algorithm may have a choice which of the two data partitioning to use.
1
Introduction
In Object-Oriented Databases (OODB), although path expression between classes may exist, it is sometimes necessary to perform an explicit join between two or more classes due to the absence of pointer connections or the need for value matching between objects. Furthermore, since objects are not in a normal form, an attribute of a class may have a collection as a domain. Collection attributes are often mistakenly considered merely as set-valued attributes. As the matter of fact, set is just one type of collections. There are other types of collection. The Object Database Standard ODMG [1] defines different kinds of collections: particularly set, list/array, and bag. Consequently, object-oriented join queries may also be based on attributes of any collection type. Such join queries are called collection join queries [9]. Our previous work reported in [9] classified three different types of collection join queries, namely: collection-equi join, collectionintersect join, and sub-collection join. In this paper, we would like to focus on collection-intersect join queries. We are particularly interested in formulating parallel algorithms for processing such queries. The algorithms are non-trivial to D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 505–512, 1998. c Springer-Verlag Berlin Heidelberg 1998
506
David Taniar and J. Wenny Rahayu
parallel object-oriented database systems, since most conventional join algorithms (e.g. hybrid hash join, sort-merge join) deal with single-valued attributes and hence most of the time they are not suitable to handle collection join queries. Collection-intersect join queries are queries that join two classes based on an attribute of a collection type. The join predicates check for whether there is an intersection between the two collection join attributes. An intersect predicate can be written by applying an intersection between the two sets and comparing the intersection result with an empty set. It is normally in a form of (attr1 intersect attr2) != set(nil). Attributes attr1 and attr2 are of type set. If one or both of them are of type bag, they must be converted to sets. Suppose the attribute editor-in-chief of class Journal and the attribute program-chair of class Proceedings are of type sets of Person. An example of a collection-intersect join is to retrieve pairs of Journal and Proceedings, where the program-chairs of a conference are intersect with the editors-in-chief of a journal. The query expressed in OQL (Object Query Language) [1] can be written as follows:
Select A, B From A in Journal, B in Proceedings Where (A.editor-in-chief intersect B.program-chair) != set(nil) As clearly seen that the intersection join predicates involve the creation of intermediate results through an intersect operator. The result of the join predicate cannot be determined without the presence of the intermediate collection result. This predicate processing is certainly not efficient. In a collection-intersect join query, the original subset predicate has to produce an intermediate set, before it can be compared with an empty set. This process checks for the smaller set twice: one for an intersection, the other for an equality comparison. Therefore, optimization algorithms for efficient processing of such queries are critical if one wants to improve query processing performance. In this paper, we present two parallel join algorithms for collection-intersection join queries. The primary intention is to solve the inefficiency imposed by the original join predicates. An interest in parallel OODB among database community has been growing rapidly, following the popularity of multiprocessor servers and the maturity of OODB. The emerging between parallel technology and OODB has shown promising results [3], [5], [7], [11]. However, most research done in this area concentrated on path expression queries with pointer chasing. Explicit join processing exploiting collection attributes has not been given much attention. The rest of this paper is organized as follows. Section 2 explains two data partitioning methods for parallel collection-intersect join algorithms. Section 3 describes a proposed parallel join algorithm for collection-intersect join queries based on a sort-merge technique. Section 4 introduces another algorithm, which is based on a hash technique. Finally, section 5 draws the conclusions and explains the future work.
EuroPar’98
2
507
Data Partitioning
Parallel join algorithms are normally decomposed into two steps: data partitioning and local join. Data partitioning creates parallelism, as it divides the data to multiple processors, so that the join can then be performed locally in each processor without interfering others. For collection-intersect join queries, it is not possible to have non-overlap partitions, due to the nature of collections which may be overlapped. Hence, some data needs to be replicated. Two non-disjoint partitioning methods are proposed. The first is a simple replication based on the value of the element in each collection. The second is a variant of Divide and Broadcast [4], called ”Divide and Partial Broadcast”. 2.1
Simple Replication
Using a simple replication technique, each element in a collection is treated as a single unit, and is totally independent of other elements within the same collection. Based on the value of an element in a collection, the object is placed into a particular processor. Depending on the number of elements in a collection, the objects that own the collections may be placed into different processors. When an object is already placed at a particular processor based on the placement of an element, if another element in the same collection is also to be placed at the same place, no object replication is necessary. As a running example, consider the data shown in Figure 1. Suppose class A and class B are Journal and Proceedings, respectively. Both classes contain a few objects shown by their OIDs (e.g., objects a to i are Journal objects and objects p to w are Proceedings objects). The join attributes are editor-in-chief of Journal and program-chair of Proceedings; and are of type collection of Person. The OID of each person in these attributes are shown in the brackets. For example a(250,75) denotes a Journal object with OID a and the editors of this journal are Persons with OIDs 250 and 75. Figure 2 shows an example of a simple replication technique. The bold printed elements are the elements which are the basis for the placement of those objects. For example, object a(250, 75) in processor 1 refers to a placement for object a in processor 1 because of the value of element 75 in the collection. And also, object a(250, 75) in processor 3 refers to a copy of object a in processor 3 based on the first element (i.e., element 250). It is clear that object a is replicated to processors 1 and 3. On the other hand, object i(80, 70) is not replicated since both elements will place the object at the same processor, that is processor 1. 2.2
Divide and Partial Broadcast
The Divide and Partial Broadcast algorithm, shown in Figure 3, proceeds in two steps. The first step is a divide step, and the second step is a partial broadcast step. We divide class B and partial broadcast class A. The divide step is explained as follows. Divide class B into n number of partitions. Each partition of class B is placed in a separate processor (e.g. partition B1 to processor 1, partition
508
David Taniar and J. Wenny Rahayu
C lass A (Jou rn al)
C lass B (P roceedin gs)
a(250, 75) b(210, 123) c(125, 181) d(4, 237) e(289, 290) f(150, 50, 250) g(270) h(190, 189, 170) i(80, 70)
p(123, 210) q(237) r(50, 40) s(125, 180) t(50, 60) u(3, 1, 2) v(100, 102, 270) w(80, 70)
P roceedin gs O ID s Journal O ID s program -ch air O ID s editor-in-chief O ID s
Fig. 1. Sample data
C lass B
C lass A a(250, 75) d(4, 237) f(150, 50, 250) i(80, 70)
r(50, 40) t(50, 60) u(3, 1, 2) w(80, 70)
b(210, 123) c(125, 181) f(150, 50, 250) h(190, 189, 170)
p(123, 210) s(125, 180) v(100, 102, 270)
Processor 2 (range 100-199)
a(250, 75) b(210, 123) d(4, 237) e(289, 290 ) f(150, 50, 250) g(270)
p(123, 210) q(237)
Processor 3 (range 200-299)
Fig. 2. Simple replication
Processor 1 (range 0-99)
EuroPar’98
509
B2 to processor 2, etc). Partitions are created based on the largest element of each collection. For example, object p(123, 210); the first object in class B, is partitioned based on element 210, as element 210 is the largest element in the collection. Then, object p is placed on a certain partition, depending on the partition range. For example, if the first partition is ranging from the largest element 0 to 99, the second partition is ranging from 100 to 199, and the third partition is ranging from 200 to 299, then object p is placed in partition B3, and subsequently in processor 3. This is repeated for all objects of class B.
Procedure DividePartialBroadcast Begin Step 1 (divide): 1. Divide class B based on the largest element in each collection. 2. For each partition of B (i = 1, 2, ..., n) Place partition Bi to processor i End For Step 2 (partial broadcast): 3. Divide class A based on the smallest element in each collection. 4. For each partition of A (i = 1, 2, ..., n) Broadcast partition Ai to processor i to n End For End Procedure
Fig. 3. Divide and Partial Broadcast Algorithm
The partial broadcast step can be described as follows. First, partition class A based on the smallest element of each collection. Clearly, this partitioning method is exactly the opposite of that in the divide step. Then for each partition Ai where i=1 to n, broadcast partition Ai to processors i to n. This broadcasting technique is said to be partial, since the broadcasting goes down as the partition number goes up. For example, partition A1 is basically replicated to all processors, partition A2 is broadcast to processor 2 to n only, and so on. In regard to the load of each processor, the load of the last processor may be the heaviest, as it receives a full copy of class A and a portion of class B. The load goes down as class A is divided into smaller size (e.g., processor 1). Load balanced can be achieved by applying the same algorithm to each partition but with a reverse role of A and B; that is, divide A and partial broadcast B. It is beyond the scope of this paper to evaluate the Divide and Partial Broadcast partitioning method. This has been reserved for future work. Some preliminary results have been reported in [10].
510
3
David Taniar and J. Wenny Rahayu
Parallel SORT-MERGE Join Algorithm
The partitioning strategy for the parallel sort-merge collection-intersect join query algorithm is based on the Divide and Partial Broadcast technique. The use of the Divide and Partial Broadcast is attractive to collection joins because of the nature of collections where disjoint partitions without replication are often not achievable. After data partitioning is completed, each processor has its own data. The join operation can then be done independently. The overall query results are the union of the results from each processor. In the local joining process, each collection is sorted. Sorting is done within collections, not among collections. After the sorting process is completed, the merging process starts. We use a nested loop structure to compare the two collections from the two operand objects. Figure 4 shows a parallel sort-merge join algorithm for collection-intersect join queries.
Program Parallel-Sort-Merge-Collection-Intersect-Join Begin Step 1 (data partitioning): Call DividePartialBroadcast Step 2 (local joining): In each processor a. Sort phase For each object a(c1) and b(c2) of class A and B, respectively Sort collection c1 and c2 End For b. Merge phase For each object a(c1) of class A For each object b(c2) of class B Merge collection c1 and c2 If TRUE Then Concatenate objects a and b into query result End If End For End For End Program
Fig. 4. Parallel Sort-Merge Collection-Intersect Join Algorithm
4
Parallel HASH Join Algorithm
In this section we introduce a parallel join algorithm based on a hash method for collection-intersect join queries. Like the previous Parallel Sort-Merge algorithm, Parallel-Hash algorithm is also divided into data partitioning and local join phases. However, unlike the parallel sort-merge algorithm, data partitioning
EuroPar’98
511
for parallel hash algorithm is available in two forms: Divide and Partial Broadcast and Simple Replication. Once data partitioning is complete, each processor has its own data, and hence local join process can proceed. The local join process itself is divided into two steps: hash and probe. The hashing is carried out to one class, whereas the probing is performed to the other class. In the hashing part, it basically runs through all elements of each collection in a class. The probing part is done in similar way, but is applied to the other class. Figure 5 shows the pseudo-code for parallel hash join algorithm for collection-intersect join queries.
Program Parallel-Hash-Collection-Intersect-Join Begin Step 1 (data partitioning): Divide and Partial Broadcast version: Call DivideAndPartialBroadcast partitioning Simple Replication version: Call SimpleReplication partitioning Step 2 (local joining): In each processor a. Hash For each object a(c1) of class A Hash collection c1 to a hash table End For b. Probe For each object b(c2) of class B Hash and probe collection c2 into the hash table If there is any match Then Concatenate obj b and the matched obj a into query result End If End For End Program
Fig. 5. Parallel Hash Collection-Intersect Join Algorithm
5
Conclusions and Future Work
The need for join algorithms especially designed for collection-intersect join queries is clear, as collection-intersect join predicates normally require intermediate results to be generated, before the final predicate results can be determined. This is certainly not optimal. In this paper, we present two algorithms especially design for collection-intersect join queries, namely Parallel Sort-Merge and Parallel Hash algorithms. These two algorithms are designed especially for collection-intersect join queries in object-oriented databases. Data partitioning
512
David Taniar and J. Wenny Rahayu
methods, which create parallelism, are based on either simple replication or ”divide and partial broadcast”. Parallel Sort-Merge can make use only the divide and partial broadcast method, whereas Parallel Hash can have a choice between the two partitioning methods. Once a data partitioning method is applied, local join is carried out by either a sort-merge operator (in the case of parallel sort-merge algorithm) or a hash function (in the case of parallel hash). Local join process is therefore much straightforward, as each processor performs sequential sort-merge or hash operations. Hence, the critical element is the data partitioning method. Our future work includes evaluating the two data partitioning methods (and possibly other partitioning methods) for parallel collection-intersect join algorithms, since data partitioning plays an important role in the overall efficiency of parallel collection join algorithms.
References 1. Cattell, R.G.G. (ed.), The Object Database Standard: ODMG-93, Release 1.1, Morgan Kaufmann, 1994. 505, 506 2. DeWitt, D.J. and Gray, J., “Parallel Database Systems: The Future of High Performance Database Systems”, Communication of the ACM, 35(6), pp. 85-98, 1992. 3. Kim, K-C., “Parallelism in Object-Oriented Query Processing”, Proceedings of the Sixth International Conference on Data Engineering, pp. 209-217, 1990. 506 4. Leung, C.H.C. and Ghogomu, H.T., “A High-Performance Parallel Database Architecture”, Proceedings of the Seventh ACM International Conference on Supercomputing, pp. 377-386, Tokyo, 1993. 507 5. Leung, C.H.C., and Taniar, D., “Parallel Query Processing in Object-Oriented Database Systems”, Australian Computer Science Communications, 17(2), pp. 119-131, 1995. 506 6. Mishra, P. and Eich, M.H., “Join Processing in Relational Databases”, ACM Computing Surveys, 24(1), pp. 63-113, March 1992. 7. Taniar, D., and Rahayu, W., “Parallelization and Object-Orientation: A Database Processing Point of View”, Proceedings of the Twenty-Fourth International Conference on Technology of Object-Oriented Languages and Systems TOOLS ASIA’97, Beijing, China, pp. 301-310, 1997. 506 8. Taniar, D. and Rahayu, J.W., “Parallel Collection-Equi Join Algorithms for Object-Oriented Databases”, Proceedings of International Database Engineering and Applications Symposium IDEAS’98, IEEE Computer Society Press, Cardiff, UK, July 1998. 9. Taniar, D. and Rahayu, J.W., “A Taxonomy for Object-Oriented Queries”, a book chapter in Current Trends in Database Technology, Idea Group Publishing, in press (1998). 505 10. Taniar, D. and Rahayu, J.W., “Divide and Partial Broadcast Method for Parallel Collection Join Queries”, High-Performance Computing and Networking, P.Sloot et al (eds.), Lecture Notes in Computer Science 1401, Springer-Verlag, pp. 937-939, 1998. 509 11. Thakore, A.K. and Su, S.Y.W., “Performance Analysis of Parallel Object-Oriented Query Processing Algorithms”, Distributed and Parallel Databases 2, pp. 59-100, 1994. 506
Exploiting Atomic Broadcast in Replicated Databases Fernando Pedone , Rachid Guerraoui, and Andr´e Schiper D´epartement d’Informatique Ecole Polytechnique F´ed´erale de Lausanne 1015 Lausanne, Switzerland
Abstract. Database replication protocols have historically been built on top of distributed database systems, and have consequently been designed and implemented using distributed transactional mechanisms, such as atomic commitment. We argue in this paper that this approach is not always adequate to efficiently support database replication and that more suitable alternatives, such as atomic broadcast primitives, should be employed instead. More precisely, we show in this paper that fully replicated database systems, based on the deferred update replication model, have better throughput and response time if implemented with an atomic broadcast termination protocol than if implemented with atomic commitment.
1
Introduction
In the database context, replication techniques based on the deferred update model have received increasingly attention in the past years [6]. According to the deferred update model, transactions are processed locally at one server (i.e., one replica manager) and, at commit time, are forwarded for certification to the other servers (i.e., the other replica managers). Deferred update replication techniques offer many advantages over immediate update techniques, which synchronise every transaction operation across all servers. Among these advantages, one may cite: (a) better performance, by gathering and propagating multiple updates together, and localising the execution at a single, possibly nearby, server (thus reducing the number of messages in the network), (b) more flexibility, by propagating the updates at a convenient time (e.g., during the next dial-up connection), (c) better support for fault tolerance, by simplifying server recovery (i.e., missing updates may be demanded to other servers), and (d) lower deadlock rate, by eliminating distributed deadlocks [6]. Nevertheless, deferred update replication techniques have two limitations. Firstly, the termination protocol used to propagate the transactions to other servers for certification is usually an atomic commitment protocol (e.g., a 2PC algorithm), whose cost directly impacts transaction response time. Secondly, the
Research supported by the EPFL-ETHZ DRAGON project and OFES under contract number 95.0830, as part of the ESPRIT BROADCAST-WG (number 22455). On leave from Col´egio T´ecnico Industrial, University of Rio Grande, Brazil
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 513–520, 1998. c Springer-Verlag Berlin Heidelberg 1998
514
Fernando Pedone et al.
certification procedure that is usually performed at transaction termination time consists in aborting all conflicting transactions.1 Such certification procedure typically leads to a high abort rate if conflicts are frequent, and hence impacts transaction throughput. In the context of client-server distributed systems, most replication schemes (that guarantee strong replica consistency) are based on atomic broadcast communication [3,9]. An atomic broadcast communication primitive enables to send messages to a set of processes, with the guarantee that the processes agree on the set of messages delivered, and on the order according to which the messages are delivered [7]. With this guarantee, consistency is trivially ensured if every operation on a replicated server is distributed to all replicas using atomic broadcast. Although several authors have mentioned the possibility of using atomic broadcast to support replication schemes in a database context (e.g., [3,11]), little work has been done in that direction. In this paper, we show that atomic broadcast can successfully be used to improve the performance of the database deferred update replication technique. In particular, we show that, for different resilience scenarios, the deferred update replication technique based on atomic broadcast provides better transaction throughput and response time than a similar scheme based on atomic commitment. The paper is organised as follows. In Section 2, we present the replicated database model, and we recall the principle of the deferred update replication technique. In Section 3, we describe a variation of this replicated technique based on atomic broadcast. In Section 4, we compare the transaction throughput and response time of deferred update replication based on atomic broadcast with the throughput and response time of deferred update replication based on classical atomic commitment. Section 5 summarises the main contributions of the paper and discusses some research perspectives.
2
The Deferred Update Technique
2.1
Replicated Database Model
We consider a replicated database system composed of a set of processes Σ = {p1 , p2 , . . . , pn }, each one executing in a different processor, without shared memory and access to a common clock. Each process has a replica of the database and plays the role of a replica manager. Processes may fail by crashing, and can recover after a crash. We assume that processes do not behave maliciously (i.e., we exclude Byzantine failures). If a process p is able to execute requests at a certain time τ (i.e., p did not fail or p has failed but recovered) we say that the 1
Note that we do not consider here replication protocols that allow replica divergence and require reconciliations. We focus on replication schemes that guarantee one-copy serializability [2].
Exploiting Atomic Broadcast in Replicated Databases
515
process p is up at time τ . Otherwise the process p is said to be down at time τ . We say that a process p is correct if there is a time after which p is forever up.2 Processes execute transactions, that are sequences of read and write operations followed by a commit or abort operation. Transactions are submitted by client processes executing in any processor (with or without a replica of the database). Our correctness criterion for transaction execution is one-copy serializability (1SR) [2]. Committed transactions are represented as Ti , Tj and Tk . The set Πpτ contains all committed transactions in process p at time τ . Non-committed transactions are represented as tm and tn . When a transaction tm commits, it is represented as Ti , where i indicates the serial order of tm , related to the already committed transactions in the database. 2.2
The Deferred Update Principle
In the deferred update technique, a transaction passes through some well-defined states. It starts in the executing state, when its read and write operations are locally executed by the process where it initiated. When the transaction requests the commit, it passes to the committing state and is sent to the other processes. When received by a process, the transaction is also in the committing state. A transaction in a process remains in the committing state until its fate is known by the process (i.e., commit or abort). The executing and committing states are transitory states, whereas the commit and abort states are final states. To summarise the principle of the deferred update replication technique, we consider a particular transaction tm executing in some process pi . Hereafter, the readset (RS) and writeset (W S) are the sets of data items that a transaction reads and writes, respectively, during its execution. 1. Transaction tm is initiated and executed at process pi . Read and write operations request the appropriate locks, that are locally granted according to the strict two phase locking rule. During this phase, transaction tm is in the executing state. 2. When transaction tm requests the commit, tm passes to the committing state. Its read locks are released (the write locks are maintained by pi until tm reaches a final state), and process pi then triggers the termination protocol for tm : the updates performed by tm (e.g., the redo log records), as well as its readset and writeset, are sent to all the other processes. 3. As part of the termination protocol, tm is certified by every process. The certification test guarantees one-copy serializability. The outcome of this test is the commit or abort of tm , according to concurrent transactions that 2
The notion of forever up is a theoretical assumption to guarantee that correct processes do useful computation. This assumption prevents cases where processes fail and recover successively without being up enough time to make the system evolve. Forever up means, for example, from the beginning until the end of a termination protocol.
516
Fernando Pedone et al.
executed in other processes. We will come back to this issue in Sections 3 and 4. 4. If tm passes the certification test, all its write locks are requested and its updates are applied to the database. Hereafter, transaction tm is represented as Ti . Transactions in the execution state whose locks conflict with Ti ’s write locks are aborted for Ti ’s sake. During the time when a process p is down, p may miss some transactions by not participating in their termination protocol. However, as soon as process p is up again, p catches up with another process that has seen all transactions in the system. This recovery procedure depends on the implementation of the termination protocol (see [10] for a detailed discussion).
3 3.1
A Replication Scheme Based on Atomic Broadcast Atomic Broadcast
An atomic broadcast primitive enables to send messages to a set of processes, with the guarantee that all processes agree on the set of messages delivered and the order according to which the messages are delivered [7]. More precisely, atomic broadcast ensures that (i) if some process delivers message m then every correct process delivers m (uniform atomicity); (ii) no two processes deliver any two messages in different orders (order); and (iii) if a process broadcasts message m and does not fail, then every correct process eventually delivers m (termination). It is important to notice that the properties of atomic broadcast are defined in terms of message delivery and not in terms of message reception. Typically, a process first receives a message, then performs some computation to guarantee the atomic broadcast properties, and then delivers the message. The notion of delivery captures the concept of irrevocability (i.e., a process must not forget that it has delivered a message). In the following, we use the expression delivering a transaction tm to mean delivering the message that contains transaction tm . 3.2
A Termination Protocol Based on Atomic Broadcast
We describe now the termination protocol for the deferred update replication technique based on atomic broadcast. On delivering a committing transaction, each process certifies the transaction and, in case of success, commits it. Once the transaction is delivered and certified successfully, it passes to the commit state and its writes are processed. In order for a process to certify a committing transaction tm , it has to know which transactions conflict with tm . The notion of conflict is defined by the precedes relation between transactions and the operations issued by the transactions. A transaction Tj precedes another transaction tm , denoted Tj → tm , if Tj committed before tm started its execution. The relation Tj → tm (not Tj → tm )
Exploiting Atomic Broadcast in Replicated Databases
517
means that Tj committed after tm started its execution. Based on these definitions, we say that a transaction Tj conflicts with tm if (1) Tj → tm and (2) tm and Tj have conflicting operations.3 More precisely, process pj performs the following steps after delivering tm . 1. Certification test. Process pj aborts tm if there is any committed transaction Tj that conflicts with tm and that updated data items read by tm . 2. Commitment. If tm is not aborted, it passes to the commit state (hereafter tm will be expressed as the committed transaction Ti , where i represents the sequential order of transaction tm , and Πpτj = Πpτj ∪ {Ti }); process pj tries to grant all Ti ’s write locks, and processes its updates. There are two cases to consider. (a) There is a transaction tn in execution at pj whose read or write locks conflict with Ti ’s write locks. In this case tn is aborted by pj and, therefore, all tn ’s read and write locks are released. (b) There is a transaction tn , that is executed locally at pj and requested the commit, but is delivered after Ti . If tn executed locally at pj , it has all write locks on the data items it updated. If tn commits, its writes will supersede Ti ’s (i.e., the ones that overlap) and, in this case, Ti need neither request these write locks nor process the updates over the database. This is similar to Thomas’ Write Rule [12]. If tn is later aborted (i.e., it does not pass the certification test), the database should be restored to a state without tn , for instance, by applying Ti ’s redo log entries to the database. Given a committing transaction tm , the set of transactions {Tj | Tj → tm } (see certification test above) is determined as follows. Given a process p, let lastp (τ ) represent the last delivered and committed transaction in p at local time τ . A transaction tm that starts its execution in process p at time τ0 has to be checked at certification time τp against the transactions {Tlastp (τ0 )+1 , . . . , Tlastp (τp ) }. The information lastp (τ0 ) is broadcast together with the writes of Ti , and lastp (τp ) is determined by each process p at certification time (see [10] for more details).
4
Atomic Broadcast vs. Atomic Commit
In this section we describe the deferred update technique based on atomic commit and compare it with the deferred update technique based on atomic broadcast using two criteria: transaction throughput and transaction response time. 4.1 The Termination Protocol Based on Atomic Commit With an atomic commit implementation of the deferred update termination protocol, no total order knowledge is available for the processes and a conflict has to be resolved by aborting all conflicting transactions. We describe below the termination protocol based on atomic commit. 3
Two operations conflict if they are issued by different transactions, access the same data item and at least one of them is a write.
518
Fernando Pedone et al.
1. Certification test. A process can make a decision on the commit or abort of a transaction tm when it knows that all the other processes also know about tm . This requires an interaction among processes that ensures that when a transaction is certified, all committing transaction are taken into account. On certifying transaction tm , process pj decides for its abort if there is a transaction tn that does not precede tm (tn → tm ) and with which tm has conflicting operations. 2. Commitment. If tm is not aborted, it passes to the commit state and pj tries to grant all its write locks and processes its updates. Transactions that are being locally executed in pj and have read or write locks on data items updated by tm are aborted. In order to certify and commit transactions, the system has to detect transactions that executed concurrently in different processes. For example, in [1] this is done associating vector clocks timestamps with local events (e.g., execution or commit of a transaction). These timestamps have the property of being ordered if the events are causally related. Each time a process communicates to another, it sends its vector clock. 4.2
Transaction Throughput and Response Time
The values presented in Figure 1 were obtained by means of a simulation involving a database with 10000 data items replicated in 5 database processes. Each transaction has a readset and a writeset with 10 different data items and each data item has the same probability of being accessed. These experiments consider concurrent committing transactions (i.e., transactions that have already requested the commit). The results show that total order can indeed produce better results. In the following we compare implementations of atomic broadcast to implementations of atomic commitment. Due to lack of space, we do not describe how atomic broadcast can be implemented (see [10] for a detailed discussion). The measures shown in Figure 2 were obtained with Sparc 20 workstations (running Solaris 2.3), an Ethernet network using the TCP protocol, transactions (messages and logs) of 1024 bytes, and failure free runs (the most frequent ones in practice). The measures convey an average time to deliver a transaction message. These results show that atomic broadcast protocols have a better performance than atomic commit, when compared under the same degree of resilience.
5
Concluding Remarks
In a recent paper, Cheriton and Skeen expressed their scepticism about the adequateness of atomic broadcast to support replicated databases [4]. The reasons that were raised were (1) atomic broadcast replication schemes consider individual operations, whereas in databases, operations are gathered inside transactions, (2) atomic broadcast does not guarantee uniform atomicity (e.g., a server
Exploiting Atomic Broadcast in Replicated Databases 160
1 Atomic Broadcast Atomic Commit
0.9
Three Phase Commit Atomic Broadcast (Alg.1) Two Phase Commit Atomic Broadcast (Alg.2)
155 150 145
0.8 Time (ms)
Committed transactions (%)
519
0.7 0.6 0.5
140 135 130 125 120 115
0.4
110 0.3
105 5
10
15 20 Transactions/s
Fig. 1. Throughput
25
30
3
4
5
6 7 Number of Nodes
8
9
10
Fig. 2. Response time
could deliver a message and crash, with the other servers not even receiving it), whereas uniform atomicity is fundamental in distributed databases (all processes, even those that have later crashed, must agree to commit or not a transaction), and (3) atomic broadcast is usually considered in a crash-stop process model (i.e., a process that crashes never recovers), whereas in databases processes are supposed to recover after a crash. In this paper, we have considered an atomic broadcast primitive in a crash-recovery model that guarantees uniformity, i.e., if one process delivers a transaction, all the others also deliver it. We have shown how that primitive can efficiently be used to propagate transaction control information in a deferred update model of replication. The choice for that replication model was not casual, as some recent research has shown that immediate update models are inviable due to the nature of the applications or the side effects of synchronisation [6]. Indirectly, this paper points out the fact that existing database replication protocols are built on top of distributed database systems, using mechanisms that were developed to deal with distributed information, but not necessarily designed with replication in mind. A flagrant example of this is the atomic commitment mechanism, which we have shown can be favourably replaced by an atomic broadcast primitive, providing better throughput and response time. Our performance figures help dis-mystify a common (mis)belief that total order atomic broadcast primitives are too expensive, when compared to traditional transactional primitives, and so inappropriate for high performance systems. As already stated in [13], we also believe that replication can be successfully applied for performance, and this can be achieved without sacrificing consistency (e.g., as in [5]) or making semantic assumptions about the transactions (e.g., as in [6,8]). It is important however to notice that the replication scheme based on atomic broadcast presented in this paper is based on two important assumptions: (1) processes certify and commit transactions sequentially, and (2) the database is fully replicated. The first assumption can be bypassed since concurrency between transactions that come from the Certifier is allowed if the transactions have disjoint write sets. The second assumption about a fully replicated
520
Fernando Pedone et al.
database is reasonable in traditional closed information systems, but is inappropriate for an open system with a large number of nodes or a large database. In such systems, replication can only be partial (i.e., processes store only subsets of the database), and the extend to which atomic broadcast can be useful in this context is an open issue.
References 1. D. Agrawal, A. El Abbadi, and R. Steinke. Epidemic algorithms in replicated databases. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, 12–15 May 1997. 518 2. P. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. 514, 515 3. K. P. Birman and R. van Renesse. Reliable Distributed Computing with the ISIS Toolkit. IEEE Press, 1994. 514 4. D. Cheriton and D. Skeen. Understanding the Limitations of Causally and Totally Ordered Communication. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, Asheville, North Carolina, December 1993. 518 5. D. J. Delmolino. Strategies and techniques for using Oracle 7 replication. Technical report, Oracle Corporation, 1995. 519 6. J. N. Gray, P. Helland, P. O’Neil, and D. Shasha. The dangers of replication and a solution. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, June 1996. 513, 519 7. V. Hadzilacos and S. Toueg. Distributed Systems, 2ed, chapter 3, Fault-Tolerant Broadcasts and Related Problems. Addison Wesley, 1993. 514, 516 8. H. V. Jagadish, I. S. Mumick, and M. Rabinovich. Scalable versioning in distributed databases with commuting updates. In Proceedings of the Thirteenth International Conference on Data Engineering, April 7-11, 1997 Birmingham U.K., pages 520– 531. IEEE Computer Society Press, April 1997. 519 9. M. F. Kaashoek and A. S. Tanenbaum. Group communication in the amoeba distributed operating system. In 11th International Conference on Distributed Computing Systems, pages 222–230, Washington, D.C., USA, May 1991. IEEE Computer Society Press. 514 10. F. Pedone, R. Guerraoui, and A. Schiper. Exploiting atomic broadcast in replicated databases. Technical Report No 98/258, Swiss Federal Institute of Technology, Lausanne, Switzerland, 1998. Available at http://lsewww.epfl.ch/pedone. 516, 517, 518 11. A. Schiper and M. Raynal. From group communication to transaction in distributed systems. Communications of the ACM, 39(4):84–87, April 1996. 514 12. R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. on Database Systems, 4(2):180–209, June 1979. 517 13. P. Triantafillou. High availability is not enough. In J.-F. Paris and H. G. Molina, editors, Proceedings of the Second Workshop on the Management of Replicated Data, pages 40–43, Monterey, California, November 1992. IEEE Computer Society Press. 519
The Hardware/Software Balancing Act for Information Retrieval on Symmetric Multiprocessors Zhihong Lu, Kathryn S. McKinley, and Brendon Cahoon Department of Computer Science University of Massachusetts Amherst, MA 01003 {zlu,mckinley,cahoon}@cs.umass.edu
Abstract. Web search engines, such as AltaVista and Infoseek, handle tremendous loads by exploiting the parallelism implicit in their tasks and using symmetric multiprocessors to support their services. The web searching problem that they solve is a special case of the more general information retrieval (IR) problem of locating documents relevant to the information need of users. In this paper, we investigate how to exploit a symmetric multiprocessor to build high performance IR servers. Although the problem can be solved by throwing lots of CPU and disk resources at it, the important questions are how much of which hardware and what software structure is needed to effectively exploit hardware resources. We have found, to our surprise, that in some cases adding hardware degrades performance rather than improves it. We show that multiple threads are needed to fully utilize hardware resources. Our investigation is based on InQuery, a state-of-the-art full-text information retrieval engine.
1
Introduction
As information explodes across the Web and elsewhere, people increasingly depend on search engines to help them to find information. Web searching is a special case of the more general information retrieval (IR) problem of locating documents relevant to the information need of users. In this paper, we investigate how to balance hardware and software resources to exploit a symmetric multiprocessor (SMP) architecture to build high performance IR servers. Our IR server is based on InQuery [2,3], a state-of-the-art full-text information retrieval engine that is widely used in Web search engines, large libraries, companies, and governments such as Infoseek, Library of Congress, White House, West Publishing, and Lotus [5]. Our work is novel because it investigates a real, proven effective system under a variety of realistic workloads and hardware configurations on an SMP architecture. The previous research investigates either the IR system on massively parallel processing (MPP) architecture or it investigates only a subset of the system on SMP architecture such as the disk system or it compares the cost factors of SMP architecture with other architectures. (See [6] D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 521–527, 1998. c Springer-Verlag Berlin Heidelberg 1998
522
Zhihong Lu et al.
for a more thorough comparison with the related work). Our results provide insights for building high performance IR servers for searching the Web and other environments using a symmetric multiprocessor. The remainder of this paper is organized as follows. The next section describes the implementation of our parallel IR server and simulator. Section 3 presents results that demonstrate the system scalability and hardware/software balancing with respect to multiple threads, CPUs, and disks. Section 4 summarizes our results and concludes.
2
A Parallel Information Retrieval Server
This section describes the implementation of our parallel IR server and simulator. We begin with a brief description of the InQuery retrieval engine [2,3,5]. We next present the features we model and summarize our validation of the simulator against the multithreaded implementation. InQuery Retrieval Engine InQuery is one of the most powerful and advanced full-text information retrieval engines in commercial or government use today [5]. It uses an inference network model, which applies Bayesian inference networks to represent documents and queries, and views information retrieval as an inference or evidential reasoning process [2,3]. The inference networks are implemented as inverted files. In this paper, we use “collection” to refer to a set of documents, and “database” to refer to an indexed collection. The InQuery server supports a range of IR commands such as query, document, and relevance feedback. The three basic IR commands we model are: query, summary, and document. A query command requests documents that match a set of terms. A query response consists of a list of top ranked document identifiers. A summary command consists of a set of document identifiers. A summary response includes the document titles and the first few sentences of the documents. A document command requests a document using its document identifier. The response includes the complete text of the document. The Parallel IR Server To investigate the balance between hardware and software in a IR system on a symmetric multiprocessor, we implemented a parallel IR server using InQuery as the retrieval engine. The parallel IR server exploits parallelism as follows: (1) It executes multiple IR commands in parallel by multithreading; and (2) It executes one command against multiple partitions of a collection in parallel. To expedite our investigation of possible system configurations, characteristics of IR collections, and the basic IR system performance, we implement a simulator with numerous system parameters, such as the number of CPUs, threads, disks, collection size, and query characteristics. Table 1 presents all the parameters and the values we use in our experiments. The simulation model is driven by empirical timing measurements from the actual system. For queries, summaries, and documents, we measure CPU and
The Hardware/Software Balancing Act for Information Retrieval
523
Table 1. Experimental Parameters Parameters Abbreviation Values Query Number QN 1000 Terms per Query (average) 2 shifted neg. binomial dist. (n,p,s) TPQ (4,0.8,1) Query Term Frequency dist. from queries QTF Observed Distribution Client Arrival Pattern 6 30 60 90 120 Poisson process (requests per minute) λ 150 180 240 300 Collection Size (GB) CSIZE 1 2 4 8 16 Disk Number DISK 1 2 4 8 16 CPU Number CPU 1 2 3 4 Thread Number TH 1 2 4 8 16 32
disk usage for each operation, but do not measure the memory and cache effects. We model the collection and queries by obtaining document and term statistics from test collections and real query sets (See [1,6] for more details.) We validate our simulator against the multithreaded implementation. The simulator reports response times that are 4.5% slower than the actual system on the average (See [6] for more details).
3
Experiments and Results
This section explores how software and hardware configurations affect system scalability with respect to multiple threads, CPUs, and disks. We start with a base system that consists of one thread, CPU, and disk. This system is disk bound. We improve the performance of our IR server through better software: multithreading; and with additional hardware: CPUs and disks. We demonstrate the system scalability using two sets of experiments. In the first set of experiments, we explore the effects of threading on system scalability. In the second set of experiments, we explore system scalability by increasing the collection size from 1 GB to 16 GB. When multiple disks exist, we use a round-robin strategy to distribute the collection and its index over disks. We assume the client arrival rate is a Poisson process. Each client issues a query and waits for response. For each query, the server performs two operations: query evaluation and retrieving the corresponding summaries. Since users typically enter short queries, we experiment with a query set that consists of 1000 short queries, with an average of 2 terms per query that mimic those found in the query set down loaded from the Web server for searching the 103rd Congressional Record [4]. All experiments measure response time, CPU and disk utilization, and determine the largest arrival rate at which the system supports a response time under 10 seconds. We chose 10 seconds arbitrarily as our cutoff point for a reasonable response time. 3.1 Threading This section examines how the software structure, i.e., number of threads, affects system scalability. Figure 1 illustrates how the average response time and
Zhihong Lu et al.
Average Response Time (Seconds)
QN TPQQTF CSIZECPUDISK TH 1000 2 Obs. 1 GB 1 1 Varied 60 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads
50 40 30 20 10 0
QN TPQQTF CSIZECPUDISK TH 1000 2 Obs. 1 GB 1 4 Varied Average Response Time (Seconds)
524
60 50 40
4 thread 8 threads 16 threads 32 threads
30 20 10 0
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
Average Response Time (Seconds)
(b) QN TPQQTF CSIZECPUDISK TH 1000 2 Obs. 4 GB 2 4 Varied
Average Response Time (Seconds)
(a) QN TPQQTF CSIZECPUDISK TH 1000 2 Obs. 1 GB 2 4 Varied 60 50 40
4 thread 8 threads 16 threads 32 threads
30 20 10 0 0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
60 50 40
4 thread 8 threads 16 threads 32 threads
30 20 10 0 0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
(c) Configuration (a) λ = 120 Resource TH=2 TH=4 CPU 34.4% 36.2% DISK 92.7% 97.30%
(d) Configuration (b) λ = 150 TH=4 TH=8 72.6% 89.2% 51.1% 62.8%
Configuration (c) λ = 180 TH=8 TH=16 55.1% 56.7% 77.4% 79.4%
Configuration (d) λ = 150 TH=8 TH=16 73.3% 85.2% 68.0% 79.3%
(e) hardware utilization at some interested data points
Fig. 1. Performance as the number of threads increases
resource utilization changes as the number of threads increases with varying number of CPUs and disks. In all the configurations, the average response time improves significantly as the number of threads increases until either the disk or the CPU is overutilized (see Figure 1 (a) and (b)). Too few threads limits the system’s ability to achieve its peak performance. For example in configuration (c), using 4 threads only supports 120 requests per minute for a response time under 10 seconds, while using 16 threads supports more than 180 requests per minute under the same hardware configuration. When either the CPU or the disk is a bottleneck, the system needs fewer threads to reach its peak performance. When CPUs and disks are well balanced (configuration (c) and (d)), the necessary number of threads is influenized more by the number of disks than the collection size. In both configuration (c) and configuration (d), the system achieves its peak performance using 16 threads. Additional threads do not bring further improvement.
The Hardware/Software Balancing Act for Information Retrieval
525
3.2 Increasing the Collection Size This section examines system scalability and hardware balancing as the collection size increases from 1 GB to 16 GB. In order to examine different hardware configurations, we consider two disk configurations as the collection size increases: fixing the number of disks, and adding disks. We vary the number of CPUs in each disk configuration. Figure 2 illustrates the average response time and resource utilization when the collection is distributed over 16 disks which means each disk stores a database for 1/16 of the collection. Partitioning the collection over 16 disks illustrates when the system is CPU bound (see Figure 2(c)). Although performance degrades as the collection size increases, the degradation is closely related to the CPU utilization. With 1 CPU where the CPU is overutilized for 1 GB and 60 requests per minute, increasing the collection size from 1 GB to 16 GB decreases the largest arrival rate at which the system supports a response time under 10 seconds by a factor of 10 (see Figure 2(a)). With 4 CPUs where CPUs are overutilized for 1 GB and 180 requests per minute, the performance degrades much more gracefully (see Figure 2(b)). Increasing the collection size from 1 GB to 16 GB only decreases the largest arrival rate for a response time under 10 seconds by a factor of 3 for 4 CPUs. Figure 3 illustrates the average response time and the resource utilization when the number of disks varies with the collection size and each disk stores a database for 1 GB of data. The system produces response times better than 1 GB for 2 GB using 1 CPU and 8 GB using 4 CPUs. A single CPU system thus handles a 2 GB collection faster than a 1 GB collection and a 4 CPU system handles a 2, 4, or 8 GB collection faster than a 1 GB collection in our configuration. The performance improves because work related to retrieving summaries is distributed over the disks such that each disk handles less work, relieving the disk bottleneck. By examining the utilization of CPU and disks in Figure 3(c), we see that the performance improves until the CPUs are overutilized. In the example of the single CPU system, the CPU is overutilized for a 4 GB collection. For a 2 GB collection distributed over 2 disks, the system handles 27.8% more requests than for a 1 GB collection on 1 disk. By comparing Figure 2(c) and Figure 3(c), we find that the CPU utilization is more closely related to the number of disks rather than the collection size. We also find that adding disks degrades system performance when CPUs and disks are not well balanced. For example, for a 8 GB collection, partitioning over 8 disks using 4 CPUs results in 67.5% CPU utilization (see Figure 3(c)), while partitioning over 16 disks results in 88.0% CPU utilization (see Figure 2(c)) due to the additional overhead to access each disk. In this configuration, a system with 16 disks performs worse than 8 disks because the CPUs are a bottleneck.
4
Conclusion
In this paper, we investigate building a parallel information retrieval server using a symmetric multiprocessor to improve the system performance. Our results show that we need more than one thread to fully utilize hardware resources
QN TPQ QTF CSIZE CPU DISK TH 1000 2 Obs. Varied Varied 16 16
QN TPQ QTF CSIZE CPU DISK TH 1000 2 Obs. Varied VariedVaried 16 Average Response Time (Seconds)
Zhihong Lu et al.
Average Response Time (Seconds)
526
60 16 GB
1 GB 4 GB
50
2 GB
40 30 20
8 GB
10 0
60 16 GB
2 GB 4 GB 8 GB
40 30 20 10 0
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
(a) CPU = 1
60 16 GB
4 GB 8 GB
40 30 20
2 GB 1 GB
10 0
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
Average Response Time (Seconds)
Average Response Time (Seconds)
(a) CPU = 1 50
60 1 GB 8 GB
50
2 GB
40
16 GB
30 20 4 GB
10 0
0 20 40 60 80 100 120 140 160 180 Average Arrival Rate (requests per minute)
(b) CPU = 4 Num. ReCPUs source CPU 1 DISK CPU 4 DISK
1 GB
50
Size of Collection (size/16 GB per disk) 1 GB 2 GB 4 GB 8 GB 16 GB 96.6% 97.4% 98.1% 98.6% 99.0% 21.6% 18.4% 14.7% 11.9% 9.8% 38.9% 49.7% 68.0% 88.0% 93.5% 34.9% 37.8% 41.1% 42.6% 37.4%
(b) CPU = 4 Num. ReCPUs source CPU 1 DISK CPU 4 DISK
1 GB 36.2% 97.4% 9.0% 97.4%
Size of Collection (1 GB per disk) 2 GB 4 GB 8 GB 75.3% 94.3% 98.1% 82.2% 43.5% 20.6% 19.0% 35.7% 67.5% 82.9% 66.7% 57.1%
16 GB 99.0% 9.8% 93.5% 37.4%
(c) Resource utilization when λ = 120
(c) Resource utilization when λ = 120
Fig. 2. Average response time and resource utilization for a collection distributed over 16 disks
Fig. 3. Average response time and resource utilization when the number of disks varies with the size of the collection
(4 to 16 threads for the configurations we explored). We also show that adding hardware components can improve the performance, but these components must be well balanced since the IR workload performs significant amounts of both I/O and CPU processing. Our results show that we can search more data with no loss in performance in many instances. Although performance eventually degrades as the collection size increases, the performance degrades very gracefully if we keep the hardware utilization balanced. Our results also show that system performance for our system is more strongly related to the number of disks rather than the collection size.
The Hardware/Software Balancing Act for Information Retrieval
527
Acknowledgment This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623. Kathryn S. McKinley is supported by an NSF CAREER award CCR-9624209. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor. We thank Bruce Croft, Jamie Callan, James Allan, and Chris Small for their support of this work. We also thank all the developers of InQuery without whose efforts this work would not have been possible.
References 1. B. Cahoon and K. S. McKinley. Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland, August 1996. 2. J. P. Callan, W. B. Croft, and J. Broglio. TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 31(3):327–343, 1995. 3. J. P. Callan, W. B. Croft, and S. M. Harding. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, Valencia, Spain, September 1992. 4. W. B. Croft, R. Cook, and D. Wilder. Providing government information on the internet: Experiences with THOMAS. In The Second International Conference on the Theory and Practice of Digital Libraries, Austin, TX, June 1995. 5. InQuery. http://ciir.cs.umass.edu/info/highlights.html. 6. Zhihong Lu, Kathryn S. McKinley, and Brendon Cahoon. The hardware/software balancing act for information retrieval on symmetric multiprocessors. Technical Report TR98-25, University of Massachusetts, Amherst, 1998. 523 521, 522 521, 522 523 521, 522 521, 523
The Enhancement of Semijoin Strategies in Distributed Query Optimization F. Najjar and Y. Slimani Dept. Informatique - Facult´e des Sciences de Tunis Campus Universitaire - 1060 Tunis, Tunisie [email protected]
Abstract. We investigate the problem of optimizing distributed queries by using semijoins in order to minimize the amount of data communication between sites. The problem is reduced to that of finding an optimal semijoin sequence that locally fully reduces the relations referenced in a general query graph before processing the join operations.
1
Introduction
The optimization of general queries, in a distributed database system, is an important and challenging research issue. The problem is to determine a sequence of database operations which process the query while minimizing some predetermined cost function. Join is a frequently used database operation. It is also the most expensive, specifically in a distributed database system; it may involve large communication costs when the relations are located at different sites. Hence, instead of performing joins in one step, semijoins [1], are performed first to reduce the size of the relations so as to minimize the data transmission cost for processing queries [2]. In the next step, joins are performed on the reduced relations. The join of two relations R and S on an attribute A is denoted by (R ✶A S), while the semijoin from R to S on an attribute A is denoted by S ∝A R. Thus, S ∝A R is defined as follows: (i) project R on the join attribute A (i.e. R(A)); (ii) Ship R(A) to the site containing S; (iii) Join S with R(A). The transmission cost of sending S to the site containing R for the join R ✶A S can thus be reduced. There are two main methods to process a join operation between two relations. One is called the nondistributed join, where a join is performed between two unfragmented relations. The other is called the distributed join, where the join operation is performed between the fragments of relations. As pointed out in [5], the problem of query processing has been proved to be NP-hard. This fact justifies the necessity of resorting to heuristics. The remaining of this paper is organized as follows: preliminaries are given in Section 2. Section 3 defines the main characteristics of two semijoin-based query optimization heuristics; then, we present and discuss the join query optimization in a fragmented database. Finally, Section 5 concludes the paper. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 528–533, 1998. c Springer-Verlag Berlin Heidelberg 1998
The Enhancement of Semijoin Strategies
2
529
Preliminaries
A join query graph can be denoted by a graph G = (V, E), where V is the set of relations and E is the set of edges. An edge (Ri , Rj ) ∈ E, if there exists a join predicate on some attribute of Ri and Rj . Without loss of generality, only cyclic query graphs are considered. In addition, all attributes are renamed in such a way that two join attributes have the same name if and only if they have a join predicate between them. The relations referenced in the query are assumed to be located at different sites. The query problem is simplified to be the estimation of the data statistics and the optimization of the transmission order, so that the total data transmission is minimized. We denote by |S| the cardinality of a relation S. Let wA be the width of an attribute A and wRi be the width of a tuple in Ri . The size of the total amount of data in Ri can then be denoted by ||Ri || = wRi |Ri |. For notational simplicity, we use |A| to denote the extant domain of the attribute A. Ri (A) denotes the set of distinct values for the attribute A appearing in Ri . For each semijoin Rj ∝A Ri , i (A)| a selectivity factor, ρiA = |R|A| is used to predict the reduction effect. After the execution of Rj ∝A Ri , the size of Rj becomes ρiA ||Ri ||. Morever, it is important to verify that a semijoin Rj ∝A Ri is profitable, i.e. if the cost incurred by this semijoin, wA |Ri (A)|, is less than the cost of the reduction (called the benefit), which is computed in terms of avoided future transmission cost, wRi |Ri |−ρiA |Ri |. The profit is set to be (benefit - cost).
3
Nondistributed Join Method
In this section, we propose two heuristics. The first, namely one-phase Parallel SemiJoins, 1-PSJ, determines a set of parallel semijoins. The second, namely Hybrid A∗ heuristic, HA∗ , finds a general sequence of semijoins, which is a combination of parallel and sequential semijoins.
3.1
1-PSJ
We say that Ri is fully locally reduced if {j / Ri ∝A Rj is feasible}. We denote by RDi = {j/Ri ∝ Rj is profitable} the set of index reducers of the relation Ri . Our objective is to find the set of the most locally profitable semijoins (called applicable semijoins), APi ⊆ RDi , such that the overall profit is maximized, and subsequently the total transmission cost (T Ci ) of Ri is minimized. Furthermore, removing a profitable semijoin may increase the total profit and minimize the extra costs incurred by semijoins. Since all applicable semijoins are executed simultaneously, local optimality (with respect to Ri ) can be attained. Finally, in order to reduce each relation in the query, we apply a divide-and-conquer algorithm. The total cost (T C) is minimized if all tranmission cost (T Ci ) are minimized simultaneously. The details of this algorithm are given in [4].
530
3.2
F. Najjar and Y. Slimani
Hybrid A∗
The well known A∗ can be used to determine a sequence of semijoin reducers [6] for distributed query processing. The key issue of A∗ algorithm is to derive a heuristic function which can intelligently guide the search of a sequence of semijoins. In the A∗ algorithm, the search is controlled by a heuristic function f , with two arguments: the cost of reaching p from the initial node (original query graph with its corresponding profile), and the cost of reaching the goal node from p. Accordingly, f (p) = g(p) + h(p), where g(p) estimates the minimum cost of trajectory from the initial state to p, and h(p) estimates the minimum cost from p to the goal state. The node p chosen for expansion (i.e., whose immediate successors will be generated) is the one which has the smallest value f (p) among all generated nodes that have not been expanded so far. In order to derive a general sequence of semijoins, for a node p, g(p) = cos t(Ri ∝ Rj ) + ||Ri ||, where p is an immediate successor of q, and g(q) + j∈APi
Ri denotes the resulting relation after performing applicables semijoins to the original relation Ri . The function h is defined as the sum of the sizes of remaining relations such that the effect of the total reduction (with respect to neighboring relations) gives the best estimation. h(p) = ( cos t(Rk ∝ Rj )+ ||Rk ||), where k
j
Rk is not yet reduced. Example 1: Consider the following join query: Select A, D from R1 , R2 , R3 , R4 where (R1 .A = R3 .A) and (R1 .B = R2 .B) and (R2 .C = R3 .C) and (R3 .D = R4 .D). We suppose that R1 , R2 , R3 and R4 are located in different sites. The corresponding query graph and profile are given, respectively in Figure 1 and Table 1.
R4
R1
B
R2
D
A
C
R3
Fig. 1. Join query graph for example 1. 1 − P SJ finds the set {R1 ∝ R2 , R3 ∝ R1 , R3 ∝ R4 }, with the total transmission cost to the final site R2 , 18, 370. Whereas, HA∗, finds the general sequence of semijoins, R1 ∝ R2 , {R3 ∝ R1 , R3 ∝ R4 }, with the cost of 16, 681. To show more insights into the performance of 1 − P SJ and HA∗ heuristics, simulations were carried on different queries for n (n is being 5-12) relations involved in each query. For a comparison purpose, in addition to 1 − P SJ and
The Enhancement of Semijoin Strategies
531
Table 1. Profile Table for example 1. Ri |Ri | X R1 1190 A B R2 3440 B C R3 2152 A C D R4 3100 D
WX |Ri (A)| 2 830 1 850 1 850 3 900 2 800 3 900 1 720 1 700
Percentage Transmission cost
HA∗ , we also apply the original method, OM , which consists of sending all the relations directly to the final site. In Figure 2, it is apparent that as the number of relations increases (n ≥ 8), HA∗ heuristic becomes better than 1 − P SJ. When n ≥ 9, HA∗ outperfoms the other heuristics significantly (the reduction cost is about 45%).
100 95 90 85 80 75 70 65 60 55 50
OM 1-PSJ HA*
5
6 7 8 9 10 Number of referenced relations
12
Fig. 2. Impact of the number of relations on transmission cost.
4
Distributed Join Method
A relation can be horizontally partitioned (declustered) into disjoint subsets, called fragments, distributed across several sites. We associate for each fragment its qualification, which is expressed by a predicate describing the common properties of the tuples in that fragment. One major issue in developping a horizontal fragmentation technique is determining the criteria to be used in guiding the fragmentation. A major difficulty is that there are no known significant criteria that can be used to partition relations horizontally. In the context of our study, we suggest a bipartition of each relation Ri , such that, a relation is divided into mutually exclusive fragments. To repre-
532
F. Najjar and Y. Slimani
sent the fragments more specifically, we propose the following formula: |Ri | = α|Ri | + (1 − α)|Ri | = |Ri1 |+ |Ri2 |, where α is a rational number ranging from 0 to 1 and Ri1 , Ri2 are the fragments of Ri . The above fragmentation satisfies the three conditions [2], completeness, reconstruction, and disjointness, which ensure a correct horizontal fragmentation. Note that bipartitioning can be applied to a relation repeatedly. To estimate the cost of an execution plan, the database profile may have the following statistics: |Rij | denotes the cardinality of the fragment number j of relation Ri and |Rij (A)| represents the number of distinct values of attribute A in its fragment. When semijoins are used in a fragmented database system, they have to be performed in a relation-to-fragment manner, so that they do not cause the elimination of contributive tuples. At each site containing a fragment Rjk to be reduced, we proceed as follows: (i) every fragment of Ri 1 must participate in reducing Rjk ; so, find the optimal set of applicable semijoins and send values of the semijoins attributes from each fragment of Ri to Rjk ; (ii) Merge the fragments of Ri before eliminating any tuple of Rjk . Example 2: We illustrate the distributed join method (HA∗ on fragmented relations) with the same previous example discussed for the nondistributed join. After partitioning, the corresponding profile is given in Table 2.
Table 2. Profile Table for Example 1. Rij |Rij | X WX R11 /R12 119/1071 A 2 B 1 R21 /R22 344/3096 B 1 C 3 R31 /R32 215/1936 A 2 C 3 D 1 R41 /R42 310/2790 D 1
|Rij (A)| 90/809 91/818 91/818 97/872 87/782 97/872 79/800 76/683
The optimal general sequence is: {R12 ∝ R22 , R12 ∝ (R31 ∪ R32 )}, {R21 ∝ R11 , R21 ∝ (R31 ∪ R32 )}, R42 ∝ R31 , {R32 ∝ (R11 ∪ R12 ), {R32 ∝ (R21 ∪ R22 ), R32 ∝ R42 }; it incurs 13, 959, which is less than in nondistributed join method. A general conclusion is that the communication cost is substantially reduced if we use a “good fragmentation”. In the absence of a formal definition of a good fragmentation, we can approximate it by the α-factor. In effect, we have noted that a good choice of this criteria leads to a good fragmentation. The Fig 2 shows the effect of the α-factor on the communication cost for a given query in which the number of relations is constant and α is varied. 1
Ri such that Rj Ri is applied in the query.
The Enhancement of Semijoin Strategies
5
533
Conclusion
In this paper, we proposed two distributed query processing strategies for join queries using semijoin as a query processing tactic. For these two strategies, we present new heuristics that “intelligently” guiding the search and returning a general reducer sequence of semijoins. For the case of the distributed join strategy, we proposed a technique to bipartition each relation assuming a fixed α-factor.
References 1. P.A. Bernstein, N. Goodman, E. Wong, C. Reeve, and J.B. Rothnie. Query Processing in a System for Distributed Databases (SDD-1). ACM TDS, vol. 6(4), Dec. 1981, pp. 602-625. 528 2. S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGrawHill, 1985. 528, 532 3. M-S Chen and P.S Yu. Combining Join and Semijoin Operations for Distributed Query Processing. IEEE TKDE vol. 5(3), Jun. 1993, pp. 534-. 542. 4. F. Najjar, Y. Slimani, S. Tlili, and J. Boughizane. Heuristics to determine a general sequence of semijoins in distributed query processing. Proc. of the 9th IASTED Int. Conf., PDCS, Washington D. C. (USA), Oct. 1997, pp. 354-359. 529 5. C. Wang and M-S. Chen. On the Complexity of Distributed Query Optimization. IEEE TKDE, vol. 8(4), Aug. 1996, pp. 650-662. 528 6. H. Yoo and S. Lafortune. An Intelligent Search Method for Query Optimization by Semijoins. IEEE TKDE, vol. 1(2), Jun. 1989, pp. 226-237. 530
Virtual Time Synchronization in Distributed Database Systems Using a Cluster of Workstations Azzedine Boukerche1, Timothy E. LeMaster2 , Sajal K. Das1 , and Ajoy Datta2 1
Department of Computer Sciences, University of North Texas, Denton, TX. USA Department of Computer science, University of Nevada, Las Vegas, NV., USA
2
Abstract. Virtual Time is the fundamental concept in optimistic synchronization schemes. In this paper, we investigate the efficiency of a distributed database management system. synchronized by virtual time using a LAN connected collection of SPARCS. To the best of our knowledge, there is little data reporting VT performance in DDBMS. The experimental results are compared with a multiversion timestamp ordering (MVTO) concurrency control method, a widely used technique in distributed databases, and approximately 25-30% reduction in the response time is observed when using the VT synchronization scheme. Our results demonstrate that the VT synchronization approach is a viable alternative concurrency control method for distributed database systems and outpforms MVTO one.
1
Introduction
Virtual Time is the fundamental concept in optimistic synchronization schemes. Many applications for virtual time have been proposed which include parallel simulations, distributed systems, and VLSI design problems. Time Warp (TW) [4, 5] based mechanism is a technique to implement virtual time by allowing application programs to cancel previously scheduled events. While conservative synchronization methods [3] rely on blocking to avoid violation of dependency constraints, Time Warp relies on detecting synchronization errors at run time and on recovery using rollback mechanisms. The principal advantages of Time Warp over more conventional blocking-based synchronization protocols is that Time Warp offers the potential for greater exploitation of concurrency and parallelism, and more importantly greater transparency of the synchronization mechanism to the programmer. The latter is due to the fact that Time Warp is less reliant on application specific information regarding which computations Despite the fact that research on virtual time has been ongoing for the past decade, little data reporting performance of distributed database systems
Part of this research is supported by Texas Advanced Technology Program grant TATP-003594031
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 534–538, 1998. c Springer-Verlag Berlin Heidelberg 1998
Virtual Time Synchronization in Distributed Database Systems
535
(DDBMS) synchronized by virtual time. Jefferson [4, 5] has suggested that virtual time could be used to synchronize DDBMS. Therefore, a question of considerable pragmatic interest is whether virtual time is really a viable, alternate concurrency control method for distributed database systems and whether it can provide good performance or not. In this paper, our primary goal is to describe an implementation of a virtual time synchronization in distributed database management systems, and study its performance. This is an important step to determine the practicality of a virtual time synchronized database systems. We then compare the performance of our method with a MultiVersion Timestamp Orderings (MVTO) concurrency control method [1]. The empirical results demonstrate that the VT approach outperforms the MVTO one.
2
Virtual Time Database Synchronization
We view a database system as a collection of objects (sequential processes) that execute concurrently and communicate through timestamped messages. Each object is capable of both computation and communication, and it has an associated virtual clock which acts as the simulation clock for that object. Now, let us examine the component and the structure of a database management system. There are four majors components of a DDBMS at each site: transactions, transaction manager (TM), database manager (DM), and data. Transactions are generated from users. The transaction manager supervises interaction between users and the DDBMS, while the database manager controls access to the data, and provides an interface between the user (via the query) and the file system. It is also responsible for controlling the simultaneous use of the database and maintaining its integrity and security. There are no formal distinction between transaction and data objects. While data objects respond only to Read/W rite messages, transaction objects receive messages directly from external users. Objects communicate with each other through Request and Response primitives which send messages. A Request either reads or writes a data value, while a Response returns a data value. After an object issues a Request, it blocks to await a response. The possibility of deadlock exists since two objects within a transaction could request data values from each other at the same time. Each would block, waiting for the other to respond to the Request. Virtual Time scheme prevents deadlock between transactions, since all transactions are executed in timestamps order, and all transaction timestamps are unique. Virtual time can be viewed as a global, one-dimensional temporal coordinate system imposed on a distributed system to measure computational progress and define synchronization [4, 5]. There are two important properties of virtual time. First, it may increase or decrease with respect to real time. There is no restriction on the amount or frequency by which it changes. Second, the virtual times generated must form a total (or partial) ordering on the relation “less than.”
536
Azzedine Boukerche et al.
A manager (or process) may send a message to any other manager at any time. There are no restrictions placed upon potential communication paths between processes. No fixed or static declarations are required before execution begins. The communication medium is assumed to be reliable but messages are not required to arrive in the order they are sent. Indeed, Occasionally, different transactions cause an unordered message to be received by an object. When this occur, rollback is used for synchronization. However, instead of rolling back the entire transaction, only certain actions within the transactions are rolled back. These actions are the ones that are casually connected to the errant message. This is where virtual time synchronization approach diverges from traditional one, in which the entire transaction is either aborted an retried or is blocked. Virtual time incorporates several different data structures and control algorithms to produce optimistic processing. The most noteworthy are the Time Warp based mechanism to implement virtual time, and the use of rollback as the fundamental synchronizer in the system. The algorithms include transaction and database managers. The former is responsible for generating and submitting requests while the latter processes the requests. The rollback mechanism is used by both managers. In this paper, we compare VT with the MVTO version based on Bernstein’s algorithm [1]. A complete description of VT and MVTO can be found in [2].
3
Simulation Experiments
All experiments were conducted on a 12-processors bus based message passing Sun Microsystems SPARC Stations1 . The SPARC stations were connected into a local area network (LAN) via ethernet. It should be pointed out that in distributed database systems, there are more structures than in distributed simulation environments, see [2] for more details. A simple database model was used. Each database manager was responsible for a specified number of items. Whenever a transaction requested a Read or W rite access to an item, it was delayed 50 milliseconds to simulate the physical I/O operations. The goals of our experiments are to study the viability of virtual time as a concurrency control method in database systems, and to compare its relative performance compared to multiversion ordering synchronization scheme. In our model, we made use of several database/transaction sizes, inter-transaction delays and varied the number of processors from 4 to 12. The experimental data which follows were obtained by averaging several trial runs. Every test run required between 8 and 50 minutes of actual time to complete all experiments. The average length of the experiments was 24 minutes. Space limitation preclude us to include all of our experiments. Hence, we present our results below in the form of graphs of response time, and the throughput. The response time achieved is defined as the average amount of wall clock time per processor required for a transaction manager to successfully submit a 1
W use a distributed memory environment.
Virtual Time Synchronization in Distributed Database Systems
537
t fixed number of transactions. The throughput is defined as follow: K∗N Tresp ; where Tresp represents the total response time of all processors; K is the number of processors used, and N t is the number of transactions.
4000
2600 MVTO VT 2400
3000
2200
2500
2000
Response Time
Response Time
MVTO VT 3500
2000
1500
1800
1600
1000
1400
500
1200
0
1000 0
2
4
6
8
10 12 Database Size
Fig. 1. Response Database Size
14
16
Time
18
20
4
5
6
7
8 Processors
9
10
11
12
Vs. Fig. 2. Response Time Vs. Number of Processors
In our experiments, we analyzed the effect of database size on the performance (response time and throughput) of the DDBMS synchronized by virtual time. We made use of 8 processors. We varied the database size from 1 to 20. The transaction size was set to 7% of the database size. All Read requests were updated with a probability 25%. The Inter-transaction delay was set to zero. In Fig. 1, we present the response time as a function of database size for both synchronizations VT and MVTO. The results obtained show that MVTO and VT exhibit about the same response time when the database size is less than 12. Then, when we increase the database size from 12 to 20, VT responded faster than MVTO. We observe about 25% reduction in the response time when we use a database size of 20. In the next set of experiments, in order to reduce the response time of DDBMS system (s), we run some preliminary tests where we varied the number of processors from 4 to 12. The results obtained for the response time are shown in Figure 2. As expected, we also observe that as we increase the number of processors, the response time increases for both synchronization schemes. Furthermore, we see that VT performs better than MVTO. For example, we see a reduction of the response time between 25-30% with VT scheme when compared to MVTO one. Similar results were obtained for the throughput metric [2].
References 1. P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987. 2. A. Boukerche, T. LeMaster, Sajal, K. Das, and A.K. Datta, “Virtual Time Synchronization in Distributed Systems”, Tech. Rep. TR-98, University of North Texas.
538
Azzedine Boukerche et al.
3. A. Boukerche, C. Tropper, “Parallel Simulation on the Hypercube Multiprocessor”, Distributed Computing, Spring Verglas 1993. 4. D. Jefferson and A. Motro, “The Time Warp Mechanism for Database Concurrency Control,” International Conference on Data Engineering, 1986, 474–481. 5. D.R. Jefferson, “Virtual Time,” ACM Transactions on Programming Languages and Systems, Vol. 7, No. 3, 1985, 404–425.
Load Balancing and Processor Assignment Statements C. Rodr´ıguez, F. Sande, C. Le´ on, I. Coloma, and A. Delgado Dpto. Estad´ıstica, Investigaci´ on Operativa y Computaci´ on, Universidad de La Laguna, Tenerife, Spain
Abstract. This paper presents extensions of processor assignment statements, the language constructs to support them and the methodology for their implementation on distributed memory computers. The data exchange produced by these algorithms follows a generalization of the concept of hypercubic communication pattern. Due to the efficiency of the introduced weighted dynamic hypercubic communication patterns, the performance of programs built using the methodology are comparable to the obtained using classical message passing programs. The concept of balanced processor assignment statement eases the portability of pram algorithms to multicomputers.
1
Balanced Processor Assignment Statements
Algorithms for the pram [1] model are usually described introducing a parallel statement like: for all x in m..n in parallel do instruction(x) It is essential to differentiate between the concepts of processor activation and processor assignment statements. The for all construction is used to denote processor assignment statements, which are usually charged with constant time, although authors avoid to explain this point. Load balancing is a fundamental issue in parallel programming. It is straightforward to implement nested processor assignment statements using cyclic and block distribution mappings. However, the division of the processors among the tasks in groups of equal sizes, provided by cyclic and block processor distribution policies may produce, specially for irregular problems, unfair balance. Any divide and conquer algorithm can be used to illustrate this idea. Since the complexity of the resulting subproblems after the division of the problem may be considerably different, to split the current group of processors in equal sizes will likely lead to work load unbalance. One way to solve this inefficiency is to expand the language with a statement allowing the programmer to modify the processor distribution policy. This can be achieved introducing a simple variant of the for all statement to provide extra heuristic information w(x), regarding to the suitable number of processors required for each task instruction(x): for all x in a..b with weight w(x) in parallel do instruction(x) D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 539–544, 1998. c Springer-Verlag Berlin Heidelberg 1998
540
C. Rodr´ıguez et al.
The effect of this new “balanced parallel assignment statement” is to adjust the number of processors according to the weights attached to each instance of instruction(x). An example of this kind of construct can be found in the La Laguna C system (llc). The La Laguna C tool consists in a set of C macros and functions expanding the library model for message passing (PVM, MPI, Inmos C, etc.). It provides not only load balancing but also many other features as processor virtualization, nested parallelism and support for collective operations. Let us first consider the implementation in llc of the well-known quicksort algorithm. A first approach to the solution consists in using a classical processor assignment statement. In this case, the workload assigned to each processor depends of the selected pivot. The implementation of this solution in llc is what appears in Figure 1, substituting in line 6 the call to parweightvirtual by a call to the macro parvirtual.
1 2 3 4 5 6 7 8
void qs(int first, int last) { int w1, w2, size1, size2, i, j; if (first < last) { partition(&i, &j, first, last); size1 = w1 = (j - first + 1); size2 = w2 = (last - i + 1); parweightvirtual(w1, qs(first,j), (A+first), size1, w2, qs(i,last), (A+i), size2); } }
Fig. 1. Balancing the size of the groups according to the load
Figure 1 shows the use of one of the llc balanced parallel processor assignment statements, the macro parweightvirtual. At the time to execute the macro parweightvirtual at line 6, the set of available processors is divided into two subsets. One executing the first call (second parameter) and the other doing the second one (sixth parameter). The third and fourth parameters, A + f irst and size1, inform that the result of the call to the function qs(f irst, j) on the first segment is constituted by the size1 integers beginning at A + f irst. The two last parameters A + i and size2 describe the fact that the result of the second call is constituted by the size2 integers starting in A + i. The first and fifth parameters of the call (w1 and w2) provide heuristic information about the number of processors required by the corresponding sorting task. In this case they are simply the sizes of the subintervals to sort. At the end of the parallel call the two sets of processors exchange their results and rejoin to form the previous set. To increase performance, instead of using virtualization, the programmer can take control of the number of available processors through the use of the llc variable numprocessors. The former example is a Replicated problem. We define a Replicated problem (or parallel algorithm) as one in which the initial input data have to be replicated in all the processors and the solution is also required in all the processors.
Load Balancing and Processor Assignment Statements
541
1 void qs (int *size) { 2 int totalSize, i, j, pivot, sum, s; 3 if (numprocessors > 1) { 4 balance( *totalSize, (*size), A, decision, getWeigth); 5 if (totalSize > numprocessors) { 6 reducebyadd( sum, A[*size/2]); 7 pivot = sum / numprocessors; 9 par( part( *size-1,p,&i,&j,&s), A+i, s, revPart( *size-1,p,&i,&j,&s), A+i, s); 10 *size = j + 1 + s; 11 split( qs(size), qs(size)); } 12 else qsSeq( 0, (*size-1)); } 13 else qsSeq( 0, (*size-1)); 14 }
Fig. 2. Distributed quicksort
The example in Figure 2 presents the use of processor assignment statements in a Distributed Problem. A Distributed problem is one in which both the input data and the results are distributed among the processors. The initial array is distributed among the processors and the goal is to have the items sorted according to the processor name. Through the all-to-all reduction reducebyadd, the common pivot is chosen in lines 6 and 7. The current set of processors is divided in two subsets of equal size using the par macro in line 9: one executing the classical partition procedure part, and the other executes a reversed version of partition revP art, that stores elements greater than the pivot in the first part of the array (from 0 to j) and stores the s items smaller than the pivot between i and size − 1. The results are interchanged and the size of the array is updated in line 10. The recursive call to the macro split divides the group using the same processor distribution policy than the the former call to the par macro. The work load is equilibrated by the call to macro balance at line 4. This La Laguna C macro generalizes the dimension exchange algorithm [4] for hypercubes. Parameters decision and getW eight are provided by the user. When called, getW eight returns a measure of the difficulty of the problem. From the input weights of two arrays of problems, decision decides the number of items in the array of problems exceeding the ideal. Furthermore, balance returns in totalSize the sum of the sizes of the arrays in the current group.
2
Weighted Dynamic Hypercube Topologies
When nested unbalanced or balanced assignment statements are executed on a distributed system, each processor computes its partners in the other subsets created. Let us consider a set of nested balanced processor assignment statements:
542
C. Rodr´ıguez et al.
for all x1 in 1..n1 with weight w1 (x1 ) in parallel do for all x2 in 1..n2 with weight w2 (x2 ) in parallel do ... for all xm in 1..nm with weight wm (xm ) in parallel do instruction(x1 , x2 , ..., xm ) the resulting communication pattern becomes a sort of distorted dynamic hypercube. Each nested balanced processor assignment statement creates a “new dimension” in the current set H of available processors. In the jth nested call, the set H is partitioned in a collection of nj subsets Γ = {H1 , ..., Hnj } whose cardinalitiess are distributed according to the weights of the statements. Each processor p on each of the new created sets Ht has at least one partner or neighbor in each of the other subsets in the collection Γ − {Ht }. This way a graph is associated with the nesting of balanced processor assignments in which links join partner processors. The way partners are selected in the other groups is called a partnership policy. The family of topologies defined this way is called weighted dynamic hypercubes. Although the formulas for the algorithm to support nested load balanced processor assignment statements are by far more convoluted than those for the unbalanced case, they still reduce to elementary arithmetic and logical operations. This simplicity and the good performance of current multicomputers on weighted hypercube connection patterns explain the efficiency of the implementation.
3
Computational Results
Our experiments were performed in an IBM–SP2 computer and a Silicon Graphics Origin 2000. The IBM–SP2 nodes (Power2–thin2 processors) are connected through a synchronous multistage packet switch with whormhole routing providing a real data transfer rate of 22 Mb/s [5]. The Origin 2000 is a 32 + 32 R10000 processors (196 MHz) machine with 8 GB main memory, 288 GB disk and 25.08 Gflop/s theoretical peak. For each number of processors and for each algorithm, columns labeled S in Table 1 show the average speedup for ten experiments against the corresponding best sequential algorithm. First and fifth pairs of columns of the Table compare the results obtained for the llc parallel quicksort algorithm of Figure 1, (label balvirt) against a message passing parallel quicksort algorithm for a hypercube multicomputer due to professor P. B. Hansen [2] (label bh). The Hansen algorithm achieves a perfect load balance through the use of the f ind procedure due to Hoare [3]. This procedure divides the array to sort in two equal halves. On both architectures La Laguna C times are slightly better than those obtained by the Hansen algorithm. This advantage can be explained by the different approach taken in the implementation of nested parallelism. While nested parallelism produces a processor assignment in the llc version, the implicit solution used by the Hansen algorithm roughly corresponds to a processor
Load Balancing and Processor Assignment Statements
543
activation. The constant complexity of the processor assignment algorithm used by the llc system is smaller than the logarithmic complexity of the processor activation algorithm. No inefficiency is introduced by the methodology in spite of the easiness introduced in the expression of parallel algorithms.
Table 1. Speedups and Speedup Standard Deviations for the parallel quicksort.IBM SP2 and Origin 2000, 7M integer array bh procs
s
ssd
sp2
2 4 8 16
1.49 2.02 2.42 2.62
0.05 0.08 0.12 0.20
Origin
2 4 8
1.59 0.04 2.20 0.08 2.68 0.13
unbal s ssd 1.27 2.94 2.77 3.48
0.17 0.37 0.66 1.01
1.27 0.16 1.95 0.31 2.74 0.52
manual s ssd 1.53 2.15 2.66 2.99
0.05 0.10 0.14 0.15
1.61 0.04 2.31 0.08 2.88 0.13
balan s ssd 1.23 2.41 3.45 4.72
0.20 0.39 0.31 0.30
1.27 0.16 2.46 0.33 3.60 0.24
balvirt s ssd 1.15 2.24 3.29 4.29
0.18 0.35 0.29 0.29
1.23 0.16 2.37 0.32 3.50 0.22
Since all the N elements of the array to sort have to be received by processor 0, the size of the array N constitutes a lower bound of the T imepar of any parallel sorting algorithm. Therefore, the speedup of any replicated sorting algorithm is bounded by the logarithm of the array size. Theorem 1. If N is the size of the array to sort, then it holds: SpeedU p = N log(N ) N log(N ) = log(N ) T imepar ≤ N This theorem explains the poor speedup in Table 1. As a consequence, the only way to increase efficiency is to consider huge arrays much larger than the ones considered in these experiences. The results in the four rightmost speedups columns compare three different load balance approaches. The unbal label corresponds to the algorithm in figure 1 but substituting line 6 for the use of the unbalanced processor assignment statement par. The pivot is selected as the average of three randomly selected elements. The amount of work assigned to each processor depends on the selected pivot. The manual label stands for a llc program that uses Hoare f ind procedure and unbalanced processor assignment statement. The programmer takes responsibility of load balance. The results of the algorithm of Figure 1 using the parweight construct instead of parweightvirtual are presented under the label balan. Performance obtained using balanced processor assignment statements are even better than those reached using a hand-written specific load balancing algorithm designed by an experienced programmer. Label balvirt corresponds to the algorithm in Figure 1. Processor virtualization does not introduce appreciable inefficiency. Columns labeled ssd present for each
544
C. Rodr´ıguez et al.
algorithm, the Speedup Standard Deviation of the experiments. Although the average speedup of the unbal entries are comparable to the manual entries, the standard deviation shows the unpredictability of its behavior. On the other extreme, the manual case presents the smaller deviation. The balan and balvirt versions keep the standard deviation in an acceptable range. When the number of processors grows, the standard deviation keeps almost constant. The Origin 2000 scales slightly better than the IBM SP2. The average speedup values for the distributed parallel quicksort algorithm presented in Figure 2 with eight processors are 6.66 and 6.90 for the IBM-SP2 and Silicon-Origin respectively. These values were obtained as the average of 10 experiments with 16M integer arrays randomly generated according to an uniform distribution. The speedups show a better behaviour than for the replicated problems.
4
Conclusions
A methodology to efficiently implement balanced nested processor assignment statements in constant time was presented. The algorithm to implement the nesting of such statements can be used to build tools providing the capability to mix parallelism and recursion. A non negligible advantage is that their performance can be predicted using the model presented in [5]. The computational results prove that the efficiency of this approach is comparable to the obtained using specific message passing algorithms. From an efficiency point of view, the results obtained for the different approaches considered are pretty similar, so the advantage of the proposed paradigms must not be seen from this perspective. The most important benefit resides in the fact that it provides a comfortable frame to translate pram algorithms to standard message–passing libraries without loss of performance. Acknowledgements This research has been done using the resources at the Centre de Computaci´ o i Comunicacions de Catalunya (CESCA-CEPBA).
References 1. Fortune S., Wyllie, J.: Parallelism in Random Access Machines. STOC. (1978) 114–118 539 2. Hansen P. B.: Do hypercubes sort faster than tree machines?. Concurrency: Practice and Experience 6 (2) (1994) 143–151 542 3. Hoare, C. A. R.: Proof of a Program: Find. Communications of the ACM. 14 39–45 542 4. Murphy, T.A., Vaughan J.: On the Relative Performance of Diffusion and Dimension Exchange Load Balancing in Hypercubes. Proceedings of the PDP’97. IEEE Computer Society Press. (1997) 29-34 541 5. Rodr´ıguez, C., Roda J.L., Morales, D. G., Almeida, F.: h-relation Models for Current Standard Parallel Platforms. Proceedings of the Euro-Par’98. Springer Verlag LNCS. (1998) 542, 544
Mutual Exclusion Between Neighboring Nodes in a Tree That Stabilizes Using Read/Write Atomicity Gheorghe Antonoiu1 and Pradip K. Srimani1 Department of Computer Science, Colorado State University, Ft. Collins, CO 80523 Abstract. Our purpose in this paper is to propose a new protocol that can ensure mutual exclusion between neighboring nodes in a tree structured distributed system, i.e., under the given protocol no two neighboring nodes can execute their critical sections concurrently. This protocol can be used to run a serial model self stabilizing algorithm in a distributed environment that accepts as atomic operations only send a message, receive a message an update a state. Unlike the scheme in [1], our protocol does not use time-stamps (which are basically unbounded integers); our algorithm uses only bounded integers (actually, the integers can assume values only 0, 1, 2 and 3) and can be easily implemented.
1
Introduction
Because of the popularity of the serial model and the relative ease of its use in designing new self-stabilizing algorithm, it is worthwhile to design lower level self-stabilizing protocols such that an algorithm developed for a serial model can be run in a distributed environment. This approach was used in [1] and can be compared with the layered approach use in networks protocol stacks. The advantage of such a lower level self-stabilizing protocol is that it makes the job of self-stabilizing application designer easier; one can work with the relatively easier serial model and does not have to worry about message management at lower level. Our purpose in this paper is to propose a new protocol that can be used to run a serial model self stabilizing algorithm in a distributed environment that accepts as atomic operations only send a message, receive a message an update a state. Unlike the scheme in [1], our protocol does not use time-stamps (which are basically unbounded integers); our algorithm uses only bounded integers (actually, the integers can assume values only 0, 1, 2 and 3) and can be easily implemented. Our algorithm is applicable for distributed systems whose underlying topology is a tree. It is interesting to note that the proposed protocol can be viewed as a special class of self-stabilizing distributed mutual exclusion protocol. In traditional distributed mutual exclusion protocols, self-stabilizing or non self-stabilizing (for references in self-stabilizing distributed mutual exclusion protocols, see [2] and for non self-stabilizing distributed mutual exclusion protocols, see [3,4]), the objective is to ensure that only one node in the system can execute its critical D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 545–553, 1998. c Springer-Verlag Berlin Heidelberg 1998
546
Gheorghe Antonoiu and Pradip K. Srimani
section at any given time (i.e., critical section execution is mutually exclusive from all other nodes in the system; the objective in our protocol, as in [1], is to ensure that a node executes its critical section mutually exclusive from its neighbors in the system graph (as opposed to all nodes in the system), i.e., multiple nodes can execute their critical sections concurrently as long as they are not neighbors to each other; in the critical section the node executes an atomic step of a serial model self-stabilizing algorithm.
2
Model
We model the distributed system, as used in this paper, by using an undirected graph G = (V, E). The nodes represent the processors while the symmetric edges represent the bidirectional communication links. We assume that each processor has its unique id. Each node x maintains one or more local state variables and one or more local state vectors (one vector for each local variable) that are used to store copies of the local state variables of the neighboring nodes. Each node x maintains an integer variable Sx denoting its status; the node also maintains a local state vector LSx where it stores the copies of the status of its neighbors (this local state vector contains dx elements, where dx is the degree of the node x, i.e. x has dx many neighbors); we use the notation LSx (y) for the local state vector element of node x that keeps a copy of the local state variable Sy of neighbor node y. The local state of a node x is defined by its local state variables and its local state vectors. A node can both read and write its local state variables and its local state vectors; it can only read (and not write) the local state variables of its neighboring nodes and it does neither read nor write the local state vectors of other nodes. A configuration of the entire system or a system state is defined to be the vector of local states of all nodes. Next, our model assumes read/write atomicity of [2] (as opposed to the composite read/write atomicity as in [5,6]). An atomic step (move) of a processor node consists of either reading the a local state variable of one of its neighbors (and updating the corresponding entry in the appropriate replica vector), or some internal computation, or writing of one of its local state variables; any such move is executed in finite time. Execution of a move by a processor may be interleaved with moves by other nodes – in this case the moves are concurrent. We use the notation N (x) = {nx [1], . . . .nx [dx ]} to denote the set of neighbors of node x. The process executed by each node x consists of an infinite loop of a finite sequence of Read, Conditional critical section and Write moves. Definition 1. An infinite execution is fair if it contains a infinite number of actions for any type. Remark 1. The purpose of our protocol is to ensure mutual exclusion between neighboring nodes. Each node x can execute its critical section iff the predicate Φx is true at node x. Thus, in a legitimate system state, mutual exclusion is ensured iff Φx ⇒ ∀y | y is a neighbor of x, ¬Φy
Mutual Exclusion Between Neighboring Nodes in a Tree
547
i.e., as long as node x executes its CS, no neighbor y of node x can execute its CS (Φy | (y is a neighbor of x) is false).
3
Self-Stabilizing Balance Unbalance Protocol for a Pair of Processes
We present an approach without using any shared variables unlike that in [2]. The structure of the two processes A and B is the same, i.e. infinite loop at each processor consists of an atomic read operation, critical section execution if certain predicate is true and an atomic write action. Sa and LSa (b) are two local variables maintained by process A (LSa (b) is the variable maintained at process A to store a copy of the state of its neighbor process B); process A can write on both Sa and LSa (b) and process B can only read Sa . Similarly, Sb and LSb (a) are local variables to process B: process B can write on both Sb and LSb (a) and process A can only read Sb . The difference is that the variables are now ternary, i.e., they can assume values 0, 1 or 2. The proposed algorithm is shown in Figure 1:
Process A Process B Ra : LSa (b) = Sb ; Rb : LSb (a) = Sa ; CSa : if (Sa = 0) then Execute CS CSb : if (Sb = 1) then Execute CS Wa : if (LSa (b) = Sa ) then Sa = Sa + 1 mod 3 Wb : Sb = LSb (a);
Fig. 1. Self-Stabilizing Balance Unbalance Protocol for a Pair of Processes
Since each of the processes A and B executes an infinite loop, after a Ra action the next “A” type action is Wa , after a Wa action the next “A” type action is Ra , after a Rb action the next “B” type action is Wb , after a Wb action the next “B” type action is Rb and so on. Remark 2. An execution of the system is an infinite execution of the processes A and B and hence an execution of the system contains an infinite number of each of the actions from the set {Ra , Wa , Rb , Wb }; thus, the execution is fair. The system may start from an arbitrary initial state and the first action in the execution of the system can be any arbitrary one from the set {Ra , Wa , Rb , Wb }. Note that the global system state is defined by the variables Sa and LSa (b) in process A and the variables Sb and LSb (a) in process B. Remark 3. When a process makes a move (the system executes an action), the system state may or may not be changed. For example, in a system state where Sa = LSa (b), the move Wa does not modify the system state, i.e., the system remains in the same state after the move.
548
Gheorghe Antonoiu and Pradip K. Srimani
Definition 2. A move (action) that modifies the system state is called a modifying action. Our objective is to show that the system, when started from an arbitrary initial state (possibly illegitimate), converges to a legitimate state in finite time (after finitely many actions by the processes). We introduce a new binary relation. Definition 3. We use the notation x y, if x = y or x = (y + 1) mod 3, where x, y ∈ Z3 . Remark 4. The relation is neither reflexive, nor transitive, nor symmetric. For example, 1 0, 2 2, 2 0, 2 1, 0 1, etc. Definition 4. Consider the ordered sequence of variable (Sa, LSb (a), Sb , LSa (b)); a system state is legitimate if (i) Sa LSb (a) ∧ LSb (a) Sb ∧ Sb LSa (b) and (ii) if at most one pair of successive variables in the previous sequence are unequal. Example 1. For example, {Sa = 1, LSa (b) = 0, Sb = 0, LSb (a) = 1} is a legitimate state while {Sa = 2, LSa (b) = 1, Sb = 1, LSb (a) = 0} is not. Theorem 1. In a legitimate state the two processes A and B execute their critical sections in mutual exclusive way, i.e., if process A is executing CS then process B cannot execute CS and vice versa, i.e. Sa = 0 ⇒ Sb = 1 and = 0. Sb = 1 ⇒ Sa Proof. The proof is obvious since in a legitimate state Sa LSb (a) ∧ LSb (a) Sb ∧ Sb LSa (b) and at most one pair of successive variables in the sequence can be unequal. Theorem 2. Any arbitrary move from the set {Ra , Wa , Rb , Wb } made in a legitimate state of the system leads to a legitimate state after the move. Proof. Since there are only four possible moves, it is easy to check the validity of the claim. For example, consider the move Ra ; the variable LSa (b) is affected; if LSa (b) = Sb before the move, this move does not change the system state; if LSa (b) = Sb before the move, then the system state before the move must satisfy Sa = LSb (a) = Sb (since it is legitimate) and after the move it will satisfy Sa = LSb (a) = Sb = LSa (b) (hence, the resulting state is legitimate). Lemma 1. Any infinite fair execution contains infinitely many modifying actions (see Definition 2). Proof. By contradiction. Assume that after a finite number of moves Sa does not change its value anymore. Then after a complete loop executed by process B, LSb (a) = Sa and Sb = Sa . In the next loop the process A must move, which contradicts our assumption. It is easy to see that if Sa changes its value infinitely many times, any other variable changes its value infinitely many times.
Mutual Exclusion Between Neighboring Nodes in a Tree
549
Lemma 2. For any given fair execution and for any initial state, a state such that three variables from the set {Sa , LSb (a), Sb , LSa (b)} are equal each other is reached in a finite number of moves. Proof. Consider the first move that modifies the state of Sa . After this move Sa = LSa (b). To change again the value of Sa , the LSa (b) variable must change its value and become equal to Sa . But LSa (b) always takes the value of Sb . Since Sa change its value infinitely many times a state such that Sa = LSa (b) and LSa (b) = Sb is reached in a finite number of moves. Lemma 3. For any given fair execution and for any initial state, a state such = LSa (b), Sa = Sb , Sa = LSb (a), is reached in a finite number of moves. that Sa Proof. Since we use addition modulo 3, the variables Sa , LSb (a), Sb , LSa (b), can have values in the set Z3 = {0, 1, 2}. When three variables from the set {Sa , LSb (a), Sb , LSa (b)} are equal to each other, Lemma 2, there is a value i ∈ Z3 such that Sa = i, LSa (b) = i, Sb = i, LSb (a) = i. When LSb (a), Sb , LSa (b) change their values, they only copy the value of one variable in the set {Sa , LSb (a), Sb , LSa (b)}. Thus, when Sa reaches for the first time the value i, the condition Sa = LSa (b), Sa = Sb , Sa = LSb (a) is met. Theorem 3. For any given fair execution, the system starting from any arbitrary state reaches a legitimate state in finitely many moves. = LSa (b), Sa = Sb , Sa = LSb (a) Proof. The system reaches a state such that Sa in a finite number of moves, Lemma 3. Let Sa = i ∈ Z3 Since LSb (a) = i, Sb =i and LSa (b) = i, Sa can not change its value until LSa (b) becomes equal to i, LSa (b) can not become equal to i until Sb becomes equal to i and Sb can not become equal to i until LSb (a) becomes equal to i. Thus, Sa can not modify its state until a legitimate state is reached.
4
Self-Stabilizing Mutual Exclusion Protocol for a Tree Network Without Shared Memory
Consider an arbitrary rooted tree; the tree is rooted at node r. We use the notation dx for the degree of node x, nx [j] for the j-th neighbor of the node x, N (x) = {nx [1], . . . .nx [dx ]} for the set of neighbors of node x, C(x) for the set of children of x and Px for the parent of node x; since the topology is a tree each node x knows its parent Px and for the root node r, Pr is Null.. As before, each node x maintains a local state variable Sx (readable by its neighbors) and a local state vector LSx used to store the copies of the states of its neighbors; we use notation LSx (y) to denote the component of the state vector LSx that stores a copy of the state variable Sy of node y, ∀y ∈ N (x). All variables are now modulo 4 integers (we explain the reason later). We assume that each node x maintains = r, Hx is a height variable Hx such that Hr = 0 for the root node and for ∀x
550
Gheorghe Antonoiu and Pradip K. Srimani
Root node r Rr1 : LSr (nr [1]) = Snr [1] ; .. .. . . dr Rr : LSr (nr [dr ]) = Snr [dr ] ; Sr = 0 then Execute Critical Section; CSr : if Wr :
if
y∈C(r)
(Sr = LSr (y))
then Sr = Sr + 1 mod 4; Leaf node y Ry1 : LSy (Py ) = SPy ; CSy : if Φ(y) then Execute Critical Section; Wy : Sy = LSy (Py ); Internal Node x LSx (nx [1]) = Snx [1] .. .. . . Rxdx : LSx (nx [dx ]) = Snx [dx ] CSx : if Φ(x) then Execute Critical Section Rx1 :
Wx :
if
y∈C(x)
(Sr = LSx (y))
then Sx = RHx (Px );
Fig. 2. Protocol for an Arbitrary Tree the number of edges in the unique path from node x to the root node. It is easy to see that if the root node sets Hr = 0 and any other node x sets its Hx to HPx + 1 (level of its parent plus 1), the height variables will correctly indicate the height of each node in the tree after a finite number of moves, starting from any illegitimate values of those variables. To avoid cluttering the algorithm, we do not include the rules for handling Hx variable in our algorithm specification. As before, the root node, internal nodes as well as the leaf nodes execute infinite loops of reading the states of neighbor(s), executing critical sections (if certain predicate is satisfied) and writing its new state. The protocols (algorithms) for root, internal nodes and leaf nodes are shown in Figure 2 where the predicate Φ(x) is: Φ(x) = (Sx = 0 ∧ (Hx is even)) ∨ (Sx = 2 ∧ (Hx is odd)) Note, as before, the state of a node x is defined by the variable Sx and the vector LSx ; the global system state is defined by the local states of all participating nodes in the tree. Definition 5. Consider a link or edge (x, y) such that node x is the parent of node y in the given tree. The state of a link (x, y), in a given system state, is
Mutual Exclusion Between Neighboring Nodes in a Tree
551
defined to be the vector (Sx , LSy (x), Sy , LSx (y)). The state of a link is called legitimate iff Sx LSy (x) ∧ LSy (x) Sy ∧ Sy LSx (y) and at most one pair of successive variables in the vector (Sx , LSy (x), Sy , LSx (y)) are unequal. Definition 6. The system is in a legitimate state if all links are in a legitimate state. Theorem 4. For an arbitrary tree in any legitimate state, no two neighboring processes can execute their critical sections simultaneously. Proof. In a legitimate state (when the H variables at nodes have stabilized) for any two neighboring nodes x and y, we have (either Lx is even & Ly is odd), or (Lx is odd & Ly is even). Hence, Φ(x) and Φ(y) are simultaneously (concurrently) true iff Sx = 2 and Sy = 0 or Sx = 0 and Sy = 2. But since link (x, y) is in a legitimate state, such condition can not be met; hence, two neighboring nodes cannot execute their critical sections simultaneously. Lemma 4. Consider an arbitrary link (x, y). The node x modifies the value of the variable Sx infinitely many times, if and only if the node y modifies the value of the variable Sy infinitely many times. Proof. If node x modifies Sx finitely many times, then after a finite number of moves the value of Sx is not modified anymore. The next complete loop of node y after the last modification of Sx , makes LSy (x) = Sx and LSy (x) is not be modified by subsequent moves. Hence, after at most one modification, Sy remains unchanged. Conversely, if node y modifies Sy finitely many times then after a finite number of moves the value of Sy is not modified anymore. The next complete loop of node x after the last modification of Sy , makes LSx (y) = Sy and LSx (y) is not be modified by subsequent moves. Hence, after at most one modification, Sx remains unchanged. Lemma 5. For any fair execution, variable Sr at root node r is modified infinitely many times. Proof. By contradiction. Assume that the value of Sr is modified finitely many times. Then, after a finite number of moves, the value Sy for each child y of r will not be modified anymore, Lemma 4. Repeating the argument, after a finite number of moves no S or LS variables for any node in the tree may be modified. Consider now the leaf nodes. Since the execution is fair, the condition Sz = LSz (Pz ) must be met for each leaf node z. If this condition is met for leaf nodes it must be met for the parents of leaf nodes too. Repeating the argument, the condition must be met by all nodes in the tree. But, if this condition is met, the root node r modifies its Sr variable in its next move, which is a contradiction. Lemma 6. For any fair execution and for any node x, the variable Sx is modified infinitely many times.
552
Gheorghe Antonoiu and Pradip K. Srimani
Proof. If node x modifies its variable Sx finitely many times, its parent, say node z, must modify its variable Sz only finitely many times, Lemma 4. Repeating the argument, the root node also modifies its variable Sr finitely many times, which contradicts Lemma 5. Lemma 7. Consider an arbitrary node z ( = r). If all links in the path from r to z are in a legitimate state, then these links remain in a legitimate state after an arbitrary move by any node in the system. Proof. Let (x, y) be an link in the path from r to z. Since (x, y) is in a legitimate state, we have Sx LSy (x) ∧ LSy (x) Sy ∧ Sy LSx (y) and at most one pair of successive variables in the sequence (Sx , LSy (x), Sy , LSx (y)) are unequal. We need to consider only those system moves (executed by nodes x and y) that can change the variables Sx , LSy (x), Sy , LSx (y). Considering each of these moves, we check that legitimacy of the link state is preserved in the resulting system state. The read moves (that update the variables LSy (x), LSx (y)) obviously preserve legitimacy. To consider the move Wx , we look at two cases differently: Case 1 (x = r): When Wx is executed, Sx can be modified (incremented by 1) only when Sx = LSx (y). Thus, since the state is legitimate, a Wx move can increment Sx only under the condition Sx = LSy (x) = Sy = LSx (y) and after the move the link (x, y) remains in a legitimate state. Case 2 (x = r): Since the link (t, x) (where node t is the parent of x, i.e., t = Px ) is in a legitimate state, we have LSx (t) = Sx or LSx (t) = (Sx +1) mod 4. When Wx is executed, Sx can be modified only by setting its value equal to LSx (t); hence, after the move the link (x, y) remains in a legitimate state. Lemma 7 shows that if a path of legitimate links from the root to a node is created the legitimacy of the links in this path is preserved for all subsequent states. The next lemma shows that a new link is added to such a path in finite time. Lemma 8. Let (x, y) be a link in the tree. If all links in the path from root node to node x are in a legitimate state, then the link (x, y) becomes legitimate in a finite number of moves. Proof. First, we observe that the node y modifies the variable Sy infinitely many times, Lemma 6. Then, we use the same argument as in the proof of Theorem 3 to show that the link (x, y) becomes legitimate after a finite number of moves. Theorem 5. Starting from an arbitrary state, the system reaches a legitimate state in finite time (in finitely many moves). Proof. The first step is to prove that all links from the root node to its children become legitimate after a finite number of moves. This follows from the observation that each child x of the root node modifies its S variable infinitely many times and from an argument similar to the argument used in the proof of Theorem 3. Using Lemma 8, the theorem follows.
Mutual Exclusion Between Neighboring Nodes in a Tree
553
References 1. M. Mizuno and H. Kakugawa. A timestamp based transformation of self-stabilizing programs for distributed computing environments. In Proceedings of the 10th International Workshop on Distributed Algorithms (WDAG’96), volume 304–321, 1996. 545, 546 2. S. Dolev, A. Israeli, and S. Moran. Self-stabilization of dynamic systems assuming only read/write atomicity. Distributed Computing, 7:3–16, 1993. 545, 546, 547 3. M. Raynal. Algorithms for Mutual Exclusion. MIT Press, Cambridge MA, 1986. 545 4. P. K. Srimani and S. R. Das, editors. Distributed Mutual Exclusion Algorithms. IEEE Computer Society Press, Los Alamitos, CA, 1992. 545 5. M. Flatebo, A. K. Datta, and A. A. Schoone. Self-stabilizing multi-token rings. Distributed Computing, 8:133–142, 1994. 546 6. S. T. Huang and N. S. Chen. Self-stabilizing depth-first token circulation on networks. Distributed Computing, 1993. 546 7. E. W. Dijkstra. Solution of a problem in concurent programming control. Communication of the ACM, 8(9):569, September 1965. 8. L. Lamport. A new solution of Dijkstra’s concurrent programming problem. Communications of the ACM, 17(8):107–118, August 1974. 9. L. Lamport. The mutual exclusion problem: Part II – statement and solutions. Journal of the ACM, 33(2):327–348, 1986. 10. H. S. M. Kruijer. Self-stabilization (in spite of distributed control) in treestructured systems. Inf. Processing Letters, 8(2):91–95, 1979.
Irreversible Dynamos in Tori P. Flocchini1 , E. Lodi2 , F. Luccio3 , L. Pagli3 , and N. Santoro4 1
Universit´e de Montr´eal, Canada 2 Universit` a di Siena, Italy 3 Universit` a di Pisa, Italy 4 Carleton University, Canada
Abstract. We study the dynamics of majority-based distributed systems in presence of permanent faults. In particular, we are interested in the patterns of initial faults which may lead the entire system to a faulty behaviour. Such patterns are called dynamos and their properties have been studied in many different contexts. In this paper we investigate dynamos for meshes with different types of toroidal closures. For each topology we establish tight bounds on the number of faulty elements needed for a system break-down, under different majority rules.
1
Introduction
Consider the following repetitive process on a synchronous network G: initially each vertex is in one of two states (colors), black or white; at each step, all vertices simultaneously (re)color themselves either black or white, each according to the color of the “majority” of its neighbors (majority rule). Different processes occur depending on how majority is defined (e.g., simple, strong, weighted) and on whether or not the neighborhood of a vertex includes that vertex. The problem is to study the initial configurations (assignment of colours) from which, after a finite number of steps, a monochromatic fixed point is reached, that is, all vertices become of the same colour (e.g., black). The initial set of black vertices is called dynamo (short for “dynamic monopoly”) and their study has been introduced by Peleg [17] as an extension of the study of monopolies. The dynamics of majority rules have been extensively studied in the context of cellular automata, and much effort has been concentrated on determining the asymptotic behaviors of different majority rules on different graph structures. In particular, it has been shown that, if the process is periodic, its period is at most two. Most of the existing research has focused, for example, on the study of the period-two behavior of symmetric weighted majorities on finite {0, 1}− and {0, . . . , p}-colored graphs [9,19], on the number of fixed points on finite {0, 1}colored rings [1,2,10], on finite and infinite {0, 1}-colored lines [12,13], on the behaviors of infinite, connected {0, 1}-colored graphs [14]. Furthermore, dynamic majority has been applied to the immune system and to image processing [1,8]. Although the majority rule has been extensively investigated, not much is known regarding dynamos. Some results are known in terms of catastrophic fault
This work has been supported in part by MURST, NSERC, and FCAR.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 554–562, 1998. c Springer-Verlag Berlin Heidelberg 1998
Irreversible Dynamos in Tori
555
patterns which are dynamos based on “one-sided” majority for infinite chordal rings (e.g., [5,15,20]). Further results are known in the study of monopolies, that is dynamos for which the system converges to all black in a single step [3,4,16]. Other more subtle definitions have been posed. An irreversible dynamo is one where the initial black vertices do not change their colour regardless of their neighbourhood. This is opposed to reversible dynamos, for which a vertex may switch colour several times according to a changing neighborhood. Among the latter,monotone reversible dynamos are the ones for which black vertices remain always black because the neighborhood never forces them to turn white. Recently, some general lower and upper bounds on the size of monotone dynamos have been estabilished in [17], and a characterization of irreversible dynamos has been given for chordal rings in [7]. In this paper we consider irreversible dynamos in tori and we focus on their dimension, that is, the minimum number of initial black elements needed to reach the fixed point. The motivation for irreversible dynamos comes from faulttolerance. Initial black vertices correspond to permanent faulty elements, and the white correspond to non-faulty. The faulty elements can induce a faulty behavior in their neighbors: if the majority of its neighbors is faulty (or has a faulty behavior), a non-faulty element will exhibit a faulty behavior and will therefore be indistinguishable from a faulty one. Irreversible dynamos are precisely those patterns of permanent initial faults whose occurrence leads the entire system to a faulty behavior (or catastrophe). In addition to its practical importance and theoretical interest, the study of irreversible dynamos gives insights on the class of monotone dynamos. In particular, all lower bounds established on the size of irreversible dynamos are immediately lower bounds for the monotone case. The torus is one of the simplest and most natural way of connecting processors in a network. We consider different types of tori: the toroidal mesh (the classical architecture used in VLSI), the torus cordalis (also known as double-loop interconnection networks), and the torus serpentinus (e.g., used by ILIAC IV). For each of these topologies we derive lower and upper bounds on the dimensions of dynamos. For a summary of results see table 1. The upper bounds are constructive, that is we derive the initial black vertices constituting the dynamo, and we also analyze the completion time (i.e., the time necessary to reach the fixed point). Limiting the discussion to meshes with toroidal connections avoids to examine border effects for some vertices, as it would occur in simple meshes. Our results and techniques can be easily adapted to simple meshes. In the following, due to space limitations, all the proofs have been omitted; the interested reader is referred to [6].
2
Basic Definitions
Let us consider an m × n mesh M and denote with µi,j , 0 ≤ i ≤ m − 1, 0 ≤ j ≤ n − 1, a vertex of M . The differences among the considered topologies consist only in the way that border vertices (i.e., µi,0 , µi,n−1 with 0 ≤ i ≤ m − 1
556
P. Flocchini et al.
Table 1. Bounds on the size of irreversible dynamos for tori of m × n vertices, with M = max{m, n}, N = min{m, n}, and H, K = m, n or H, K = n, m (choose the alternative that yields stricter bounds). Simple majority Strong majority Lower Bound Upper Bound Lower Bound Upper Bound Toroidal mesh
m+n −1 2
m+n −1 2
mn+1 3
H (K + 1) 3
Torus cordalis
n2
n2 + 1
mn+1 3
m (n + 1) 3
N2 + 1
mn+1 3
H (K + 1) 3
Torus N2 , for M ≥ N2 3 serpentinus M , for M < N2 3 3
and µ0,j , µm−1,j with 0 ≤ j ≤ n − 1) are linked to other processors. The vertices µi,n−1 on the last column are usually connected either to the opposite ones on the same rows (i.e. to µi,0 ), thus forming ring connections in each row, or to the opposite ones on the successive rows (i.e. to µi+1,0 ), in a snake-like way. The same linking strategy is applied for the last row. In the toroidal mesh rings are formed in rows and columns; in the torus cordalis there are rings in the columns and snake-like connections in the rows; finally, in torus serpentinus there are snake-like connections in rows and columns. Formally, we have: Definition 1 Toroidal Mesh A toroidal mesh of m × n vertices is a mesh where each vertex τi,j , with 0 ≤ i ≤ m−1, 0 ≤ j ≤ n−1 is connected to the four vertices τ(i−1) mod m,j , τ(i+1) mod m,j , τi,(j−1) mod n , τi,(j+1) mod n (mesh connections). Definition 2 Torus Cordalis A torus cordalis of m × n vertices is a mesh where each vertex τi,j , with 0 ≤ i ≤ m − 1, 0 ≤ j ≤ n − 1 has mesh connections except for the last vertex τi,n−1 of each row i,which is connected to the first vertex τ(i+1) mod m,0 of row i + 1. Notice that this torus can be seen as a chordal ring with one chord. Definition 3 Torus Serpentinus A torus serpentinus of m × n vertices is a mesh where each vertex τi,j , with 1 ≤ i ≤ m − 1, 0 ≤ j ≤ n − 1 has mesh connections, except for the last vertex τi,n−1 of each row i which is connected to the first vertex τ(i+1) mod m,0 of row i + 1, and for the last vertex τm−1,j of each column j which is connected to the first vertex τ0,(j−1) mod n of column j − 1.
Irreversible Dynamos in Tori
557
Majority will be defined as follows: Definition 4 Irreversible-majority rule. A vertex v becomes black if the majority of its neighbours are black. In case of tie v becomes black (simple majority), or keeps its color (strong majority). In the tori, simple (or strong) irreversible-majority asks for at least two (or three) black neighbours. We can now formally define dynamos: Definition 5 A simple (respectively: strong) irreversible dynamo is an initial set of black vertices from which an all black configuration is reached in a finite number of steps under the simple (respectively: strong) irreversible-majority rule. Simple and strong majorities will be treated separately because they exhibit different properties and are treated by different techniques.
3
Irreversible Dynamos with Simple Majority
Network behaviour changes drastically if we pass from simple to strong majority. We start our study from the former case. 3.1
Toroidal Mesh
Consider a toroidal mesh with a set T of m × n vertices, m, n > 2. Each vertex has four neighbors, then two black neighbors are enough to color black a white vertex. Let S ⊆ T be a generic subset of vertices, andRS be the smallest rectangle containing S. The size of RS is mS × nS . If S is all black, a spanning set for S (if any) is a connected black set σ(S) ⊇ S derivable from S with consecutive applications of the simple majority rule. We have: Proposition 1 Let S be a black set, mS < m − 1, nS < n − 1. Then, any (non necessarily connected) black set B derivable from S is such that B ⊆ RS . Proposition 2 Let S be a black set. The existence of a spanning set σ(S) imS) . plies |S| ≥ (mS +n 2 From Propositions 1 and 2 we immediately derive our first lower bound result: Theorem 1 Let S be a simple irreversible dynamo for a toroidal mesh m × n. We have: (i) mS ≥ m − 1, nS ≥ n − 1; (ii) |S| ≥ (m+n) − 1. 2 To build a matching upper bound we need some further results. Proposition 3 Let S be a black set, such that a spanning set σ(S) exists. Then a black set RS can be built from S. (note: RS is then a spanning set of S).
558
P. Flocchini et al.
Definition 6 An alternating chain C is a sequence of adjacent vertices starting and ending with black. The vertices of C are alternating black and white, however, if C has even length there is exactly one pair of consecutive black vertices somewhere. Theorem 2 Let S be a black set consisting of the black vertices of an alternating chain C, with mS = m − 1 and nS = n − 1. Then, the whole torus can be colored black starting from S, that is S is a simple irreversible dynamo. An example of alternating chain of proper length, and the phases of the algorithm, are illustrated in Figure 1. From Theorem 2 we have:
◦ • ◦ • ◦ • ◦ ◦
◦ ◦ ◦ ◦ ◦ • ◦ ◦
◦ • ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ • ◦ •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
S
◦ ◦ ◦ • ◦ • ◦ •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • • • • • ◦ ◦
◦ • ◦ ◦ ◦ • ◦ ◦
◦ • ◦ ◦ ◦ • ◦ ◦
◦ ◦ ◦ ◦ ◦ • • •
σ(S)
◦ ◦ ◦ ◦ ◦ ◦ ◦ •
◦ ◦ ◦ • • • • •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ • • • • • • •
◦ • • • • • • •
◦ • • • • • • •
◦ • • • • • • •
◦ • • • • • • •
◦ • • • • • • •
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
RS
Fig. 1. The initial set S, the spanning set σ(S) and the smallest rectangle RS containing S.
Corollary 1 Any m × n toroidal mesh admits a simple irreversible dynamo S with |S| = (m+n) − 1. 2 Note that the lower bound of Theorem 1 matches the upper bound of Corollary 1. From the proof of this corollary we see that a dynamo of minimal cardinality can be built on an alternating chain. Furthermore we have: Corollary 2 There exists a simple irreversible dynamo S of minimal cardinality starting from which the whole toroidal mesh can be colored black in (m+n) steps. 2 An example of such a dynamo is shown in Figure 3.1. An interesting observation is that the coloring mechanism shown in corollary 2 works also for asynchronous systems. In this case, however, the number of steps looses its significance. 3.2
Torus Cordalis
If studied on a torus cordalis, simple irreversible-majority is quite easy. We now show that any simple irreversible dynamo must have size ≥ n2 , and that there exist dynamos with almost optimal size.
Irreversible Dynamos in Tori
◦ ◦ ◦ ◦ ◦
◦ • ◦ ◦ ◦
◦ ◦ • ◦ ◦
◦ ◦ ◦ • ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ •
◦ • ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦
559
◦ ◦ ◦ ◦ ◦ ◦ • ◦ •
Fig. 2. Examples of dynamos requiring (m+n) steps. 2 Theorem 3 Let S be a simple irreversible dynamo for a torus cordalis m × n. We have: |S| ≥ n2 . Theorem 4 Any m × n torus cordalis admits a simple irreversible dynamo S with |S| = n2 + 1. Starting from S the whole torus can be colored black in m−3 2 n + 3 steps. An example of such dynamos is given in Figure 3.
◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ • ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ • ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Fig. 3. Simple irreversible dynamos of n2 + 1 vertices for tori cordalis, with n even and n odd.
3.3
Torus Serpentinus
Since the torus serpentinus is symmetric with respect to rows and columns, we assume without loss of generality that m ≥ n, and derive lower and upper bounds to the size of any simple irreversible dynamo. For m ≤ n simply exchange m with n in the expressions of the bounds. Consider a white cross, that is a set of white vertices arranged as in Figure 5, with height m and width n. The parallel white lines from the square of nine vertices at the center of the cross to the borders of the mesh are called rays. Note that each vertex of the cross is adjacent to three other vertices of the same cross, thus implying:
560
P. Flocchini et al. ◦ ◦ ◦ ◦ ◦◦◦◦◦ ◦◦◦◦◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦◦◦◦◦◦ ◦◦◦◦◦◦ ◦ ◦ ◦ ◦ ◦
Fig. 4. The white cross configuration.
Proposition 4 In a torus serpentinus the presence of a white cross prevents a simple irreversible dynamo to exist. We then have: Proposition 5 If a torus serpentinus contains three consecutive white rows, any simple irreversible dynamo must contain ≥ n2 vertices. Based on Proposition 5 we obtain the following lower bounds: Theorem 5 Let S be a simple irreversible dynamo for a torus serpentinus m×n. We have: n for m 1. |S| ≥ n2 , 3 ≥ 2 ; 2. |S| ≥ m 3 ,
n for m 3 < 2 .
The upper bound for the torus serpentinus is identical to the one already found for the torus cordalis. In fact we have: Theorem 6 Any m × n torus serpentinus admits a simple irreversible dynamo S with |S| = n2 + 1. Starting from S the whole torus can be colored black in m−3 2 n + 3 steps.
4
Irreversible Dynamos with Strong Majority
A strong majority argument allows to derive a significant lower bound valid for the three considered families of tori (simply denoted by tori). Since these tori have a neighbourhood of four, three adjacent vertices are needed to color black a white vertex under strong majority. We have: Theorem 7 Let S be a strong irreversible dynamo for a torus m × n. Then |S| ≥ mn+1 3 . We now derive an upper bound also valid for all tori.
Irreversible Dynamos in Tori
561
Theorem 8 Any m × n torus admits a strong irreversible dynamo S with n |S| = m 3 (n+ 1). Starting from S the whole torus can be colored black in 2 + 1 steps. An example of such a dynamo is shown in Figure 5.
• ◦ • • ◦ • • ◦ •
◦ • ◦ ◦ • ◦ ◦ • ◦
◦ ◦ • ◦ ◦ • ◦ ◦ •
◦ • ◦ ◦ • ◦ ◦ • ◦
◦ ◦ • ◦ ◦ • ◦ ◦ •
◦ • ◦ ◦ • ◦ ◦ • ◦
◦ ◦ • ◦ ◦ • ◦ ◦ •
◦ • ◦ ◦ • ◦ ◦ • ◦
(n + 1) = 27 m = 9, n = 8, |S| = m 3
Fig. 5. A strong irreversible dynamo for toroidal mesh, torus cordalis or torus serpentinus. For particular values of m and n the bound of Theorem 8 can be made stricter for the toroidal mesh and the torus serpentinus. In fact these networks are symmetrical with respect to rows and columns, hence the pattern of black vertices reported in figure 5 can be turned of 90 degrees, still constituting a dynamo. We immediately have: Corollary 3 Any m × n toroidal mesh or torus serpentinus admits a strong n irreversible dynamo S with |S| = max{ m 3 (n + 1), 3 (m + 1)}.
References 1. Z. Agur. Fixed points of majority rule cellular automata with application to plasticity and precision of the immune system. Complex Systems, 2:351–357, 1988. 554 2. Z. Agur, A. S. Fraenkel, and S. T. Klein. The number of fixed points of the majority rule. Discrete Mathematics, 70:295–302, 1988. 554 3. J-C Bermond, D. Peleg. The power of small coalitions in graphs. In Proc. of 2nd Coll. on Structural Information and Communication Complexity, 173-184, 1995. 555 4. J-C Bermond, J. Bond, D. Peleg, S. Perennes. Tight bounds on the size of 2monopolies In Proc. of 3rd Coll. on Structural Information and Communication Complexity, 170-179, 1996. 555 5. R. De Prisco, A. Monti, L. Pagli. Efficient testing and reconfiguration of VLSI linear arrays. To appear in Theoretical Computer Science. 555
562
P. Flocchini et al.
6. P.Flocchini, E. Lodi, F.Luccio, L.Pagli, N.Santoro. Irreversible dynamos in tori. Technical Report n. TR-98-05, Carleton University, School of Compiter Science, 1998. 555 7. P. Flocchini, F. Geurts, N. Santoro. Dynamic majority in general graphs and chordal rings. Manuscript, 1997. 555 8. E. Goles, S. Martinez. Neural and automata networks, dynamical behavior and applications. Maths and Applications, Kluwer Academic Publishers, 1990. 554 9. E. Goles and J. Olivos. Periodic behavior of generalized threshold functions. Discrete Mathematics, 30:187–189, 1980. 554 10. A. Granville. On a paper by Agur, Fraenkel and Klein. Discrete Mathematics, 94:147–151, 1991. 554 11. N. Linial, D. Peleg, Y. Rabinovich, M. Sachs. Sphere packing and local majority in graphs In Proc. of 2nd ISTCS, IEEE Comp. Soc. Press. pp. 141-149, 1993. 12. G. Moran. Parametrization for stationary patterns of the r-majority operators on 0–1 sequences. Discrete Mathematics, 132:175–195, 1994. 554 13. G. Moran. The r-majority vote action on 0–1 sequences. Discrete Mathematics, 132:145–174, 1994. 554 14. G. Moran. On the period-two-property of the majority operator in infinite graphs. Transactions of the American Mathematical Society, 347(5):1649–1667, 1995. 554 15. A. Nayak, N. Santoro, R. Tan. Fault intolerance of reconfigurable systolic arrays. Proc. 20th Int. Symposium on Fault-Tolerant Computing, 202-209, 1990. 555 16. D. Peleg. Local majority voting, small coalitions and controlling monopolies in graphs: A review. In Proc. of 3rd Coll. on Structural Information and Communication Complexity, 152-169, 1996. 555 17. D. Peleg. Size bounds for dynamic monopolies In Proc. of 3rd Coll. on Structural Information and Communication Complexity, 151-161, 1997. 554, 555 18. S. Poljak. Transformations on graphs and convexity. Complex Systems, 1:1021– 1033, 1987. 19. S. Poljak and M. S˚ ura. On periodical behavior in societies with symmetric influences. Combinatorica, 1:119–121, 1983. 554 20. N. Santoro, J. Ren, A. Nayak. On the complexity of testing for catastrophic faults. In Proc. of 6th Int. Symposium on Algorithms and Computation, 188-197, 1995. 555
MPI-GLUE: Interoperable High-Performance MPI Combining Different Vendor’s MPI Worlds Rolf Rabenseifner Rechenzentrum der Universit¨ at Stuttgart Allmandring 30, D-70550 Stuttgart [email protected] http://www.hlrs.de/people/rabenseifner/ Abstract. Several metacomputing projects try to implement MPI for homogeneous and heterogeneous clusters of parallel systems. MPI-GLUE is the first approach which exports nearly full MPI 1.1 to the user’s application without losing the efficiency of the vendors’ MPI implementations. Inside of each MPP or PVP system the vendor’s MPI implementation is used. Between the parallel systems a slightly modified TCP-based MPICH is used, i.e. MPI-GLUE is a layer that combines different vendors’ MPIs by using MPICH as a global communication layer. Major design decisions within MPI-GLUE and other metacomputing MPI libraries (PACX-MPI, PVMPI, Globus and PLUS) and their implications for the programming model are compared. The design principles are explained in detail.
1
Introduction
The development of large scale parallel applications for clusters of high-end MPP or PVP systems requires efficient and standardized programming models. The message passing interface MPI [9] is one solution. Several metacomputing projects try to solve the problem that mostly different vendors’ MPI implementations are not interoperable. MPI-GLUE is the first approach that exports nearly full MPI 1.1 without losing the efficiency of the vendors’ MPI implementations inside of each parallel system.
2
Architecture of MPI-GLUE
MPI-GLUE is a library exporting the standard MPI 1.1 to parallel applications on clusters of parallel systems. MPI-GLUE imports the native MPI library of the system’s vendor for the communication inside of a parallel system. Parallel systems in this sense are MPP or PVP systems, but also homogeneous workstation clusters. For the communication between such parallel systems MPI-GLUE uses a portable TCP-based MPI library (e.g. MPICH). In MPI-GLUE the native MPI library is also used by the portable MPI to transfer its byte messages inside of each parallel system. This design allows any homogeneous and heterogeneous combination of any number of parallel or nonparallel systems. The details are shown in Fig. 1. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 563–569, 1998. c Springer-Verlag Berlin Heidelberg 1998
564
Rolf Rabenseifner
MPI processes on system 1 parallel application MPI−GLUE layer native portable MPI based on MPI native & TCP of comm. vendor 1 native comm. TCP local comm.
MPI processes on system 2 standardized MPI interface
parallel application MPI−GLUE layer portable MPI native based on MPI TCP & native of comm. vendor 2 TCP native comm.
global communication
local comm.
Fig. 1. Software architecture
3
Related Works
The PACX-MPI [1] project at the computing center of the University of Stuttgart has a similar approach as MPI-GLUE. At Supercomputing ’97 PACXMPI was used to demonstrate that MPI-based metacomputing can solve large scale problems. Two CRAY T3E with 512 processors each were combined to run a flow solver (URANUS) which predicts aerodynamic forces and high temperatures effecting on space vehicles when they reenter Earth’s atmosphere [2]. Furthermore a new world record in simulating an FCC crystal with 1.399.440.00 atoms has been reached on that cluster [11]. PACX-MPI uses a special module for the TCP communication in contrast to the portable MPI in MPI-GLUE. The functionality of PACX-MPI is limited by this module to a small subset of MPI 1.1. The TCP communication does not directly exchange messages between the MPI processes. On each parallel system two special nodes are used as concentrators for incoming and outgoing TCP messages. PACX-MPI can compress all TCP messages to enlarge bandwidth. PVMPI [4] has been developed at UTK, ORNL and ICL, and couples several MPI worlds by using PVM based bridges. The user interface does not meet the MPI 1.1 standard, but is similar to a restricted subset of the functionality of MPI Comm join in MPI 2.0 [10]. It is impossible to merge the bridge’s intercommunicator into a global intra-communicator over all processes. Therefore, one can not execute existing MPI applications with PVMPI on a cluster of parallel systems. In the PLUS [3] project at PC2 at the University of Paderborn a bridge has been developed too. With this bridge MPI and PVM worlds can be combined. Inefficient TCP implementations of some vendors have been substituted by an own protocol based on UDP. It is optimized for wide area networks. As in PACX-MPI the communication is concentrated by a daemon process on each parallel system. The interface has the same restrictions as described for PVMPI, moreover, only a restricted subset of datatypes is valid. PLUS is part of the MOL (Metacomputer Online) project.
MPI-GLUE
565
In the Globus [5] project and in the I-WAY [7] project MPICH [8] is used on top of the multi-protocol communication library NEXUS [6]. NEXUS is designed as a basic module for communication libraries like MPI. On the parallel system itself MPICH is used instead of the vendor’s MPI library.
4
Design Decisions and Related Problems
Several decisions must be taken in the design of an MPI library usable for heterogeneous metacomputing. Major choices and the decisions in the mentioned projects are viewed in this section. The glue approach versus the portable multi-protocol MPI implementation: The goals of high efficiency and full MPI functionality can be achieved by the following alternatives (Fig. 2): (a) The glue layer exports an MPI interface to the application and imports the native MPIs in each parallel system and a global communication module which allows communication between parallel systems. This gerneral glue approach is used in MPI-GLUE, PACX-MPI, PVMPI and PLUS. (b) In the portable multi-protocol MPI approach (used in Globus) each parallel system has its own local communication module. A common global communication module enables the communication between the parallel systems. The MPI library uses both modules. In this approach the global communication module is usually very small in comparison with the glue approach: it does only allow byte message transfers. The main disadvantage of this approach is that the performance of the native MPI library can not be used. Global world versus dynamically created bridges: The standards MPI 1.1 and MPI 2.0 imply two metacomputing alternatives: (a) MPI COMM WORLD collects all processes (in MPI-GLUE, PACX-MPI and Globus) or (b) each partition of the cluster has a separate MPI COMM WORLD and inter-communicators connect the partitions. (doesn’t conform to MPI 1.1, used in PVMPI and PLUS). Global communication using an own module versus importing a portable MPI: The global communication module in the glue approach can be (a) a portable TCP-based MPI implementation (in MPI-GLUE), or (b) a special TCP-based module (in PACX-MPI, PVMPI and PLUS). With (a) its possible to implement the full MPI 1.1 standard without reimplementing most of the MPI’s functionality inside of this special module.
parallel application
MPI
glue layer native global MPI communication implemention module local network
global network
parallel application
MPI
portable multi−protocol MPI implementation local global communication communication local network
global network
Fig. 2. The glue approach (left) and the multi-protocol MPI approach (right)
566
Rolf Rabenseifner
Using daemons or not using daemons: The global communication can be (a) concentrated by daemons or can be (b) done directly. In the daemon-based approach (a) the application processes send their messages on the local network (e.g. by using the native MPI) to a daemon, that transfers them on the global network (e.g. with TCP) to the daemon of the target system. From there the messages are received by the application processes within the local network of the target system. In the direct-communication approach (b) the work of the daemons is done by the operating system and each node communicates directly via the global network. With the daemon-based approach the application can fully parallelize computation and communication. The application is blocked only by native MPI latencies although in the background slow global communication is executed. Currently only the direct-communication approach is implemented in MPI-GLUE, but it would be no problem to add the daemon-based approach. A detailed discussion can be found in [12]. The first progress problem: All testing and blocking MPI calls that have to look at processes connected by the local network and at other processes connected by the global network should take into account that the latencies for probing the local and global network are different. On clusters of MPPs the ratio can be 1:100. Examples are a call to MPI Recv with source=MPI ANY SOURCE, or a call to MPI Waitany with requests handling local and global communication. More a workaround than a solution are asymmetric polling strategies: if polling is necessary then only in the nth round the central polling routine will inquire the state of the global interface, while in each round the local interface will be examined. n is given by the ratio mentioned above. Without this trick the latency for local communication may be expanded to the latency of the global communication. But with this trick the latency for global communication may be in the worst case twice the latency of the underlaying communication routine. This first progress problem arises in all systems that use no daemons for the global communication (MPI-GLUE, PVMPI configured without daemons, and Globus). In MPI-GLUE the asymmetric polling strategy has been implemented. In Globus it is possible to specify different polling intervals for each protocol. The second progress problem: The restriction in MPI 1.1, page 12, lines 24 ff (”This document specifies the behavior of a parallel program assuming that only MPI calls are used for communication”) allows that MPI implementations make progress on non-blocking and buffered operations only inside of MPI routines, e.g. in MPI Wait... / Test... / Iprobe or Finalize. This is a weak interpretation of the progress rule in MPI 1.1, page 44, lines 41ff. The rationale in MPI 2.0, page 142, lines 14-29 explicitely mentions this interpretation. This allows that a native MPI makes no progress, while the portable MPI is blocked by waiting for a message on the TCP link, and vice versa. Therefore, in the glue approach the problem arises that the glue layer must not implement anything by blocking or testing only within the native MPI without giving the global communication module a chance to make progress, and vice versa. The only hard problem is the combination of using collective routines of the native MPI (because there isn’t any non-blocking alternative) and
MPI-GLUE
567
using sends in the global MPI and these sends are buffered at the sender and the sender is blocked in the native collective operation. This problem can be solved in daemon-based systems by sending all messages (with a destination in another parallel system) immediately to the daemons and buffering them there (implemented in PACX-MPI and PLUS). It can also be solved by modifying the central probe and dispatching routine inside the native MPI. The analogous problem (how to give the native MPI a chance to make progress, while the glue layer blocks in a global collective routine) usually only exists, if one uses an existing MPI as a global communication module, but in MPI-GLUE this problem has been solved, because the global MPI itself is a multi-protocol implementation which uses the local MPI as one protocol stack.
5
Design Details
The notation “MPI” is used for the MPI interface exported by this glue layer and for the glue layer software. The notation “MPX” is used for the imported TCP-based MPI-1.1 library in which all names are modified and begin with “MPX ” instead of “MPI ”. MPX allows the exchange of messages between all processes. The notation “MPL” is used for the native (local) MPI 1.1 library. All names are modified (“MPL ” instead of “MPI ”). All MPL COMM WORLDs are disjoint and in the current version each process is member of exactly one MPL COMM WORLD (that may have any size≥1). Details are given in the table below and in [12]. In the table “X” and “L” are abbreviations for MPX and MPL. MPI constant, handle, routine ranks group handling communicators datatypes
impl.ed by using X X L and X L and X
remarks
i.e. the MPI ranks are equal to the MPX ranks. i.e. group handles and handling are impl. with MPX. and roots-communicator, see Sec. Opt. of collective op. and a mapping of the L ranks to the X ranks and vice versa. requests L or X MPI IRECVs with MPI ANY SOURCE and communicators spawning more than one MPL region store all aror delayed guments of the IRECV call into the handle’s structure. The real RECV is delayed until the test or wait calls. sends, receives L or X (receives only with known source partition). receives with ANY SOURCE: - blocking L and X with asymmetric polling strategy. - nonblocking GLUE postponed until the wait and test routines, see MPI requests above. wait and test L or X or L&X. Testing inside partial areas of the communicator may MPI Testall complete request handles, but the interface definition requires that only none or all requests are completed. Therefore (only for Testall), requests may be MPL/MPXfinished, but not MPI-finished.
568
Rolf Rabenseifner
The MPI process start and MPI Init is mapped to the MPL process start, MPL Init, MPX process start and MPX Init. The user starts the application processes on the first parallel system with its MPL process start facility. MPI Init first calls MPL Init and then MPX Init. The latter one is modified to reflect that the parallel partition is already started by MPL. Based on special arguments at the end of the argument list and based on the MPL ranks the first node recognizes inside of MPX Init its role as MPICH’s big master. It interactively starts the other partitions on the other parallel systems or uses a remote shell method (e.g. rsh) to invoke the MPL process start facility there. There as well the MPI Init routine first invokes MPL Init and then MPX Init. Finally MPI Init initializes the predefined handles. MPI Finalize is implemented in the reverse order. Optimization of collective operations: It is desirable that metacomputing applications do not synchronize processes over the TCP links. This implies that they should not call collective operations on a global communicators, especially those operations which make a barrier synchronization (barrier, allreduce, reduce scatter, allgather(v), alltoall(v)). The following schemes help to minimize the latencies if the application still uses collective operations over global communicators. In general a collective routine is mapped to the corresponding routine of the native MPI if the communicator is completely inside of one local MPL world. To optimize MPI Barrier, MPI Bcast, MPI Reduce and MPI Allreduce if the communicator belongs to more than one MPL world, for each MPI-communicator, there is an additional internal roots-communicator that combines the first process of any underlying MPL-communicator by using MPX, e.g. MPI Bcast is implemented by a call to native Bcast in each underlying MPL-communicator and one call to the global Bcast in the roots-communicator.
6
Status, Future Plans, and Acknowledgments
MPI-GLUE implements the full MPI 1.1 standard except some functionalities. Some restrictions result from the glue approach: (a) It is impossible to allow an unlimited number of user-defined operations. (b) Messages must be and need to be received with datatype MPI PACKED if and only if they are sent with MPI PACKED. Some other (hopefully minor) functionalities are still waiting to be implemented. A detailed list can be found in [12]. MPI-GLUE is portable. Current test platforms are SGI, CRAY T3E and Intel Paragon. Possible future extensions can be the implementation of MPI 2.0 functionalities, optimization of further collective routines, and integrating a daemon-based communication module inside the used portable MPI. I would like to thank the members of the MPI-2 Forum and of the PACX-MPI project as well as the developers of MPICH who have helped me in the discussion which led to the present design and implementation of MPI-GLUE. And I would like to thank Oliver Hofmann for implementing a first prototype in his diploma thesis, Christoph Grunwald for implementing some routines during his practical course, especially some wait, test and topology functions, and Matthias M¨ uller for testing MPI-GLUE with his P3T-DSMC application.
MPI-GLUE
7
569
Summary
MPI-GLUE achieves interoperability for the message passing interface MPI. For this it combines existing vendors’ MPI implementations losing neither full MPI 1.1 functionality nor the vendors’ MPIs’ efficiency. MPI-GLUE targets existing MPI applications that need clusters of MPPs or PVPs to solve large-scale problems. MPI-GLUE enables metacomputing by seamlessly expanding MPI 1.1 to any cluster of parallel systems. It exports a single virtual parallel MPI machine to the application. It combines all processes in one MPI COMM WORLD. All local communication is done by the local vendor’s MPI and all global communication is implemented with MPICH directly, without using deamons. The design has only two unavoidable restrictions which concern user defined reduce operations and packed messages.
References 1. Beisel, T., Gabriel, E., Resch, M.: An Extension to MPI for Distributed Computing on MPPs. In Marian Bubak, Jack Dongarra, Jerzy Wasniewski (Eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, LNCS, Springer, 1997, 75-83. 564 2. T. B¨ onisch, R. R¨ uhle: Portable Parallelization of a 3-D Flow-Solver, Parallel Comp. Fluid Dynamics ’97, Elsevier, Amsterdam, to appear. 564 3. Brune, M., Gehring, J., Reinefeld, A.: A Lightweight Communication Interface for Parallel Programming Environments. In Proc. High-Performance Computing and Networking HPCN’97, LNCS, Springer, 1997. 564 4. Fagg, G.E., Dongarra, J.J.: PVMPI: An Integration of the PVM and MPI Systems. Department of Computer Science Technical Report CS-96-328, University of Tennessee, Knoxville, May, 1996. 564 5. Foster, I., Kesselman, C.: Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications (to appear). 565 6. Foster, I., Geisler, J., Kesselman, C., Tuecke, S.: Managing Multiple Communication Methods in High-Performance Networked Computing Systems. J. Parallel and Distributed Computing, 40:35-48, 1997. 565 7. Foster, I., Geisler, J., Tuecke, S.: MPI on the I-WAY: A Wide-Area, Multimethod Implementation of the Message Passing Interface. Proc. 1996 MPI Developers Conference, 10-17, 1996. 565 8. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Paralle Computing, 22, 1996. 565 9. MPI: A Message-Passing Interface Standard. Message Passing Interface Forum, June 12, 1995. 563 10. MPI-2: Extensions to the Message-Passing Interface. Message Passing Interface Forum, July 18, 1997. 564 11. M¨ uller, M.: Weltrekorde durch Metacomputing. BI 11+12, Regionales Rechenzentrum Universit¨ at Stuttgart, 1997. 564 12. Rabenseifner, R.: MPI-GLUE: Interoperable high-performance MPI combining different vendor’s MPI worlds. Tech. rep., Computing Center University Stuttgart. http://www.hlrs.de/people/rabenseifner/publ/mpi glue tr apr98.ps.Z 566, 567, 568
High Performance Protocols for Clusters of Commodity Workstations P. Melas and E. J. Zaluska Electronics and Computer Science University of Southampton, U.K. [email protected] [email protected]
Abstract. Over the last few years technological advances in microprocessor and network technology have improved dramatically the performance achieved in clusters of commodity workstations. Despite those impressive improvements the cost of communication processing is still high. Traditional layered structured network protocols fail to achieve high throughputs because they access data several times. Network protocols which avoid routing through the kernel can remove this limit on communication performance and support very high transmission speeds which are comparable to the proprietary interconnection found in Massively Parallel Processors.
1
Introduction
Microprocessors have improved their performance dramatically over the past few years. Network technology has over the same period achieved impressive performance improvements in both Local Area Networks (LAN) and Wide Area Networks (WAN) (e.g. Myrinet and ATM) and this trend is expected to continue for the next several years. The integration of high-performance microprocessors into low-cost computing systems, the availability of high-performance LANs and the availability of common programming environments such as MPI have enabled Networks Of Workstations (NOWs) to emerge as a cost-effective alternative to Massively Parallel Processors (MPPs) providing a parallel processing environment that can deliver high performance for many practical applications. Despite those improvements in hardware performance, applications on NOWs often fail to observe the expected performance speed-up. This is largely because the communication software overhead is several orders of magnitude larger than the hardware overhead [2]. Network protocols and the Operating System (OS) frequently introduce a serious communication bottleneck, (e.g. traditional UNIX layering architectures, TCP/IP, interaction with the kernel, etc). New software protocols with less interaction between the OS and the applications are required to deliver low-latency communication and exploit the bandwidth of the underlying intercommunication network [9]. Applications or libraries that support new protocols such as BIP [4], Active Messages [6], Fast Messages [2], U-Net [9] and network devices for ATM and D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 570–577, 1998. c Springer-Verlag Berlin Heidelberg 1998
High Performance Protocols for Clusters of Commodity Workstations
571
Myrinet networks demonstrate a remarkable improvement in both latency and bandwidth, which are frequently directly comparable to MPP communication performance. This report investigates commodity “parallel systems”, taking into account the internode communication network in conjunction with the communication protocol software. Latency and bandwidth tests have been run on various MPPs (SP2, CS2) and a number of clusters using different workstation architectures (SPARC, i86, Alpha), LANs (10 Mbit/s Ethernet, 100 Mbit/s Ethernet, Myrinet) and OS (Solaris, NT, Linux). The work is organised as follows: a review of the efficiency of the existing communication protocols and in particular TCP/IP from the NOW perspective is presented in the first section. A latency and bandwidth test of various clusters and an analysis of the results is presented next. Finally an example specialised protocol (BIP) is discussed followed by conclusions.
2
Clustering with TCP/IP over LANs
The task of a communication sub-system is to transfer data transparently from one application to another application, which could reside on another node. TCP/IP is currently the most widely-used network protocol in LANs and NOWs, providing a connection-oriented reliable byte stream service [8]. Originally TCP/IP was designed for WANs with relatively high error rate. The philosophy behind the TCP/IP model was to provide reliable communication between autonomous machines rather than a common resource [6]. In order to ensure reliability over an unreliable subnetwork TCP/IP uses features such as an in-packet end-to-end checksum, “SO NDELAY” flag, “Timeto-live” IP field, packet fragmentation/reassembly, etc. All these features are useful in WANs but in LANs they represent redundancy and consume vital computational power [6]. Several years ago the bandwidth of the main memory and the disk I/O of a typical workstation was an order of magnitude faster than the physical network bandwidth. The difference in magnitude was invariably sufficient for the existing OS and communication protocol stacks to saturate network channels such as 10 Mbit/s Ethernet [3]. Despite hardware improvements workstation memory access time and internal I/O bus bandwidth in a workstation have not increased significantly during this period (the main improvements in performance have come from caches and a better understanding of how compilers can exploit the potential of caches). Faster networks have thus resulted in the gap between the network bandwidth and the internal computer resources being considerably reduced [3]. The fundamental design objective of traditional OS and protocol stacks at that time was reliability, programmability and process protection over an unreliable network. In a conventional implementation of TCP/IP a send operation involves the following stages of moving data: data from the application buffer are copied to the kernel buffer, then packet forming and calculation of the headers and the checksum takes place and finally packets are copied into the network
572
P. Melas and E. J. Zaluska
interface for transmission. This requires extra context switching between applications and the kernel for each system call, additional copies between buffers and address spaces, and increased computational overhead [7].
APPLICATION send()
00000000 11111111 11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 write()
Application layer
KERNEL
sosend()
111 000 000 111
11 00 00 00 11 11 00 11 00 11 00 11 00 11 11 00
111 000
11 00 00 11 00 11 00 11
tcp_output() ip_output()
111 000
TCP/IP
11 00 00 11 00 11 00 11
soreceive()
MPI
IP
ether_output() lestart()
11 11 00 00 00 11 11 00 11 11 00 00
11 00 00 11
111 000 00 11 11 00 00 11 11 00
ADI
TCP/UDP
IP MPI-BIP
Network layer
111 000 111 000 00 11 11 00
TCP/UDP
TCP/IP
ipintr()
Network Queue
11 00 00 11 00 00 11 11
MPI PVM
tcp_input()
11 00 00 11
MAC
leoutput()
Application
PVM
ADI
read()
Protocol layer tcp_usrreq()
Application
Socket layer Socket Queue
write()
MPI MPI_Send() MPI_Recv()
read()
Network
IP-BIP
Ether net
Ethernet
BIP
do_protocol() ;eread() ;eomtr()
(a)
(b)
(c)
Fig. 1. a) Conventional TCP/IP implementation b)MPI on top of the TCP/IP protocol stack c)The BIP protocol stack approach As a result, clusters of powerful workstations still suffer a degradation in performance even when a fast interconnection network is provided. In addition, the network protocols in common use are unable to exploit fully all of the hardware capability resulting in low bandwidth and high end-to-end latency. Parallel applications running on top of communication libraries (e.g. MPI, PVM, etc.) add an extra layer on top of the network communication stack, (see Fig. 1). Improved protocols attempt to minimise the critical communication path of application-kernel-network device [4,6,2,9]. Additionally these protocols exploit advanced hardware capabilities (e.g. network devices with enhanced DMA engines and co-processors) by moving as much as possible of their functionality into hardware. Applications can also interact directly with the network interface avoiding any system calls or kernel interaction. In this way processing overhead is considerably reduced improving both latency and throughput.
3
Latency and Bandwidth Measurements
In order to estimate and evaluate the effect of different communication protocols on clusters a ping-pong test measuring both peer-to-peer latency and bandwidth between two processors over MPI was written [2]. The latency test measures the time a node needs to send a sequence of messages and receive back the echo. The bandwidth test measures the time required for a sequence of back-to-back messages to be sent from one node to another. In both tests receive operations were posted before the send ones. The message size in tests always refers to the payload. Each test is repeated many times in order to reduce any clock jitter, first-time and warm-up effects and the median time from each measurement test presented in the results. Various communication models [1,5] have been developed in order to evaluate communication among processors in parallel systems. In this paper we follow a
High Performance Protocols for Clusters of Commodity Workstations
573
linear approach with a start-up time α (constant per segment cost) and a variable per-byte cost β. The message length at which half the maximum bandwidth is achieved (n1/2 ) is an important indication as well. tn = α + βn
(1)
Parameters that can influence tests and measurements are also taken into account for each platform in order to analyse the results better, i.e. measurements on non-dedicated clusters have also been made to assess the effect of interference with other workload. Tests on MPPs (SP2 and CS2) Latency and bandwidth tests have been run on the SP2 and CS2 in the University of Southampton and on the SP2 at Argonne National Laboratory. The results of the MPP test are used as a benchmark to analyse and compare the performance of NOW clusters. In both SP2 machines the native IBM’s MPI implementation was used while on the CS2 the mpich 1.0.12 version of MPI was used. The difference in performance between the two SP2 machines is due to the different type of the high-performance switch (TB2/TB3). The breakpoint at 4 KB on the SP2 graph is due to the change in protocol used for sending small and large messages in the MPI implementation. The NT cluster. The NT cluster is composed of 8 DEC Alpha based workstations and uses a dedicated interconnection network based around a 100Mbit/s Ethernet switch. The Abstract Device Interface makes use of the TCP/IP protocol stack that NT provides. Results on latency and bandwidth are not very impressive (Fig. 2), because of the experimental version of the MPI implementation used. The system was unable to make any effective use of the communication channel, even on large messages. An unnecessarily large number of context switches resulted in very low performance.
Tue Jan 20 16:46:53 1998 1.04858e+06
Latency benchmark
Tue Jan 20 16:33:29 1998 4
Bandwidth benchmark
1
262144
0.25
0.0625
Linux 100Mb/s Linux 10Mb/s SPARC-4 ULTRA SPARC NT Theor. TCP/IP over Ethernet
MB/s
time (us)
65536
16384
0.0156 4096 0.0039
1024
Linux ULTRA SPARC NT SPARC-4 Linux 10Mb/s
0.0010
256
0.0002 4
16
64
256
1024 4096 Message size (bytes)
16384
65536
262144
4
16
64
256
1024 4096 Message size in bytes
Fig. 2. Latency and bandwidth on NOW clusters
16384
65536
262144 1.04858e+06
574
P. Melas and E. J. Zaluska
The Linux cluster. This cluster is build around Pentium-Pro machines running Linux 2.0.x. Tests have run using both the 10Mbit/s channel and FastEthernet with MPI version mpich 1.1. A poor performance was observed similar to the NT cluster, although in this case the non-dedicated interconnection network affected some of the results. Although its performance is slightly better than the NT cluster, the system fails to utilise the underlying hardware efficiently. In Fig.2 we can see a break point at the bandwidth plots close to 8 KB due to the interaction between TCP/IP and the OS. The Solaris cluster. The Solaris cluster uses a non-dedicated 10Mbit/s Ethernet segment running SunOS 5.5.1 and the communication library is mpich 1.0.12. This cluster has the best performance among the clusters for 10Mbit/s Ethernet. Both SPARC-4 and ULTRA SPARC nodes easily saturated the network channel and push the communication bottleneck down to the Ethernet board. The nondedicated intercommunication network had almost no affect on the results. The Myrinet cluster. This cluster is build around Pentium-Pro machines running Linux 2.0.1, with an interconnection based on a Myrinet network. The communication protocol is the Basic Interface for Parallelism (BIP), an interface for network communication which is targeted towards message-passing parallel computing. This interface has been designed to deliver to the application layer the maximum performance achievable by the hardware [3]. Basic features of this protocol are a zero-copy protocol, user application level direct interaction with the network board, system call elimination and exploitative use of memory bandwidth. BIP can easily interface with other protocol stacks and interfaces such as IP or APIs, (see Fig. 1). The cluster is flexible enough to configure the API either as a typical TCP/IP stack running over 10Mbit/s Ethernet, TCP/IP over Myrinet or directly on top of BIP. In the first case the IP protocol stack on top of the Ethernet network provides similar performance to the Linux cluster discussed above, although its performance is comparable to the Sun cluster. Latency for zero-size message length has dropped to 290 µs and the bandwidth is close to 1 MB/s for messages larger than 1 KB. Changing the physical network from Ethernet to Myrinet via the TCP/BIP protocol provides a significant performance improvement. Zero size message latency is 171 µs and the bandwidth reaches 18 MB/s, with a n1/2 figure below 1.5KB. Further change to the configuration enables the application (MPI) to interact directly with the network interface through BIP. The performance improvement in this case is impressive pushing the network board to the design limits. Zero length message latency is 11 µs and the bandwidth exceeds 114 MB/s with a n1/2 message size of 8 KB. Figure 3 shows the latency and bandwith graphs for those protocol stack configurations. A noticeable discontinuity at message sizes of 256 bytes reveals the different semantics between short and long messages transmission modes.
High Performance Protocols for Clusters of Commodity Workstations
Thu Dec 11 12:36:01 1997 1.04858e+06
Latency benchmark
Thu Dec 11 12:29:06 1997 256
575
Bandwidth benchmark
64
262144
65536
16
Over Ethernet Over Myrinet Over Myrinet-BIP
4
16384
MB/s
time (us)
1 4096 0.25
1024 0.0625
Over Ethernet Over Myrinet Over Myrinet-BIP
256 0.0156 64
0.0039
16
0.0010
4
0.0002 1
4
16
64
256 1024 4096 Message size (bytes)
16384
65536
262144 1.04858e+06
1
4
16
64
256 1024 4096 Message size in bytes
16384
65536
262144 1.04858e+06
Fig. 3. Latency and bandwidth on a Myrinet cluster with page alignment
4
Analysis of the Results
The performance difference between normal NOW clusters and parallel systems is typically two or more orders of magnitude. A comparable performance difference can sometimes be observed between the raw performace of a NOW cluster and the performance delievered at the level of the user application (the relative latency and bandwidth performance of the above clusters is presented in Fig. 2). With reference to the theoretical bandwidth of a 10 Mbit/s Ethernet channel with TCP/IP headers we can see that only the Sun cluster approaches close to that maximum. The difference in computation power between the Ultra SPARC and SPARC-4 improves the latency and bandwidth of short messages. On the other hand the NT and Linux cluster completely fail to saturate even a 10Mbit/s Ethernet channel and an order of magnitude improvement in the interconnection channel to FastEthernet does not improve throughput significantly. In both clusters the communication protocol implementation drastically limits performance. It is worth mentioning that both software and hardware for the Solaris cluster nodes comes from the same supplier, therefore the software is very well tuned and exploits all the features available in the hardware. By contrast the open market in PC-based systems requires software to be a more general and thus less efficient. Similar relatively-low performance was measured on the Myrinet cluster using the TCP/IP protocol stack. The system exploits a small fraction of the network bandwidth and latency improvement is small. Replacing the communication protocol with BIP the Myrinet cluster is then able to exploit the network bandwidth and the application level performance becomes directly comparable with MPPs such as the SP2, T3D, CS2, etc. As can be seen from Fig. 3, the Myrinet cluster when compared to those MPPs has significantly better latency features over the whole range of the measurements made. In bandwidth terms for short messages up to 256 bytes Myrinet outperforms all the other MPPs. Then for messages up to 4KB (which is the breakpoint of the SP2 at Argonne), the SP2 has a higher performance, but after this point performance of Myrinet is again better.
576
P. Melas and E. J. Zaluska
Table 1. Latency and Bandwidth results Configuration MPI over TCP/IP MPI over TCP/IP MPI over TCP/IP MPI over TCP/IP MPI over TCP/IP MPI over TCP/IP MPI over TCP/BIP MPI over BIP
Fri Dec 12 11:21:45 1997 65536
Cluster H/.W min Lat. max BW NT/Alpha FastEth. 673 µs 120 KB/s Linux/P.Pro Ethernet 587 µs 265 KB/s Linux/P.Pro FastEth. 637 µs 652 KB/s SPARC-4 Ethernet 660 µs 1.03 MB/s ULTRA Ethernet 1.37 ms 1.01 MB/s Linux/P.Pro Ethernet 280 µs 1 MB/s Linux/P.Pro Myrinet 171 µs 17.9 MB/s Linux/P.Pro Myrinet 11 µs 114 MB/s
Latency benchmark
Fri Dec 12 11:17:52 1997 256
16384
Bandwidth benchmark
64
4096
16
Argone SP2 Soton SP2 Soton CS2 Lyon BIP
1024
4 MB/s
time (us)
n1/2 100 1.5 K 1.5 K 280 750 300 1.5 K 8K
256
1
64
0.25
16
0.0625
4
Argone SP2 Soton SP2 Soton CS2 Lyon BIP
0.0156 1
4
16
64
256 1024 4096 Message size (bytes)
16384
65536
262144 1.04858e+06
1
4
16
64
256 1024 4096 Message size in bytes
16384
65536
262144 1.04858e+06
Fig. 4. Comparing latency and bandwidth between a Myrinet cluster and MPPs
5
Conclusion
In this paper the importance and the impact of some of the most common interconnecting network protocols for clusters has been examined. The TCP/IP protocol stack was tested with different cluster technologies and comparisons made with other specialised network protocols. Traditional communication protocols utilising the critical application-kernel-network path impose excessive communication processing cost and cannot exploit any advanced features in the hardware. Conversely communication protocols that enable applications to interact directly with network interfaces such as BIP, Fast Messages, Active Messages, etc can improve the performance of clusters considerable. The majority of parallel applications use small-size messages to coordinate program execution and for this size of message latency overhead dominates the transmission time. In this case the communication cost cannot be hidden by any programming model or technique such as overlapping or pipelining. Fast communication protocols that deliver a drastically reduced communication latency are necessary for clusters to be used effectively in a wide range of scientific and commercial parallel applications as an alternative to MPPs.
High Performance Protocols for Clusters of Commodity Workstations
577
Acknowledgements We acknowledge the support of the Greek State Scholarships Foundation, Southampton University Computer Services, Argonne National Laboratory and LHPC in Lyon.
References 1. J. J. Dongarra and T. Dunigan. Message-passing performance of various computers. Technical Report UT-CS-95-299, Department of Computer Science, University of Tennessee, July 1995. 572 2. Mario Lauria and Andrew Chien. MPI-FM: High performance MPI on workstation clusters. Journal of Parallel and Distributed Computing, 40(1):4–18, January 1997. 570, 572 3. Bernard Tourancheau Lo¨ıc Prylli. Bpi: A new protocol design for high performance networking on myrinet. Technical report, LIP-ENS Lyon, September 1997. 571, 574 4. Loic Prylli. Draft: Bip messages user manual for bip 0.92. Technical report, Laboratoire de l’Informatique du Parallelisme Lyon, 1997. 570, 572 5. PARKBENCH Committee: Report-1, assembled by Roger Hockney (chairman), and Michael Berry secretary). Public international benchmarks for parallel computers. Technical Report UT-CS-93-213, Department of Computer Sci/ence, University of Tennessee, November 1993. 572 6. Steven H. Rodrigues, Thomas E. Anderson, and David E. Culler. High-performance local-area communication with fast sockets. In USENIX, editor, 1997 Annual Technical Conference, January 6–10, 1997. Anaheim, CA, pages 257–274, Berkeley, CA, USA, January 1997. USENIX. 570, 571, 572 7. A. Chien S. Pakin, M. Lauria. High performance messaging on workstations: Illinois fast messages (fm) for myrinet. In In Supercomputing ’95, San Diego, California, 1995. 572 8. W. R. Stevens. TCP/IP Illustrated, Volume 1; The Protocols. Addison Wesley, Reading, 1995. 571 9. Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels. U-net: a user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), volume 29, pages 303–316, 1995. 570, 572
Significance and Uses of Fine-Grained Synchronization Relations Ajay D. Kshemkalyani Dept. of Electrical & Computer Engg. and Computer Science University of Cincinnati, P. O. Box 210030, Cincinnati, OH 45221-0030, USA [email protected]
Abstract. In a distributed system, high-level actions can be modeled by nonatomic events. Synchronization relations between distributed nonatomic events have been proposed to allow applications a fine choice in specifying synchronization conditions. This paper shows how these fine-grained relations can be used for various types of synchronization and coordination in distributed computations.
1
Introduction
High-level actions in several distributed application executions are realistically modeled by nonatomic poset events (where at least some of the component atomic events of a nonatomic event occur concurrently), for example, in distributed multimedia support, coordination in mobile computing, distributed debugging, navigation, planning, industrial process control, and virtual reality. It is important to provide these and emerging sophisticated applications a fine level of granularity in the specification of various synchronization/causality relations between nonatomic poset events. In addition, [20] stresses the theoretical and practical importance of the need for such relations. Most of the existing literature, e.g., [1,2,4,5,7,10,13,15,18,19,20] does not address this issue. A set of causality relations between nonatomic poset events was proposed in [11,12] to specify and reason with a fine-grained specification of causality. This set of relations extended the hierarchy of the relations in [9,14]. An axiom system on the proposed relations was given in [8]. The objective of this paper is to demonstrate the use of the relations in [11,12]. The following poset event structure model is used as in [4,9,10,12,14,15,16,20]. (E, ≺) is a poset such that E represents points in space-time which are the primitive atomic events related by the causality relation ≺ which is an irreflexive partial ordering. Elements of E are partitioned into local executions at a coordinate in the space dimensions. In a distributed system, E represents a set of events and is discrete. Each local execution Ei is a linearly ordered set of events in partition i. An event e in partition i is denoted ei . For a distributed application, points in the space dimensions correspond to the set of processes (also termed nodes), and Ei is the set of events executed by process i. Causality between events at different nodes is imposed by message communication. High-level actions in the D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 578–590, 1998. c Springer-Verlag Berlin Heidelberg 1998
Significance and Uses of Fine-Grained Synchronization Relations
579
computation are nonempty subsets of E. Formally, if E denote the power set of E and A ( = ∅) ⊆ (E − ∅), then each element A of A is termed an interval or a nonatomic event. It follows that if A Ei = ∅, then (A Ei ) has a least and a greatest event. A is the set of all the sets that represent a higher level grouping of the events of E that is of interest to an application.
Table 1. Some causality relations [9]. Relation r Expression for r(X, Y ) R1 ∀x ∈ X∀y ∈ Y, x ≺ y R1 ∀y ∈ Y ∀x ∈ X, x ≺ y R2 ∀x ∈ X∃y ∈ Y, x ≺ y R2 ∃y ∈ Y ∀x ∈ X, x ≺ y R3 ∃x ∈ X∀y ∈ Y, x ≺ y R3 ∀y ∈ Y ∃x ∈ X, x ≺ y R4 ∃x ∈ X∃y ∈ Y, x ≺ y R4 ∃y ∈ Y ∃x ∈ X, x ≺ y
The relations in [9] formed an exhaustive set of causality relations to express the possible interactions between a pair of linear intervals. These relations R1 - R4 and R1 - R4 from [9] are expressed in terms of the quantifiers over X and Y in Table 1. Observe that R2 and R3 are different from R2 and R3, respectively, when applied to posets. Table 1 gives the hierarchy and inclusion relationship of the causality relations R1 - R4. Each cell in the grid indicates the relationship of the row header to the column header. The notation for the inclusion relationship between causality relations is as follows. The inclusion relation “is a subrelation of” is denoted ‘’ and its inverse is ‘ ’. For relations r1 r2 r1 ). The relations { R1, R2, and r2 , we define r1 || r2 to be (r1 r2 R3, R4 } form a lattice hierarchy ordered by .
Table 2. Inclusion relationships between relations of Table 1 [9]. relation of row header to column header R1 R2 R3 R4
R1 R2 R3 R4 =
= ||
|| =
=
The relations in [9] formed a comprehensive set of causality relations to express all possible interactions between a pair of linear intervals using only the ≺ relation between atomic events, and extended the partial hierarchy of rela-
580
Ajay D. Kshemkalyani
tions of [14]. However, when the relations of [9] are applied to a pair of poset intervals, the hierarchy they form is incomplete. Causality relations between a pair of nonatomic poset intervals were formulated by extending the results [8,9] to nonatomic poset events [11,12]. The relations form a “comprehensive” set of causality relations between nonatomic poset events using first-order predicate logic and only the ≺ relation between atomic events, and fill in the existing partial hierarchy formed by relations in [9,14]. A relational algebra for the relations in [11,12] is given in [8]. Given any relation(s) between X and Y , the relational algebra allows the derivation of conjunctions, disjunctions, and negations of all other relations that are also valid between X and Y , as well as between Y and X. Section 2 reviews the hierarchy of causality relations [11,12]. Section 3 gives the uses of each of the relations. Some uses of the relations include modeling various forms of synchronization for group mutual exclusion, initiation of nested computing, termination of nested processing, and monitoring the start of a computation. Section 4 gives concluding remarks. The results of this paper are included in [8].
2
Relations between Nonatomic Poset Events
Previous work on linear intervals and time durations, e.g., [1,2,4,5], identifies an interval by the instants of its beginning and end. Given a nonatomic poset interval, one needs to define counterparts for the beginning and end instants. These counterparts serve as “proxy” events for the poset interval just as the events at the beginning and end of linear intervals such as time durations serve as proxies for the linear interval. The proxies identify the durations on each node, in which the nonatomic event occurs. Two possible definitions of proxies are (i) LX = {ei ∈ X | ∀ei ∈ X, ei ei } and UX = {ei ∈ X | ∀ei ∈ X, ei ei }, and (ii) LX = {e ∈ X | ∀e ∈ X, e e } and UX = {e ∈ X | ∀e ∈ X, e ≺ e }. Assume that one definition of proxies is consistently used, depending on context and application. Fig. 1 depicts the proxies of X and Y . There are two aspects of a relation that can be specified between poset intervals. One aspect deals with the determination of an appropriate proxy for each interval. A proxy for X and Y can be chosen in four ways corresponding to the relations in {R1, R2, R3, R4}. From Table 1, it follows that these four relations form a lattice ordered by . The second aspect deals with how the atomic elements of the chosen proxies are related by causality. The chosen proxies can be related by the eight relations R1, R1 , R2, R2 , R3, R3 , R4, R4 of Table 1, which are renamed a, a , b, b , c, c , d, d , respectively, to avoid confusion with their original names used for the first aspect of specifying the relations between poset intervals. The inclusion hierarchy among the six distinct relations forms a lattice ordered by ; see Table 3. The two aspects of deriving causality relations, described above, are combined to define the relations. The lattice of relations { R1*, R2*, R3*, R4* } between proxies of X and Y , and the lattice of relations { a, a , b, b , c, c , d, d } between the elements of the proxies, give a product lattice of 32 relations over
Significance and Uses of Fine-Grained Synchronization Relations
LY
581
UY
X LX
UX
Y
space
atomic event
time
Fig. 1. Poset events X and Y and their proxies.
Table 3. Full hierarchy of relations of Table 1 [9]. Relations R1, R1 , R2, R2 , R3, R3 , R4, R4 of Table 1 are renamed a, a , b, b , c, c , d, d , respectively. Relations in the row and column headers are defined between X and Y . Relation names: R1,a (=R1 ,a ): R2,b: R2 ,b : R3,c: R3 ,c : R4,d (=R4 ,d ): its quantifiers for x ≺ y ∀x∀y (=∀y∀x) ∀x∃y ∃y∀x ∃x∀y ∀y∃x ∃x∃y (=∃y∃x) R1,a (=R1 ,a ) : = ∀x∀y(= ∀y∀x) R2,b: ∀x∃y = || || R2 ,b : ∃y∀x = || || R3,c: ∃x∀y || || = R3 ,c : ∀y∃x || || = R4,d (=R4 ,d ): = ∃x∃y(= ∃y∃x)
582
Ajay D. Kshemkalyani
R2b'
b
b'
R2a
d
a
R2c
R2b R2c'
R2d
c'
c T I
R1b'
M R1a
==
E
R1b
R4b' R1d
R1c
R4b
R4a
R4d R4c
R1c'
R4c'
S 2
4
1
R3a 3
R3b' R3c
R3b
R3d
R3c'
Fig. 2. Hierarchy of relations in [11,12]. A × A to express r(X, Y ). The resulting set of poset relations, denoted R, is given in Table 4. The relations in R form a lattice of 24 unique elements as shown in Fig. 2. R is comprehensive using first-order predicate logic and only the ≺ relation between atomic events. Relation R?#(X, Y ) means that proxies of X and Y are chosen as per ?, and events in the proxies are related by #. The two relations in [14], viz., −→ and −− →, correspond to R1a and R4d, respectively, whereas the relations in [9] and listed in Table 1 correspond to the these relations as follows: R1=R1, R2, R2 , R3, R3 , R4=R4 correspond to R1a, R2b, R2b , R3c, R3c , R4d, respectively.
3
Significance of the Relations
The hierarchy of causality relations is useful for applications which use nonatomicity in reasoning and modeling and need a fine level of granularity of causality relations to specify synchronization relations and their composite global predicates between nonatomic events. Such applications include industrial process control applications, distributed debugging, navigation, planning, robotics, diagnostics, virtual reality, and coordination in mobile systems. The hierarchy provides a range of relations, and an application can use those that are relevant and useful to it. The relations and their composite (global) predicates provide a precise handle to express a naturally occurring or enforce a desired fine-grained level of causality or synchronization in the computation. A specific meaning of each relation and a brief discussion of how it can be enforced is now given. In the following discussion, the “X computation” and “Y computation” refer to
Significance and Uses of Fine-Grained Synchronization Relations
583
Table 4. Relations r(X, Y ) in R [11,12]. Relation r(X, Y )
Relation definition specified by quantifiers for x ≺ y, where x ∈ X, y ∈ Y R1a ∀x ∈ UX ∀y ∈ LY R1a (=R1a) ∀y ∈ LY ∀x ∈ UX R1b ∀x ∈ UX ∃y ∈ LY R1b ∃y ∈ LY ∀x ∈ UX R1c ∃x ∈ UX ∀y ∈ LY R1c ∀y ∈ LY ∃x ∈ UX R1d ∃x ∈ UX ∃y ∈ LY R1d (=R1d) ∃y ∈ LY ∃x ∈ UX R2a ∀x ∈ UX ∀y ∈ UY R2a (=R2a) ∀y ∈ UY ∀x ∈ UX R2b ∀x ∈ UX ∃y ∈ UY R2b ∃y ∈ UY ∀x ∈ UX R2c ∃x ∈ UX ∀y ∈ UY R2c ∀y ∈ UY ∃x ∈ UX R2d ∃x ∈ UX ∃y ∈ UY R2d (=R2d) ∃y ∈ UY ∃x ∈ UX
Relation r(X, Y )
Relation definition specified by quantifiers for x ≺ y, where x ∈ X, y ∈ Y R3a ∀x ∈ LX ∀y ∈ LY R3a (=R3a) ∀y ∈ LY ∀x ∈ LX R3b ∀x ∈ LX ∃y ∈ LY R3b ∃y ∈ LY ∀x ∈ LX R3c ∃x ∈ LX ∀y ∈ LY R3c ∀y ∈ LY ∃x ∈ LX R3d ∃x ∈ LX ∃y ∈ LY R3d (=R3d) ∃y ∈ LY ∃x ∈ LX R4a ∀x ∈ LX ∀y ∈ UY R4a (=R4a) ∀y ∈ UY ∀x ∈ LX R4b ∀x ∈ LX ∃y ∈ UY R4b ∃y ∈ UY ∀x ∈ LX R4c ∃x ∈ LX ∀y ∈ UY R4c ∀y ∈ UY ∃x ∈ LX R4d ∃x ∈ LX ∃y ∈ UY R4d (=R4d) ∃y ∈ UY ∃x ∈ LX
the computation performed by the nonatomic events X and Y , respectively. A ˆ proxy of X is denoted X. We first consider the significance of the groups of relations R*a(X, Y ), R*b(X, Y ), R*b (X, Y ), R*c(X, Y ), R*c (X, Y ), and R*d(X, Y ). Each group ˆ and Yˆ . deals with a particular proxy X – R*a(X, Y ): All events in Yˆ know the results of the X computation (if any,) ˆ This is a strong form of synchronization between upto all the events in X. ˆ and Yˆ . X ˆ some event in Yˆ knows the results of the X – R*b(X, Y ): For each event in X, ˆ The Y computation may then computation (if any,) upto that event in X. ˆ among the nodes exchange information about the X computation upto X, participating in the Y computation. – R*b (X, Y ): Some event in Yˆ knows the results of the X computation (if ˆ These relations are useful when it is sufficient for any,) upto all events in X. one node in NYˆ to detect a global predicate across all nodes in NXˆ . If the event in Yˆ is at a node that behaves as the group leader of NYˆ , then it can either inform the other nodes in NYˆ or make decisions on their behalf. – R*c(X, Y ): All events in Yˆ know the results of the X computation (if any,) ˆ This group of relations is useful when it upto some common event in X. is sufficient for one node in NXˆ to inform all the nodes in NYˆ of its state,
584
Ajay D. Kshemkalyani
such as when all the nodes in NXˆ have a similar state. If the node at which ˆ occurs has already collected information about the results/ the event in X ˆ from other nodes in N ˆ (thus, that states of the X computation upto X X ˆ then the events in Yˆ will know the node behaves as the group leader of X), ˆ states of the X computation upto X. ˆ – R*c (X, Y ): Each event in Y knows the results of the X computation (if ˆ If it is important to the application, then the any,) upto some event in X. ˆ state at each event in X should be communicated to some event in Yˆ . – R*d(X, Y ): Some event in Yˆ knows the results of the X computation (if ˆ The nodes under consideration at which the any,) upto some event in X. ˆ respectively, occur may be the group leaders of N ˆ and events in Yˆ and X, Y NXˆ , respectively. This group leader of NXˆ may have collected relevant state information from other nodes in NXˆ , and conveys this information to the group leader of NYˆ , which in turn distributes the information to all nodes in NYˆ . The above significance of each group of relations applies to each individual relation of that group. The specific use and meaning of each of the 24 relations in R is given next. We do not restrict the explanation that follows to any specific application. R1*(X, Y ): This group of relations deals with UX and LY . Each relation signifies a different degree of transfer of control for synchronization, as in group mutual exclusion (gmutex), from the X computation to the Y computation. – R1a(X, Y ): The Y computation at any node in NY begins only after that node knows that the X computation at each node in NX has ended, e.g., a conventional distributed gmutex in which each node in NY waits for an indication from each node in NX that it has relinquished control. – R1b(X, Y ): For every node in NX , the final value of its X computation is known by (or its mutex token is transferred to) some node in NY before that node in NY begins its Y computation. Thus, nodes in NY collectively (but not individually) know the final value of the X computation before the last among them begins its Y computation. This is a weak version of synchronization/gmutex. – R1b (X, Y ): Before beginning its Y computation, some node in NY knows the final value of the X computation at each node in NX . This is a weak version of synchronization/gmutex (but stronger than R1b) with the property that at least one node in NY cannot begin its Y computation until the final value of the X computation at each node in NX is known to it. – R1c(X, Y ): The final value of the X computation at some node in NX is known to all the nodes in NY before they begin their Y computation. This is a weak form of synchronization/gmutex which is useful when it suffices for a particular node in NX to grant all the nodes in NY gmutex permission to proceed with the Y computation; this node in NX may be the group leader of NX , or simply all the nodes in NX have the same final local state of the X computation within this application.
Significance and Uses of Fine-Grained Synchronization Relations
585
– R1c (X, Y ): Each node in NY begins its Y computation only after it knows the final value of the X computation of some node in NX . This is a weak form of synchronization/gmutex (weaker than R1c) which requires each node in NY to receive a final value (or gmutex token) from at least one node in NX before starting its Y computation. This relation is sufficient for some applications such as those requiring that at most one (additional) process be admitted to join those in the critical section when one process leaves it. – R1d(X, Y ): Some node in NY begins its Y computation only after it knows the final value of (or receives a gmutex token from) the X computation at some node in NX . This is the weakest form of synchronization/gmutex. R2*(X, Y ): This group of relations deals with UX and UY . The relations can signify various degrees of synchronization between the termination of computations X and Y , where X is nested within Y or X is a subcomputation of Y . Alternately, Y could denote activity at processes that have already spawned X activity in threads, and Y can complete only after X completes. – R2a(X, Y ): The Y computation at any node in NY can terminate only after that node knows the final value of (or termination of) the X computation at each node in NX . This is a strong synchronization before termination, between X and Y . – R2b(X, Y ): For every node in NX , the final value of its X computation is known by at least one node in NY before that node in NY terminates its Y computation. Thus, all the nodes in NY collectively (but not individually) know the final values of the X computation before they terminate their Y computation. This is a weak synchronization before termination. – R2b (X, Y ): Before terminating its Y computation, some node in NY knows the final value of the X computation at all nodes in NX . This is a stronger synchronization before termination than R2b, wherein at least one node in NY cannot terminate its Y computation without knowing the final state of the X computation at all nodes in NX . This suffices for all applications in which it is adequate for one node in NY to detect the termination of the X computation at each node in NX before that node terminates its Y computation. – R2c(X, Y ): The final value of the X computation at some node in NX is known to all the NY nodes before they terminate the Y computation. This is a weak form of synchronization. The pertinent node in NX could represent a critical thread in the X computation, or could be the group leader of NX that represents the X computation at all nodes in NX . – R2c (X, Y ): Each node in NY terminates its Y computation only after it knows the final value of the X computation at some node in NX . This is a weak form of synchronization before termination (weaker than R2c), but is adequate when all the nodes in NX are performing a similar X computation. – R2d(X, Y ): Some node in NY terminates its Y computation after it knows the final value of the X computation at some node in NX . This is a weak form of synchronization; however, if the concerned nodes in NX and NY are the respective group leaders of the X and Y computations and, respectively,
586
Ajay D. Kshemkalyani
collect/distributed information from/to their groups, then a strong form of synchronization can be implicitly enforced because when Y terminates, it is known to each node in NY that the X computation has terminated. R3*(X, Y ): This group of relations deals with LX and LY . The relations can signify various degrees of synchronization between the initiation of computations X and Y , where Y is nested within X or Y is a subcomputation of X. Alternately, X could denote activity at processes that have already spawned Y activity in threads. – R3a(X, Y ): The Y computation at any node in NY begins after that node knows the initial values of the X computation at each node in NX . This is a strong form of synchronization between the beginnings of the X and Y computations. – R3b(X, Y ): For each node in NX , the initial state of its X computation is known to some node in NY before that node in NY begins its Y computation. Thus, all the nodes in NY collectively (but not individually) know the initial state of the X computation. This synchronization is sufficient when the forked Y computations at each node in NY are only loosely coupled and should not know each others’ initial states communicated by the X computation; while at the same time ensuring that the initial state of the X computation at each node in NX is available to at least one node in NY before it commences its Y computation. – R3b (X, Y ): Before beginning its Y computation, some node in NY knows the initial state of the X computation at all the nodes in NX . Thus the Y computation at this node can run a parallel X computation for faulttolerance, or can be made an entirely deterministic function of the inputs to the X computation. The subject node in NY can coordinate the Y computation of the other nodes in NY . This synchronization is weaker than R3a but stronger than R3b. – R3c(X, Y ): The initial state of the X computation at some node in NX is known to all the nodes in NY before they begin their Y computation. This is a weak synchronization; however, it is adequate when the subject node in NX has forked all the threads that will perform Y , and behaves as the group leader of X that initiates the nested computation Y . – R3c (X, Y ): Each node in NY begins its Y computation only after it knows the initial state of the X computation at some node in NX . Thus each node executing the computation Y has its Y computation forked or spawned by some node in NX and its Y computation corresponds to a nested branch of X. The nodes in NY may not know each others’ initial values for the Y computation; the X computations at (some of) the NX nodes have semiindependently forked the Y computations at the nodes in NY . – R3d(X, Y ): Some node in NY begins its Y computation only after it knows the initial state of the X computation at some node in NX . This is a weak form of synchronization in which only one node in NX and one node in NY coordinate their respective initial states of their local X and Y computations. However, if the node in NX that initiated the X computation forks off
Significance and Uses of Fine-Grained Synchronization Relations
587
the main thread for the Y computation, then this form of synchronization between the initiations of X and Y is adequate to have Y as an entirely nested computation within X. R4*(X, Y ): This group of relations deals with LX and UY . The relations signify different degrees of synchronization between a monitoring computation Y that knows the initial values with which the X computation begins, and then the monitoring computation Y terminates. – R4a(X, Y ): The Y computation at any node in NY terminates only after that node knows the initial values of the X computation at each node in NX . This is a strong form of synchronization between the start of X and the end of Y . – R4b(X, Y ): For every node in NX , the initial state of its X computation is known by at least one node in NY before that node in NY terminates its Y computation. Even if there is no exchange of information in the Y computation about the state of the X computation at individual nodes in NX , this relation guarantees that when Y completes, the (initial) local states at each of the NX nodes are collectively (but not individually) known by NY . – R4b (X, Y ): Before terminating its Y computation, some node in NY knows the initial state of the X computation at all the nodes in NX . This node in NY can detect if an initial global predicate of the X computation across the nodes in NX is satisfied, before it terminates its Y computation. If this node in NY is a group leader, it can then inform the other nodes in NY to terminate their Y computations. – R4c(X, Y ): The initial state of the X computation at some node in NX is known to all the nodes in NY before they terminate their Y computation. This weak synchronization is adequate for applications where all the NX nodes start their X computation with similar values. Alternately, if the node in NX behaves as a group leader, it can first detect the initial global state of the X computation and then inform all the nodes in NY . – R4c (X, Y ): Each node in NY terminates its Y computation only after it knows the initial state of the X computation at some node in NX . This is a weaker form of synchronization than R4c because the states of all nodes in NX may not be observed before the nodes in NY terminate their Y computation. But this will be adequate for applications in which each node in NX is reporting the same state/value of the X computation, and each node in NY simply needs a confirmation from some node in NX before it terminates its Y computation. For example, a mobile host (an NY node) may simply need a confirmation from some base station (an NX node) before it exits its Y computation. – R4d(X, Y ): Some node in NY terminates its Y computation after it knows the initial state of the X computation at some node in NX . This weak form of synchronization is sufficient when the group leader of X which is responsible for kicking off the rest of X informs some node (or the group leader) of the monitoring distributed program Y that computation X has successfully begun.
588
Ajay D. Kshemkalyani
Consider enforcing group mutual exclusion (gmutex) between two groups of processes G1 and G2 . Multiple processes of either group, but not both groups, are permitted to access some critical resources, such as distributed database records, at any time. Relations R1∗ represent different degrees of gmutex that can be enforced, as explained for R1*(X, Y ) earlier in this section. Also, the strongest form of gmutex, R1a, can also be enforced by R1b , R1c, and R1d, if the communicating nodes in G1 /G2 are the respective group leaders. Thus, for R1b , the nodes in G1 communicate their states (gmutex tokens) to the group leader of G2 which then collects all these states (gmutex tokens), and distributes them within G2 . For R1c, the group leader of G1 collects all the gmutex tokens from G1 , then informs all the nodes in G2 . For R1d, the group leader of G1 collects all the gmutex tokens from G1 , then informs the group leader of G2 which then informs all the nodes in G2 . The above four ways of expressing the distributed mutual exclusion provide a choice in trade-offs of (i) knowledge of membership of G1 and/or G2 , by members and/or group leaders, within each group and across groups (this is further complicated with mobile processes), (ii) different delay or the number of message exchange phases to achieve gmutex, (it is critical to have rapid exchange of access rights to the distributed database), (iii) different number of messages exchanged to achieve gmutex, (bandwidth is a constraint, particularly with the use of crypto techniques), and (iv) fault-tolerance implications (critical to sensitive applications).
4
Concluding Remarks
We showed the uses of fine-grained synchronization relations by applications that use nonatomicity in modeling actions and need a fine degree of granularity to specify synchronization relations and their composite global predicates. We showed the specific meaning and significance of each relation. Some uses of the relations in a distributed system include modeling various forms of synchronization for group mutual exclusion, initiation of nested computing, termination of nested computing, and monitoring the start of a computation. The synchronization between any X and Y computations can be performed at any “synchronization barrier” for the X and Y computations of any application. For example, R2*(X, Y ) synchronization can be performed at a barrier to ensure that subcomputation X which is nested in subcomputation Y completes before subcomputation Y completes, following which R3*(Y, X) synchronization can be performed to kick off another nested subcomputation X within Y . This barrier synchronization may involve blocking of processes which need to wait for expected messages, analogous to the barrier synchronization for multiprocessor systems [17]. The synchronization relations provide a precise handle to express various types of synchronization conditions in first-order logic. Each synchronization performed satisfies a global predicate in the execution. (A classification of some types of global predicates is given in [3,6].) Complex global predicates can be formed by logical expressions on such synchronization relations. It is an inter-
Significance and Uses of Fine-Grained Synchronization Relations
589
esting problem to classify the global predicates that can be specified using the synchronization relations. Observe that performing a synchronization (corresponding to one of the relations) involves message passing, and hence also provides a direct way to evaluate global state functions and global predicates involving the nodes participating in the synchronization. Thus, performing the synchronization enables the detection of global predicates. Also, global predicates can be enforced by initiating the synchronization only when some local predicates become true. Identifying the types of global predicates that can be detected or enforced using the synchronization relations is an interesting problem. The design of algorithms to enforce global predicates is also a topic for study.
References 1. Linear time, branching time, and partial orders in logics and models of concurrency, J. W. de Bakker, W. P. de Roever, G. Rozenberg (Eds.), LNCS 354, SpringerVerlag, 1989. 578, 580 2. J. V. Benthem, The Logic of Time, Kluwer Academic Publishers, (1ed. 1983), 2ed. 1991. 578, 580 3. R. Cooper, K. Marzullo, Consistent detection of global predicates, ACM/ONR Workshop on Parallel and Distributed Debugging, 163-173, May 1991. 588 4. C. A. Fidge, Timestamps in message-passing systems that preserve partial ordering, Australian Computer Science Communications, Vol. 10, No. 1, 56-66, Feb. 1988. 578, 580 5. P. C. Fishburn, Interval Orders and Interval Graphs: A Study of Partially Ordered Sets, J. Wiley & Sons, 1985. 578, 580 6. V. Garg, B. Waldecker, Detection of weak unstable predicates in distributed programs, IEEE Transactions on Parallel and Distributed Systems, 5(3), 299-307, March 1994. 588 7. W. Janssen, M. Poel, J. Zwiers, Action systems and action refinement in the development of parallel systems, In J.C. Baeten, J.F. Groote, (Eds.) Concur91, LNCS 527, Springer-Verlag, 298-316, 1991. 578 8. A. Kshemkalyani, Temporal interactions of intervals in distributed systems, TR29.1933, IBM, Sept. 1994. 578, 580, 589 9. A. Kshemkalyani, Temporal interactions of intervals in distributed systems, Journal of Computer and System Sciences, 52(2), 287-298, April 1996. (Contains some parts of [8]). 578, 579, 580, 581, 582 10. A. Kshemkalyani, Framework for viewing atomic events in distributed computations, Theoretical Computer Science, 196(1-2): 45-70, April 1998. 578 11. A. Kshemkalyani, Relative timing constraints between complex events, 8th IASTED Conf. on Parallel and Distributed Computing and Systems, 324-326, Oct. 1996. 578, 580, 582, 583 12. A. Kshemkalyani, Synchronization for distributed real-time applications, 5th Workshop on Parallel and Distributed Real-time Systems, IEEE CS Press, 81-90, April 1997. 578, 580, 582, 583 13. L. Lamport, Time, clocks, and the ordering of events in a distributed system, CACM, 558-565, 21(7), July 1978. 578 14. L. Lamport, On interprocess communication, Part I: Basic formalism, Part II: Algorithms, Distributed Computing, 1:77-101, 1986. 578, 580, 582
590
Ajay D. Kshemkalyani
15. F. Mattern, Virtual time and global states of distributed systems, Parallel and Distributed Algorithms, North-Holland, 215-226, 1989. 578 16. F. Mattern, On the relativistic structure of logical time in distributed systems, In: Datation et Controle des Executions Reparties, Bigre, 78 (ISSN 0221-525), 3-20, 1992. 578 17. J. Mellor-Crummey, M. Scott, Algorithms for scalable synchronization on sharedmemory multiprocessors, ACM Transactions on Computer Systems, 9(1): 21-65, Feb. 1991. 588 18. E.R. Olderog, Nets, Terms, and Formulas, Cambridge Tracts in Theoretical Computer Science, 1991. 578 19. A. Rensink, Models and Methods for Action Refinement, Ph.D. thesis, University of Twente, The Netherlands, Aug. 1993. 578 20. R. Schwarz, F. Mattern, Detecting causal relationships in distributed computations: In search of the holy grail, Distributed Computing, 7:149-174, 1994. 578
A Simple Protocol to Communicate Channels over Channels Henk L. Muller and David May Department of Computer Science, University of Bristol, UK {henkm,dave}@cs.bris.ac.uk
Abstract. In this paper we present the communication protocol that we use to implement first class channels. Ordinary channels allow data communication (like CSP/Occam); first class channels allow communicating channel ends over a channel. This enables processes to exchange communication capabilities, making the communication graph highly flexible. In this paper we present a simple protocol to communicate channels over channels, and we show that we can implement this protocol cheaply and safely. The implementation is going to be embedded in, amongst others, ultra mobile computer systems. We envisage that the protocol is so simple that it can be implemented at hardware level.
1
Introduction
The traditional approach to communication in concurrent and parallel programming languages is either very flexible, or very restricted. A language like Occam [1], based on CSP [2], has a static communication graph. Processes are connected by means of channels. Data is transported over channels synchronously, meaning that input and output operations wait for each other before data can be communicated. In contrast, communication in concurrent object oriented languages is carried out via object identifiers. Once a process obtains an identifier of another object, it can communicate with that object, and any object can communicate with that object. Similarly, languages based on the π-calculus [3], use many-to-one communications. We have recently developed a form of communication that lies between the purely static and very dynamic communication constructs. We enforce communication over point-to-point channels, just like Occam, but instead of having a fixed communication graph we allow channels to be passed between processes. This gives us a high degree of flexibility. A channel can be seen as a communication capability to or from a process. This is close to the NIL approach [4]. These flexible channels can be extremely useful in many application areas. Allowing channel ends to be passed like ordinary data gives us an elegant paradigm to encode the (mobile) software required for wearable computers. Multi-user games often require communication capabilities to be exchanged between nodes in the system. A similar type of channels has been successfully used for audio processing [5], and mobile channels appear to be a natural communication medium for continuous media. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 591–600, 1998. c Springer-Verlag Berlin Heidelberg 1998
592
Henk L. Muller and David May
O a
D b
i) Initial state
O a
D b
ii) if b is moved
O b
a
D b
a
b
iii) if a is moved subsequently
Fig. 1. A system, with two nodes ‘D’ and ‘O’, and a channel connecting ports a and b. (i) the channel is local on ‘O’; ii) port b is passed to node ‘D’, stretching the channel; iii) port a is transferred to ‘D’, the channel snaps back. In all of these applications, one wants to be able to pass around a communication capability from one part of a system to another. Having one-to-one communication channels as opposed to one-to-many communication models (which many concurrent OO languages offer as their default mechanism) makes it easier for the programmer to reason about the program behaviour, while it simplifies and speeds up the implementation. This is especially important when we are going to implement these protocols at hardware level. In this paper we discuss the implementation of the mobile channel model. In Section 2 we first give a brief overview of the high level paradigm that we use to model these flexible channels. After that, we discuss the protocols that we have developed to implement the channels in Section 3. In Section 4 we present some performance figures of our implementation.
2
Moving Ports Paradigm
The programming paradigm that we have developed, Icarus, is designed to enable the development of mobile distributed software. In this section we give a brief description of the Icarus programming model, concentrating on the communication aspects; for an in-depth discussion of Icarus we refer to a companion paper [6]. Icarus processes communicate by means of channels. Like CSP and Occam channels, Icarus channels are point-to-point, and uni-directional. That means, at any moment in time, a channel has exactly one process that may input from the channel, and one process that may output to the channel. We call these channel ends ports, so each channel has an associated input port and output port. The two ports that are connected by a channel are called companion ports. In contrast with CSP and Occam, Icarus ports are first-class. This means that one can declare a variable of the type “input port of integers”, and assign it a channel end (the type system requires this to be the input-end of an integer channel). Alternatively one can declare a “channel of input port of integers”, and communicate a port over this channel. Because an Icarus port can be passed around between processes, a flexible communication structure can be created. There are two ways to look at mobile channels and ports. The easiest way to visualise mobile channels is to view a channel as a rubber pipe connecting two ports. Wherever the ports are moved, the channel connects them and transports data from the input port to the output port when required. The grey line in
A Simple Protocol to Communicate Channels over Channels
{ chan !int d ; chan ?int c ; 1 Initial par { { chan z ; Configuration c ! z ; d ! z ; } || { ?int x ; int b ; c ? x ; 2 The ports of z are x ? b ; moving } || { !int y ; d ? y ; 3 The ports y ! 1 ; have } arrived } in x and y }
y
593
x b
c
d dc z
z
y
x b
c
d dc z
z
y
x c
d
b
dc z
z
Fig. 2. Example Icarus program. Left: the code. Right: the execution, the processes before, during and after communication over c and d. Figure 1(i) shows the channel between the two ports labelled a and b. Another way to view channels and ports is to enumerate them. If we assign a unique even numbers to each channel, say C, then we can define the ports of that channel to have numbers C and C + 1. Data output over port C will always arrive at its companion port C + 1; regardless the physical locations of the ports. In order to preserve the point-to-point property of channels, ports are not copied when assigned or sent over a channel. Instead, they are moved. When reading a port variable, the variable is not just read, but read-and-destroyed in one atomic operation. These moving semantics guarantee that at any time each channel connects exactly two ports. A port-variable can thus either contain a port, that is, it is connected via a channel to a companion port, or it is unconnected, which we can represent by having no value in the port. The syntax of Icarus allow us to declare an output port of something by using an exclamation mark: !T is the type of a port over which values of type T can be output. The type ?T denotes the ports over which values of T can be input. So the type !?int is the type of port over which one can output a port over which one can input an integer. Icarus has output (!) and input (?) operators, and a select statement to allow a choice between multiple possible inputs. Icarus communication is synchronous, which, as is shown later in this paper, simplifies the implementation dramatically. The example program in Figure 2 creates three processes. Process 1 sends an input port to process 2 and an output port to process 3. This establishes a channel between processes 2 and 3, over which they can exchange data.
3
The Protocol for Moving Ports
The protocol used to communicate ports over channels is simple. We present the protocol bottom up: we start with the basic communication of simple values, then
594
Henk L. Muller and David May
Port index Companion port index Remote Address Port-state More 13 idle ... 12: 13: 12 idle ...
Fig. 3. Part of the port data structure, with two companion ports.
we discuss sending ports itself, and we finally consider the difficult cases such as sending two companion ports simultaneously. We assume that the protocol is implemented on top of a reliable in-order packet delivery mechanism. 3.1
Communicating Single Values
The port data structure consists of an integer that denotes the index of the companion port, an optional field to denote the remote address of the companion port (not used for a port with a local companion), a field that gives the state of the port, and some additional fields for storing process identifications. Two companion ports are shown in Figure 3. Communicating single values is implemented conventionally, using the same algorithms as used in, for example, the Transputer family [7]. The channel can be in one of three states, idle, inputting and outputting. This uniquely defines the state of both inputting and outputting processes. The idle state denotes that no process wishes to communicate over the channel. The inputting state denotes that a process has stopped for input and is waiting for communication, but that no process is ready for outputting yet. Conversely, the outputting state denotes that a process is ready for output but that no process is ready for input. There is no state for two processes being ready for input and output, as that state is resolved immediately when the second process performs its input/output operation. The state of a channel is represented as the state of its two ports. It is sufficient to store the state inputting on the output port, and to store the state outputting on the input port. This ensures that a check of the local port suffices to determine the state of the channel. If we want to support a select-operation, we will need an extra state, selecting, which indicates that a process is willing to input. Upon communication the selecting process will perform some extra work to clean other ports which have been placed in the state selecting. 3.2
Communicating Simple Values Off-Node
Off-node communication is synchronous, but has been optimised to hide latency as much as possible. A process that can output data will immediately send the data on the underlying network, and then wait for a signal that the data has been consumed before continuing. Note that the process stops after outputting, until the data has been input; this is essential when implementing the full port-moving protocol.
A Simple Protocol to Communicate Channels over Channels
595
Because the port-state of a remote companion port is not directly accessible, a port-state is stored at both nodes. Ports with a remote companion can be in the states remote idle, remote inputting, remote outputting and remote ready. The inputting port can be remote idle (no process willing to input at this moment), remote inputting (a process is willing to input and waiting) or remote ready (no process is willing to input, but data has been sent and is waiting to be input). The outputting port can be in the state remote idle (no process is willing to output at this moment) or remote outputting (a process has output data and is waiting for the data to be input). When implementing a select-operation, a port may get into a state remote selecting to denote that a process is waiting for input from more than one source (note that our select-operation only allows selective input, we do not support a completely general select-operation [8]) 3.3
Communicating Ports Off-Node
Locally, within a node, a port can be simply an index in an array. Within a node, ports can be freely moved around from one process to another simply by communicating this index value. We discuss transfers of ports to another node in four steps: 1. Transfer of a port from a source node to a destination node, where the companion port is on the source node (channel stretches). 2. Transfer of a port from a source node to a destination node, where the companion port is on a third node. 3. Transfer of a port from a source node to a destination node, where the companion port is on the destination node (channel snaps back) 4. Difficult cases (simultaneous transfers of two companion ports) 1. Transfer of a port from a source node to a destination node, where the companion port is on the source node. In this case the companion port is local, and stays behind. According to Section 3.1, the port that will be sent, must be in the state idle, inputting, or outputting. Because ports are associated with exactly one process, we can be sure that the port that is going to be sent is not active. That is, no process can be inputting or outputting on the port which is being sent: if a process was inputting or outputting on this port, then the process cannot send the port (it could only attempt to send the port over itself; the type system prevents this from happening). Because the port being sent is not active, it is sufficient to send some identification of the companion port to the destination node. The identification that we send consists of a local index number, and the node address of the source (for example an IP-address and IP port-number). This tuple (local index, node address) is a globally unique identification of the port. Upon receiving the identification, a new port is allocated, and the companion port index and node address are stored in this port. The port-state is initialised to remote idle, for the port was inactive. The second step is to inform the
596
Henk L. Muller and David May
... ... ... 1
12 13
13 12 -
... ... ...
12 13
13 12 -
4
12 13
12 13
GONE GONE
... ... ... ... ...
3
2. Send ID of allocated port (137.222.103.28,1127,34)
...
... ... ... .. ... ... ... 34 28 REMOTE_IDLE
... ...
34 35
...
13 - GONE 34 28 REMOTE_IDLE
... ...
34 35
?
... ... ... 2
1. Send companion port-ID (137.222.102.50,2314,13)
IDLE IDLE
...
4. Acknowledge, system can go (34)
... ... ... ...
... ... ... ...
... ... ... ...
... ...
...
13 50 REMOTE_IDLE
... ... ... ...
... ...
... ...
...
34 35
13 50 REMOTE_IDLE
34 35
13 50 REMOTE_IDLE
... ... ... ...
... ...
... ...
...
... ... ... ...
... ...
Fig. 4. Steps to send a port to a remote node. Grey arrows are channels. companion node of the newly created port index, and the companion port’s state is set to remote idle, remote inputting, or remote outputting. After the companion port is updated, step three is to delete the original port. Even though the port send was inactive, it was possible that at most one message was waiting on it, in that case the data is forwarded in order to preserve the invariant that the data has been sent to the remote node. Finally, in the fourth step an acknowledgement is sent to the destination node, signalling that the companion port is ready. This four step process is summarised in Figure 4. Between steps 1 and 2 is a transitional period when the companion port is not connected to anything. For this reason, the companion port-state is set to gone in step 1. Any I/O operation between steps 1 and 2 will wait until this transient state disappears. 2. Transfer of a port from a source node to a destination node, where the companion port is on a different, third, node. This case is slightly more difficult, and we need a generalised version of the previous protocol. This generalised protocol has four steps, outlined in Figure 5. In step 1 the global identification of the companion port is sent from ‘S’ to ‘D’. In step 2 the newly allocated port-index is sent from the destination node ‘D’ to the companion node ‘C’. This will allow the companion port to be updated. Step 3 will acknowledge to ‘S’ that the port has been transferred, allowing the source node to delete the original port. Finally step 4 is an acknowledgement to ‘D’ signalling that the port is ready for use. Performing these four steps in this order will ensure that any message that might have been sent to the source node is collected and forwarded to the destination node. In the worst case, a message might be output on the companion port, just after the source node has initiated the transfer. This message will be delivered to the source node while the port is in state gone. The data will be kept here until step 3 of the protocol, whereupon it is transferred to the destination node.
A Simple Protocol to Communicate Channels over Channels
Source Node ‘S’
3 Tell so urce to delete port
1 Send companion ID 4 Acknowledge
Companion Node ‘C’
597
Destination Node ‘D’
n companio 2 Inform D -I rt o p w of Ne
Fig. 5. Steps to send a port in a system involving three nodes. Note that the case with two nodes discussed in the previous section is a special case of this four step protocol. If the companion port resides on the source node, then step 3 is a local operation. 3. Transfer of a port from a source node to a destination node, where the companion port is on the destination node. Because the port is being sent to the node where the companion resides, we must ensure that both companion ports are in a ‘local’ state, allowing data transfer to be optimised for local transport. We use the same protocol as before. Because the destination node and the companion node are now one and the same, step 2, informing the companion node of its new status, is now a local activity. We still need to execute steps 3 and 4, telling the source node that the port can be deleted, and acknowledging the completion. These steps take care of the situation in which the port sent was originally in the state remote ready. The companion had sent data to the input-port, and the data was waiting to be picked up. Before the port is deleted, we forward the data to the destination node. Both companions are now on the destination node, so we change the input port-state to outputting, and keep the data in the outputting process until a process is ready to input it. The data can be safely kept in the outputting process because it had not been allowed to proceed. 4. Difficult cases. There are two difficult cases worth discussing: two companion ports that are sent at the same time (to the same or different destination nodes); and the case where a port is transferring a port, which is in turn transferring a port (which in turn might be transferring a port). When two ports are transferred simultaneously, the protocols moving the two ports might interfere with each other. The source of the interference is step 2, where the companion node is updated to reflect the new location of the port transferred. We overcome this problem by checking during step 2 whether the companion port is actually being moved, which is indicated by the port-state gone. If this is the case, then the message is forwarded to the companion’s destination node. If a companion port is found in the state gone, then the other port must find its companion gone (because of the message ordering), and both nodes will forward the message of step 2. Eventually, both newly created ports will have a reference to each others’ location, and both old ports are deleted. Note that if
598
Henk L. Muller and David May
!int
?int
!?int ??int
A
!??int ???int
B a
b
C c
d
D e
f
Fig. 6. Ports over ports over ports. two companion ports are sent simultaneously, then both ports must have been idle, and no data will be sent until both have arrived at their destinations. The most difficult situation is the case where a port is sent, and this port is actually a port of ports, carrying a port. In Figure 6 we have sketched a situation with 4 nodes (A, B, C and D) and 6 ports (a, b, c, d, e and f). The ports are linked up as denoted by the grey lines. Outputting port d over e is a normal output operation as discussed before. Similarly, outputting port b over c to d is a normal operation. If the transfers of b and d happen simultaneously, then b is actually making a double hop, from node B via node C to node D. Indeed, there may be an arbitrarily long finite list of ports to be moved, only restricted by the number of ‘?’ in the port type definition (the typing system guarantees it to be finite). This forwarding causes a serious complication in the protocol (around 20% of the code), and we are considering whether it is worthwhile to prohibit this. 3.4
Security
The protocol described above is not secure. Most notably, an attacker can forge a message for step 2 of the protocol, and deliver it to a node, in order to obtain a port in the process. The solution is to extend our protocol with an extra step. When executing step 1 of the process, we have to send a clearance message to the companion port, informing it that a move is imminent. The companion node will only execute step 2 of the protocol when the source node has sent this clearance. This provides security against unknown processes taking channel ends, but on the assumption that the underlying network enforces a unique naming scheme, and secure delivery of messages. 3.5
Discussion
The protocol has been optimised so that the frequent operations, such as sending ordinary data or ports, have high performance. There are communication patterns which will perform badly. The worst case is a program where a process outputs a large array over a port, while there is no process willing to input this data. If the input-port is now sent from one process to the next, then the data is shipped with it. If the input port were to be sent many times, we would waste bandwidth and increase latency. We could optimise this case by not sending data until required, but this would seriously impair performance in the more common case, when data is communicated between two remote nodes (incurring double latency on each transferral).
A Simple Protocol to Communicate Channels over Channels
Node a
Node b
Node a
599
Node b
Yoyo: send a port up and down between two nodes Diabolo: send both ports up and down
Bouncing between Yo-yo Diabolo Bristol-a Bristol-a 7.70 µs 16 µs Bristol-a Bristol-b 4.7 ms 9.9 ms Bristol-a Amsterdam 322 ms 660 ms
Fig. 7. The two test programs that we measured, with round trip times.
4
Results
The protocol has been implemented in C, to implement Icarus on a network of workstations. The implementation of the protocol takes less than 1000 lines of C, underlining its simplicity. We use UDP (User Data Protocol) for the underlying network, with an extra layer to guarantee reliable in-order delivery. We have tested the protocol with many different applications. A class of students has written multi user games in Icarus. To convince the reader that our implementation works over the Internet, we show timings of the two test programs displayed in Figure 7. The first test program, Yo-yo, keeps one port at a fixed node, and sends the companion port forwards and backwards to a process on a second node. In this program the channel must stretch and snap back every time. The second test program, Diabolo, stretches the protocol with some more demanding communication. Two pairs of processes are running on two nodes, each pair of processes sends one of two companion forwards and backwards between two nodes. We have measured three configurations: one where all processes are running within the same node (which measures communications inside the run-time support only), a configuration where we run the processes on two machines on the same LAN, and a configuration where we run the processes on two distant machines. The performance is linearly related to the average latency between the nodes involved, and to the number of messages that is needed for a round trip. This is clearly shown in the timings of Yo-yo versus Diabolo for the three configurations.
5
Conclusions
In this paper we presented a simple protocol that allows us to treat communication channels as first class objects. The programming paradigm allows processes to communicate over point-to-point channels. Values communicated can either
600
Henk L. Muller and David May
be simple values (like integers or arrays of integers), or ports (ends of channels). Because the paradigm allows a synchronous implementation, we have been able to produce a simple implementation. Whenever a port is transferred, the port is known to be idle (because ports are point to point and synchronous), so we do not have to transfer any state around. We have developed a four step protocol to move a communication port from one node to another. Step one carries the port identifier, creating a new port on the destination node. Step two informs the companion port of its new companion. Step three informs the source that the original port is to be deleted. Step four allows the newly created port to start communication. These steps may be local or remote, depending on the relative positions of the source, destination, and companion nodes. If the two connected ports end up in the same node, the channel snaps back to within the node, and the implementation switches to a highly efficient local communication protocol. The applications of this protocol lie in many areas: multi user games (where players receive channels that allow them to communicate with other players), continuous media switchboards (where a channel carrying, for example, video signals is handed out to two terminals), and wearable computing (establishing communication channels between parts of a changing environment). We are currently integrating this protocol in a wearable computer system, and its applications in continuous media.
References 1. Inmos Ltd. Occam-2 Reference Manual. Prentice Hall, 1988. 591 2. C. A. R. Hoare. Communicating Sequential Processes. Communications of the ACM, 21(8):666–677, August 1978. 591 3. R. Milner, J. Parrow, and D. Walker. A Calculus of Mobile Processes, I. Information and Computation, 100(1):1–40, Sept. 1992. 591 4. R. Strom and S. Yemini. The NIL Distributed Systems Programming Language: A Status Report. SIGPLAN notices, 20(5):36–44, May 1985. 591 5. R. Kirk and A. Hunt. MIDAS–MILAN An Open Distributed Processing System for Audio Signal Processing. Journal Audio Engineering Society, 44(3):119–129, Mar. 1996. 591 6. D. May and H. L. Muller. Icarus language definition. Technical report, Department of Computer Science, University of Bristol, January 1997. 592 7. M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, Routers & Transputers. IOS Press, 1993. 594 8. G. N. Buckley and A. Silberschatz. An Effective Implementation for the Generalised Input-Output Construct of CSP. ACM Transactions on Programming Languages and Systems, pp 224, Apr. 1983. 595
SciOS: Flexible Operating System Support for SCI Clusters Povl T. Koch and Xavier Rousset de Pina SIRAC Laboratory, INRIA Rhˆ one-Alpes, ZIRST - 655, avenue de l’Europe 38330 Montbonnot Saint-Martin (France) [email protected]
Abstract. The bottleneck for many parallel and distributed applications on networks of workstations is the high cost of communication on traditional network interfaces. Memory-mapped network interfaces provide latencies of a few microseconds and bandwidths close to the maximum of the local I/O bus. A major drawback with the current operating system support is that applications have to deal directly with data consistency and the allocation and placement of pinned physical memory. We propose a more elaborate interface, called SciOS, that provides a sharedmemory abstraction where physical memory is treated like in a NUMA architecture. To lower the average memory access times, we use a relaxed memory model and dynamic page migration and replication combined with use of idle remote memory instead of disk swap. We describe the SciOS programming model and describe the issues for its implementation on an SCI cluster.
1
Introduction
Traditional network interfaces and operating systems often require that every communication is through expensive calls to the operating system, includes extensive buffer handling, and require memory copies between address spaces to enforce protection and provide flexibility and reliability. These interface designs typically result in high message latencies, bandwidth limitations, and high software overheads of communication. This hurts the performance of a wide range of parallel and distributed applications. Many new networking technologies are based on virtual memory-mapped interfaces and have bandwidths in the range of Gigabits per second, message latencies of a few microseconds, and hardware-based reliable delivery and congestion control[1,6,8,14]. After the setup of remote memory mappings, all communication is handled through the processor’s normal load and store instructions
This work was supported in part by the Region Rhˆ one-Alpes in the framework of the Emergence Research Program (C.R. E901010001) Supported in part by the European Commission, the TMR Program. Also associated with University of Copenhagen (Denmark)
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 601–610, 1998. c Springer-Verlag Berlin Heidelberg 1998
602
Povl T. Koch and Xavier Rousset de Pina
or system managed DMA with very little overhead on the sending and receiving processors. These types of interfaces have been shown to be well suited for message-passing interfaces [10] and parallel programming languages [9] because of their low latencies and the hardware support of zero-copy protocol implementations. Our approach is to use the user-level load and store interface of memorymapped networks as a means to build a Non-Coherent NUMA architecture and to provide a coherent shared memory system through operating system mechanism. A major performance issue for NUMA architectures is that applications have to deal with the high performance difference between local and remote memory references. On Cache-Coherent NUMA machines (CC-NUMA), remote accesses are typically 2-3 times more expensive than local access times. Therefore, the physical placement of data is crucial to performance of many applications. Each application has to deal explicitly with this locality issue which can become complex when dealing with dynamic sharing patterns. Mechanisms have been implemented to dynamically migrate and replicate shared memory pages [2,3,5,16] to increase memory locality. For memory-mapped network interfaces, the difference between local and remote memory access is much higher than for CC-NUMA machines because the remote memory accesses have to pass by the I/O bus. In Sect. 2, we describe the capabilities of memory-mapped networks together with the current operating system support provided for message-passing interfaces and its limitations. Section 3 describes the main features and implementation issues of our system, SciOS, based mainly on a coherent shared memory abstraction. Section 4 describes related work and Sect. 5 concludes the paper.
2
Memory-Mapped Network Implementations
The implementation of memory-mapped networks is based on a processor’s userlevel access to the network interface placed on the I/O bus. Although the I/O bus, e.g., PCI, is slower than the memory bus, it allows more general implementations that can be used on a wide range of architectures. We describe below the main mechanisms of the PCI-SCI adapter and the operating system support provided for it. 2.1
Dolphin’s PCI-SCI Adapter
Dolphin’s PCI-SCI adapter is based on an IEEE standard called Scalable Coherent Interface (SCI) [11] which specifies the implementation of a coherent shared memory abstraction and a low-level message passing interface. The cache coherence is optional and is typically not implemented on the I/O-based SCI implementations we are considering. SCI specifies a shared 64-bit physical address space. The high order 16 bits of an SCI address designates a node and the remaining 48 bits allow addressing of physical memory within each node. The
SciOS: Flexible Operating System Support for SCI Clusters
603
PCI-SCI adapter [6] provides a number of mechanisms to map remote memory, to send and receive low-level messages, generate remote interrupts, and to provide atomic operations on shared memory. SCI is based on a ring topology and Dolphin’s implementation is based on 200 MB/s bidirectional point-to-point links and SCI switches which allow the interconnection of rings. The PCI-SCI adapter does not implement the cache consistency mechanisms in the SCI standard. Each application will therefore have to deal with issues such as when to flush or invalidate processor caches, buffers in the PCI-SCI adapter, etc. Below, we focus on the adapter’s ability to map remote memory using a standard 32-bit PCI bus. To share physical memory between nodes, a process can allocate local physical memory. The physical memory is accessed locally through a normal virtualto-physical memory mapping. For a process on a remote node to access the shared memory, the process must set up two mappings using the operating system. (1) The remote memory is mapped into the address space of the local PCI bus. This is done by updating an address translation table (ATT) in the PCISCI adapter with the node ID and physical address of the remote memory. (2) The operating system then maps the specific address range of the PCI bus into its virtual address space through manipulation of its page tables. This mapping also enforces protection since the process can only access mapped memory and only in the right mode (read/write/execute). The mappings are shown in Fig. 1 where Process A has remotely mapped a physical memory page allocated and locally mapped by Process B on another node.
Process B
Process A ATT
virtual address space
PCISCI I/Obridge PCI bus
SCI Interconnect
PCISCI I/Obridge PCI bus
physical memory
virtual address space
Fig. 1. Memory mappings for physical memory and PCI-SCI adapter. 2.2
Driver Support for the PCI-SCI Adapter
The driver support for the PCI-SCI adapter is structured as an operating system independent module which has different interfaces for each operating system and adapter versions. The main primitives directly represent the hardware functions presented in Sect. 2.1. The management of physical memory and memory mappings is implemented using device files. DMA functionality is provided through standard read and write calls. A number of reliability functions are provided to shield nodes from hardware failures.
604
Povl T. Koch and Xavier Rousset de Pina
All physical memory for a segment is marked as pinned so it is not placed on disk by the operating system. This is because the PCI bus only accesses the memory using physical addresses and the PCI-SCI adapter does not have a TLB functionality for incoming requests. An application can create a shared segment between 8 KB and 512 KB which is allocated in contiguous physical memory. The allocation follows certain alignment requirements because the address translation table entries in the PCI-SCI adapter are aligned at 512 KB boundaries. The number of address translation table entries is 8,192 which can map up to 4 GB of remote memory, which is more than sufficient. But the main limitation is how much address space the adapter is allowed to use on the PCI bus. Currently, 128 MB is reserved on the PCI bus for the PCI-SCI adapter, which corresponds to only 256 active ATT entries. For the mapping of small memory blocks, e.g., for special synchronization mappings, a whole ATT entry must be used to map only a few physical pages. ATT entries must therefore be considered a scarce resource which limits the number of remote mappings. After some allocation and de-allocation of physical memory, it can be hard for the operating system to find contiguous and correctly aligned physical memory. Depending on the operating system, it can mean that only a limited number of segments can be created. For memory not perfectly allocated, more ATT entries are therefore needed. (The next generation PCI-SCI adapter has more ATT entries and allows mapping at a granularity smaller than 512 KB, thereby reducing the alignment requirements.) Each segment has a unique key which is used to name the segment when it is mapped from a remote node. When the segment is mapped to the local PCI bus, the mmap call can map the segment into the virtual address space of a process. The processor cache is disabled for the region of the PCI bus that is used by the PCI-SCI adapter. This means that all load and store operations have to pass through the PCI-SCI adapter, i.e., loading a remote location recently stored to by the same processor results in PCI and SCI traffic. Since writes can arrive out of order special flush operations or remote reads can be used as a store barriers to guarantee some memory consistency. Although the allocation, deallocation, mapping, and unmapping of segments is completely dynamic, the location of a segment is fixed during its lifetime. This is useful for simple, fixed sharing patterns, e.g., when implementing a messagepassing interface. Since the bandwidth of remote writes much higher than those for remote reads, a receive buffer should be fixed on the receiving node to obtain the best performance. This performance may be hurt seriously for segments that have a dynamic sharing pattern.
3
The SciOS Prototype
The SciOS prototype provides a flexible coherent shared memory abstraction on a cluster of workstations connected by a load/store-based memory-mapped network interface. The goal is to minimize the cost of all memory references for different sharing patterns.
SciOS: Flexible Operating System Support for SCI Clusters
3.1
605
The SciOS Programming Model
SciOS provides a large, flat, shared address space. The abstraction for sharing is a memory block which is a contiguous range of virtual memory. To allow efficient implementations of the coherent shared memory, SciOS model is based on lazy release consistency [12]. This is a relaxed consistency model that allows many low-level optimizations such as the delaying of remote writes and delivery of remote writes out of order. The observation behind release consistency is that correctly synchronized applications only need to have shared memory coherent while inside critical sections. For an application to see a sequentially consistent memory, as in uniprocessor systems, all accesses to shared data are required to be bracketed with appropriate acquire and release operations when respectively entering and leaving a critical section. These operations are provided by the SciOS implementation. Traditional synchronization primitives such as locks can be implemented directly using the corresponding acquire/release operations while a barrier entry can use the release operation and a barrier departure, the acquire operation. To allow the application to exercise some control over the underlying memory system, SciOS supports five different memory protocols. The protocols range from a simple, fixed memory placement as described in Sect. 2.2 to a more complex implementation with migration and replication of shared memory blocks: FIXED INIT Physical pages are pinned and fixed on the node that allocates it. Virtual addresses are coordinated so the same memory block is mapped at the same addresses in all sharing processes. Implementation of message passing interfaces and applications with multiple-producers/single-consumer patterns can benefit from using this memory protocol. FIRST TOUCH Physical pages are allocated and pinned on the first node that accesses them. After being allocated, a physical page is fixed on the node, as with the FIXED INIT protocol. This protocol is targeted at parallel applications with a mostly-write sharing pattern or little active sharing. MIGRATION Physical pages are allocated as in the FIRST TOUCH protocol, but they may migrate individually between the nodes depending on the sharing pattern. Mechanisms ensure that thrashing (ping-pong) is minimized. Sequential sharing patterns can benefit from this protocol. MIG REP This is an extension of the MIGRATION protocol where physical pages are replicated when read by multiple nodes. Coherency mechanisms assure that the memory model is respected. Sharing patterns with mostlyread shared data can use this protocol. GLOBAL This is similar to the MIG REP protocol, but to globally minimize swap activity, decisions are coordinated with the virtual memory system and idle remote memory is used as swap space instead of a local disk. This protocol is primarily targeted at applications where each node is memorybound (“out-of-core” applications).
606
3.2
Povl T. Koch and Xavier Rousset de Pina
Our SciOS Testbed
We are currently implementing the SciOS prototype on a cluster of a total of six uniprocessor Intel nodes running the Linux operating system. The nodes all have a standard 32-bit PCI bus. Each node has a single PCI-SCI adapter connected to our SCI switches. When the machines are connected back-to-back, we observe that the latency for remote 4-byte, non-cached accesses is 2.4 microseconds for writes and 4.6 microseconds for reads. Remote access bandwidth is measured to approximately 10 MB/s for reads and 35-70 MB/s for writes for packets sizes as small as 512 bytes and when using all hardware optimizations. Detailed performance analysis of this adapter type can be found in [15]. Dolphin’s PCI-SCI driver runs as a Linux kernel module. SciOS is being implemented as another module that uses a few basic functions in the driver module. For SciOS’s memory blocks, we use the normal UNIX file abstraction that can be implemented using Linux’s Virtual File System (VFS) facility. After the SciOS module has been loaded and a directory mounted, an application can use the standard UNIX system calls open, ioctl, mmap, munmap, and close on memory blocks. The kernel directs these system calls — along with other kernel events such as page faults, swap in, and swap out — for all SciOS files to our SciOS module. This way, we can implement SciOS at the kernel level with full control of the memory blocks and the associated mappings and physical memory. It also allows us to manage the memory blocks with standard UNIX commands like touch, rm, and ls commands. 3.3
Issues for Implementing SciOS
In this section we will discuss the implementation issues for SciOS on our Linux cluster. The main SciOS data structure is a global table which holds information about each node and a directory for all the SciOS files. When a file is created, a page table holds information about the state of the physical memory allocated for it. With the SciOS implementation, we are dealing with the following four issues: (1) dynamic management of the scarce ATT entries in the PCI-SCI adapter, (2) allowing the use of normal, swappable physical memory for SciOS files, (3) the implementation of NUMA techniques for migration and replication of shared memory, and (4) the integration of the NUMA techniques with a remote swap mechanism. These issues are discussed below. The PCI-SCI driver described in Sect. 2.2 simply refuses more remote mappings when there are no more ATT entries available. In SciOS, applications are allowed to map remote memory because ATT entries are shared dynamically, like the entries in the processor’s TLB. When there are no ATT entries available for a new remote mapping, SciOS will invalidate all the virtual memory mappings for an already used ATT entry and reuse it. When the old mapping is used again, an ATT entry will be allocated in the same way. Since the PCI-SCI adapter does not provide statistics on the extent to which the ATT entry is used, we are forced to use a simple FIFO policy.
SciOS: Flexible Operating System Support for SCI Clusters
607
To alleviate the problem of allocation and alignment of pinned physical memory for the PCI-SCI adapter, it is possible during the startup of the Linux kernel to reserve a large part of physical memory. Since a static reservation is hard to make optimally and it does not allow the sharing of the physical memory with other applications, SciOS uses normal swappable physical memory. When the operating system is running out of physical memory, a “least recently used” page will be placed on disk so the page can be used by another process. SciOS is notified about such events (swapout) for the pages it has allocated. All remote mappings for the page must then be invalidated to avoid wrongly accessing the data of another application. An invalidation request must therefore be sent to all nodes that map the page and the page cannot be placed on disk until all acknowledgments have been received. When a remote process accesses an invalidated remote mapping, the page must be brought back from disk into the physical memory and the remote mapping must then be reestablished. The NUMA techniques are reflected in SciOS’s MIGRATION and MIG REP protocols. During a page fault, two possibilities exist for the MIGRATION protocol: establish a remote mapping or migrate the page to allow local accesses. If the page has migrated much in the past, it is frozen for a period of time [5] and a remote mapping is established to avoid a ping-pong effect. Otherwise, the page is migrated to the node that accesses it. For the MIG REP protocol, a third possibility exists: a page that is not frozen can be replicated on a read access. Replication is preferred over migration if the page has not been modified recently. On write accesses, page replicas must be invalidated to maintain memory coherency. In addition to the page fault decisions, a daemon periodically collects sharing information and may redo migration/replication decisions. Normal NUMA replication protocols eagerly invalidate all replicas before a write access can be allowed to enforce sequential consistency. With lazy release consistency [12], the invalidation of remote replicas can be postponed until the remote nodes perform an acquire operation. We will use this lazy technique for implementing the MIG REP protocol. On current architectures, access to remote memory is much faster than to local disk [7] even with traditional network. By avoiding disk swap activity, performance can be gained for applications with high memory requirements or when multiprogramming each SciOS node. With the GLOBAL protocol, SciOS takes into consideration the amount of idle physical memory on each node when pages are swapped out and while deciding when to migrate and replicate pages. If there is no idle physical memory on a node, SciOS can decide to map a page remotely instead of migrating or replicating the page. When the operating system wants to swap out a page, SciOS can simply discard replica pages. Instead of placing the page on a disk, it can be migrated to a node with a remote mapping and idle physical memory. In cases where the mapping nodes have no idle memory or when the page is not remotely mapped, the page can be migrated to other nodes with idle memory. Only when all physical memory is used on all nodes, SciOS will place pages on disk. To support the GLOBAL protocol, each node places information about its amount of idle memory in the global table.
608
4
Povl T. Koch and Xavier Rousset de Pina
Related Work
Our work draws on the research in the area of operating system support for distributed shared memory (DSM) abstractions for workstation clusters and NUMA architectures. Numerous DSM systems implement a coherent shared memory abstraction on workstation clusters without any hardware support for remote memory access. Munin [4] implements multiple memory protocols to support different sharing patterns like we will do in SciOS. The remote load/store capabilities of the PCI-SCI adapter widens the design space for SciOS compared traditional DSM system. The use of page migration and replication can lower memory access time and reduce load on the interconnect [3,13]. Ibel et al. have implemented a global address space at the user level where application manages a pool of preallocated pinned segments and mappings of remote segments. Because SciOS is implemented in the kernel, we can share the physical memory and remote mappings between applications/processes. Verghese et al. [16] also takes the amount of idle physical memory into consideration when replicating pages like SciOS’s GLOBAL protocol. Feeley et al.’s Global Memory Service [7] uses idle physical memory to reduce expensive disk swap but they do not deal with consistency of write-shared pages as SciOS’s GLOBAL protocol.
5
Summary
Instead of using the memory-mapped network interfaces as a means to provide efficient message passing, we use it as a non-coherent NUMA architecture. We have presented SciOS, a system which provides a programming model based on a coherent shared memory that facilitates the programming of an SCI cluster. Since the placement of physical memory is important for performance, SciOS provides five different protocols so the programmer is able to control the behavior of the underlying memory system. Mechanisms such as migration and replication of shared pages will lower the average memory access times. A contribution of SciOS is a global memory protocol that tightly integrates the shared memory abstraction with the swap functionality of the virtual memory system. This will provide much better performance for applications with high memory requirements because the physical memory on all nodes is used as cache before pages are placed on a disk. By only using normal swappable physical memory and dynamically sharing address translation table entries, the SciOS implementation will remove many of the restrictions found in the current driver support. We are currently implementing SciOS on an Intel cluster using Dolphin’s PCISCI adapter.
Acknowledgments Kaare Lochsen and Hugo Kohmann from Dolphin Interconnect Solutions (Norway) willingly granted us access to the source codes for the PCI-SCI adapter and answered all of our questions. Roger Butenuth from Universit¨ at Paderborn (Germany) kindly gave us access to his Linux port of the PCI-SCI driver.
SciOS: Flexible Operating System Support for SCI Clusters
609
References 1. Matthias A. Blumrich, Kai Li, Richard Alpert, Cezary Dubnicki, and Edward W. Felten. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proceeding of the 21st Annual International Symposium on Computer Architecture, pages 142–153, April 1994. 601 2. William J. Bolosky, Robert P. Fitzgerald, and Michael L. Scott. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating System Principles, pages 19–31, December 1989. 602 3. Edouard Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 143–156, October 1997. 602, 608 4. John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of the 13th ACM Symposium on Operating System Principles, October 1991. 608 5. A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. In Proceedings of the 12th ACM Symposium on Computer Architecture, December 1990. 602, 607 6. Dolphin Interconnect Solutions. PCI-SCI cluster adapter specification, May 1996. Version 1.2. See also http://www.dolphinics.no. 601, 603 7. Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath. Implementing global memory management in a workstation cluster. In Proceedings of the 15th ACM Symposium on Operating System Principles, pages 201–212, December 1995. 607, 608 8. Marco Fillo and Richard B. Gillett. Architecture and implementation of MEMORY CHANNEL. Digital Technical Journal, 9(1):27–41, 1997. 601 9. Richard B. Gillett. Memory channel network for PCI. In IEEE Micro, volume 16, pages 12–18. February 1996. 602 10. Maximilian Ibel, Klaus E. Schauser, Chris J. Scheiman, and Manfred Weis. Highperformance cluster computing using SCI. In Hot Interconnects Symposium V, August 1997. 602 11. IEEE. IEEE Standard for Scalable Coherent Interface (SCI). 1992. Standard 1596. 602 12. Peter Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter USENIX Conference, pages 115–132, January 1994. 605, 607 13. Richard P. Larowe Jr., Carla Schlatter Ellis, and Laurence S. Kaplan. The robustness of NUMA memory management. In Proceedings of the 13th ACM Symposium on Operating System Principles, pages 137–151, October 1991. 608 14. Evangelos P. Markatos and Manolis G.H. Katevenis. Telegraphos: Highperformance networking for parallel processing on workstation clusters. In Proceedings of the Second International Symposium on High-Performance Computer Architecture (HPCA). February 1996. 601 15. Knut Omang. Performance of a cluster of PCI based UltraSparc workstations interconnected with SCI. In Proceedings of Network-Based Parallel Computing, Communication, Architecture, and Applications (CANPC’98), number 1362 in Lecture Notes in Computer Science, pages 232–246, January 1998. 606
610
Povl T. Koch and Xavier Rousset de Pina
16. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 279–289, October 1996. 602, 608
Indirect Reference Listing: A Robust Distributed GC Jos´e M. Piquer and Ivana Visconti Universidad de Chile, Casilla 2777, Santiago, Chile [email protected]
Abstract. Reference Listing is a distributed Garbage Collection (GC) algorithm which replaces Reference Counts by a list of sites holding references to a given object. It has been successfully implemented on many systems during the last years, and some extensions to it have been proposed to collect cycles. However, Reference Listing is difficult to extend to an environment supporting migration and must remain blocked during a reference duplication to avoid race conditions on the reference lists. On the other hand, Indirect GCs are a family of algorithms which use an inverse tree to group all the references to an object, supporting migration and in-transit references. This paper presents a new algorithm, called Indirect Reference Listing which extends the Reference Listing to the tree, supporting migration, intransit references (without blocking) and some degree of fault-tolerance. The GC has been implemented on a distributed Lisp system and the execution time overhead (compared to a Reference Count) is under 3%.
1
Introduction
Distributed Reference Counting algorithms (Lermen and Maurer [8]), (Bevan [1]), (Watson and Watson [21]) and (Piquer [10]) are being replaced by Reference Listing algorithms (Shapiro, Dickman and Plainfoss´e [17]; Birrel et al. [2]) in order to tolerate failures, as the algorithm is more robust. On the other hand, some extensions to Reference Listing have been proposed to recover cycles based on back tracing(Fuchs [5]; Maheshwari and Liskov [9]) and in partial tracing (Rodrigues and Jones [15]). As in Reference Count, migration, in-transit references and the mixture of increment and decrement messages (now called add and delete) can cause problems. Indirect Garbage Collection (Piquer [12]) is a family of algorithms based on a diffusion tree of the references, eliminating the need of two messages (only decrements exist) and supporting migration. However, the tree structure was very fragile and did not support any kind of failure.
This work has been partially funded by Direcci´ on de Investigaci´ on, Fac. Cs. F´ısicas y Matem´ aticas, U. de Chile
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 610–619, 1998. c Springer-Verlag Berlin Heidelberg 1998
Indirect Reference Listing: A Robust Distributed GC
611
The environment in which our distributed garbage collector is designed to run is a loosely-coupled multi-processor system with independent memories, and a reliable point-to-point message passing system. In general, we will suppose that message passing is very expensive and remote pointers already have many fields of information concerning the remote processor, object identifier, etc. In this kind of system, we can accept to lose memory (e.g., adding fields to the remote pointers) if extra messages can be avoided when requiring remote access, or when doing garbage collection. In this paper, we present an new GC, integrating Reference Listing with the Indirect GC family. This algorithm has been implemented on a distributed Lisp environment (Piquer [11]) showing a minimum overhead. Furthermore, we will discuss how Reference Listing can be used to enhance the failure tolerance of the system. The paper is organized as follows: section 2 presents the abstract model of a distributed system with remote pointers used throughout the paper. Section 3 presents the family of indirect garbage collectors, and the detailed algorithm of Indirect Reference Listing. Section 4 discusses the behavior of the algorithm in front of failures and section 5 compares it with other related work. Finally, section 6 presents the conclusions.
2
The Model
The basic model we will use in this paper is composed of the set of processors P , communicating through a reliable message-passing system, with finite but not bounded delays. The set of all objects is noted O. Given an object o ∈ O, it always resides at one and only one processor pi ∈ P : pi is called the owner of o. Every object has always one and only one valid owner at a given moment in time. The set of all the remote pointers to an object o is denoted RP(o). It is assumed that an object can migrate, thus changing its owner. However, this operation is considered an atomic operation, and objects are never in transit between two processors. This assumes a strong property in the underlying system, but in general a distributed system blocks every access to a migrating object, and the effect is the same. During a migration operation, a local object is transformed into a remote one and vice-versa. All the other remote pointers in the system remain valid after a migration as they refer to the object, not to a processor. A remote pointer to an object o is called an o -reference1 . For every object o ∈ O, RP(o) includes all of the existing o -references, including those contained in messages already sent but not yet received. This means that asynchronous remote pointer sending is allowed. In-transit remote pointers are considered to be always accessible. An o -reference is a symbolic pointer, usually implemented via an object identifier. It points to the object o, not to its owner. 1
The o -reference terminology and the basis of this model were proposed by Lermen and Maurer ([8]), Piquer ([10]) and Tel and Mattern ([20]).
612
Jos´e M. Piquer and Ivana Visconti
Only remote pointers are considered in this model, so the local pointers to o are not included in RP(o) and they are not called o -references. In fact, this paper just ignores local references to objects, assuming that only distant pointers exist. The only acceptable values inside an object are remote pointers. The model also requires that, given an object o, every processor can hold at most one o -reference. Remote pointers are usually handled this way: if there are multiple pointers at a processor to the same remote object o, they pass by local indirection. This indirection represents for us the only o -reference at the processor. This is just for simplicity of the model, but it is not a requisite for the algorithms presented, although it simplifies some implementations. In our model then, an object containing a remote pointer to another object o is seen as an object containing a local pointer to an o -reference. The four primitive operations supported are: 1. Creation of an o -reference A processor pi , owner of an object o, transmits an o -reference to another processor pj . The operation happens each time that the owner of o sends a message with an o -reference to another processor. 2. Duplication of an o -reference A processor pj , which already has an o -reference to an object o (not being its owner), transmits the o -reference to another processor pk . 3. Deletion of an o -reference A processor pj , discards an o -reference. This means that the o -reference is no longer locally accessible2 . 4. Migration of an object o A processor pj , holding an o -reference to an object o located at a distant processor pi , becomes the new owner of o. In general, this operation implies the following actions: the old owner transforms its local object o into an o -reference and the new owner transforms its o -reference into a local object o. This operation preserves the number of elements of RP(o) since it just swaps an object o and an o -reference. A migration also generates many Duplication operations: one for each remote pointer contained in the migrating object, from the old to the new owner. The migration protocol is independent of the GC algorithm. However, the GC protocol must support asynchronous owner changes, with in-transit messages, without losing self-consistency. We suppose that every o -reference creation is performed via one of the above operations. Remote pointer redirection is not supported, so an o -reference always points to the same object o. The minimal set of primitive operations supported is powerful enough to model most of the existing distributed systems. Our simplified model enables a very simple exposition of the main ideas behind the algorithms but not their detailed implementation, which is usually very complex and system dependent. In fact, the model includes an underlying local GC: objects contain every remote 2
In real implementations, this operation is usually invoked by the local GC.
Indirect Reference Listing: A Robust Distributed GC
613
pointer accessible from them. The Delete operation provides the interface to be used by the local GC to signal a remote pointer that is no longer locally accessible.
3
Indirect Reference Listing
In general, a distributed GC uses the information kept in the pointers by the underlying system to send its messages to the referenced object. This can be performed using the object finder, or just using the owner field of the pointer, and using forward pointers (Fowler [4]). An Indirect GC is any GC modified to use a new data structure, exclusively managed and used by the GC, to send its messages: every object o has a distributed inverted tree associated to it, containing every host containing an o -reference. The o -references are extended with a new field Parent for the tree pointers. The object o is always at the root of the tree (see Fig. 1). 3.1
The GC Family
Any GC can be transformed into an Indirect GC, by simply replacing the messages sent directly to the object with messages sent through the inverted tree. This has two advantages: the GC is independent from the object manager, and it supports migrations, as will be shown later. The inverted tree structure represents the diffusion tree of the pointer throughout the system. This structure contains every o -reference and the object o itself. When an o -reference is created or duplicated, the destination node is added as a child of the sender. If the destination node already had a valid Parent for this o -reference, it refuses the second Parent to avoid cycles in the structure. When an o -reference is deleted, the corresponding GC algorithm will delete the o -reference from the tree only if it was a leaf. If not, a special o -reference (called a zombie o -reference) with all the GC fields, but no object access fields, is kept in the structure until it becomes a leaf. The zombie o -references are a side effect of the inverted tree: if one leaf of the sub-tree is still locally accessible, the GC needs zombie o -references to reach the object at the root. Obviously, when the root is the only node in the tree, there are no more o -references in the system, and the object itself is garbage. The Parent field contains a processor identifier3 . If the tree structure is always preserved (during every remote pointer operation), the GC can work on it without depending on the object finder. It is enough to use the Parent fields to send the GC messages to the ancestors in the tree, and following them recursively we can reach the object at the root, without using the underlying system. 3
This optimization cannot be applied if we accept multiple o -references to the same object from the same processor. The Parent field is in fact a pointer to a remote pointer, but our simplified model allows this to be implemented as a pointer to a processor.
614
Jos´e M. Piquer and Ivana Visconti
Parent
New Parent
Fig. 1. A migration in the inverted tree
Migration is also allowed, by simply making a root change in the tree structure, which is trivial in an inverted tree (see Fig. 1). On the other hand, the object finder itself does not need to use the Parent field. It can access the object by means of the owner fields, or whatever other fields are convenient. 3.2
Indirect Reference Listing
Indirect Reference Counting (IRC) was originally presented in (Piquer [10]). In a different context, this algorithm was also independently discovered at the same time by other authors (Ichisugi and Yonezawa [6]), (Rudalics [16]). In (Tel and Mattern [20]) it is shown that IRC is equivalent to the Dijkstra and Scholten ([3]) termination detection algorithm. Basically, IRC installs the reference counts at each node of the diffusion tree, counting the children of each one. To implement Indirect Reference Listing (IRL), we will simply replace the counters by child lists, containing the site identifier of each node to which a reference has been sent. The changes to IRC in the implementation are minimal: – Creation and Duplication • at pi , when an o -reference is sent to pj , pj is added to Ref list(x) (x being the object o for a Creation or the original o -reference for a Duplication)4 • at pj , upon reception of an o -reference from pi , if it was already known, a delete message is sent to pi . If it is newly installed, the o -reference is created with Parent ← pi and Ref list ← N IL. – Deletion • When an o -reference is deleted, it is marked as Deleted. 4
If pj was already in Ref list(x), this operation will generate a duplicate. It is safe to do so, because a delete message could be in transit from pj . It is a transient situation only, because pj will refuse a second Parent for o, generating a delete message.
Indirect Reference Listing: A Robust Distributed GC
615
• Upon reception of a delete message for x from pj , pj is deleted from Ref list(x). • When an o -reference is marked as Deleted and its Ref list = N IL, a delete message is sent to Parent(o -reference), and the o -reference is reclaimed. This test must be made upon reception of a delete message for an o -reference and upon o -reference deletion. Since we assume in the model that every processor holds at most one o reference per object o, a deleted o -reference with a non-null Ref list is a zombie o -reference. If the same o -reference is received again in a message, a delete message must be sent to the o -reference sender, because the zombie already has a valid Parent. – Migration The migration of an object o from pi to pj means a change of the root in the diffusion tree. This operation is trivial on an inverted tree if the old root is known. For IRL, the lists must also be kept consistent, so the new owner (pj ) has a new child (adding pi to its Ref list) and has no Parent (sending a delete to its old Parent). The migration costs one delete message, plus some work to be done locally at the respective processors, pi and pj . When the Ref list of an object is NIL, the object can be collected. An inverted tree with reference lists is shown in Fig. 2.
P1
P2
Ref_list = (P2 P3)
Ref_list = ()
P4 Ref_list = ()
P3
Ref_list = (P4 P5)
P5 Ref_list = ()
Fig. 2. Indirect Reference Listing
IRL is very straightforward to implement, but like Reference Count, it lacks the ability to collect cycles. However, the extensions proposed for Reference Listing to collect cycles, could also be applied here. Back tracing could be implemented following the reference lists through the tree. IRL uses the inverted tree to send messages, so even as the Parent pointers are modified during migration (the other operations are only allowed to initialize them, not to change them), the reference lists never move. Objects migrate alone
616
Jos´e M. Piquer and Ivana Visconti
and they get the local o -reference Ref list. Therefore, any in-transit delete message will always arrive at the correct destination. IRL was implemented in a distributed Lisp System, an compared to IRC it only adds a 3% overhead on execution time. Of course, the main overhead is in space, depending on the total number of o -references. However, an interesting point, worth noting, is that IRL is backward-compatible with IRC (if decrement and delete messages are considered equivalent), and thus some sites could be using IRL while others use IRC in a correct Distributed GC.
4
Failures
Using Reference Lists, it is possible to resist some failures. IRL is a little different, so we will examine in turn: a node wanting to leave the system, a node crash and message loss or duplication. 4.1
Shutdown
One important problem for Indirect GCs is that it is not obvious for a node to leave the system, once it has diffusion trees traversing it. In large distributed systems, it is common for a node to need a shutdown, for maintainance or load reasons. It is important to be able to execute a shutdown at the node, deleting all its o -references and migrating all its remote-pointed objects. For Indirect GCs, zombie o -references must also be eliminated. In IRC, a zombie o -reference could not be deleted, because its children need it to exist. On the other hand, we knew how many children it had, but not who they were. With IRL, we know the children identifiers, so we can start an intermediate node deletion algorithm (suppose pi is shutting down): – foreach x zombie o -reference • pi sends (adopt, Ref list(x)) to Parent(x) – at pj , upon reception of (adopt, list) for x, foreach p in list • Add p to Ref list(x), send (new parent, o -reference) to p – At pk , upon reception of new parent for x from pj : • send deletion to Parent(x), set Parent(x) ← pj . Processor pi waits until there are no more zombie o -references before finally shutting down. Any processor implementing IRL can shutdown cleanly using this protocol. If the other nodes are only using IRC, this protocol still works. If the node willing to leave does not implement IRL, this protocol cannot be engaged. Another possible use of this protocol is to implement intermediate node deletion in the tree, avoiding zombie o -references completely: instead of marking an o -reference as deleted, the Delete operation could send an adopt message to the Parent.
Indirect Reference Listing: A Robust Distributed GC
4.2
617
Crashes
Supposing that every node is informed of a node crash, we would like to have a protocol to rebuild a consistent tree for every object accessed through the failed site. Every site can search into its Ref lists the failed site, and they can delete the references coming from it. However, indirect references could also exist, and the failed site could also be Parent of some subtrees. These subtrees need to be rebuilt. One alternative is to generate an error on access, and to need to ask for the object again, or try to move the subtrees to the root. Anyway, the involved nodes now can detect this situation and handle it. 4.3
Message Loss/Duplication
Reference Lists are usually used to tolerate message loss and duplication. IRL does not support these behaviors, because we accept multiple copies of a site in a Ref list to protect in-transit references. If it is necessary to deal with unreliable transports, exchanging the Ref lists as proposed by Shapiro, Dickman and Plainfoss´e [17] is probably the best solution.
5
Related Work
– A reference listing algorithm using indirect references to provide faulttolerance is SSP (Stub-Scion Pairs)(Shapiro, Dickman and Plainfoss´e [17]; Plainfoss´e [14]). This algorithm is based on replacing the reference counts by a site list, containing every site where a duplication has been sent. The reference lists are exchanged between sites to keep the information up to date. This extra information provides tolerance to duplication and disorder if a timestamp is added to the messages. A prototype implementation has shown a 10% overhead compared to plain IRC (Plainfoss´e and Shapiro [13]), but supporting message failures. Indirect Reference Listing is simpler and more efficient, being also more fragile in front of failures. – Using Reference Lists to tolerate failures was originally proposed by Birrel et al [2] and it has been implemented into Sun’s Remote Method Invocation for Java (Sun [19]). It does not support migration and must block the sender of a duplicated reference until an acknowledgment is received from the owner. IRL is not so strong in front of failures, but provides a more efficient solution to the problem. – Fuchs [5] and Maheshwari and Liskov [9] propose back tracing techniques based on Reference Listing. Basically, they perform a tracing GC backwards, from the objects to their roots. Reference Lists are useful for that, because they give us the list of sites pointing to our site. However, inside local spaces the problem is harder, because an o -reference does not know which object contains it. Both papers propose techniques to deal with that issue. These
618
Jos´e M. Piquer and Ivana Visconti
techniques can be equally applied to IRL, because following the Ref list pointers the tree can be traversed from the root to the leaves (which is a back trace in this case).
6
Conclusions
We propose a new distributed GC algorithm, mixing Reference Listing with Indirect Garbage Collection: IRL. The algorithm is very simple and straightforward to implement, and backward compatible with IRC. The advantage is that a node can leave the system using a shutdown protocol, zombie o -references can be deleted and site crashes can be handled. Furthermore, back tracing can be applied to IRL, extending it to collect cycles (which is not possible with plain IRC). An implementation in a distributed Lisp system shows an execution overhead under 3% and demonstrates the ability to inter-operate between IRC and IRL. Compared with related work, IRL is simpler, supports migration and intransit references and does not block a site during duplication. In the other hand it is more robust than plain IRC.
References 1. Bevan, D. I.: “Distributed Garbage Collection Using Reference Counting,” LNCS 259, PARLE’87 Proceedings Vol. II, Eindhoven, Springer Verlag, June 1987. 610 2. Birrel, A., Evers D., Nelson G., Owicki S. and Wobber E.: “Distributed Garbage Collection for Network Objects,” Technical Report 116, DEC-SRC, 1993. 610, 617 3. Dijkstra, E. W. and Scholten, C. S.: “Termination Detection for Diffusing Computations,” Information Processing Letters, Vol. 11, N. 1, August 1980. 614 4. Fowler, R. J.: “The Complexity of Using Forwarding Addresses for Decentralized Object Finding,” in Proc. 5th Annual ACM Symp. on Principles of Distributed Computing, pp. 108–120, Alberta, Canada, August 1986. 613 5. M. Fuchs, “Garbage Collection on an Open Network,” LNCS 986, Proc. 1995 International Workshop on Memory Management (IWMM’95), Springer Verlag, pp. 251–265, Kinross, UK, September 1995. 610, 617 6. Ichisugi, Y. and Yonezawa, A.: “Distributed Garbage Collection using group reference counting,” Tech. Report 90-014, Dept Information Science, Univ., Of Tokyo, 1990. 614 7. Lang, B., Queinnec, C. and Piquer, J.: “Garbage Collecting the World,” 19th ACM Conference on Principles of Programming Languages 1992, Albuquerque, New Mexico, January 1992, pp. 39–50. 8. Lermen, C. W. and Maurer, D.: “A Protocol for Distributed Reference Counting,” Proc. 1986 ACM Conference on Lisp and Functional Programming, Cambridge, Massachussets, August 1986. 610, 611 9. Maheshwari, U., Liskov, B.: “Collecting Distributed Garbage Cycles by Back Tracing,” Proc. 1997 ACM Symp. on Principles of Distributed Computing (PODC’97), Santa Barbara, California, August 1997, pp. 239–248. 610, 617 10. Piquer, J.: “Indirect Reference Counting: A Distributed GC,” LNCS 505, PARLE ’91 Proceedings Vol I, pp. 150–165, Springer Verlag, Eindhoven, The Netherlands, June 1991. 610, 611, 614
Indirect Reference Listing: A Robust Distributed GC
619
11. Piquer, J.: “A Re-Implementation of TransPive: Lessons from the Experience,” Proc. Parallel Symbolic Languages and Systems (PSLS’95), LNCS 1068, Vol I, pp. 310–329, Springer Verlag, Beaune, France, October 1995. 611 12. Piquer, J.: “Indirect Distributed Garbage Collection: Handling Object Migration,” ACM Trans. on Programming Languages and Systems, V. 18, N. 5, September 1996, pp. 615–647. 610 13. Plainfoss´e, D. and Shapiro, M.: “Experience with a Fault-Tolerant Garbage Collector in a Distributed Lisp System,” LNCS 637, Proc. 1992 International Workshop on Memory Management (IWMM’92), Springer Verlag, pp. 116–133, St-Malo, France, September 1992. 617 14. Plainfoss´e, D.: “Distributed Garbage Collection and Referencing Management in the Soul Object Support System,” PhD Thesis, U de Paris VI, France, June 1994. 617 15. Rodrigues, H. and Jones R,: “A Cyclic Distributed Garbage Collector for Network Objects,” LNCS 1151, Proc. 10th Workshop on Distributed Algorithms (WDAG’96), Springer-Verlag, pp. 123–140, Bologna, Italy, October 1996. 610 16. Rudalics, M.: “Implementation of Distributed Reference Counts,” Tech. Report, RISC, J. Kepler Univ, Linz, Austria, 1990. 614 17. Shapiro, M., Dickman, P. and Plainfoss´e, D.: “Robust, Distributed References and Acyclic Garbage Collection,” ACM Symposium on Principles of Distributed Computing, Vancouver, Canada, August 1992. 610, 617 18. Shapiro, M., Dickman, P. and Plainfoss´e, D.: “SSP Chains: Robust, Distributed References Supporting Acyclic Garbage Collection,” INRIA Res. Report 1799, November 1992. 19. Sun Microsystems Computer Corporation: “Java Remote Method Invocation Specification”, http://sunsite.unc.edu/java, November, 1996. 617 20. Tel, G. and Mattern, F.: “The Derivation of Distributed Termination Detection Algorithms from Garbage Collection Schemes,” ACM Trans. on Programming Languages and Systems, Vol. 15, N. 1, January 1993, pp. 1–35. 611, 614 21. Watson, P. and Watson, I.: “An Efficient Garbage Collection Scheme for Parallel Computer Architectures,” LNCS 259, PARLE Proceedings Vol. II, Eindhoven, Springer Verlag, June 1987. 610
Active Ports: A Performance-Oriented Operating System Support to Fast LAN Communications G. Chiola and G. Ciaccio DISI, Universit` a di Genova, via Dodecaneso 35, 16146 Genova, Italy {chiola,ciaccio}@disi.unige.it http://www.disi.unige.it/project/gamma/
Abstract. The Genoa Active Message MAchine (GAMMA) is an efficient communication layer for 100base-T clusters of Personal Computers running Linux. It is based on Active Ports, a communication mechanism derived from Active Messages. GAMMA Active Ports deliver excellent communication performance at user level (latency 12.7 µs, maximum throughput 12.2 MByte/s, half-power point reached with 192 byte long messages), thus enabling cost-effective cluster computing on 100base-T. Despite being implemented at kernel level in the Linux OS, the performance numbers of GAMMA Active Ports are much better than many other LAN-oriented communication layers, including so called “user-level” ones (e.g. U-Net).
1
Introduction
Networks of workstations (NOWs) or, even better, clusters of Personal Computers (PCs) networked by inexpensive commodity interconnects, potentially offer a cost-effective support to parallel processing. The only obstacle to overcome is the inefficiency caused by the OS layers that sedimented in the communication architecture during the past evolutive eras of computer and communication technology. Thus the issue of enhancing or by-passing OS kernels with performanceoriented support to inter-process communication has become crucial. So far this issue has been addressed by following two ways, that we call the “efficient OS support” approach and the “user-level” approach. In the first approach the messaging system is supported by the OS kernel with a small set of flexible and efficient low-level communication mechanisms and by a simplified communication protocols carefully designed according to a performance-oriented approach. Issues like choosing the right communication abstraction and implementing such abstraction efficiently are of primary concern here. The “efficient OS support” architecture fits nicely into the structure of modern OSs providing protected access to the communication devices. This way multi-users, multitasking features need not be limited or modified. The “user-level” approach aims at improving performance by minimizing the OS involvement in the communication path in order to obtain a closer integration between the application and the communication device: The use of D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 620–624, 1998. c Springer-Verlag Berlin Heidelberg 1998
Active Ports: A Performance-Oriented Operating System Support
621
system calls is minimized allowing direct access to the network interface whenever possible. Here the challenge is to provide direct access to the communication devices without violating the OS protection model. In order for such approach not to compromise protection in the access to the communication devices, either a single-user environment or some form of gang scheduling are usually required. These two alternatives have their own obvious drawbacks. A third solution is to leverage programmable NICs which can run the necessary support for device multiplexing in place of the OS kernel. Currently this implies much higher hardware costs. The best known representative of the “user-level” approach is U-Net. With U-Net, user processes are given direct protected access to the network device with no virtualization. Any communication layer as well as standard interface must move to user-level programming libraries. U-Net runs on an ATM network of SPARCstations [7], where a communication co-processor located on the NIC executes proper firmware in order to multiplex a pre-defined number of virtual “endpoints” over the NIC itself, with no OS involvement. In the Fast Ethernet emulation of U-Net [8], the low-cost NIC does not provide a programmable coprocessor. The host CPU itself multiplexes the endpoints over the NIC by means of proper OS support, thus actually following an “efficient OS support” rather than a “user-level” architecture. The “user-level” architecture is believed to deliver better communication performance than the “efficient OS support” architecture by avoiding the OS involvement. Such belief is now contradicted by our experience, at least in the case of commodity off-the-shelf devices.
2
Active Ports in GAMMA
The Genoa Active Message MAchine (GAMMA) [3] is a fast messaging system running on a 100base-T cluster of Pentium PCs running Linux and equipped with either 3COM 3C595-TX Fast Etherlink or 3COM 3C905-TX Fast Etherlink XL PCI 100base-T adapters. Porting GAMMA to DEC “tulip”-based and Intel EtherExpressPro NICs is in progress. Following the “efficient OS support” principle, with GAMMA the Linux kernel has been enhanced with a communication layer implemented as a small set of additional light-weight system calls (that is, system calls with no intervention of the scheduler upon return) and a custom NIC driver. Most of the communication layer is embedded in the Linux kernel at the NIC driver level, the remaining part being placed in a user-level programming library. All the multiuser, multi-tasking functionalities of the Linux kernel have been preserved. All the communication software of GAMMA has been carefully developed according to a performance-oriented approach, extensively described in [3]. The adoption of a communication mechanism that we call Active Ports allowed a “zero copy” protocol, with no intermediate copies of messages along the whole user-to-user communication path.
622
G. Chiola and G. Ciaccio
With GAMMA, each process owns a number of (currently 255) active ports. GAMMA active ports are inspired to the Thinking Machines’ CMAML Active Message library for the CM-5. Each active port can be used to exchange messages to another process in the same as well as a different process group. GAMMA directly supports SPMD parallel programming with automatic replication of the same process on different computation nodes. However plain MIMD programming is allowed as well since sender and receiver need not share code address space. Indeed the message handlers as used in active ports are designated by the receiver, not by the sender as in Generic Active Messages [6]. Prior to using an active port for outbound communications, the owner process must bind it to a destination, in the form of a triple (process group, process instance, destination active port). To send a message throughout an active port, the process has to invoke a “send” light-weight system call which traps to the GAMMA device driver. Each message is then transmitted as a sequence of one or more Ethernet frames. Following a “zero copy” policy, frames are arranged directly into the NIC’s transmit FIFO by DMA of each frame body directly from the sender process’ data structure with no intermediate buffering in kernel memory space. Frame headers are pre-computed at port binding time. As soon as the last frame of a message has been written to the NIC’s transmit FIFO the “send” system call returns and the sender process resumes user-level activity. Active ports can be used to broadcast messages as well. The Ethernet broadcast service is exploited directly, so that one single physical transmission occurs. Efficient barrier synchroniziation is also implemented using this facility [2]. For inbound communications, prior to use a given active port L the owner process must bind L to an application-defined function called receiver handler as well as to an application-defined final destination data structure (spanning a contiguous virtual memory region) where each message incoming through port L is to be stored. The final destination must be pinned down in physical RAM. At this point the local instance of the GAMMA driver knowns in advance the final destination in user space for messages incoming through port L. Therefore, when the GAMMA device driver detects frame arrivals for port L, the frame payloads can be copied directly in their final destination, with no temporary storage in kernel memory space and regardless of the schedule state of the receiver process at the arrival time. Moreover, as soon as the last piece of message has been delivered to port L, the GAMMA driver can immediately run the receiver handler bound to L so as to allow a real-time management of the message itself. The receiver handler is run on the interrupt stack. If the receiver process is not currently running, a temporary context switch is performed by the GAMMA driver so as to ensure that the handler is run in the receiver context. The possibility of running the receiver handler in real time is crucial in order to avoid that subsequent messages incoming to port L overlap the current one if this is to be avoided. Indeed a typical receiver handler: 1) integrates the current message into the main thread of the destination process; 2) notifies the message reception to the main thread itself; 3) re-binds port L to a fresh final destination in user space for the next incoming message.
Active Ports: A Performance-Oriented Operating System Support
623
The GAMMA implementation of Active Ports relies on a very simplified communication protocol with neither error recovery nor flow control. However the only source of frame corruption in modern LANs is due to frame collisions in shared networks. This may be avoided either using switched networks or explicitly scheduling communications at the application level. Flow control is unnecessary as long as the network is the throughput bottleneck. The flexibility of GAMMA allows either to use its simplified and efficient protocol “as is”, or to build more complex and reliable higher level protocols using the GAMMA error detection features [3]. 2.1
Error Handlers for User-Defined Error Recovery
As a by-product of the simplification of the communication protocol and of the elimination of intermediate copies of messages GAMMA provides neither message acknowledgement nor explicit flow control. However: – In a LAN environment with good quality wiring, the only source of frame corruption is collision with other frames in case of shared Ethernet. Such possibility could be eliminated by using a switch instead of a repeater hub. In any case frames corrupted by collision have bad CRC and are discarded by the receiver NIC driver. – If the receiver handler on the receiver side of a communication runs quickly enough (as usually is the case) both the RAM-to-NIC transfer rate on the sender side and the NIC-to-RAM transfer rate on the receiver side are much greater than the Fast Ethernet transfer rate. Hence, flow is implicitly controlled by simply testing whether there is enough room in the transmit FIFO of the sender’s NIC before writing a frame. The choice of GAMMA is to implement some error detection mechanisms and to allow applications to have their own error management policies if required. This is supported by allowing applications to bind each active port with an error handlers. An error handler is similar to a receiver handler but runs in case of communication errors rather than on message arrivals. Error handlers can implement higher level reliable communication protocols, if really needed for a given application.
3
Communication Performance and Conclusions
According to our average estimates based on a “ping-pong” microbenchmark, the GAMMA messaging system exhibit an unprecedented one-way user-to-user latency (delay for a zero byte message) as low as 12.7 µs, with a maximum throughput of 12.2 MByte/s (98% of the Fast Ethernet raw asymptotic bandwidth) with long (500 kByte) messages. The half-power point is achieved with messages as short as 192 bytes. Indeed such numbers enable cost-effective parallel computing on 100base-T clusters of PCs [1,4,5].
624
G. Chiola and G. Ciaccio
The improvement achieved by GAMMA with respect to Linux TCP/IP sockets on similar hardware (latency 150 µs, maximum throughput 7 MByte/s) is more than one order of magnitude for latency and 71% for bandwidth. A latency improvement of more than 2 times has been obtained with respect to the U-Net “user-level” on similar or superior hardware (U-Net latency is 30 µs on 100base-T and 35.5 µs on 140 Mbps ATM). Therefore GAMMA delivers better communication performance than U-Net while providing a higher level and more flexible communication abstraction with same quality of service. As a conclusion, we contradict the common belief that only a “user-level” communication architecture may yield a satisfactory answer to the demand for efficient inter-process communication in a low-cost LAN environment: the improved applications-to-device integration yielded by the “user-level” approach, and the corresponding performance gain, is negligible given that communication hardware cannot be closely integrated with the host CPU in commodity-based NOWs. Much better results can be obtained by optimizing the communication protocol and choosing a suitable communication abstraction. Indeed the GAMMA Active Ports are a flexible communication abstraction which preserves the Linux multi-user multi-tasking environment while still allowing best communication performance, based on traditional kernel-level protection mechanisms and low-cost commodity interconnects.
References 1. G. Chiola and G. Ciaccio. Architectural Issues and Preliminary Benchmarking of a Low-cost Network of Workstations based on Active Messages. In Proc. 14th conf. on Architecture of Computer Systems (ARCS’97), Rostock, Germany, Sept. 1997. 623 2. G. Chiola and G. Ciaccio. Fast Barrier Synchronization on Shared Fast Ethernet. 2nd Int. Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC’98), LNCS 1362, pages 132–143, Feb. 1998. 622 3. G. Ciaccio. Optimal Communication Performance on Fast Ethernet with GAMMA. In Proc. Workshop PC-NOW, IPPS/SPDP’98, pages 534–548, Orlando, Fl., April 1998. LNCS 1388. 621, 623 4. G. Ciaccio and V. Di Martino. Porting a Molecular Dynamics Application on a Low-cost Cluster of Personal Computers running GAMMA. In Proc. Workshop PC-NOW, IPPS/SPDP’98, pages 524–533, Orlando, Fl., April 1998. LNCS 1388. 623 5. G. Ciaccio, V. Di Martino, and P. Lanucara. Porting the Flame Front Propagation Problem on GAMMA. In Proc. HPCN Europe ’98, Amsterdam, The Netherlands, April 1998. 623 6. D. Culler, K. Keeton, L.T. Liu, A. Mainwaring, R. Martin, S. Rodriguez, K. Wright, and C. Yoshikawa. Generic Active Message Interface Specification. Tech. Report White Paper of the NOW Team, CS Dept., U. California at Berkeley, 1994. 622 7. T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. 15th ACM Symp. on Operating Systems Principles (SOSP’95), Copper Mountain, Co., Dec. 1995. ACM Press. 621 8. M. Welsh, A. Basu, and T. von Eicken. Low-latency Communication over Fast Ethernet. In Proc. Euro-Par’96, Lyon, France, August 1996. 621
Workshop 6+16+18 Languages Henk Sips, Antonio Corradi, and Murray Cole Co-chairmen
Three Workshops(6, 16, and 18) have programming languages and models as central theme. Workshop 6 focuses on the use of object oriented paradigms in parallel programming; Workshop 16 has the design of parallel languages as primary focus, and Workshop 18 deals with programming models and methods. Together they present a nice overview of current research in these areas.
Object Oriented Programming. The Object-Oriented (OO) technology has received a renovated stimulus by the ever-increasing usage of the Web and Internet technology. The globalisation has enlarged the number of the potentially involved users to such an extent to suggest a reconsideration of the available environments and tools. This has also motivated the attempt to clarify all debatable and ambiguous points, not all of which of practical and immediate application. On the one hand, several issues connected to program correctness and semantics are still unclear. In particular, the introduction of concurrency and parallelism within an object framework is still subject to discussion, in its verification and modelling. The same is for the aggregation of different objects and behaviour in predetermined patterns. How to accommodate several execution capacities and resources into the same object or pattern requires still work and new proposals. In any case, there is much to be done, both in the abstract area and in the applied field. On the other hand, the distributed framework has not only introduced examples in need of practical solutions and environments, but also forced to reflect on all applicable models, starting from traditional ones, such as the client-server one and RPCs, to less traditional ones, such as the agent models. The growth of Java, CORBA, and their capacity of attracting implementors and resources produce unifying perspectives, with the possibility of offering a new integrated framework in which to solve most common problems. This starts to produce the possibility of creating generally available components to be really employed, by reducing the convenience of the redesign from scratch. This conference session is an occasion to expose to an enlarged audience working into parallelism some of the hot topics and researches going on in the OO area. And, even if the OO community has many occasions of meeting and many forums to exchange opinions, EUROPAR seems a particular opportunity of both presenting experiences and receiving contributions with possibilities of cross-fertilisation. The five papers presented in the conference explore several of the most strategic directions of evolution of the OO area. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 625–628, 1998. c Springer-Verlag Berlin Heidelberg 1998
626
Henk Sips et al.
The first paper, “Dynamic Type Information in Process Types” by Puntigam, uses the process as an example of objects with dynamic type. The goal is to make possible all the checks typical of static types even in the dynamic case: the process is modelled as an active object that, depending on its state, is capable of accepting different messages from different clients. The presented model is a refinement of a previous work of the same author and is based on a calculus of objects that communicate with asynchronous message passing. The third paper of the session, by Gehrke, “An Algebraic Semantics for an Abstract Language with Intra-Object-Concurrency,” addresses the problems of intra-object concurrency, working with a process algebra method. The goal is to introduce the formal semantics for intra-object concurrency in OO frameworks where active processes can be distinguished from passive object. Let us recall that this is the Java assumption. The topics of the other papers are all connected to Java. The paper by Launay and Pazat, “Generation of distributed parallel Java programs” and the fifth paper, by Giavitto, De Vito and Sansonnett, “A Data Parallel Java Client-Server Architecture for Data Field Computations” addresses the point of enlarging the usage of the Java Framework. Launay and Pazat propose a framework capable of transparently distributing Java components of an application onto the available target architecture. Giavitto, De Vito and Sansonnett apply their effort to make Java usable in the data- parallel paradigm, for client-server applications. The fourth paper, by Lorcy and Plouzeau, “An object-oriented framework for managing the quality of service of distributed applications”, addresses the quality of service problem for interactive applications. The authors elaborate on the known concept of ‘contract’, with new considerations and insight.
Programming Languages. As up to now data parallel languages have been the most succesful attempt to bring parallel programming closer to the application programmer. The most seriuos attempt has been the definition of High Performance Fortran (HPF) as an extension of Fortran 90. Several commercial compilers are available today. However, user experience indicates that for irregularly structured problems, the current definition is often inadequate. The first two papers in the HPF session deal with this problem. The paper by Brandes and Germain, called “A tracing protocol for optimizing data parallel irregular computations” describes a dynamic approach by allowing the user to specify which data has to be traced for modifications. The second paper by Brandes, Bregier, Counilh, and Roman proposes a programming style for irregular problems close to regular problems. In this way compile-time and run-time techniques can be readily combined. A recent language extension for shared-memory programming that has caught a lot of attention is OpenMP. OpenMP is a kind of reincarnation of the old PCF programming model. The paper by Chapman and Mehrotra describes various ways how HPF and OpenMP can be combined to form a combined powerful programming system. Also the Java language can be fruitfully used as a basis
Workshop 6+16+18 Languages
627
for parallel programming. The paper by Carpenter, Zhang, Fox, Li, Li, and Wen outlines a conservative set of language extensions to Java to support SPMD style of programming. General parallel programming languages have a hard time in obtaining optimal performance for specific cases. If the application domain is restricted, better performance can be obtained by using a domain specific parallel programming language. The paper by Spezzano and Talia describes the langauge CARPET intended for programming cellular automata systems. Finally, the paper by Hofstadt presents the integration of task parallel extensions into a functional programming language. This approach is illustrated by a branch and bound problem example.
Programming Models and Languages. Producing correct software is already a difficult task in the sequential context. The challenge is compounded by the conceptual complexity of parallelism and the requirement for high performance. Building on the foundations laid in Workshop 7 in the preceding instantiation of Euro-Par, Workshop 18 focuses on programming and design models that abstract from low-level programming techniques, present software developers with interfaces that reduce the complexity of the parallel software construction task, and support correctness issues. It is also concerned with methodological aspects of developing parallel programs, particularly transformational and calculational approaches, and associated ways of integrating cost information into them. The majority of papers this year work from a “skeletal” programming perspective, in which syntactic restrictions are used both to raise the conceptual level at which parallelism is invoked and to constrain the resulting implementation challenge. Mallet’s work links the themes of program transformation (here viewed as a compilation strategy) and cost analysis, using symbolic methods to choose between distribution strategies. His source language is the by now conventional brew of nested vectors and the map, fold, scan skeleton family, while the cost analysis borrows from the polytope volume techniques of the Fortran parallelization world, an interesting and encouraging hybrid. Skillicorn and colleagues work with the P3L language as source and demonstrate that the use of BSP as an implementation mechanism enables a significant simplification of the underlying optimisation problem. The link is continued in the work of Osoba and Rabhi, in which a skeleton abstracting the essence of the multigrid approach benefits from the portability and costability of BSP. In contrast, Keller and Chakravarty work with the well know data-parallel language NESL, introducing techniques which allow the existing concept of “flattening transformations” (which allow efficient implementation of the nested parallel structures expressible in the source language) to be extended to handle user-defined recursive types, and in particular parallel tree structures. Finally, Vlassov and Thorelli apply the ubiquitous principle of simplification through abstraction to the design of a shared memory programming model.
628
Henk Sips et al.
In summary, we expect from these session a fruitful discussion of the hot topics in concurrency and parallelism in several areas. This enlarged exchange of ideas can impact on advances of the discipline in the whole distributed and concurrency field.
A Tracing Protocol for Optimizing Data Parallel Irregular Computations Thomas Brandes1 and C´ecile Germain2 1
Institute for Algorithms and Scientific Computing (SCAI) GMD, Schloß Birlinghoven, D-53754 St. Augustin, Germany [email protected] 2 Laboratoire de Recherche en Informatique (LRI-CRNS) Universit´e Paris-Sud, F-91405 Orsay Cedex, France [email protected]
Abstract. High Performance Fortran (HPF) is the de facto standard language for writing data parallel programs. In case of applications that use indirect addressing on distributed arrays, HPF compilers have limited capabilities for optimizing such codes on distributed memory architectures, especially for optimizing communication and reusing communication schedules between subroutine boundaries. This paper describes a dynamic approach for optimizing unstructured communication in codes with indirect addressing. The basic idea is that runtime data reflecting the communication patterns will be reused if possible. The user has only to specify which data in the program has to be traced for modifications. The experiments and results show the effectiveness of the chosen approach.
1
Introduction
Data parallel languages as High Performance Fortran [9,10] have been designed to make possible efficient and high-level parallel programming for distributed memory machines. As calculations in computationally intensive codes mostly occur in loops or intrinsic calls involving loops, a challenging problem is to handle efficiently loops with indirectly referenced distributed arrays. For irregular parallel loops, complex data structures, called schedules, are required for accessing remote items of distributed arrays and for communication optimization. The schedules cannot be worked out at compile time, but only at run-time, when the values of the indirection arrays are known. The code design which first builds the schedule, then uses it to carry out the actual communication and computation, has been coined as the inspector/executor scheme that was pioneered very early [16,12], and developed in the PARTI [6,2] and CHAOS [17] libraries. It is the state-of-the-art design for academic compilers [13,11,18,4] as well as commercial ones [15]. The preprocessing associated with the inspector creates a large run-time overhead, because it involves a large amount of computation and all-to-all communications. Hence the inspector/executor scheme targets irregular loops which D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 629–638, 1998. c Springer-Verlag Berlin Heidelberg 1998
630
Thomas Brandes and C´ecile Germain
exhibit especially temporal locality where the schedule can be reused. The schedule reusing problem can be addressed at compile-time, through data-flow analysis [6,8]. The main drawbacks of this approach are the limits of inter-procedural analysis and the wall of separate compilation. More precisely, real codes with indirect addressing often require inter-procedural analysis [1,7,5]. At the other extreme, language support makes the concept of schedule visible to the end-user. The HPF+ project [3] proposes gather and scatter clauses referring to explicit schedule variables, or reuse clause for independent loop. The user is responsible for tracking the modifications, but through high level conditional constructs. Reusing schedules across procedure calls relies on the Fortran 90 save attribute for the schedule variable. The seminal paper [14] proposes a schedule reusing method based on timestamps. Each modification of an array is mirrored by an update of the timestamp, and the time-stamps are recorded at inspector invocation. A schedule may be reused if all relevant arrays have the same time-stamps at the point of reuse as the recorded ones. As no directive is provided, all array modifications have to be tracked. Moreover, array descriptors are shared through all arrays having the same structure parameters, so that an indirection array may be considered modified without necessity. In this paper, we propose a new protocol to handle temporal locality. It has been implemented and evaluated in the framework of the ADAPTOR HPF compilation system [4] that already supported the inspector/executor scheme. The protocol is called URB, because the run-time system can either Use, Refresh or Build a schedule for each parallel loop. With some help from the user, URB is able to track changes in the irregular loops exactly: a new schedule is built only when either an irregular parallel loop is structurally different from the previously executed ones, or when the indirection arrays have been modified. The structural information includes the distribution, shape and size of the arrays, and is completely handled by the compiler and the run-time system. The intervention from the user is limited to provide an attribute for each indirection array, through a directive. This attribute indicates that all modifications of this array have to be recorded at run-time, and allows to handle procedure calls without interprocedural analysis. Finally, the tracking scheme is conservative in the sense that, when tracing directives are missing, the generated code simply rebuilds a new schedule for each execution of an irregular loop. Hence, codes using the trace directive and other codes (e.g. library routines) can securely be mixed. The rest of the paper is organized as follows. Section 2 describes the realization of indirect addressing of distributed arrays via the inspector/executor scheme. Section 3 introduces the protocol that is needed for tracking indirection arrays and how it can be realized. Section 4 presents performance results for kernels and a real application, and we conclude in section 5.
A Tracing Protocol for Optimizing Data Parallel Irregular Computations
2
631
Unstructured Communication
In the following, we introduce a formal definition of the communication schedule that is built up during the inspector phase. We discuss it for a typical gather operation. The array T specifies the home of the iteration where the data is needed. forall (I ∈ IT , M (I)) T (I) = A(L(I)) Such an indirect addressing will be noted as IT,A,L,M = {IT ,dT ,IA ,dA ,L,M }, the inspector info, where IT and IA specify the index space of the arrays T and A, dT and dA their distributions on the available processors P . M is a predicate defining the subset ST that is the iteration space of the parallel loop. L stands for the indirect addressing of the array A. All indices and arrays can be multi-dimensional (I = I1 , . . . , In ), and L stands for a set of integer arrays L1 , . . . , Lm . dT : IT → P M : IT → {true, f alse}, L : ST (⊆ IT ) → IA ,
dA : IA → P ST := {I | M (I) = true} ⊆ IT I → L(I)
We assume that the function L is specified by an integer array or by a set of integer arrays Lj and that they are aligned with the array T . In other words, L(I) is available on the processor that owns T (I). Also the predicate M , if available as a logical array, will not involve any communication. For every processor pair (p, q), p = q, the set Cp←q specifies which elements of IA processor p needs from processor q: Cp←q := {(I, L(I))|I ∈ ST , dT (I) = p, dA (L(I)) = q} −1 Cp←q ⊆ d−1 T (p) × dA (q), so for every (I, K) ∈ Cp←q , I is a local index of T on processor p and K is a local index of A on processor q. Data movement from processor q to processor p will be necessary if Cp←q is not empty. Processor q has to send the corresponding data and processor p must receive it. In the following, the collection of sets Cp←q , ST,A,L,M = {Cp←q }, is called the communication schedule. Due to the fact that L is not a function, the computation cannot be done by closed formulas as it might be possible in case of structured communication.
for q ∈ P : Cp←q := {} end for for q ∈ P , p =q for I in d−1 send Cp←q to processor q T (p) ∩ ST q := dA (L(I)) recv Cq←p from processor q Cp←q := Cp←q + {(I, L(I))} end for end for end for
Fig. 1. Schedule building algorithms
632
Thomas Brandes and C´ecile Germain
for q ∈ P , p = q, Cq←p = {} compute Vq←p (local on p) send Vq←p to processor q end for for q ∈ P , p = q, Cp←q = {} recv Vp←q from processor q for (I, v) ∈ Vp←q set T (I) = v end for
for q ∈ P , p = q, Cp←q = {} compute Wp←q (local on p) send Wp←q to processor q end for for q ∈ P , p = q, Cq←p = {} recv Wq←p from processor q for (v, K) ∈ Wq←p set A(K) = v end for
Fig. 2. Executor algorithms Each processor p has to compute the sets Cp←q for all processors q, as shown in Fig. 1, left part. By an all-to-all communication, the processors exchange the information which data they need from each other (index exchange, Fig. 1, right part). This is necessary in order to make sure that the other processors know which values they have to send. The inspector info together with its schedule builds the inspector data. In the executor phase, the processors use the communication schedule of the inspector data to carry out the communication for gathering or scattering data. Therefore the values of T and A belonging to a domain X are transferred. valA : IA → X
Vp←q := {(I, valA (K))|(I, K) ∈ Cp←q }
Figure 2 (left part) shows the algorithm which is executed on each processor p to gather the values of the distributed array A. For a scatter operation, f orall(I ∈ IT , M (I))A(L(I)) = T (I), the communication schedule can be used in the same way to realize the necessary communication (Fig.2, right part). valT : IT → X
3 3.1
Wp←q := {(valT (I), K)|(I, K) ∈ Cp←q }
The URB Protocol Dynamic Reuse of Communication Schedules
Let be IT,A,L,M = {IT , dT , IA , dA , L, M } an indirect addressing for which a communication schedule ST,A,L,M has been computed. This communication schedule can be reused if the sets Cp←q have not changed. This is the case if IT , IA , dT , dA , L and M have not changed. This model points out that the inspector does not depend on the actual values of the arrays T and A, but only on their shape and distribution. Thus, an inspector has the potential to be reused for instance for different sections of the same array, or for aligned arrays. Our URB protocol is based on run-time analysis. For this dynamic approach, it will be decided at runtime whether a communication schedule can be reused or not. Therefore, inspectors are kept in a database. Two problems must be addressed: at first, tracing the modifications of the indirection and mask arrays, and secondly, avoiding the explosion of the data base while keeping useful information.
A Tracing Protocol for Optimizing Data Parallel Irregular Computations
633
The penalty of this approach is that modifications of the indirection arrays must be traced. In the presence of a mask, this is also necessary for the corresponding arrays within the mask. 3.2
Principle of Dynamic Tracing
The previous formulation shows the inspector database management as an instance of a cache management problem. The URB protocol is based on this analogy. To allow tracing, the descriptor of an indirection array L includes a dirty flag. Each write to L sets its dirty flag. The schedule can thus be reused if and only if the dirty flag of L is clear. The approach has to guarantee that every update or every possible update of L will set its dirty flag. Furthermore, if the dirty flag of an array L is set, all inspector for which this array is involved, become invalid. The entries for these inspectors can be freed avoiding the explosion of the database. In fact, the previous scheme fails if an indirection array Lj is involved in more than one indirect assignment. The reason is simply that the flag of the Lj may be shared by many inspectors, but that their status with respect to these inspectors should not be shared. Hence, the descriptor of a traced array actually includes not only one dirty flag, but a dirty mask, with a flag for each possible inspector. Each write to the array sets all flags of the dirty mask, while each inspector update clears only the corresponding flag in the dirty mask. The overall run-time algorithm for building or retrieving schedules is sketched in Fig. 3. Even when the schedule is reusable, the actual schedule can need refreshing. This means that all the hard work of computing ownership and local addresses is done, but that a translation term from the base address has to be taken into account. The inspector data then is updated with the refreshed schedule.
Schedule GetSchedule (InspectorInfo I) begin var InspectorData : P /* InspectorData = InspectorInfo + Schedule */ P := DBSearch (I) L := IndirectionArrayDescriptor(I) if (P == NULL) or DirtyMask(L, P) then DBFree (P) P := InspectorBuild (I) DirtyMaskClear(L, P) elseif not EqualBaseAddress (P, I) then ScheduleRefresh (P, I) endif return (Schedule(P)) end
Fig. 3. The URB protocol
634
3.3
Thomas Brandes and C´ecile Germain
Tracing of Arrays
Tracing clearly belongs to the compiler, which has to insert the code setting the dirty flags as a side-effect of a write of some arrays. The requirement for tracing is described by the user directive !ADP$ TRACE. This directive has a Fortran attribute semantics, and follows Fortran 90 rules of scoping. By default, arrays are not traced. To insure correctness, this case must lead to the conservative scheme where the schedule is computed at each access. If at least one indirection array is not traced, no entry in the data base is created. The schedule is built, used and destroyed immediately after its use. An array must be flagged dirty when it is changed. The syntactic constructs that can modify an array are limited, and can be analyzed at compile-time: assignments with this array as a left-hand-side, redistributions, allocations and deallocations. Thus, the trace attribute is compatible with the DYNAMIC and ALLOCATABLE attributes, but not with the POINTER or TARGET attribute, because the compiler will not be able to record modification of such arrays. 3.4
Procedure Handling
We want the URB protocol to be compatible with the general HPF framework that does not include the tracing directive at the present time. Thus, all combination of traced and not traced for actual and dummy arguments must be considered. The run-time system copies information from the descriptor of the actual argument into the descriptor of the dummy argument at the entry of the subprogram, and vice versa when leaving the subprogram. Four cases must be considered, following the fact that the actual and the dummy arguments have or have not the trace attribute. If none has, we are done. If the actual and the dummy arguments both have, the modifications of the array will be traced through the subroutine boundaries, by copying forth and back the dirty mask. If the actual does not have the attribute and the dummy has, the modifications of the dirty mask of the dummy inside the procedure have no impact on the actual: reusing of schedules where this array is involved is only possible during the lifetime of this subroutine call. Finally, procedures without explicit or implicit interface, or with an explicit or implicit interface that does not give the trace attribute to an output dummy argument, are assumed to modify the actual argument. Thus, on the calling side, the compiler will set the dirty flag of the corresponding array when the procedure returns. With this scheme, the case of library subroutines that modify an indirection array, and that are not aware of the trace attribute, is correctly handled.
4
Performance Results
All experiments were performed on an IBM SP2 equipped with P2SC nodes, and a 80MB/s maximal bandwidth per node. We used the user space protocol, and the hardware timer of the Power2 architectures. Measurements are averages.
A Tracing Protocol for Optimizing Data Parallel Irregular Computations
4.1
635
Basic Performance
We studied the basic performance of the URB protocol for a typical gather operation, A(I) = B(L(I)), in order to make evident the various components of the inspector-executor scheme. Three schemes of indirect addressing L have been considered: identity (i), a cyclic shift (c), and uniform random values (r). The random and shift distributions are standing here for two typical degrees of irregularity, heavy and light. Identity is useful to measure pure overhead. It also describes data-dependent locality, where no communication is needed because the indirection array operates inside disjoint index subsets. Domain decomposition methods are a typical example of this situation. First, a breakdown of the various components of the inspector and of the executor, for the random scheme, on an 8 processors configuration and a 4K array of real (Table 1) shows that the only potential overhead of the URB protocol, namely searching the database, is negligible. It also highlights the cost of the inspector step. Table 1. Timing of unstructured communication (seconds, 8 processors). inspector step executor step no reuse 0.16 sec 0.07 sec reuse 0.25e-4 sec 0.07 sec 2e-01 r-ie s-ie i-ie r-ex s-ex i-ex
6e-02
Time (s)
2e-02
4e-03
1e-03
2e-04
6e-05 2K
4K
8K
16K
32K
64K 128K 256K Array Size (real)
512K
1M
Fig. 4. Performance effects of schedule reusing: 16 processors. Curves labeled ie are the non-reuse case, and curves labeled ex are the reuse case. Figure 4 presents a detailed comparison of timings with and without schedule reuse, on a 16 processors configuration. The random scheme displays a nearly
636
Thomas Brandes and C´ecile Germain
linear difference (constant speedup) between the reuse and non reuse case, because of the index exchange, which in this case has exactly the same volume as the data movement. The identity scheme, in the reuse case, describes the minimal cost of an indirect operation, which includes reading the schedule and the local copy between the user arrays. The shift scheme, in the reuse case, has the only additional overhead of each processor sending one 4-bytes message. This overhead is very significant for small arrays, but becomes negligible for large arrays, where the performance converges with the one of the identity case. The shift and identity schemes have very similar performance in the non-reuse case, and display the overhead of building the schedule (ownership and offset computations). 4.2
Performance Results for a Real Application
The AEROLOG computational fluid dynamic software developed by MATRA is devoted to the study of compressible fluid flows around complex geometries. The kernel of the numerical solver (called CLASS for Conservative LAws Systematic Solver) is common to all physical models and is based on a second order accurate time and space finite volume scheme. Within the European Esprit project PHAROS, this code has been ported to HPF [5]. The HPF parallelization strategy is based on the coarse grain parallelism given by the decomposition of the mesh data into sub-domains. In one time step, the time increments for the physical variables are computed. The routines called within one time step can be sorted into two groups, the local routines and the boundary routines. The local routines are called independently over subdomains. This group represents typically up to 90 % of the total CPU time of the serial code. The local computations scale well as no communication is necessary. The boundary routines perform the boundary conditions. These routines make an intensive use of indirect addressing and involve dependences between data belonging to different sub-domains.
Table 2. Execution times in seconds (304200 mesh points, 20 iterations).
local boundary replicated compress notrace trace
P=1 P=2 P=4 P=8 P=16 190.46 95.84 46.68 23.43 12.86 11.24 12.03 165.65 34.42
35.99 11.49 96.88 18.72
38.53 11.56 52.43 12.16
42.97 52.19 13.24 14.05 28.66 21.12 7.73 5.88
In the initial HPF version, the whole mesh data involved in the boundary computations as well as the boundary computations itself are replicated on all processors. This avoids unstructured communication but the replication is very expensive. In the tuned HPF version, the mesh data is packed before the replication. This reduces the communication volume, but the scalability is still limited.
A Tracing Protocol for Optimizing Data Parallel Irregular Computations
637
The third version distributes the mesh data thus requires unstructured communication. Performance results are very poor if the communication schedule is not reused (notrace). Only the reuse of communication schedules (trace) gives acceptable and scalable performance results. With ADAPTOR, the user gets this performance gain only by inserting some TRACE directives in his HPF code. Table 2 shows the performance results for a medium industrial test case with 16 sub-domains of 304200 total mesh points executing for 20 iterations.
5
Conclusions
In this paper, we have presented a strategy to implement effectively an inspectorexecutor paradigm where communication schedules can be reused. It requires no more data-flow and inter-procedural analysis at compile time. It will also reuse communication schedules in situations where even best compile time analysis might fail, e.g. if the indirection arrays are modified under a mask. Although the method described here requires a new language construct, the TRACE directive, the user does not have to track the multiple execution paths. This should be task of the compiler and of the run-time system. The realization of our concept in the ADAPTOR HPF compilation system showed that it can be easily integrated in an existing HPF compiler that supports the inspector/executor scheme. The compiler itself has only to deal with the new TRACE attribute. The runtime system requires the support of reusing communication schedules and the implementation of a data base for these schedules. For the tracing of arrays with the TRACE attribute, the array descriptor has to be extended by a dirty mask. Tracing can be tracked along subroutine boundaries if an array or a section of an array is passed via such a descriptor. This is generally the case for ADAPTOR, but tracking fails if the HPF compiler passes an array only by a pointer. The work described here demonstrates two new ideas. The first is that a schedule can be associated with more than one parallel indirect assignment, allowing aggressive optimization even in dynamic applications. The second is the analogy to schedule reusing with cache management. We believe that the ideas of the URB protocol might also be useful for other optimizations in an HPF compiler, especially in the context of indirect and dynamic distributions, and for the efficient support of DOACROSS loops with irregular dependencies. In particular, the fact that the URB protocol is very efficient on the identity case has an important application: the HPF 2.0 Approved Extensions deal with data-dependent locality through the coupling of the ON HOME and RESIDENT directives. Schedule reusing is simpler, for the user and the compiler, and more versatile, because the same code can be kept unmodified whether data-dependent locality exists or not. The improvement of the current strategy as well as the investigation of benefits for other optimizations will be part of our future work.
Acknowledgments We acknowledge the CRI and the CNUSC for access to their IBM SP2.
638
Thomas Brandes and C´ecile Germain
References 1. G. Agrawal and J. Saltz. Interprocedural communication optimizations for distributed memory compilation. In Language and Compilers for Parallel Computing, pages 1–16, Aug. 1994. 630 2. G. Agrawal, A. Sussman, and J. Saltz. Compiler and run-time support for structured and block-structured applications. In Supercomputing, pages 578–587. IEEE, 1993. 629 3. S. Benkner and H. Zima. Definition of HPF+ Rel.2. Technical report, HPF+ Consortium, 1997. 630 4. T. Brandes and F. Zimmermann. ADAPTOR - A Transformation Tool for HPF Programs. In K. Decker and R. Rehmann, editors, Programming Environments for Massively Parallel Distributed Systems, pages 91–96. Birkh¨ auser Verlag, Apr. 1994. 629, 630 5. T. Brandes, F. Zimmermann, C. Borel, and M. Br´edif. Evaluation of High Performance Fortran for an Industrial Computational Fluid Dynamics Code. In Proceedings of VECPAR 98, Porto, Portugal, June 1998. Accepted for publication. 630, 636 6. R. Das et al. Communication optimization for irregular scientific computations on distributed memory architecturess. Journal of Parallel and Distributed Computing, (22):462–478, 1994. 629, 630 7. C. Germain, J.Laminie, M. Pallud, and D. Etiemble. An HPF Case Study of a Domain-Decomposition Based Irregular Application. In PACT’97. LNCS, Sep 1997. 630 8. M. Gupta, E. Schonberg, and H. Srinivasan. A unified data-flow framework for optimizing communication in data-parallel programs. IEEE Trans. on Parallel and Distributed Systems, 7(7):689–704, 1996. 630 9. High Performance Fortran Forum. High Performance Fortran Language Specification. Rice Univ., Nov. 1994. Version 1.1. 629 10. High Performance Fortran Forum. High Performance Fortran Language Specification, Oct. 1997. Version 2.0. 629 11. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD Distributed-Memory machines. CACM, 35(8):66–80, Aug. 1992. 629 12. S. Hiranandani, J. Saltz, P. Mehrotra, and H. Berryman. Performance of hashed cache migration schemes on multicomputers. Journal of Parallel and Distributed Computing, 12:415–422, 1991. 629 13. C. Koelbel, P. Mehrotra, and J. V. Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd. Symp. on Principles and Practice of Parallel Programming, pages 177–186. ACM, 1990. 629 14. R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime compilation techniques for data partitioning and communication schedule reuse. In ACM Int. Conf. on Supercomputing, pages 361–370, 1993. 630 15. Portland Group, Inc. PGHPF User’s Guide. Manual Release 2.2, PGI, 1997. 629 16. R. Mirchandaney et al. Principles of run-time support for parallel processing. In ACM Int. Conf. on Supercomputing, pages 140–152, 1988. 629 17. S. D. Sharma and al. Run-time and Compile-time Support for Adaptive Irregular Problems. In Supercomputing’94, pages 99–106. IEEE, 1994. 629 18. H. Zima and B. Chapman. Compiling for distributed-memory systems. Proceedings of the IEEE, Feb. 1993. 629
Contribution to Better Handling of Irregular Problems in HPF2 Thomas Brandes1 , Fr´ed´eric Br´egier2, Marie Christine Counilh2 , and Jean Roman2 1
GMD/SCAI, Institute for Algorithms and Scientific Computing, German National Research Center for Computer Science, Schloss Birlinghoven, PO Box 1319, 53754 St. Augustin, Germany 2 LaBRI, ENSERB and Universit´e Bordeaux I, 33405 Talence Cedex, France Abstract. In this paper, we present our contribution for handling irregular applications with HPF2. We propose a programming style of irregular applications close to the regular case, so that both compiletime and run-time techniques can be more easily performed. We use the well-known tree data structure to represent irregular data structures with hierarchical access, such as sparse matrices. This algorithmic representation avoids the indirections coming from the standard irregular programming style. We use derived data types of Fortran 90 to define trees and some approved extensions of HPF2 for their mapping. We also propose a run-time support for irregular applications with loop-carried dependencies that cannot be determined at compile-time. Then, we present the TriDenT library, which supports distributed trees and provides runtime optimizations based on the inspector/executor paradigm. Finally, we validate our contribution with experimental results on IBM SP2 for a sparse Cholesky factorization algorithm.
1
Introduction
High Performance Fortran (HPF) [10] is the current standard language for writing data parallel programs for shared and distributed memory parallel architectures. HPF is well suited for regular applications; however, many scientific and engineering applications (fluid dynamics, structural mechanics, . . . ) use irregular data structures such as sparse matrices. Irregular data need a special representation in order to save storage requirements and computation time, and the data accesses use indirect addressing through pointers stored in index arrays (see for example SPARSKIT [14]). In these applications, patterns of computations are generally irregular and special data distributions are required to both provide high data locality for each processor and good load balancing. Then, compiletime analysis is not sufficient to determine data dependencies, location of data and communication patterns; indeed, they are only known at run-time because they depend on the input data. Therefore, it is currently difficult to achieve good performance with these irregular applications. The second version of the language, HPF2 [10], introduces new features for efficient irregular data distributions and for data locality specification in irregular computations. To deal with the lack of compile-time information on irregular D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 639–649, 1998. c Springer-Verlag Berlin Heidelberg 1998
640
Thomas Brandes et al.
codes, run-time compilation techniques based on the inspector/executor method have been proposed and widely used. Major works include PARTI [16] and CHAOS [12] libraries used in Vienna Fortran and Fortran90D [13] compilers, and PILAR library [11] used in PARADIGM compiler [2]. They are efficient for solving iterative irregular problems in which communication and computation phases alternate. RAPID [9] is another runtime system based on a computation specification library for specifying irregular data objects and tasks manipulating them. It is not limited to iterative computations and can address algorithms with loop-carried dependencies. In order to reduce the overhead due to these run-time techniques, the technique called sparse-array rolling (SAR) [15] exploits compile-time information about the representation of distributed sparse matrices for optimizing data accesses. In the Vienna Fortran Compilation System [6], a new directive (SPARSE directive) [15] specifies the sparse matrix structure used in the program to improve the performance of existing sparse codes. Bik and Wijshoff [3] propose another approach based on a “sparse compiler” that converts a code operating on dense matrices and annoted with sparsity related information into an equivalent sparse code. Our approach consists of combining compile-time analysis and run-time techniques, but in a way that differs from the previously cited works. We propose a programming style of irregular applications close to the regular case so that both compile-time and run-time techniques can be more easily performed. In order to do so, we use the well-known tree data structure to represent sparse matrices and, more generally, irregular data structures with a hierarchical access. This algorithmic representation masks the implementation structure and avoids the indirections coming from the standard irregular programming style. Like in the SAR approach [15], the compiler knows more information about the distributed data, but unlike this approach, information naturally result from the representation of the irregular data structure itself. Our approach is not restricted to sparse matrices but it necessitates a coding of irregular applications using our tree structure. In order to increase portability, we use Derived Data Types (DDT) of Fortran 90 to define trees and hence irregular data structures, and we use some approved extensions of HPF2 for their mapping. In all the following of this paper, we will refer to our language proposition as HPF2/Tree. We also propose a run-time support for irregular applications with loop-carried dependencies that cannot be determined at compile-time. Each iteration is performed on a subset of processors only known at run-time. This support is based on an inspector/executor scheme which builds the processor sets, the loop indices for each processor and generates the necessary communications. So, the first step of our work has been to develop a library, called TriDenT, which supports distributed trees and implements the above-mentioned optimizations. The performance of this library have been demonstrated on irregular applications such as sparse Cholesky factorization. The following step of this work is to provide an HPF2/Tree compiler; we are currently working on the integration of the TriDenT library in the ADAPTOR compilation platform [5].
Contribution to Better Handling of Irregular Problems in HPF2
641
This paper (see [4] for a more detailed version) is organized as follows. In Section 2, we give a short background on a sparse column Cholesky factorization algorithm which will be our illustrative and experimental example for the validation of our approach. Section 3 presents HPF2/Tree and section 4 describes the TriDenT library. Section 5 gives an experimental study on IBM SP2 for sparse Cholesky factorization applied on real size problems, and provides an analysis of the performance. Finally, section 6 gives some perspectives of this work.
2
An Illustrative Example
The Cholesky factorization of sparse symmetric positive definite matrices is an extremely important computation arising in many scientific and engineering applications. However, this factorization step is quite time-consuming and is frequently the computational bottleneck in these applications. Consequently, it is a significant interesting example for our study. The goal is to factor a sparse symmetric positive definite n × n matrix A into the form A = LLT , with L lower triangular. Two steps are typically performed for this computation. First, we perform a symbolic factorization to compute the non-zero structure of L from the ordering of the unknowns in A; this ordering, for example using a nested dissection strategy must reduce the fill in and increase the parallelism in the computations. This (irregular) data structure is allocated and its initial non-zero coefficients are those of A. In the pseudo-code given below, this step is implicitly contained in the instruction L = A. From this symbolic factorization, one can deduce for each k the two following sets defined as the sparsity structure of row k and the sparsity structure of column k of L : = 0} Struct(Lk∗ ) = {j < k such that lkj = 0}. Struct(L∗k ) = {i > k such that lik Second, the numerical factorization computes the non-zero coefficients of L in the data structure. This step, which is the most time-consuming, can be performed by the following sparse column-Cholesky factorization algorithm (see for example [7,1] and included references). 1. L = A 2. for k = 1 to n do 3. for j ∈ Struct(Lk∗ ) do 4. for i ∈ Struct(L∗j ), i ≥ k do % cmod(k,j,i) % 5. lik√ = lik - lkj * lij 6. lkk = lkk % cdiv(k) % 7. for i ∈ Struct(L∗k ) do 8. lik = lik / lkk
where cmod(k, j, i) represents the modification of the rows of the column k by the corresponding terms√in column j, and cdiv(k) represents the division of the column k by the scalar lkk . This algorithm is said to be a left-looking algorithm, since at each stage it accesses needed columns to the left of the current column in the matrix. It is also referred to as a fan-in algorithm, since the basic operation is to combine the effects of multiple previous columns on a single subsequent column.
642
Thomas Brandes et al.
If we consider now the parallelism induced by sparsity and achieved by distributing the columns on processors (so, we exploit the parallelism of the outer loop of instruction 2.), we can see that a given column k depends only of columns belonging to Struct(Lk∗ ); so, we have loop-carried dependencies in this algorithm. In order to take advantage of this structural parallelism, we use an irregular distribution called subtree-to-subcube mapping and which leads to an efficient reduction of communication while keeping a good load balance between processors; this mapping is computed algorithmically from the sparse data structure for L (see for example [7] and included references).
3
HPF2/Tree
3.1
Representation of Irregular Data Structures with Trees
In this paper, we consider irregular data structures with a hierarchical access and we propose the use of a tree data structure for their representation. For example, the sparse matrix used in the previous oriented column Cholesky algorithm must be a data structure with a hierarchical column access and the user can represent it by a tree with three levels numbered from 0 to 2. The k-th node on level 1 represents the k-th column. Its p-th son represents the p-th nonzero element defined by its value and row number. Fig. 1 gives a sparse matrix and its corresponding tree. In order to distinguish the levels, we number the columns with roman numbers and the rows with arabic numbers. I
II
III IV V
VI VII VIII
type level2 integer ROW real VAL end type level2
1 2 3 4 5 6 7 8
I
II
III
IV
V
VI
VII
! row number ! non zero value
VIII
type level1 type (level2), allocatable :: COL(:) end type level1 type (level1), allocatable :: A(:) 1
52
5 3 4
6
8 5 6
7 8 7
8
!ADP$ TREE A
Fig. 1. A Sparse Matrix and its HPF2/Tree Declaration.
3.2
Tree Representation and TREE Directive in HPF2
Trees are embedded in the HPF2 programming language by using the Derived Data Type (DDT) of Fortran 90. For every level of a tree (except level 0), a DDT must be defined. It contains all the data type definitions for this level and (except for the level with the greatest number) the declaration of an array variable whose type is the one associated with the following level. A tree is then declared as an array of the type the DDT associated with the level 1 (cf. Fig. 1). The TREE directive distinguishes tree variables from other variables using DDT because these variables are used in a restricted way; in particular, pointer
Contribution to Better Handling of Irregular Problems in HPF2
643
attributes of Fortran 90 and recursions are not allowed. So, our tree data structures must not be recursively defined and cannot be used for example to represent quadtree decompositions of matrices [8]. However, this limitation leads to an efficient implementation of trees by the compiler (cf. 4.1). The accesses to a specific tree element or to a subtree are performed according to the usual Fortran 90 notation. The advantage of this programming style is the analogy between the level notion for trees and the classical dimension notion for arrays used in regular applications. Then, the compiler can perform the classical optimizations based on dependence analysis much more easily than with indirection arrays currently used in irregular applications. 3.3
Distribution of Trees
In order to support irregular applications, HPF2/Tree includes the GEN BLOCK and INDIRECT distribution formats and allows the mapping of the components of DDT according to HPF2.0 approved extensions [10]. The constraints for the mapping of derived type components allow the mapping of structure variables at only one level, the reference level. For the two distributions of tree A given at Fig. 2, the reference level is level 1. In HPF2/Tree, the levels preceding the reference level are replicated while the levels following it are distributed according to the distribution of this reference level. This distribution implies an implicit alignment of data with the data of the reference level. This data locality is an important advantage of tree distribution and a compiler can exploit it to generate an efficient code. This is clearly more difficult to obtain when a sparse storage format, such as CSC format [14], is used because the indirection and data arrays are of different sizes and their distribution formats also differ. A
A
BLOCK
I
1
II
INDIRECT
III
IV
V
VI
52 5 3 4 6 8 5 6 1 2 !HPF$ distribute A(BLOCK)
VII
7 8 7
VIII
8 3
I
II
1
52 1
III
IV
V
VI
VII
VIII
5 3 4 6 8 5 6 7 8 7 8 2 3 2 1 2 3 2 integer ind(1:8) = (/1,2,3,2,1,2,3,2/) !HPF$ distribute A(INDIRECT(ind))
Fig. 2. Tree Distributions with 3 Processors.
3.4
Application to the Sparse Fan-In Cholesky Algorithm
This section describes an HPF2/Tree code (cf. Program 3 (a)) for the sparse fanin Cholesky algorithm presented in section 2. This code uses the tree A declared in Fig. 1 and another tree, named B, used to store the dependencies between columns. Then, the first value of B(K)%COL(:)%DEP is K, and the other ones are the column numbers given by Struct(Lk∗ ). In section 4.3, we will show that these
644
Thomas Brandes et al.
data can be computed during an appropriate inspection phase so that this tree B can be avoided. The distribution of tree A uses a specific INDIRECT distribution of nodes on level 1 according to a subtree-to-subcube mapping (cf. section 2). Then, the tree B is aligned with the tree A using the HPF ALIGN directive. In the outer K loop, the ON directive asserts that the statements at iteration K are performed by the set of processors that own at least one column with a non-zero value in row K. The internal L loop performs a reduction inside this set of processors using the NEW variable TMP VAL specified in the ON directive. Each iteration L is performed by the processor which owns the column with number J = B(K)%COL(L)%DEP; so, the computation of the contributions for the column K is distributed. Moreover, a compiler might identify that this reduction uses an all-to-one scheme due to the instruction following the loop. The tree notation which avoids indirection in the code, and the data locality and alignment provided by the distribution of the tree make the compiler able to extract the needed communications outside this internal loop (for A(K)%COL(:)%ROW and B(K)%COL(:)%DEP variables). This would not be possible by using a sparse storage format (such as CSC) due to the indirections. Version (a) : HPF2/TREE type Blevel2 integer DEP end type Blevel2
Version (b) : HPF2/TREE + ON HOME Extension
! Column number (in B(K)) which modifies A(K)
type Blevel1 type (Blevel2), allocatable :: COL(:) end type Blevel1 type (Blevel1), allocatable :: B(:)
DO K = 1, size(A) !HPF$ ON HOME(A(J), J = 1:K, ANY(A(J)%COL(:)%ROW) .EQ. K), $ NEW(TMP_VAL) BEGIN ! Active processor set
!ADP$ TREE B !HPF$ DISTRIBUTE A(INDIRECT(map_array)) !HPF$ ALIGN B WITH A DO K = 1, size(A) !HPF$ ON HOME (A(B(K)%COL(:)%DEP)), NEW(TMP_VAL) BEGIN allocate (TMP_VAL(size(A(K)%COL))) TMP_VAL(:) = 0.0 !HPF$ INDEPENDENT, REDUCTION (TMP_VAL) DO L = 2, size(B(K)%COL) !HPF$ ON HOME (A(B(K)%COL(L)%DEP)) BEGIN J = B(K)%COL(L)%DEP CMOD(TMP_VAL(:), A(K)%COL(:)%ROW, $ A(J)%COL(:)%VAL, A(J)%COL(:)%ROW) !HPF$ END ON END DO A(K)%COL(:)%VAL = A(K)%COL(:)%VAL - TMP_VAL(:) CDIV(A(K)%COL(:)%VAL) !HPF$ END ON END DO
! Local array for the reduction allocate (TMP_VAL(size(A(K)%COL))) TMP_VAL(:) = 0.0 !HPF$ INDEPENDENT, REDUCTION (TMP_VAL) DO J = 1, K-1 !HPF$ ON HOME(A(J), ANY(A(J)%COL(:)%ROW) .EQ. K) BEGIN ! Only for contribution columns CMOD(TMP_VAL(:), A(K)%COL(:)%ROW, $ A(J)%COL(:)%VAL, A(J)%COL(:)%ROW) !HPF$ END ON END DO A(K)%COL(:)%VAL = A(K)%COL(:)%VAL - TMP_VAL(:) ! Implies an all-to-one reduction CDIV(A(K)%COL(:)%VAL) !HPF$ END ON END DO
Program 3. 2 Versions of the Fan-In Cholesky Algorithm.
However, this example also shows that run-time techniques are required because the set of active processors specified by the ON directive cannot be known until run-time. In the following section, we describe the TriDenT run-time system which provides supports for tree and for such irregular active processor sets.
4
The TriDenT Library
TriDenT is a library which consists of two parts: a set of routines for the manipulation of trees, and another one for the optimization of computations and communications in irregular processor sets based on the inspector/executor paradigm.
Contribution to Better Handling of Irregular Problems in HPF2
645
We currently use this library to validate the tree approach and our optimization proposals by writing HPF2 codes with explicit calls to TriDenT primitives. Furthermore, the TriDenT library has been designed to be integrated as easily as possible in the compilation platform ADAPTOR [5] in order to automatically translate HPF2/Tree codes in SPMD codes with calls to TriDenT primitives. ADAPTOR also supports HPF2 features and is based on the DALIB (Distributed Array LIBrary) run-time system for distributed arrays. The following sections present the two components of the TriDenT library. 4.1
Support for Distributed Trees
TriDenT implements trees by using an array representation. The TREE directive allows the compiler to generate one-dimensional arrays for the implementation of the Derived Data Types associated with the tree; this implementation consists of one array for each scalar variable of a DDT and of some pointer arrays (cf. Fig. 4). Hence, TriDenT provides efficient access functions to tree elements and to subtrees; moreover, the array implementation allows to take advantage of the potential cache effects. I
II
III IV V
VI VII VIII
1 3 5 6 9 10 13 14
A_Col_Row
1 5
2 5 3
4 6 8 5
I
Processors : a
6 7 8
b
c
7 8
II
III IV V VI VII VIII
1 3 6 7 9 10 11 12
1 5
4 6 8 7
A_Col_Val a1 a2 b1 b2 c1 a3 a4 a5 b3 c2 c3 c4 a6 b4
a1 a2 a3 a4 a5 a6
Fig. 4. Indirect Distribution and Local Data for Processor a.
The distribution of trees (cf. 3.3) is performed by using the technique described in [13]. It consists in distributing all the data arrays, for the levels from the reference level, with GEN BLOCK distribution format according to the tree distribution format. Then, the processor local data are stored in a contiguous memory space (cf. Fig. 4 for the indirect distribution of Fig. 2). 4.2
Support for Irregular Processor Subsets
TriDenT is a run-time system which integrates the active processor set notion introduced in HPF2 by the ON directive. This directive specifies how computation is partitioned among processors and therefore may affect the efficiency of computation. Then TriDenT includes context push and pop as well as reduction and broadcast for active processor sets. Moreover, if the ON directive includes an indirection (e.g. ON HOME (A(INDIR(1:4)))), the set can be only determined at run-time and so an inspection phase is necessary. TriDenT provides further optimizations for handling ON directive used inside a global DO loop with loopcarried dependencies which can be determined before this loop. Our main goal is to improve the efficiency of the global computation (inspection and execution), even if the executor is used only once. The inspector consists of three parts.
646
Thomas Brandes et al.
The Set Inspector analyzes the ON directive of the global loop in order to create in advance the processor subsets associated with each iteration. This will avoid during the execution of the global loop the synchronizations due to the set building messages. In order to allow an automatic generation of these irregular sets, we propose to extend the ON directive by specifying the algorithmic properties to be verified by the variables in the HOME clause (cf. Program 3 (b)). The Loop Index Inspector determines the useful loop indices for each processor. The Communication Inspector inspects two kinds of communications: those to extract out of the global loop and across-iterations communications. Moreover, the executor optimizes the communications by splitting the send and receive operations and integrating them in the code so as to minimize the synchronisations. 4.3
Application to the Sparse Fan-In Cholesky Algorithm
We briefly described the optimizations achieved by inpection mechanisms on the HPF2/Tree code for the sparse fan-in Cholesky algorithm given in Program 3 (b). This code uses the HOME clause extension introduced in section 4.2 and avoids the effective storage of the dependence tree B used in Program 3 (a) (whose size is of the same order of magnitude as A(:)%COL(:)%ROW). The Set Inspector involved by the extended ON directive for the global loop directly generates from the A(:)%COL(:)%ROW data (and so without the tree B) the processor set for each iteration K (ACTIVE PROC(K)). In this example, the Loop Index Inspection for the inner loop allows to avoid the scan of the A(J)%COL(:)%ROW, and to directly use the valid iteration index sets for each processor. The compiler can identify A(K)%COL(:)%ROW as loop invariant, and can detect that these data must be sent to the ACTIVE PROC(K) set; then the Communication Inspector can extract the emission outside the global loop and keep the reception inside iterations.
5
Experimental Results
We validate our contribution with experimental results for a sparse fan-in Cholesky factorization on IBM SP2 with 16 processors. The test matrix is a n × n sparse matrix with n = 65024 achieved from a 2D-grid finite element problem. This matrix is distributed with an INDIRECT mapping according to the subtreeto-subcube distribution (cf. section 2). We compare the three versions described at Fig. 5. The A version uses the tree B (cf. Program 3 (a)), whereas the B and C versions use the proposed extended HOME clause (cf. Program 3 (b)). In all the following, the global time consists in the summation of inspection and execution times. All these versions use the Processor Set Inspection. We have verified the effectiveness of this basic inspection by comparing the execution of A with another version which doesn’t use the processor subsets (35 seconds against 460 seconds for the global time with 16 processors).
Contribution to Better Handling of Irregular Problems in HPF2
647
Name Irregular Set Insp. Loop Index Insp. Communication Insp. Dep. (T)ree B or (O)N HOME Ext. A Yes No No T B Yes Yes No O C Yes Yes Yes O
Fig. 5. Characteristics of the Three Versions of the Fan-In Cholesky Factorization
The Fig. 6 presents the relative efficiency with regard to version A, first for the execution time only, and second for the global time. The first graph demonstrates the improvment obtained : from 2 to 8% for the Loop Index Inspection (B), from 3 to 13% with Communication Inspection added (C). The second graph illustrates that the global time, even for only one executor phase, can be better than for the reference version A. It appears that the Loop Index Inspection leads to an interesting improvment (4% for B). However, the cost of the Communication Inspection is just recover by the executor on 16 processors; nevertheless, one can expect a greater improvement with more processors. In all our experimentations, we can notice that inspection times are proportional to global time. This property is important for the scalability of our execution support. For example, this time represents between 9% and 15% for version C according to the number of processors. 114 (A) Processor Set (B) Set and Loop Indices (C) Set, Loop Indices and Communications
112 110
(A) Processor Set (B) Set and Loop Indices (C) Set, Loop Indices and Communications
104
102
108
C
106
B
B A
100
C
98
104 102
96 A
100 1
2
4
8 Nb of Proc. 16
1
2
4
8 Nb of Proc. 16
Fig. 6. Relative Efficiencies (Execution Time and Global Time).
So, these experimental results validate our run-time system. Of course, if the factorization step is executed many times, then the inspector cost will be negligible compared with the gain achieved by multiple executions. See [4] for other measures which validate this approach.
6
Conclusion and Perspectives
In this paper, we present a contribution to better handling irregular problems in a HPF2 context. We use a tree representation to take data or distribution irregularity into account. This can help the compiler to perform standard analysis and optimizations. We also introduce an extension for the ON HOME directive in order to enable automatic irregular processor set creations, and we propose a run-time support based on inspection/execution paradigm for active processor sets, loop indices and communications for algorithms with loop-carried dependencies. Our approach is validated by experimental results for sparse Cholesky factorization.
648
Thomas Brandes et al.
Our future work is to extend the properties of our trees (geometrical distributions as BRS or MRD [15], flexible trees in order to enable reallocations and redistributions), and to improve the performance of our Inspector/Executor system (use of the task parallelism induced by the loops). Finally, we are currently working on the integration of TriDenT in the compilation platform ADAPTOR.
References 1. C. Ashcraft, S. C. Eisenstat, J. W.-H. Liu, and A. H. Sherman. A Comparison of Three Column Based Distributed Sparse Factorization Schemes. In Fifth SIAM Conference on Parallel Processing for Scientific Computing, 1991. 641 2. P. Banerjee, J. A. Chandy, M. Gupta, J. G. Holm, A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. The PARADIGM Compiler for Distributed-Memory Message Passing Multicomputers. In the First International Workshop on Parallel Processing, Bangalore, India, December 1994. 640 3. A. J.C. Bik and H. A.G. Wijshoff. Simple Quantitative Experiments with a Sparse Compiler. In A. Ferreira J. Rolim Y. Saad and T. Yang editors, editors, Proc. of Third International Workshop, IRREGULAR’96, volume 1117 of Lecture Notes in Computer Science, pages 249–262, Springer, August 1996. 640 4. T. Brandes, F. Br´egier, M.C. Counilh, and J. Roman. Contribution to Better Handling of Irregular Problems in HPF2. Technical Report RR 120598, LaBRI, May 1998. 641, 647 5. T. Brandes and F. Zimmermann. Programming Environments for Massively Parallel Distributed Systems, chapter ADAPTOR - A Transformation Tool for HPF Programs, pages 91–96. In K.M. Decker and R.M. Rehmann, editors, Birkhauser Verlag, April 1994. 640, 645 6. B. Chapman, S. Benkner, R. Blasko, P. Brezany, M. Egg, T. Fahringer, H.M. Gerndt, J. Hulman, B. Knaus, P. Kutschera, H. Moritsch, A. Schwald, V. Sipkova, and H. Zima. Vienna Fortran Compilation System, User’s Guide Edition, 1993. 640 7. K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, 1990. 641, 642 8. J. D. Frens and D. S. Wise. Auto-blocking Matrix-Multiplication or Tracking BLAS3 Performance with Source Code. Technical Report 449, Computer Scince Department, Indiana University, December 1996. 643 9. C. Fu and T. Yang. Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures. Journal of Parallel and Distributed Computing, 42:143–156, 1997. 640 10. HPF Forum. High Performance Fortran Language Specification, January 1997. Version 2.0. 639, 643 11. A. Lain. Compiler and Run-time Support for Irregular Computations. PhD thesis, Illinois, 1996. 640 12. S.S. Mukherjee, S. D. Sharma, M. D. Hill, J. R. Larus, A. Rogers, and J. Saltz. Efficient Support for Irregular Applications on Distributed-Memory Machines. In ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), July 1995. 640 13. R. Ponnusamy, Y.S. Hwang, R. Das, J. Saltz, A. Choudhary, and G. Fox. Supporting Irregular Distributions in Fortran 90D/HPF Compilers. IEEE Parallel and Distributed Technology, 1995. Technical Report CS-TR-3268 and UMIACS-TR-9457. 640, 645
Contribution to Better Handling of Irregular Problems in HPF2
649
14. Y. Saad. SPARSKIT : a Basic Tool Kit for Sparse Matrix Computations - Version 2. Technical report, CSRD, University of Illinois, June 1994. 639, 643 15. M. Ujaldon, E. L. Zapata, B. M. Chapman, and H. P. Zima. New Data-Parallel Language Features for Sparse Matrix Computations. In Proc. of 9th IEEE International Parallel Processing Symposium, Santa Barbara, California, April 1995. 640, 648 16. J. Wu, R. Das, J. Saltz, H. Berryman, and S. Hiranandani. Distributed Memory Compiler Design for Sparse Problems. IEEE Transactions on Computers, 44(6), 1995. 640
OpenMP and HPF: Integrating Two Paradigms Barbara Chapman1 and Piyush Mehrotra2 1
2
VCPC, University of Vienna, Vienna, Austria. ICASE, MS 403, NASA Langley Research Center, Hampton VA 23681. [email protected] [email protected]
Abstract. High Performance Fortran is a portable, high-level extension of Fortran for creating data parallel applications on non-uniform memory access machines. Recently, a set of language extensions to Fortran and C based upon a fork-join model of parallel execution was proposed; called OpenMP, it aims to provide a portable shared memory programming interface for shared memory and low latency systems. Both paradigms offer useful features for programming high performance computing systems configured with a mixture of shared and distributed memory. In this paper, we consider how these programming models may be combined to write programs which exploit the full capabilities of such systems.
1
Introduction
Recently, high performance systems with physically distributed memory, and multiple processors with shared memory at each node, have come onto the market. Some have relatively low latency and may support at least moderate levels of shared memory parallelism. This has motivated a group of hardware and compiler vendors to define OpenMP, an interface for shared memory programming which they hope to see adopted by the community. OpenMP [2] supports a model of shared memory parallel programming which is based to some extent on PCF [3], an earlier effort to define a standard interface, but increases the power of its parallel constructs by permitting parallel regions to include subprograms. HPF was designed to exploit data locality and does not provide features for utilizing shared memory in a system; OpenMP provides the latter, however it does not support data locality. In this paper we consider how these paradigms might be combined to program distributed memory multiprocessing systems (DMMPs) which may require both of these. We assume that such a system consists of a number of interconnected processing nodes, or simply nodes, each of which is associated with a specific physical memory module and one or more general-purpose processors which share this memory and execute a program’s instructions at run time: we emphasize this two-level structure by calling such systems SM-DMMPs.
This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-97046 while both authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 650–658, 1998. c Springer-Verlag Berlin Heidelberg 1998
OpenMP and HPF: Integrating Two Paradigms
651
We first consider interfacing the paradigms, enabling each of them to be used where appropriate. However, there is also potential for a tighter integration of at least some features of HPF with those offered by OpenMP. We therefore examine an approach which combines HPF data mappings with OpenMP constructs, permitting exploitation of shared memory whilst coordinating the mapping of data and work to the target machine in a manner which preserves data locality. This paper is organized as follows. Section 2 briefly describes the features of each paradigm, and in Section 3 we consider an interface between them based upon the HPF extrinsic mechanism. We speculate on the potential for a closer integration of the two, and illustrate each of them with a multiblock application example.
2
A Brief Comparison of HPF and OpenMP
High Performance Fortran (HPF) is a set of extensions to Fortran, designed to facilitate data parallel programming on a wide range of architectures [1]. It presents the user with a conceptual single thread of control. HPF directives allow the programmer to specify the distribution of data across processors, thus providing a high level description of the program’s data locality, while the compiler generates the actual low-level parallel code for communication and scheduling of instructions. HPF also defines directives for expressing loop parallelism and simple task parallelism, where there is no interaction between tasks. Mechanisms for interfacing with other languages and programming models are provided. OpenMP is a set of directives, with bindings for Fortran 77 and C/C++, for explicit shared memory parallel programming ([2]). It is based upon the forkjoin execution model, in which a single master thread begins execution and spawns worker threads to perform computations in parallel as required. The user must specify the parallel regions, and may explicitly coordinate threads. The language thus provides directives for declaring such regions, for sharing work among threads, and for synchronizing them. An order may be imposed on variable updates and critical regions defined. Threads may have private copies of some of the program’s variables. Parallel regions may differ in the number of threads which execute them, and the assignation of work to threads may also be dynamically determined. Parallel sections easily express other forms of task parallelism; interaction is possible via shared data. The user must take care of any potential race conditions in the code, and must consider in detail the data dependences which need to be respected. Both paradigms provide a parallel loop with private data and reduction operations, yet these differ significantly in their semantics. Whereas the HPF INDEPENDENT loop requires data independence between iterations (except for reductions), OpenMP requires only that the construct be specified within a parallel region in order to be executed by multiple threads. This implies that the user has to handle any inter-iteration data dependencies explicitly. HPF permits private variables only in this context; otherwise, variables may not have different values in different processes. OpenMP permits private variables – with
652
Barbara Chapman and Piyush Mehrotra
potentially different values on each thread – in all parallel regions and worksharing constructs. This extends to common blocks, private copies of which may be local to threads. Even subobjects of shared common blocks may be private. Sequence association is permitted for shared common block data under OpenMP; for HPF, this is the case only if the common blocks are declared to be sequential. The HPF programming model allows the code to remain essentially sequential. It is suitable for data-driven computations, and on machines where data locality dominates performance. The number of executing processes is static. In contrast, OpenMP creates tasks dynamically and is more easily adapted to a fluctuating workload; tasks may interact in non-trivial ways. Yet it does not enable the binding of a loop iteration to the node storing a specific datum. Thus each programming model has specific merits with respect to programming SM-DMMPs, and provides functionality which cannot be expressed within the other.
3
Combining HPF with OpenMP
In this section we examine two ways of combining HPF and OpenMP such that we can exploit the features of both approaches. Note that the logical processors of HPF can be associated with either the nodes or with the individual processors of an SM-DMMP. In the former case, the user must rely on automatic shared memory parallelization to exploit all processors on a node. In the latter case, a typical implementation would create separate address spaces for each individual processor, and even communication between two processors on the same node will require an explicit data transfer. 3.1
OpenMP as an HPF Extrinsic Kind
It is a stated goal of HPF to interoperate with other programming paradigms. The language specification defines the extrinsic mechanism for interfacing to program units which are written in other languages or do not assume the HPF model of computation. The programming language and computation model must be defined. An interface specification must be provided for each extrinsic procedure. This approach keeps the two models completely separate, which implies that the compilers can also be kept distinct; in particular, the OpenMP compiler does not need to know anything about HPF data distributions. For the HPF calling program, execution should proceed just as if an HPF subprogram was invoked. HPF provides language interfaces for Fortran (90/95), Fortran 77 and C. We consider only Fortran programs and OpenMP constructs in this paper.1 HPF provides the pre-defined programming models global, serial and local for use with extrinsics. It also permits additional models so long as conceptually one copy of the called procedure executes initially. That is the case for an OpenMP program. We define an extrinsic model below which allows us to invoke OpenMP subroutines and functions from within an HPF program. We also indicate how an alternative model might be defined. 1
The two Fortran language versions differ in the assumptions on sequence and storage association, particularly when arguments are passed to subroutines.
OpenMP and HPF: Integrating Two Paradigms
653
The OpenMP extrinsic model We base the OpenMP extrinsic kind upon the predefined local model. It assumes that processors defined in an HPF program are associated with nodes of the target system (the number of nodes, not the total number of processors, is returned by HPF’s NUMBER OF PROCESSORS function). An OpenMP extrinsic procedure called under this model is realized by creating a single local procedure on each node (or HPF processor). These execute concurrently and independently, exclusively within their respective node, until they return to the HPF calling program upon completion. The calling program blocks until all local procedures have terminated. With the extrinsic kind OpenMP, therefore, each local procedure is an independent OpenMP routine, initially executed by a master thread on a processor of the node it is invoked on. It will create worker threads on the same node when a parallel region is encountered; no communication is possible with a local OpenMP procedure on another node. Only the processors and memory on that node are available to it. An explicit interface will describe the HPF mapping of data required by the routine; any remapping required is performed before the OpenMP extrinsic procedure is called. Each invocation will access the local part of distributed data only. Scalar data arguments are available on each node. Only data visible to the master thread will be returned: if they are scalar, the value must be the same on each node. The OpenMP routine may not return to the caller from a parallel region or from within a synchronization construct, and may not have alternate entries. An OpenMP library routine will return information related to the procedure executing on the node where it is invoked only. This programming model might be very useful for systems with high latency between nodes, such as clusters of workstations, where it is important to map data to the node which needs them. It permits use of a finer grain of parallelism to exploit the processors within each such node once the data is in place. However, it is also restrictive: an OpenMP extrinsic routine can only exploit parallelism within a single node, since other nodes, and their data, are not visible to it.
An alternative extrinsic model Other extrinsic kinds may be defined which conform to the requirements of HPF. In particular, an alternative can be defined which permits an OpenMP routine to utilize an entire platform, i.e. it begins with a single master thread which is cognizant of all processors executing the program. The team of threads created to execute a parallel region in the OpenMP code may thus span the entire machine. Since the OpenMP routine will not understand HPF data mappings, all distributed arguments must be mapped to the processor which will initiate the routine during its setup. This model enables the user to switch between programming models, depending upon which is most suitable for some region of the program. However, it does not enable OpenMP to take advantage of HPF’s data locality, and the initial overhead might be large. Its practical value is therefore not clear.
654
3.2
Barbara Chapman and Piyush Mehrotra
Merging HPF and OpenMP
An interface between two paradigms obviously does not permit full exploitation of their features. The OpenMP extrinsic may be satisfactory for SM-DMMP architectures with high latency. However, the alternative model is less appealing as an approach to programming large low-latency SM-DMMPs, which may benefit from both data locality and the flexibility of a shared memory model. We thus consider augmenting the OpenMP directives with HPF directives directly, so that both shared-memory parallelism and data locality can be expressed within the same routines. OpenMP directives can then describe the parallel computation, while HPF directives may map data to the nodes where it is used and bind the execution of loop iterations to nodes storing distributed data. Data Mappings The HPF data mapping directives were carefully defined to affect a program’s performance, but not its semantics. Thus they can be easily integrated into OpenMP without changing the semantics of OpenMP programs. It is natural to associate logical HPF processors with processing nodes under this model, since data will be shared within a node unless it is thread-private. Techniques from HPF compilers may be applied to prefetch data in bulk transfers where permitted by the semantics of the code. The resulting language might allow privatization of distributed variables just as HPF does inside INDEPENDENT loops. In general, however, we expect that HPF directives will be used to explicitly map large data objects, which are likely to be shared rather than being private to each thread. HPF does not support sequence or storage association for explicitly mapped variables. This restriction must carry over to the integrated language so that the compiler may exploit the mapping information. Common blocks with distributed constituents must follow rules of both HPF and OpenMP. That is, if they are named in a THREADPRIVATE directive, or in a PRIVATE clause, each private copy inherits the original distribution. Storage for the explicitly mapped constituents is dissociated from the storage for the common block itself. In HPF programs, mapped data objects may be implicitly remapped across procedure boundaries to match the prescriptive mapping directives for the formal arguments. On the other hand, if they have been declared DYNAMIC, objects can be explicitly remapped using a REDISTRIBUTE or REALIGN directive. In order to facilitate remapping within an OpenMP program, we must impose the following constraint: Any code which implicitly or explicitly remaps a data objects must be encountered by all the threads that share that object. This ensures that if the object is remapped, the new mapping is visible to all the threads which have access to the data. Additional HPF Constructs HPF provides the ON directive to specify the locus of computation of a block of statements, a loop iteration or subroutine call. This might be used as an alternative to the OpenMP scheduling options to bind the execution of code in a work-sharing construct to one or more nodes, or
OpenMP and HPF: Integrating Two Paradigms
655
logical HPF processors. An ON clause enables the user to specify, for example, the node where a SINGLE region or a SECTION of a parallel sections construct is to be executed. Similarly, it may bind each iteration of a DO construct to the node where a specific data item is stored. Finally, in contrast to the INDEPENDENT directive of HPF, parallel loops in OpenMP may have inter-iteration data dependencies. Thus an additional worksharing construct could be based upon the INDEPENDENT loop, which permits the prefetching of data and may eliminate expensive synchronizations.
4
An Example
Scientific and engineering applications sometimes use a set of grids, instead of a single grid, to model a complex problem domain. Such multiblock codes use tens, hundreds, or even thousands of such grids, with widely varying shapes and sizes. Structure of the Application The main data structure in such an application stores the values associated with each of the grids in the multiblock domain. It is realized by an array, MBLOCK , whose size is the number of grids NGRID which is defined at run time. Each of its elements is a Fortran 90 derived type containing the data associated with a single grid. TYPE GRID INTEGER NX, NY REAL , POINTER :: U(:, :), V(:, :), F(:, :), R(:, :) END TYPE GRID TYPE (GRID), ALLOCATABLE :: MBLOCK(:)
The decoupling of a domain into separate grids creates internal grid boundaries. Values must be correctly transferred between these at each step of the iteration. The connectivity structure is also input at run time, since it is specific to a given computational domain. It may be represented by an array CONNECT (not shown here) whose elements are pairs of sections of two distinct grids. The structure of the computation is as follows: DO ITERS = 1, MAXITERS CALL UPDATE BDRY (MBLOCK, CONNECT) CALL IT SOLVER (MBLOCK, SUM) IF ( SUM . LT . EPS) THEN finished END DO
The first of these routines transfers intermediate boundary results between grid sections which abut each other. We do not discuss it further. The second routine comprises the solution step for each grid. It has the following main loop:
656
Barbara Chapman and Piyush Mehrotra SUBROUTINE IT SOLVER( MBLOCK, SUM ) SUM = 0.0 DO I = 1, NGRID CALL SOLVE GRID( MBLOCK(I)%U, MBLOCK(I)%V, . . . ) CALL RESID GRID( ISUM, MBLOCK(I)%R, MBLOCK(I)%V, . . . ) SUM = SUM + ISUM END DO END
Such a code generally exhibits two levels of parallelism. The solution step is independent for each individual grid in the multiblock application. Hence each iteration of this loop may be invoked in parallel. If the grid solver admits parallel execution, then major loops in the subroutine it calls may also be parallelized. Finally the routine also computes a residual; for each grid, it calls a procedure RESID GRID to carry out a reduction operation. The sum of the results across all grids is produced and returned in SUM . This would require a parallel reduction operation if the iterations are performed in parallel. Distributing the Grids We again presume that an HPF processor maps onto a node of the underlying machine rather than to an individual physical processor. We map each grid to a single node (which may “own” several grids) by distributing the array MBLOCK . Such a mapping permits levels of parallelism to be exploited. It implies that the boundary update routine may need to transfer data between nodes, whereas the solver routine can run in parallel on the processors of a node using the shared memory in the node to access data. A simple BLOCK distribution, as shown below, may not ensure load balance. !HPF$ DISTRIBUTE MBLOCK( BLOCK )
The INDIRECT distribution of HPF may be used to provide a finer grain of control over the mapping. 4.1
Using an OpenMP Extrinsic Routine
This version calls an OpenMP extrinsic to perform the solution step on each grid as well as compute local contributions to the residual. The boundary updates and the final residual summation are performed in the calling HPF program. An interface declaration is required for the OpenMP extrinsic routine: EXTRINSIC (’OPENMP’, ’LOCAL’) SUBROUTINE OPEN(MBLOCK, M, LSUM) INTEGER M(:) REAL LSUM(:) TYPE (GRID), ALLOCATABLE :: MBLOCK(:)
OpenMP and HPF: Integrating Two Paradigms
657
Only the local segment of MBLOCK will be passed to the local procedure, M is its size and LSUM is the residual value for the local grids on the node. The extrinsic subroutine is invoked on each node independently. The master thread creates the required number of threads locally. It will return when each of the threads has terminated. The calling HPF routine waits until all extrinsic procedures have terminated. SUBROUTINE OPEN( LOC MBLOCK, NLOC, LOC SUM ) SUM = 0.0 CALL GET THREAD(NLOC, LOC MBLOCK, N) !$OMP CALL OMP SET NUM THREADS(N) !$OMP PARALLEL DO SCHEDULE ( DYNAMIC ) DEFAULT ( SHARED ) !$OMP REDUCTION (+: LOC SUM) DO I = 1, NLOC CALL SOLVE GRID( LOC MBLOCK(I)%U, . . . ) CALL RESID GRID( ISUM, LOC MBLOCK(I)%R, . . . ) LOC SUM = LOC SUM + ISUM END DO RETURN END
This approach is suitable for systems where the latency between nodes is comparatively high. It permits direct experimentation with, and exploitation of, threads in order to balance work despite large differences in grid sizes. 4.2 Combining HPF and OpenMP The combined model is very similar to the above; however, it is simpler to write in the sense that there is no need to construct an extrinsic interface. The data declarations and distribution are identical to those of the HPF program. The IT SOLVER routine can be directly written with OpenMP directives in a manner very similar to the code shown above: SUBROUTINE IT SOLVER( MBLOCK, SUM ) !HPF$ DISTRIBUTE MBLOCK( BLOCK ) SUM = 0.0 CALL GET THREAD(NGRID, MBLOCK, N) !$OMP CALL OMP SET NUM THREADS(N) !$OMP PARALLEL DO SCHEDULE ( DYNAMIC ) DEFAULT ( SHARED ) !$OMP REDUCTION (+: SUM) DO I = 1, NGRID !$HPF ON ( HOME (MBLOCK(I)), RESIDENT CALL SOLVE GRID( MBLOCK(I)%U, . . . ) CALL RESID GRID( ISUM, MBLOCK(I)%R, . . . ) !$HPF END ON SUM = SUM + ISUM END DO RETURN END
658
Barbara Chapman and Piyush Mehrotra
This version again deals with the full array MBLOCK , rather than with local segments. Thus, there may be a need to specify the locus of computation so that each iteration is performed on the processor which stores its data. This can be done using an HPF-style ON-block along with the RESIDENT directive, as shown. The execution differs. This time, the initialization of SUM will be performed once only, on a single master thread which controls the entire machine. The number of threads, N , is computed for the entire system, and the work is distributed across all processors. This may enable an improved load balance. We have avoided a two-step computation of SUM . This approach is appropriate for a system with comparatively low latency, which allows a single master thread on the machine and a global scheduling policy, permitting a compiler/runtime system to do global load balancing if appropriate.
5
Summary
OpenMP was recently proposed for adoption in the community by a group of major vendors. It is not yet clear what level of acceptance it will find, yet there is certainly a need for portable shared-memory programming. The functionality it provides differs strongly from that of HPF, which considers the needs of distributed memory systems. Since many current architectures have requirements for both data locality and efficient shared memory parallel programming, there are potential benefits in combining elements of these two paradigms. We have shown in this paper that the HPF extrinsic mechanism enables usage of each within a single program. This may facilitate the programming of workstation clusters with standard interconnection technology, for example. It is less clear if a large low latency machine will benefit from a simple interface. For these, there appears to be added value in providing some level of integration. Given the orthogonality of many concepts in the two programming models, they are fundamentally highly compatible and a useful integrated model can be specified. In particular, a combination of HPF data locality directives with OpenMP parallelization constructs extends the scope of applicability of each of them. Since at least one major vendor has already provided data mapping options together with a shared memory parallel programming model [4], we expect that this avenue will be closely explored by the OpenMP consortium in the near future.
References 1. High Performance Fortran Forum: High Performance Fortran Language Specification. Version 2.0, January 1997 651 2. OpenMP Consortium: OpenMP Fortran Application Program Interface, Version 1.0, October, 1997 650, 651 3. B. Leasure (Ed.): Parallel Processing Model for High Level Programming Languages, Draft Proposed National Standard for Information Processing Systems, April 1994 650 4. Silicon Graphics, Inc.: MIPSpro Fortran 77 Programmer’s Guide, 1996 658
Towards a Java Environment for SPMD Programming Bryan Carpenter, Guansong Zhang, Geoffrey Fox, Xiaoming Li , Xinying Li, and Yuhong Wen NPAC at Syracuse University Syracuse, New York, NY 13244, USA {dbc,zgs,gcf,lxm,xli,wen}@npac.syr.edu
Abstract. As a relatively straightforward object-oriented language, Java is a plausible basis for a scientific parallel programming language. We outline a conservative set of language extensions to support this kind of programming. The programming style advocated is Single Program Multiple Data (SPMD), with parallel arrays added as language primitives. Communications involving distributed arrays are handled through a standard library of collective operations. Because the underlying programming model is SPMD programming, direct calls to other communication packages are also possible from this language.
1
Introduction
Java boasts a direct simplicity reminiscent of Fortran, but also incorporates many of the important ideas of modern object-oriented programming. Of course it comes with an established track-record in the domains of Web and Internet programming. The idea that Java may enable new programming environments, combining attractive user interfaces with high performance computation, is gaining increasing attention amongst computational scientists [7,8]. This article will focus specifically on the potential of Java as a language for scientific parallel programming. We envisage a framework called HPJava. This would be a general environment for parallel computation. Ultimately it should combine tools, class libraries, and language extensions to support various established paradigms for parallel computation, including shared memory programming, explicit message-passing, and array-parallel programming. This is a rather ambitious vision, and the current article only discusses some first steps towards a general framework. In particular we will make specific proposals for the sector of HPJava most directly related to its namesake: High Performance Fortran. For now we do not propose to import the full HPF programming model to Java. After several years of effort by various compiler groups, HPF compilers are still quite immature. It seems difficult justify a comparable effort for Java
Current address: Peking University
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 659–668, 1998. c Springer-Verlag Berlin Heidelberg 1998
660
Bryan Carpenter et al.
before success has been convincingly demonstrated in Fortran. In any case there are features of the HPF model that make it less attractive in the context of the integrated parallel programming environment we envisage. Although an HPF program can interoperate with modules written in other parallel programming styles through the HPF extrinsic procedure interface, that mechanism is quite awkward. Rather than follow the HPF model directly, we propose introducing some of the characteristic ideas of HPF—specifically its distributed array model and array intrinsic functions and libraries—into a basically SPMD programming model. Because the programming model is SPMD, direct calls to MPI [1] or other communication packages are allowed from the HPJava program. The language outlined here provides HPF-like distributed arrays as language primitives, and new distributed control constructs to facilitate access to the local elements of these arrays. In the SPMD mold, the model allows processors the freedom to independently execute complex procedures on local elements. All access to non-local array elements must go through library functions—typically collective communication operations. This puts an extra onus on the programmer; but making communication explicit encourages the programmer to write algorithms that exploit locality, and simplifies the task of the compiler writer. On the other hand, by providing distributed arrays as language primitives we are able to simplify error-prone tasks such as converting between local and global array subscripts and determining which processor holds a particular element. As in HPF, it is possible to write programs at a natural level of abstraction where the meaning is insensitive to the detailed mapping of elements. Lower-level styles of programming are also possible.
2
Distributed Arrays
HPJava adds class libraries and some additional syntax for dealing with distributed arrays. These arrays are viewed as coherent global entities, but their elements may be divided across a set of cooperating processes. The new objects represent true multidimensional arrays. They allow regular section subscripting, similar to Fortran 90 arrays. These properties all appear very attractive within the scientific parallel programming community, and have become established in HPF and related languages. But they imply a kind of array structurally very different to Java’s builtin arrays. A design decision was made not to attempt integration of the new arrays with standard Java array types: instead they are a new kind of entity that coexists in the language with ordinary Java arrays. The type-signatures and constructors of the multidimensional array use double brackets to distinguish them from ordinary arrays.
Towards a Java Environment for SPMD Programming
661
In this example: Procs2 p = new Procs2(3, 2) ; Range x = new BlockRange(100, p.dim(0)) ; Range y = new BlockRange(200, p.dim(1)) ; float [[,]] a = new float [[x, y]] on p ;
a is created as a 100 × 200 array, block-distributed over the 6 processes in p. The ideas will be discussed in more detail in the following paragraphs, but the fragment is essentially equivalent to the HPF declarations !HPF$ PROCESSORS P(3, 2) REAL A(100, 200) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO P
Before examining the HPJava syntax for declaration of distributed arrays we will discuss the process groups over which their elements are scattered. A base class Group describes a general group of processes. It has subclasses Procs1, Procs2, . . . , representing one-dimensional process grids, two-dimensional process grids, and so on. In the example p represents a 3 by 2 grid of processes. At the time the constructor of p is executed the program should be running on six or more processes. Declaration of p in the Java program corresponds directly to declaration of the processor arrangement P in the HPF fragment. In HPJava additional objects describing individual dimensions of a process grid are accessed through the inquiry member dim. Some or all of the dimensions of a multi-dimensional array can be declared as distributed ranges. In general a distributed range is represented by an object of class Range. A Range object defines a range of integer subscripts, and specifies how they are mapped into a process grid dimension. For example, the class BlockRange is a subclass of Range describing a simple block-distributed range of subscripts. Like BLOCK distribution format in HPF, it maps blocks of contiguous subscripts to each element of its target process dimension1 . The constructor of BlockRange usually takes two arguments: the extent of the range and a Dimension object defining the process dimension over which the new range is distributed. In general the constructor of the distributed array must be followed by an on clause, specifying the process group over which the array is distributed. Distributed ranges of the array must be distributed over distinct dimensions of this group. The on clause can be omitted in some circumstances—see Sect. 3. A distributed array can also have ordinary sequential dimensions, in which case the range slot in the constructor is replaced by an integer extent, and the corresponding slot in the type signature should contain an asterisk: float [[,*]] b = new float [[x, 10]] on p ; 1
Other range subclasses include CyclicRange, which produces the equivalent of CYCLIC distribution format in HPF.
662
Bryan Carpenter et al.
The asterisk is not mandatory (array types decorated in this way are regarded as subtypes of the undecorated ones) but it allows some extra flexibility in subscripting sequential dimensions2 . With one or two extra constructors for groups and ranges, the framework described here captures all the distribution and alignment features of HPF. Because arrays like a are declared as a collective objects we can apply collective operations to them. For example a standard library called Adlib (one of a number of existing communication libraries that may eventually be brought into the HPJava framework) provides many of the array functions of Fortran 90. float [[,]] c = new float [[x, y]] on p ; Adlib.shift(a, c, -1, 0, CYCL) ;
At the edges of the local segment of a the shift operation causes the local values of a to be overwritten with values of c from a processor adjacent in the x dimension. Subscripting operations on distributed arrays are subject to certain restrictions. The HPJava model is a formalization of the distributed memory SPMD programming style, and all memory accesses must refer to the memory of the local processor. In general therefore an array reference like a [17, 23] = 13 ;
is illegal because it may imply access to an element held on a different processor. The language provides several distributed control constructs to alleviate the inconvenience of this restriction. These will be discussed in the next few sections.
3
The on Construct and the Active Process Group
The concept of process groups has an important role in the HPJava model. The class Group (process grid classes are special cases) has a member function called local. This returns a boolean value true if the local process is a member of the group, false otherwise. In if(p.local()) { ... }
the code inside the conditional is executed only if the local process is a member p. HPJava provides a short way of writing this construct on(p) { ... } 2
Incidentally, because b is has no range distributed over p.dim(1) it is understood to be replicated over this dimension of the grid.
Towards a Java Environment for SPMD Programming
663
The on construct provides some extra value. The language incorporates a formal idea of the active process group (APG). At any point of execution some process group is singled out as the APG. An on(p) construct specifically changes the value of the APG to p. On exit from the construct, the APG is restored to its value on entry. Elevating the APG to a part of the language allows some simplifications. For example, it provides a natural default for the on clause in array constructors. More importantly, formally defining the APG simplifies the statement of various rules about what operations are legal inside distributed control constructs like on. A set of formal rules defines how these constructs can be nested and what data accesses are legal inside them. These rules have been presented elsewhere.
4
Locations and the at Construct
As noted at the end of Sect. 2, a mechanism is needed to ensure that the array reference a [17, 23] = 13 ;
is legal, because the local process holds the element in question. In general determining whether an element is local may be a non-trivial task. Rather than use integer values directly as local subscripts in distributed array dimensions, the idea of a location is introduced. A location can be viewed as an abstract element, or “slot”, of a distributed range. An individual location is described by an object of the class Location. Each Location element is mapped to a particular slice of a process grid. In general two locations are identical only if they come from the same position in the same range. A subscripting syntax is used to represent location n in range x: Location i = x [n]
Locations are used to parametrize a second distributed control construct called the at construct. This is similar to on, except that its body is executed only on processes that hold a specified location. For distributed array dimensions, locations rather than integers are used as subscripts. Access to element a [17,23] can be safely written as: Location i = x [17], j = y [23] ; at(i) at(j) a [i, j] = 13 ;
Locations used as array subscripts must be elements of the corresponding ranges of the array.
5
Distributed Loops
The at mechanism of the previous section is often useful, but a more urgent requirement is a mechanism for parallel access to distributed array elements.
664
Bryan Carpenter et al.
The last and most important distributed control construct in the language is called overall. It implements a distributed parallel loop. Conceptually it is quite similar to the FORALL construct of Fortran, except that the overall construct specifies exactly where its parallel iterations are to be performed. The argument of overall is a member of the special class Index. This class is a subclass of Location, so it is syntactically correct to use an index as an array subscript. Here is an example of a pair of nested overall loops: float [[,]] a = new float [[x, y]], b = new float [[x, y]] ; ... Index i, j ; overall(i = x | :) overall(j = y | :) a [i, j] = 2 * b [i, j] ;
The body of an overall construct executes, conceptually in parallel, for every location in the range of its index. An individual “iteration” executes on just those processors holding the location associated with the iteration. The net effect of the example above should be reasonably clear. It assigns twice the value of each element of b to the corresponding element of a. Because of the rules about where an individual iteration iterates, the body of an overall can usually only combine elements of arrays that have some simple alignment relation relative to one another. The idx member of range can be used in parallel updates to yield expressions that depend on global index values. The “:” in the overall examples above can be replaced by a general Fortranlike subscript triplet of the form l : u : s, where l, u and s are respectively lower bound upper bound and a stride. The “: s” part can be omitted, in which case s defaults to 1; l or u can be omitted, in which case the default to 0 or N − 1 respectively, where N is the size of the range specified before the “|”. Hence the degenerate isolated “:” selects the whole of the range. With the overall construct we can give some useful examples of parallel programs. Figure 1 gives a parallel implementation of red-black relaxation in the extended language. To support the important stencil-update paradigm, ghost regions are allowed on distributed arrays. Ghost regions are extensions of the locally held block of a distributed array, used to cache values of elements held on adjacent processors. In our case the width of these regions is specified in a special form of the BlockRange constructor. The ghost regions are explicitly brought up to date using the library function writeHalo. Its arguments are an array with suitable extensions and a vector defining in each dimension the width of the halo that must actually be updated. Note that the new range constructor and writeHalo function are library features, not new language extensions. One new piece of syntax is needed: the addition and subtraction operators are overloaded so that integer offsets can be added or subtracted to locations, yielding new, shifted, locations. This kind of shifted does not imply access to off-processor data. It only works if the subscripted array has suitable ghost extensions. Figure 2 gives a parallel implementation of Cholesky decomposition in the extended language. The first dimension of a is sequential (“collapsed” in HPF
Towards a Java Environment for SPMD Programming
665
parlance). The second dimension is distributed (cyclically, to improve loadbalancing). This a column-oriented decomposition. The example introduces Fortran-like array sections. These are distinguished from local subscripting operations by the use of double brackets. The subscripts are typically scalar (integers) or triplets 3 . The example also involves one new operation from the Adlib library. The function remap copies the elements of one distributed array or section to another of the same shape. The two arrays can have any, unrelated decompositions. In the current example remap is used to implement a broadcast. Because b has no range distributed over p, it implicitly has replicated mapping; remap accordingly copies identical values to all processors. We have covered most of the important language features we are implementing. Two additional features that are quite important in practice but have not been discussed are subranges and subgroups. A subrange is simply a range which is a regular section of some other range, created by syntax like x [0 : 49]. Subranges can be used to create distributed arrays with general HPF-like alignments. A subgroup is some slice of a process array, formed by restricting process coordinates in one or more dimensions to single values. Subgroups formally describe the state of the active process group inside at and overall constructs. For a more complete description of a slightly earlier version of the proposed language, see [3]. Note that if distributed array accesses are genuinely irregular, the necessary subscripting cannot usually be directly expressed in our language, because subscripts cannot be computed randomly in parallel loops without violating the fundamental SPMD restriction that all accesses be local. This is not regarded as a shortcoming: on the contrary it forces explicit use of an appropriate library package for handling irregular accesses (such as CHAOS [6]). Of course a suitable binding of such a package is needed in our language.
6
Discussion
We have described a conservative set of extensions to Java. In the context of an explicitly SPMD programming environment with a good communication library, we claim these extensions provide much of the concise expressiveness of HPF, without relying on very sophisticated compiler analysis. The object-oriented features of Java are exploited to give an elegant parameterization of the distributed arrays in the extended language. Because of the relatively low-level programming model, interfacing to other parallel-programming paradigms is more natural than in HPF. With suitable care, it is possible to make direct calls to, say, MPI from within the data parallel program (in [2] we suggest a concrete Java binding for MPI). It is necessary to address various questions: Why introduce language extensions and additional syntax—why not work entirely with class libraries for distributed arrays? In fact we consciously put 3
In the example upper bounds of various triplets default to N − 1, determined by the extent of the corresponding array range.
666
Bryan Carpenter et al.
Procs2 p = new Procs2(P, P) ; on(p) { Range x = new BlockRange(N, p.dim(0), 1) ; Range y = new BlockRange(N, p.dim(1), 1) ;
// ghost width 1 // ghost width 1
float [[,]] u = new float [[x, y]] ; int [] widths = {1, 1} ;
// Widths updated by ‘writeHalo’
// ... some code to initialise ‘u’ for(int iter = 0 ; iter < NITER ; iter++) { for(int parity = 0 ; parity < 2 ; parity++) { Adlib.writeHalo(u, widths) ; Index i, j ; overall(i = x | 1 : N - 2) overall(j = y | 1 + (x.idx(i) + parity) % 2 : N - 2 : 2) u [i, j] = 0.25 * (u [i - 1, j] + u [i + 1, j] + u [i, j - 1] + u [i, j + 1]) ; } } }
Fig. 1. Red-black iteration using writeHalo. Procs1 p = new Procs1(P) ; on(p) { Range x = new CyclicRange(N, p.dim(0)); float [[*,]] a = new float [[N, x]] ; float [[*]]
b = new float [[N]] ;
// buffer
// ... some code to initialise ‘a’ Location l ; Index m ; for(int k = 0 ; k < N - 1 ; k++) { at(l = x [k]) { float d = Math.sqrt(a [k, l]) ; a [k, l] = d ; for(int s = k + 1 ; s < N ; s++) a [s, l] /= d ; } Adlib.remap(b [[k + 1 : ]], a [[k + 1 : , k]]); overall(m = x | k + 1 : ) for(int i = x.idx(m) ; i < N ; i++) a [i, m] -= b [i] * b [x.idx(m)] ; } at(l = x [N - 1]) a [N - 1, l] = Math.sqrt(a [N - 1, l]) ; }
Fig. 2. Cholesky decomposition.
Towards a Java Environment for SPMD Programming
667
as much of the framework into libraries as seemed practical, based on extensive experimentation (in Java and C++). All communication operations and collective operations are library functions. Although we introduce a basic syntactic framework for creating distributed arrays4, all variants of distribution format are captured in runtime libraries, in the Range hierarchy. Similary all variants of processor arrangement are captured in the Group hierarchy. Unfortunately it seems to be difficult to obtain a pure library implementation of the basic dataparallel loops and local array subscripting operations that is flexible, convenient, and efficient. Our suggestion is that, while it would certainly be unrealistic to expect the syntax extensions outlined here to be incorporated in the standard Java language, relatively simple preprocessors can convert the extended language to ugly but efficient standard Java plus class-library calls, as required. Why does the language seem to have awkward restrictions on subscripting, compared with HPF? The answer is: because HPJava is supposed to be a thin layer on top of a straightforward distributed SPMD program with calls to class libraries. In general the restrictions make it much easier to implement efficiently than HPF. Why adopt the distributed memory model—why not do parallel programming with Java threads? Mainly because our target hardware is clusters of commodity processors, not expensive SMPs. The language extensions described were devised partly to provide a convenient interface to a distributed-array library developed in the PCRC project [5,4]. Hence most of the run-time technology needed to implement the language is available “off-the-shelf”. The existing library includes the run-time descriptor for distributed arrays and a comprehensive array communication library. The HPJava compiler itself is being implemented initially as a translator to ordinary Java, through a compiler construction framework also developed in the PCRC project [12]. A complementary approach to communication in a distributed array environment is the one-sided-communication model of Global Arrays (GA) [9]. For task-parallel problems this approach is often more convenient than the scheduleoriented communication of CHAOS (say). Again, the language model we advocate here appears quite compatible with GA approach—there is no obvious reason why a binding to a version of GA could not be straightforwardly integrated with the the distributed array extensions of the language described here. Finally we mention two language projects that have some similarities. Spar [11] is a Java-based language for array-parallel programming. There are some similarities in syntax, but semantically Spar is very different to our language. Spar expresses parallelism but not explicit data placement or communication—it is a higher level language. ZPL [10] is a new programming language for scientific computations. Like Spar, it is an array language. It has an idea of performing computations over a region, or set of indices. Within a compound statement prefixed by a region specifier, aligned elements of arrays 4
Even this framework could be eliminated if Java provided a feature like C++ templates.
668
Bryan Carpenter et al.
distributed over the same region can be accessed. This has similarities to our overall construct.
References 1. Bryan Carpenter, Yuh-Jye Chang, Geoffrey Fox, Donald Leskiw, and Xiaoming Li. Experiments with HPJava. Concurrency: Practice and Experience, 9(6):633, 1997. 660 2. Bryan Carpenter, Geoffrey Fox, Xinying Li, and Guansong Zhang. A draft Java binding for MPI. http://www.npac.syr.edu/projects/pcrc/doc. 665 3. Bryan Carpenter, Guansong Zhang, Geoffrey Fox, Xinying Li, and Yuhong Wen. Introduction to Java-Ad. http://www.npac.syr.edu/projects/pcrc/doc. 665 4. Bryan Carpenter, Guansong Zhang, and Yuhong Wen. NPAC PCRC runtime kernel definition. Technical Report CRPC-TR97726, Center for Research on Parallel Computation, 1997. Up-to-date version maintained at http://www.npac.syr.edu/projects/pcrc/doc. 667 5. Parallel Compiler Runtime Consortium. Common runtime support for highperformance parallel languages. In Supercomputing ‘93. IEEE Computer Society Press, 1993. 667 6. R. Das, M. Uysal, J.H. Salz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462–479, September 1994. 665 7. Geoffrey C. Fox, editor. Java for Computational Science and Engineering— Simulation and Modelling, volume 9(6) of Concurrency: Practice and Experience, June 1997. 659 8. Geoffrey C. Fox, editor. Java for Computational Science and Engineering— Simulation and Modelling II, volume 9(11) of Concurrency: Practice and Experience, November 1997. 659 9. J. Nieplocha, R.J. Harrison, and R.J. Littlefield. The Global Array: Non-uniformmemory-access programming model for high-performance computers. The Journal of Supercomputing, 10:197–220, 1996. 667 10. Lawrence Snyder. A ZPL programming guide. Technical report, University of Washington, May 1997. http://www.cs.washington.edu/research/projects/zpl/. 667 11. Kees van Reeuwijk, Arjan J. C. van Gemund, and Henk J. Sips. Spar: A programming language for semi-automatic compilation of parallel programs. Concurrency: Practice and Experience, 9(11):1193–1205, 1997. 667 12. Guansong Zhang, Bryan Carpenter, Geoffrey Fox, Xiaoming Li, Xinying Li, and Yuhong Wen. PCRC-based HPF compilation. In 10th International Workshop on Languages and Compilers for Parallel Computing, 1997. To appear in Lecture Notes in Computer Science. 667
Language Constructs and Run-Time System for Parallel Cellular Programming Giandomenico Spezzano and Domenico Talia ISI-CNR c/o DEIS, Universit` a della Calabria, 87036 Rende (CS), Italy {spezzano,talia}@si.deis.unical.it
Abstract. This paper describes the main features of CARPET, a highlevel programming language based on cellular automata theory. The language has been designed for supporting the development of parallel highperformance software abstracting from the parallel architecture on which programs run. A CARPET user can write cellular programs to describe the actions of a very large number of simple active agents interacting locally. The CARPET run-time system allows a user to observe, also in a graphical format, the global results that arises from their parallel execution.
1
Introduction
The lack of high-level languages, tools, and application-oriented environments often limits the design and implementation of parallel algorithms that are portable, efficient, and expressive. The restricted-computation structures represent one of the most important models of parallel processing [8]. The interest for these models is due to the possibility to restrict the form of computations so as to restrict communication volume achieving high performance. Restricted-computation models offer a user a structured paradigm of parallel programming and improve the performance of the parallel algorithms reducing the overheads due to the communication latency. Further, tools can be designed to estimate the performance of various constructs of a high-level language on a specific parallel architecture. Cellular processing languages based on the cellular automata (CA) model [10] represent a significant example of restricted-computation that it is used to model parallel computation for a large number of applications in biology, physics, geophysics, chemistry, economics, artificial life, and engineering. A cellular automaton consists of one-dimensional or multi-dimensional lattice of cells, each of which is connected to a finite neighborhood of cells that are nearby in the lattice. Each cell in the regular spatial lattice can take any of a finite number of discrete state values. Time is discrete, as well, and at each time step all the cells in the lattice are updated by means of a local rule, called transition function, that determines the cell’s next state based upon the states of its neighbors. That is, the state of D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 669–675, 1998. c Springer-Verlag Berlin Heidelberg 1998
670
Giandomenico Spezzano and Domenico Talia
a cell at a given time depends only on its own state and the states of its nearby neighbors at the previous time step. Different neighborhoods can be defined for the cells. The most common neighborhoods in the two-dimensional case are the von Neumann neighborhood consisting of the North, South, East, West neighbors and the Moore neighborhood composed of eight neighbor cells. The global behavior of the an automaton is defined by the evolution of the states of all cells as a result of multiple interactions. CA are intrinsically parallel so they can be simulated onto parallel computers running the cell transition functions in parallel with high efficiency, as the communication flow between processors can be kept low. In fact, in our approach, a cellular algorithm is composed of all the transition functions of cells that compose the lattice. Each transition function generally contains the same local rule, but it is also possible to define some cells with different transition functions (inhomogeneous cellular automata). According to this approach, we designed and implemented a high-level programming language, called CARPET (CellulAR Programming EnvironmenT) [9], that allows a user to design cellular algorithms. In particular, CARPET has been used for programming cellular algorithms in the CAMEL (Cellular Automata environMent for systEms ModeLing environment) [3]. A user can design cellular programs by CARPET describing the actions of many simple active agents (implemented by the cells) interacting locally, then the CAMEL system runs cell transition functions in parallel allowing a user to observe the global complex evolution that arises from all the local interactions. A number of cellular programming languages such as Cellang [4], CDL [6], and CARP [7] have been defined in the last decade. However, none of those contains all the features of CARPET neither a parallel run-time support for them has been implemented till today.
2
CARPET
The rational of CARPET is to make parallel computers available to applicationoriented users hiding the implementation issues coming from their architectural complexity. A CARPET user can program complex problems that may be represented as discrete across a lattice. Parallelism inherent to its programming model is not apparent to the programmer. CARPET implements a cellular automaton as an SPMD program. CA are implemented as a number of processes each one mapped on a distinct PE that executes the same code on different data. According to this approach, a user must specify by CARPET only the transition function of a single cell of the system he wants to simulate. The language uses the control structures, the types, the operators of the C language. A CARPET program is composed by a declaration part that appears only once in the program and must precede any statement (except those of C pre-processor) and by a program body. The program body has the usual C statements, without I/O instructions, and a set of statements to deal with the state of a cell and its neighborhood. Further, CARPET users may use C functions or procedures to improve the structure of programs.
Language Constructs and Run-Time System
671
The declaration section includes constructs to define the dimensions of the automaton (dimension), the radius of the neighborhood (radius), the type of the neighborhood (neighbor), and to specify the state of a cell (state) as a set of typed substates that can be char, shorts, integers, floats, doubles and arrays of these basic types. The dimension declaration defines the number of dimensions of the automaton in a discrete Cartesian space. The maximum number of dimensions allowed in the current implementation is 3. Radius defines the set of cells that can compose the neighborhood of a cell. In CARPET the state of a cell is composed of a record of typed substates, unlike classical cellular automata where the cell state is represented by a few bits. The typification of the substates extends the range of the applications that can be coded in CARPET simplifying the writing the programs and improving their readability. In the following example the state is composed of three substates: short direction, float mass, speed); state (short A substate of the current cell can be referred by the variable cell substate (eg., cell speed). To guarantee the semantics of cell updating in cellular automata the value of one substate of a cell can be modified only by the update operation. After an execution of the update statement, the value of the substate, in the current iteration, is unchanged. The new value does take effect in the next iteration. The neighbor declaration assigns a name to a set of specified neighbor cells (of the current cell). This mechanism allows a cell to access by name the values of the substates of its neighbor cells. Neighborhoods can be asymmetrical or have any other special topological properties (e.g., hexagonal). Furthermore, in the neighbor declaration it must be defined the name of a vector that has as length the number of elements composing the logic neighborhood. The name of the vector can be used as an alias in referring to the neighbor cell. For instance, the von Neumann neighborhood shown in figure 1, can be defined as follows: neighbor Neum[4]([0,-1]North,[-1,0]West,[0,1]South,[1,0]East); A substate of a neighbor cell is referred, for instance, as North speed. Using the vector name the same substate can be referred also as Neum[0] speed. This referring way makes simpler to write loops in CARPET programs. CARPET allows the program to know the number of iterations that have been executed by the predefined variable step. Step is silently updated by the run-time system. Initially its value is 0 and it is incremented by 1 each time all the cells of the automata are updated. This feature allows a user to define also time-dependent neighborhoods. The step variable permits also to change dynamically the values of the substates dependent upon the iterations. In modeling a complex system, it is often necessary to describe some global features of the system. CARPET allows a user to define global parameters and initialize them to specific values. The value of a parameter is the same in every cell of the automaton. For this reason, the value of each parameter cannot be
672
Giandomenico Spezzano and Domenico Talia
Fig. 1. The von Neumann neighborhood in a two-dimensional CA lattice.
changed in the program but it can only be modified, during the simulation, by the UI. An example of parameter declaration in CARPET is parameter (permeability 0.9) ; Input and output of a CARPET program can be performed by files or by the edit function of the UI. A file can contain the values of one substate of all cells, so these values can be loaded at the step 0 to initialize the automaton. The output of a CARPET program can automatically be saved in files at regular intervals to be post-processed by mathematical or visualization tools. CARPET offers a user the possibility to define non-deterministic rules using of a random function. Finally, CARPET allows a user to define cells with different transition functions by means of the Getx, Gety, Getz operations that return the value of coordinates X, Y, and Z of a cell in the automaton. Differently form other cellular languages, CARPET does not provide statements to configure the automata, to visualize the cell values or to define the process-to processor mapping. These features can be defined by means of the GUI of its runtime system. The CAMEL GUI allows a user to define the size of cellular automata, the number of the processors onto which an automaton must be executed, and choose the colors to be assigned to the cell substates to support the graphical visualization of their values. By excluding constructs for configuration and visualization from the language, it is possible to execute the same CARPET compiled code using different configurations.
3
A Parallel Cellular Run Time System
Parallel computers are the best practical support for the effective implementation of high-performance CA [1]. According to this basic idea has been developed CAMEL. It is a parallel software system based on the cellular automata model that constitutes the parallel run-time system of CARPET. CAMEL has been implemented on a parallel computer composed of a mesh of Transputers connected to a host node [2]. CAMEL provides a GUI to configure a program, to monitor the parameters of a simulation and to dynamically change them at run time. The parallel execution of cellular algorithms is implemented by the parallel execution of the transition function of each cell in the Single Program Multiple
Language Constructs and Run-Time System
673
Data (SPMD) way. A portable implementation of CAMEL for MIMD parallel computers based on the MPI communication library is developing. The CAMEL system is composed by a set of Macrocell processes, each one running on a single processing element of the parallel machine, and by a Controller process running on a master processor. Each Macrocell process implements an automaton partition that includes several elementary cells, and it makes use of a communication system that handles the data exchange among cells. All the Macrocells of the lattice execute in parallel the local rule that results in a global transformation of the whole lattice state. CAMEL uses a load balancing strategy for assigning lattice partitions to the processors of a parallel computer [2]. This load balancing is a domain decomposition strategy similar to the scattered decomposition technique.
4
A Simple Program
This section shows a simple program written by CARPET. This example should familiarize the reader with the CARPET approach. The program in figure 2 shows how the CARPET constructs can be used to implement the parity rule algorithm. The parity rule is a very simple example of ”self-reproduction” in cellular automata; an initial pattern is replicated at some specific iteration that is a power of two. The cells can have 0 or 1 values only. A cell takes the sum of its neighbors and goes to 1 if the sum is odd, or to 0 if the sum is even. Let us call N the number of ’1’ cells among the four nearest neighbors of a given cell. The transition rule is the following: given a cell, if N is odd, the new value of the cell will be 0; if N is even the cell’s value does not change.
caef { dimension 2; /*bidimensional lattice */ radius 1; short value); state (short neighbor cross[4]([0,-1]North,[-1,0]West, [0,1]South,[1,0]East); } int i; short N = 0; {int for (i = 0; i<4; i++) N = cross value[i] + N; if (N%2 == 1) update (cell value,0); /*updating the state of a cell*/ }
Fig. 2. The parity rule algorithm written in CARPET.
674
5
Giandomenico Spezzano and Domenico Talia
Performance
Together with the minimization of elapsed time, scalability is a major goal in the design of parallel computing applications. Scalability shows the potential ability of parallel computers to speedup their performance by adding more processing elements. The scalability of CARPET programs has been measured increasing the number of PEs used to solve the same problem, and using the same number of processing elements to solve bigger problems [2]. The speedup obtained on 32 Transputers with an automaton composed of 224x70 cells which simulated the Ontake mountain (Japan) landslide was 25.0 (78.1% efficient). Table 1 shows the times (in seconds) and the speed-up obtained running the landslide simulation on CAMEL using 1, 2, 4, 8, 16 and 32 PEs. The second column shows the number of cells, which are mapped, on each PE. Similar performance results have been obtained in other applications developed by CARPET [3].
Table 1. Execution times and speed-up. Number of PEs Number Cells/PE Time for 2000 steps Speed-up 1 15680 10231.59 1 2 7840 6392.48 1.6 4 3920 3181.40 3.2 8 1960 1447.96 7.0 16 980 728.11 14.0 32 490 407.84 25.0
6
Final Remarks and Future Work
As stated also by M. J. Flynn [5], the cellular automata model is a new mathematical way to represent problems that allows to effectively use parallel computers achieving scalable performance. In this paper, we described the CARPET programming language designed for programming cellular algorithms on parallel computers. CARPET has been used successfully to implement several real life applications such as landslide simulation, lavaflow models, freeway traffic simulation, image processing, and genetic algorithms. The experience during the design, implementation, and use of the CARPET language showed us that high-level languages are very useful in the development of cellular algorithms for solving complex problems in science and engineering. Currently, the CARPET language is used in the COLOMBO (Parallel COmputers improve cLean up of sOils by Modelling BiOremediation) project within the ESPRIT framework. The COLOMBO main objective is the use of CA models for the bioremediation of contaminated soils. This project is developing a
Language Constructs and Run-Time System
675
portable MPI-based implementation of CARPET and CAMEL on MIMD parallel computers such as the Meiko CS-2, the CRAY T3E, and workstation clusters. The new implementation will make the CARPET programs portable on a large number of MIMD machines.
Acknowledgements This research has been partially funded by the CEC ESPRIT project n 24,907.
References 1. B. P. Brinch Hansen, Parallel Cellular Automata: a Model for Computational Science. Concurrency: Practice and Experience, 5:425-448, 1993. 672 2. M. Cannataro, S. Di Gregorio, R. Rongo, W. Spataro, G. Spezzano, and D. Talia, A Parallel Cellular Automata Environment on Multicomputers for Computational Science. Parallel Computing, 21:803-824, 1995. 672, 673, 674 3. S. Di Gregorio, R. Rongo, W. Spataro, G. Spezzano, and D. Talia., A Parallel Cellular Tool for Interactive Modeling and Simulation. IEEE Computational Science & Engineering 3:33-43, 1996. 670, 674 4. J. D. Eckart, Cellang 2.0: Reference Manual. ACM Sigplan Notices, 27:107-112, 1992. 670 5. M.J. Flynn, Parallel Processors Were the Future and May Yet Be. IEEE Computer 29:152, 1996. 674 6. C. Hochberger and R. Hoffmann, CDL - a Language for Cellular Processing. In: Proc. 2nd Intern. Conference on Massively Parallel Computing Systems, IEEE Computer Society Press, 1996. 670 7. G. Junger, Cellular Automaton Tool User Manual. GMD, Sankt Augustin, Germany, 1994. 670 8. D.B. Skillicorn and D. Talia, Models and Languages for Parallel Computation. ACM Computing Survey, 30, 1998. 669 9. G.Spezzano and D. Talia, A High-Level Cellular Programming Model for Massively Parallel Processing. In: Proc. 2nd Int. Workshop on High-Level Programming Models and Supportive Environments (HIPS97), IEEE Computer Society, pages 55-63, 1997. 670 10. J. von Neumann, Theory of Self Reproducing Automata. University of Illinois Press, 1966. 669
Task Parallel Skeletons for Irregularly Structured Problems Petra Hofstedt Department of Computer Science, Dresden University of Technology, [email protected]
Abstract. The integration of a task parallel skeleton into a functional programming language is presented. Task parallel skeletons, as other algorithmic skeletons, represent general parallelization patterns. They are introduced into otherwise sequential languages to enable the development of parallel applications. Into functional programming languages, they naturally are integrated as higher-order functional forms. We show by means of the example branch-and-bound that the introduction of task parallel skeletons into a functional programming language is advantageous with regard to the comfort of programming, achieving good computation performance at the same time.
1
Introduction
Most parallel programs were and are written in imperative languages. In many of these languages, the programmer has to use low-level constructs to express parallelism, synchronization and communication. To support platform-independent development of parallel programs standards and systems have been invented, e.g. MPI and PVM. In functional languages, such supporting libraries have been added in rudimentary form only recently. Hence, the advantages of functional programs, such as their ability to state powerful algorithms in a short, abstract and precise way, cannot be combined with the ability to control the parallel execution of processes on parallel architectures. Our aim is to remedy that situation. A functional language has been extended by constructs for data and task parallel programming. We want to provide comfortable tools to exploit parallelism for the user, so that she is burdened as few as possible with communication, synchronization, load balancing, data and task distribution, reaching at the same time good performance by exploitation of parallelism. The extension of functional languages by algorithmic skeletons is a promising approach to introduce data parallelism as well as task parallelism into these languages. As demonstrated for imperative languages, e.g. by Cole [3], there are several approaches how to introduce skeletons into functional languages as higher-order
The work of this author was supported by the ‘Graduiertenkolleg Werkzeuge zum effektiven Einsatz paralleler und verteilter Rechnersysteme’ of the German Research Foundation (DFG) at the Dresden University of Technology.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 676–682, 1998. c Springer-Verlag Berlin Heidelberg 1998
Task Parallel Skeletons for Irregularly Structured Problems
677
parallel forms. However, most authors concentrated on data parallel skeletons, e.g. [1], [4], [5]. Hence, our aim has been to explore the promising concept of task parallel skeletons for functional languages by integrating them into a functional language. A major focus of our work is on reusability of methods implemented as skeletons. Algorithmic skeletons are integrated into otherwise sequential languages to express parallelization patterns. In our approach, currently skeletons are implemented in a lower-level imperative programming language, but presented as higher-order functions in the functional language. Implementation-details are hidden within the skeletons. In this way, it is possible to combine expressiveness and flexibility of the sequential functional language with the efficiency of parallel special purpose algorithms. Depending on the type of parallelism exploited, skeletons are distinguished in data and task parallel ones. Data parallel skeletons apply functions on multiple data at the same time. Task parallel skeletons express which elements of a computation may be executed in parallel. The implementation in the underlying system determines the number and the location of parallel processes that are generated to execute the task parallel skeleton.
2
The Branch-and-Bound Skeleton in a Functional Language
Branch-and-bound methods are systematic search techniques for solving discrete optimization problems. Starting with a set of variables with a finite set of discrete values (a domain) assigned to each of the variables, the aim is to assign a value of the corresponding domain to each variable in such a way that a given objective function reaches a minimum or a maximum value and several constraints are satisfied. First, mutually disjunct subproblems are generated from a given initial problem by using an appropriate branching rule (branch). For each of the generated subproblems an estimation (bound) is computed. By means of this estimation, the subproblem to be branched next is chosen (select) and decomposed (branched). If the chosen problem cannot be branched into further subproblems, its solution (if existing) is an optimal solution. Subproblems with non-optimal or inadmissible variable assignments can be eliminated during the computation (elimination). The four rules branch, bound, select and elimination are called basic rules. The principal difference between parallel and sequential branch-and-bound algorithms lies in the way of handling the generated knowledge. Subproblems generated from problems by decomposition and knowledge about local and global optima belong to this knowledge. While with sequential branch-and-bound one processor generates and uses the complete knowledge, the distribution of work causes a distribution of knowledge, and the interaction of the processors working together to solve the problem becomes necessary. Starting point for our implementation was the functional language DFS (‘Datenparallele funktionale Sprache’ [6]), which already contained data parallel skeletons for distributed arrays. DFS is an experimental programming
678
Petra Hofstedt
language to be used on parallel computers. The language is strict and evaluates DFS-programs in a call-by-value strategy accordingly. To give the user the possibility to exploit parallelism in a very comfortable way, we have extended the functional language DFS by task parallel skeletons. One of them was a branch-and-bound skeleton. The user provides the basic rules branch, bound, select and elimination using the functional language. Then she can make a function call to the skeleton as follows: branch&bound branch bound select elimination problem. A parallel abstract machine (PAM) represents the runtime environment for DFS. The PAM consists of a number of interconnected nodes communicating by messages. Each node consists of three units: the message administration unit, which handles the incoming and outgoing messages, the skeleton unit, which is responsible for skeleton processing, and the reduction unit, which performs the actual computation. Skeletons are the only source of parallelism in the programs. To implement the parallel branch-and-bound skeleton, several design decisions had to be made with the objective of good computation performance and high comfort. In the following, the implementation is characterized according to Trienekens’ classification ([7]) of parallel branch-and-bound algorithms. Table 1. Classification by Trienekens knowledge sharing
knowledge use dividing the work synchronicity basic rules
global/local knowledge base complete/partial knowledge base update strategy access strategy reaction strategy units of work load balancing strategy synchronicity of each process branch, bound, select, elimination
Each process uses a local partial knowledge base containing only a part of the complete generated knowledge. In this way, the bottleneck arising from the access of all processes to a shared knowledge base is avoided, but at the expense of the actuality of the knowledge base. A process stores newly generated knowledge at its local knowledge base only; if a local optimum has been computed, the value is broadcasted to all other processes (update strategy). When a process has finished a subtask, it accesses its local knowledge base, to store the results at the knowledge base and to get a new subtask to solve. If a process receives a message containing a local optimum, the process compares this optimum with its actual local optimum and the bounds of the subtasks still to be solved (access strategy). A process receiving a local optimum from another process first finishes its actual task and then reacts according to the received
Task Parallel Skeletons for Irregularly Structured Problems
679
message (reaction strategy). This may result in the execution of unnecessary work. But the extent of this work is small because of the high granularity of the distributed work. A unit of work consists of branching a problem into subproblems and computing the bounds of the newly generated subproblems. The load balancing strategy is simple and suited to the structure of the computation, because new subproblems are generated during computation. If a processor has no more work, it asks its neighbours one after the other for work. If a processor receives a request for work, it returns a unit of work – if one is available – to the asking processor. The processor sends that unit of work which is nearest to the root of the problem tree and has not been solved yet. The implemented distributed algorithm works asynchronously. The basic rules are provided by the user using the functional language.
3
Performance Evaluation
To evaluate the performance of task parallel skeletons, we implemented branchand-bound for the language DFS as a task parallel skeleton in C for a GigaCluster GCel1024 with 1024 transputers T805(30 MHz) (each with 4 MByte local memory) running the operating system Parix. Performance measurements for several machine scheduling problems – typical applications of the branch-and-bound method – were made to demonstrate the advantageous application of skeletons. In the following, three cases of a machine scheduling problem for 2 machines and 5 products have been considered. The number of possible orders of machine allocation is 5! = 120. This very small problem size is sufficient to demonstrate the consequences for the distribution of work and the computation performance, if the part of the problem tree, which must be computed, has a different extent. In case (a) the complete problem tree had to be generated. In case (b) only one branch of the tree had to be computed. Case (c) is a case where a larger part of the problem tree than in case (b) had to be computed. Each of the machine scheduling problems has been defined first in the standard functional way and second by use of the branch-and-bound skeleton. The measurements were made using different numbers of processors. First we counted branching steps, i.e. we measured the average overall number of decompositions of subproblems of all processors working together in the computation of the problem. These measurements showed the extend of the computed problem tree working sequentially and parallelly. It became obvious that the overall number of branching steps is increasing in the case of a small number of to be branched subproblems. The local partial knowledge bases and the asynchronous behaviour of the algorithm cause the execution of unnecessary work. If the whole problem tree had to be generated, we observed a decrease of the average overall number of branching steps with increasing number of processors. This behaviour is called acceleration anomaly ([2]). Acceleration anomalies occur if the search tree generated in the parallel case is smaller than the one generated in the sequential case. This can happen in the parallel case because of
680
Petra Hofstedt
branching several subproblems at the same time. Therefore it is possible to find an optimum earlier than in the sequential case. Acceleration anomalies cause a disproportional decrease of the average maximum number of branching steps per processor with increasing number of processors, a super speedup.
Table 2. Average overall number of reduction steps 1 proc. 1 proc. 4 proc. 6 proc. 8 proc. 16 proc. functional skeleton skeleton skeleton skeleton skeleton (a) 17197 20727 1848,2 1792,9 1074,9 709,3 (b) 54 47 138,0 218,6 299,0 451,6 (c) 138 118 179,8 261,8 258,6 484,8 Table 3. Average maximum number of reduction steps per processor 1 proc. 1 proc. 4 proc. 6 proc. 8 proc. 16 proc. functional skeleton skeleton skeleton skeleton skeleton (a) 17197 20727 896,0 710,2 390,5 113,1 (b) 54 47 43,2 44,7 47,0 46,4 (c) 138 118 65,5 66,1 60,9 51,7
To compare sequential functional programs with programs defined by means of skeletons, we counted reduction steps. The reduction steps include besides branching a problem, the computation of a bound of the optimal solution of a subproblem, the comparison of these bounds for selection and elimination of a subproblem from a set of to be solved subproblems, and the comparison of bounds to determine an optimum. Table 2 shows the average overall number of reduction steps of all processors participating in the computation. In Table 3 the average maximum numbers of reduction steps per processor are given. Table 2 and Table 3 clearly show the described effect of an acceleration anomaly. Because the set of reduction steps contains comparison steps of the operation ‘selection of subproblems from a set of to be solved subproblems’, the distribution of work causes a decrease of the number of comparison steps for this operation at each processor. Looking at both Table 2 and Table 3 it becomes apparent that the numbers of reduction steps in the sequential cases of (a), (b), and (c) of the computation of the problem first defined in the standard functional way and second using the skeleton differ. That is caused by different styles of programming in functional and imperative languages. In case (a) an obvious decrease of the average maximum number of reduction steps per processor (Table 3) caused by the distribution of the subproblems onto several processors is observable. At the same time the average overall number of reduction steps (Table 2) is also decreasing
Task Parallel Skeletons for Irregularly Structured Problems
681
as explained before. The distribution of work onto several processors yields a large increase of efficiency in cases when a large part of the problem tree must be computed. In case (b) the average maximum number of reduction steps per processor nearly does not change while the overall number of reduction steps is increasing, because, firstly, subproblems, which are to be branched, are distributed, and secondly, a larger part of the problem tree is computed. Because in case (b) the solution can be found in a short time, working parallelly as well as sequentially, the use of several processors produces overhead only. In case (c) the same phenomena as in case (b) are observable. Moreover, the average maximum number of reduction steps decreases to nearly 50% in case of parallel computation in comparison to the sequential computation.
4
Conclusion
The concept, implementation, and application of task parallel skeletons in a functional language were presented. Task parallel skeletons appear to be a natural and elegant extension to functional programming languages. This has been shown using the language DFS and a parallel branch-and-bound skeleton as an example. Performance evaluations showed that using the implemented skeleton for finding solutions for a machine scheduling problem is performance better, especially if a large part of the problem tree has to be generated. Also in the case of the necessity to compute a smaller part of the problem tree only, a distribution of work is advantageous.
Acknowledgements The author would like to thank Herbert Kuchen and Hermann H¨ artig for discussions, helpful suggestions and comments.
References 1. Botorog, G.H., Kuchen, H.: Efficient Parallel Programming with Algorithmic Skeletons. In: Boug, L. (Ed.): Proceedings of Euro-Par’96, Vol.1. LNCS 1123. 1996. 677 2. de Bruin, A., Kindvater, G.A.P., Trienekens, H.W.J.M.: Asynchronous Parallel Branch and Bound and Anomalies. In: Ferreira, A.: Parallel algorithms for irregularly structured problems. Irregular ’95. LNCS 980. 1995. 679 3. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press. 1989. 676 4. Darlington, J., Field, A.J., Harrison, P.G., Kelly, P.H.J., Sharp, D.W.N., Wu, Q., While, R.L.: Parallel Programming Using Skeleton Functions. In: Bode, A. (Ed.): Parallel Architectures and Languages Europe : 5th International PARLE Conference. LNCS 694. 1993. 677 5. Darlington, J., Guo, Y., To, H.W., Yang, J.: Functional Skeletons for Parallel Coordination. In: Haridi, S. (Ed.): Proceedings of Euro-Par’95. LNCS 966. 1995. 677
682
Petra Hofstedt
6. Park, S.-B.: Implementierung einer datenparallelen funktionalen Programmiersprache auf einem Transputersystem. Diplomarbeit. RWTH Aachen 1995. 677 7. Trienekens, H.W.J.M.: Parallel Branch and Bound Algorithms. Dissertation. Universit¨ at Rotterdam 1990. 678
Synchronizing Communication Primitives for a Shared Memory Programming Model Vladimir Vlassov and Lars-Erik Thorelli Royal Institute of Technology, Electrum 204, S-164 40 Kista, Sweden {vlad,le}@it.kth.se
Abstract. In this article we concentrate on semantics of operations on synchronizing shared memory that provide useful primitives for shared memory access and update. The operations are a part of a new sharedmemory programming model called mEDA. The main goal of this research is to provide a flexible, effective and relatively simple model of inter-process communication and synchronization via synchronizing shared memory.
1
Introduction
There is a belief [7], and we share it, that dataflow computing in combination with multithreading is a general answer on two fundamental issues in multiprocessing: memory latency and synchronization [3]. Synchronization of parallel processes [11] is a subject of considerable interest for various research group, both in industry and academia [8,9,12,13,14]. This article presents some aspects of a new shared-memory programming model that is based on dataflow principles and provides a unified approach for communication and synchronization of parallel processes via synchronizing shared memory. The model is called mEDA, Extended Dataflow Actor, where the prefix ”m” is used to distinguish a new revision of the model from the EDA model presented by Wu [17,15]. The model was inspired from different computation and memory models, such as the data flow model [6], the Actor model [1], I-structures [3] and M-structures [5]. In this article we concentrate on the semantics of synchronizing shared memory operations introduced in the model. The model states that a shared-memory operation (store, fetch) holds a synchronization type which specifies special synchronization requirements for both the accessing process and the accessed cell. In this way, common patterns of parallel programs such as data dependency, shared protected data, barriers, stream communication, AND- and OR-parallelism, can be conveniently expressed. The notion of processes accessing shared ”boxes” (cells) using such operations provides a natural way of communication and synchronization. Section 2 informally presents the semantics of mEDA synchronizing memory operations. Section 3 outlines synchronization and communication properties of the operations. Section 4 shortly illustrates how mEDA allows parallel processes to communicate and be synchronized via shared memory. Our conclusions are given in Section 5. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 682–688, 1998. c Springer-Verlag Berlin Heidelberg 1998
Synchronizing Communication Primitives
2
683
Synchronizing Memory Operations
Assume that a parallel application consists of a dynamic set of processes (or threads) and a set of cells which are shared, uniquely identified and known to the processes of the application. The processes interact with one another in a onesided fashion storing data to, or fetching data from, shared cells. The processes can also transfer values from one cell to another. A process may access cells only by special store/fetch/transfer operations defined in the model. Since each process has a notion of local state, on this level of abstraction, a shared cell can be regarded as a special kind of process. An mEDA operation sends to the shared memory an access request with the following attributes: address of requested cell, operation code, value to be stored (in the case of store operation), and identifier of the requesting process. The latter is used to direct a fetch response or a store acknowledgment when needed. A shared cell may contain stored data (a value) and a FIFO queue that holds the postponed access-requests which can not be satisfied because of inappropriate state of the requested cell. We extend the set of values, which can be stored to, or fetched from, the shared cell, by an ”empty” value. We call a shared cell full if it contains a normal value. The cell is empty if it contains the empty value. mEDA recognizes four synchronization types of store operations (denoted by X, S, I, and U-store), and four synchronization types of fetch operations (X, S, I, and U-fetch). In addition the model defines 16 transfer operations briefly described in Section 2.3. The semantics of store and fetch operations is based on considerations given below. See also Fig. 1.
Fig. 1. Categorization of EDA operations
2.1
Store Operations
First, we state that a store operation to an empty cell always succeeds. If the cell is full, the store request either can be postponed until the variable is emptied (buffered store), or be served somehow as soon as it arrives (non-buffered store). Next we assume that a buffered store can be either blocking (synchronous)
684
Vladimir Vlassov and Lars-Erik Thorelli
or non-blocking (asynchronous). A blocking store, denoted by X in Fig. 1, requires an acknowledgment that a value has been stored. The requesting process is suspended until the acknowledgment arrives. On the other hand, the process executing a non-blocking buffered store (S), continues its computation as soon as the request is sent to shared cell. A non-buffered store-request to a full cell can be either ignored (I), or served by updating the requested cell independently of its state (U). In both cases, a synchronizing acknowledgment is not needed since non-buffered requests are served without delay. Based on the above observations we arrive at four types of store operations: X-store (”eXclusive”), S-store (”Stream”), I-store (”Ignore”), and U-store (”Update”). 2.2
Fetch Operations
We distinguish two kinds of fetch operations categorized in Fig. 1: (i) extracting operations which empty a shared cell after its value has been copied to the accessing process, and (ii) copying operations which copy a value from the cell to the process and leave the cell intact. We also distinguish the cases that a fetch operation may fetch (extract or copy) either only a normal value or any value (normal or empty). The request to fetch a normal value is postponed on an empty cell until the latter is filled. In all cases, a requesting process is blocked until the requested value (any or only normal) is fetched from the cell. The operations requesting any value can be classified as non-blocking. Thus, mEDA includes four types of fetch operations: X-fetch, S-fetch, I-fetch, and U-fetch, where the first two are extracting and the first and third one are blocking. 2.3
Transfer Operations
A transfer operation combines a pair fetch/store in one operation. Consider the expression store(s_s, B, fetch(s_f, A)) that requires to fetch a value from cell A and store the value to cell B. Here s_f and s_s specify synchronization types of fetch and store to cell A and cell B, respectively. To implement the expression efficiently, we introduce special operations (XX-trans, SX-trans, etc.) which transfer data between shared cells. The operations allow increasing the efficiency of the model. All transfer operations are non-blocking and do not affect the requesting process: the value fetched from A, is directly stored to B.
3
Synchronization and Communication Properties of mEDA Operations
The mEDA operations on a shared cell issued by cooperating processes trigger each other. When a process empties a full cell or fills an empty cell, one or more postponed requests (if any) queued on the cell, are served. This may cause resuming of blocked processes waiting for the completion of X-store, X-fetch or I-fetch. Storing a normal value to an empty cell allows all first pending copy requests (I-fetch) and one following pending extract request (X-fetch), if any,
Synchronizing Communication Primitives
685
to be resumed. Extracting a normal value from a full cell (X-fetch, S-fetch) or storing the empty value to the full cell (U-store) allows resuming and servicing one postponed store request, if any, queued on the cell. It is important to note that the shared memory may not service incoming requests until all postponed requests which can be resumed, are served.
Fig. 2. State Diagram of a Shared Cell
Figure 2 illustrates a state diagram of a shared cell. Each state is marked by a triple index (C, qs , qf ), where: – C - indicates which value, normal or empty, is stored in the cell: Ø if the cell is empty C= d if the cell is full – qs ∈ {0, 1, 2, . . .} - the number of store-requests (X or S) queued on the cell. – qf ∈ {0, 1, 2, . . .} - the number of fetch-requests (X or I) queued on the cell.
686
Vladimir Vlassov and Lars-Erik Thorelli
Each transaction is marked by incoming or resumed access-requests causing the transaction. For instance, – fetch(i ∨ u) means a fetch request with either I or U synchronization. – store(∗, Ø) means store the empty value Ø with any synchronization. – store(u, d) means store a normal value d with U synchronization. Intuitively it is easy to understand that a queue of requests postponed on a shared cell, cannot contain a mixture of storing and fetching requests. If the queue is not empty, it contains either only store requests of X and/or S types, queued on the empty cell, or only fetch requests (X and/or I) queued on the full cell. This consideration is confirmed by the diagram in Fig. 2. While resuming postponed requests, the cell passes through the states depicted in a dash shape until a request (if any) that heads the queue, can be resumed.
4
Process Interaction via Synchronizing Shared Memory
The store and fetch operations introduced in mEDA can be used in different combinations in order to provide a wide range of various communication and synchronization patterns for parallel processes such as data dependency, locks, barriers, and OR-parallelism. For example, X-fetch in combination with any buffered store (X- or S-store) supports data-driven computation where a process accessing shared location by X-fetch is blocked until requested data are available. One can see that the pair X-fetch/X-store is similar to the pair of take and put operations on M-structures [5], and to the pair of synchronizing read and write introduced in the memory model presented by Dennis and Gao in [8]. Locks are easily implemented using X-fetch/X-store, and OR-parallelism is achieved with I-fetch/I-store, since an I-store to a full cell is discarded.
5
Conclusions
We have presented the semantics of shared memory operations introduced in the mEDA model, and demonstrated how these operations allow implementing a wide variety of inter-process communication and synchronization patterns in a convenient way. Orthogonal mEDA operations provide a unified approach for communication and synchronization of parallel processes via synchronizing shared memory supporting different access and synchronization needs. The model was implemented on top of the PVM system [10]. Two software implementation of mEDA were earlier reported in [2] and [16]. In the last version [16], the synchronizing shared memory for PVM is implemented as distributed virtual shared variables addressed by task and variable identifiers. The values stored to and fetched from the shared memory are PVM messages. The mEDA library can be used in combination with PVM message passing. This allows a wide variety of communication and synchronization patterns to be programmed in a convenient way.
Synchronizing Communication Primitives
687
Finally, we believe that the memory system of a multiprocessor supporting mEDA synchronizing memory operations, can be relatively efficiently and cheaply implemented. This issue may be of interest to computer architects.
References 1. Agha, G.: Concurrent Object-Oriented Programming. Communications of the ACM 33 (1990) 125–141 682 2. Ahmed, H., Thorelli, L.-E., Vlassov, V.: mEDA: A Parallel Programming Environment. Proc. of the 21st EUROMICRO Conference: Design of Hardware/Software Systems, Como, Italy. IEEE Computer Society Press. (1995) 253–260 686 3. Arvind and Thomas, R. E.: I-structures: An Efficient Data Structure for Functional Languages. TM-178. LCS/MIT (1980) 682 4. Arvind and Iannucci, A.: Two fundamental issues in multiprocessing. Parallel computing in Science and Engineering. Lecture Notes in Computer Science, Vol. 295. Springer-Verlag (1978) 61–88 5. Barth, P. S., Nikhil, R. S., Arvind.: M-Structures: Extending a Parallel, Non-Strict, Functional Language with States. TM-237. LCS/MIT (1991) 682, 686 6. Dennis, J. B.: Evolution of ”Static” dataflow architecture. In: Gaudiot, J.-L., Bic L. (eds.): Advanced Topics in Dataflow Computing. Prentice-Hall (1991) 35–91 682 7. Dennis, J. B.: Machines and Models for Parallel Computing. International Journal of Parallel Programming 22 (1994) 47–77 682 8. Dennis, J. B., Gao, G. R.: On memory models and cache management for sharedmemory multiprocessors. In: Proc. of the IEEE Symposium on Parallel and Distributed Processing, San Antonio (1995) 682, 686 9. Fillo, M., Keckler, S.W., Dally, W. J., Carter, N. P., Chang, A., Gurevich, Ye., Lee, W. S.: The M-Machine Multicomputer. In: Proc. of the 28th IEEE/ACM Annual Int. Symp. on Microarchitecture, Ann Arbor, MI (1995) 682 10. Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V.: PVM: Parallel Virtual Machine. A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press (1994) 686 11. Mellor-Crummey, J. M., Scott, M. L.: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. on Computer Systems 9 (1991) 21–65 682 12. Ramachandran, U., Lee, J.: Cache-Based Synchronization in Shared Memory Multiprocessors. Journal of Parallel and Distributed Computing 32 (1996) 11–27 682 13. Scott, S. L.: Synchronization and Communication in the T3E Multiprocessor. In: Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII) (1996) 26–36 682 14. Skillicorn, D. B., Hill, J. M. D., McColl, W.F.: Questions and Answers about BSP. Tech. Rep. PRG-TR-15-96. Oxford University. Computing Laboratory (1996) 682 15. Thorelli, L.-E.: The EDA Multiprocessing Model. Tech. Rep. TRITA-IT-R 94:28. Dept. of Teleinformatics, Royal Institute of Technology, Stockholm, Sweden (1994). 682 16. Vlassov, V., Thorelli, A Synchronizing Shared Memory: Model and Programming Implementation. In: Recent Advances in PVM and MPI. Proc.of the 4th European PVM/MPI Users’ Group Meeting. Lecture Notes on Computer Science, Vol. 964. Springer-Verlag (1995) 288–293 686
688
Vladimir Vlassov and Lars-Erik Thorelli
17. Wu, H.: Extension of Data-Flow Principles for Multiprocessing. Tech. Rep. TRITATCS-9004 (Ph D thesis). Royal Institute of Technology, Stockholm, Sweden (1990) 682
Symbolic Cost Analysis and Automatic Data Distribution for a Skeleton-Based Language Julien Mallet IRISA, Campus de Beaulieu, 35042 Rennes, France [email protected]
Abstract. We present a skeleton-based language which leads to portable and cost-predictable implementations on MIMD computers. The compilation process is described as a series of program transformations. We focus in this paper on the step concerning the distribution choice. The problem of automatic mapping of input vectors onto processors is addressed using symbolic cost evaluation. Source language restrictions are crucial since they permit to use powerful techniques on polytope volume computations to evaluate costs precisely. The approach can be seen as a cross-fertilization between techniques developed within the FORTRAN parallelization and skeleton communities.
1
Introduction
A good parallel programming model must be portable and cost predictable. General sequential languages such as FORTRAN achieve portability but cost estimations are often very approximate. The approach described in this paper is based on a restricted language which is portable and allows an accurate cost analysis. The language enforces a programming discipline which ensures a predictable performance on the target parallel computer (there will be no “performance bugs”). Language restrictions are introduced through skeletons which encapsulate control and data flow in the sense of [6], [7]. The skeletons act on vectors which can be nested. There are three classes defined as sets of restricted skeletons: computation skeletons (classical data parallel skeletons), communication skeletons (data motion over vectors), and mask skeletons (conditional data parallel computation). The target parallel computers are MIMD computers with shared or distributed memory. Our compilation process can be described as a series of program transformations starting from a source skeleton-based language to an SPMD-like skeleton-based language. There are three main transformations: in-place updating, making all communications explicit and distribution. Due to space concerns, we focus in this paper only on the step concerning the choice of the distribution. We tackle the problem of automatic mapping of input vectors onto processors using a fixed set of classical distributions (e.g. row block, block cyclic,...). The goal is to determine mechanically the best distribution for each input vector D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 688–698, 1998. c Springer-Verlag Berlin Heidelberg 1998
Symbolic Cost Analysis and Automatic Data Distribution
689
through cost evaluation. The restrictions of the source language and the fixed set of standard data distributions guarantee that the parallel cost (computation + communication) can be computed accurately for all source programs. Moreover, these restrictions allow us to evaluate and compare accurate execution times in a symbolic form. So, the results do not depend on specific vector sizes nor on a specific number of processors. Even if our approach is rooted in the field of data-parallel and skeleton-based languages, one of its specificities is to reuse techniques developed for FORTRAN parallelization (polytope volume computation) and functional languages (singlethreading analysis) in a unified framework. The article is structured as follows. Section 2 is an overview of the whole compilation process. Section 3 presents the source language and the target parallel language. In Section 4, we describe the symbolic cost analysis through an example: LU decomposition. We report some experiments done on a MIMD distributed memory computer in Section 5. We conclude by a review of related work. The interested reader will find additional details in an extended version of this paper ([12]).
2
Compilation Overview
The compilation process consists of a series of program transformations: ✲ ✲ L2 ✲ L3 ✲ L4 ✲ L5 L1 ✲ Each transformation compiles a particular task by mapping skeleton programs from one intermediate language into another. The source language (L1 ) is composed of a collection of higher-order functions (skeletons) acting on vectors (see Section 3). It is primarily designed for a particular domain where high performance is a crucial issue: numerical algorithms. L1 is best viewed as a parallel kernel language which is supposed to be embedded in a general sequential language (e.g. C). The first transformation (L1 → L2 ) deals with in-place updating, a standard problem in functional programs with arrays. We rely on a type system in the spirit of [17] to ensure that vectors are manipulated in a single-threaded fashion. The user may have to insert explicit copies in order for his program to be welltyped. As a result, any vector in a L2 program can be implemented by a global variable. The second transformation (L2 → L3 ) makes all communications explicit. Intuitively, in order to execute an expression such as map (λx.x+y) in parallel, y must be broadcast to every processor before applying the function. The transformation makes this kind of communication explicit. In L3 , all communications are expressed through skeletons. The transformation L3 → L4 concerns automatic data distribution. We restrict ourselves to a small set of standard distributions. A vector can be distributed cyclicly, by contiguous blocks or allocated to a single processor. For a
690
Julien Mallet
matrix (vector of vectors), this gives 9 possible distributions (cyclic cyclic, block cyclic, line cyclic, etc.). The transformation constructs a single vector from the input vectors according to a given data distribution. This means, in particular, that all vector accesses have to be changed according to the distributions. Once the distribution is fixed, some optimizations (such as copy elimination) become possible and are performed on the program. The transformation yields an SPMD-like skeleton program acting on a vector of processors. In order to choose the best distribution, we transform the L3 program according to all the possible distributions of its input parameters. The symbolic cost of each transformed program can then be evaluated and the smallest one chosen (see Section 4). For most numerical algorithms, the number of input vectors is small and this approach is practical. In other cases, we would have to rely on the programmer to prune the search space. The L4 → L5 step is a straightforward translation of the SPMD skeleton program to an imperative program with calls to a standard communication library. We currently use C with the MPI library. Note that all the transformations are automatic. The user may have to interact only to insert copies (L1 → L2 step) or, in L4 , when the symbolic formulation does not designate a single best distribution (see 4.4).
3
The Source and Target Languages
The source language L1 is basically a first-order functional language without general recursion, extended with a collection of higher-order functions (the skeletons). The main data structure is the vector which can be nested to model multidimensional arrays. An L1 program is a main expression followed by definitions. A definition is either a function definition or the declaration of the (symbolic) size of an input vector. Combination of computations is done through function composition (.) and the predefined iterator iterfor. iterfor n f a acts like a loop applying n times its function argument f on a. Further, it makes the current loop index accessible to its function argument. The L1 program implementing LU decomposition is: LU(M) where M :: Vect n (Vect n Float) LU(a) = iterfor (n-1) (calc.fac.apivot.colrow) a calc(i,a,row,piv) = (i,(map (map first) .rect i (n-1) (rect i (n-1) fcalc) .map zip3.zip3)(a,row,piv)) fcalc(a,row,piv) = (a-row*piv,row,piv) fac(i,a,row,col,piv) = (i,a,row,(map(map /).map zip.zip)(col,piv)) apivot(i,a,row,col) = (i,a,row,col,map (brdcast (i-1)) row) colrow(i,a) = (i,a,brdcast (i-1) a,map (brdcast (i-1)) a) zip3(x,y,z) = map p2t (zip(x,zip(y,z))) p2t(x,(y,z)) = (x,y,z)
Symbolic Cost Analysis and Automatic Data Distribution
691
For handling vectors, there are three classes of skeletons: computation, communication, and mask skeletons. The computation skeletons include the classical higher order functions map, fold and scan. The communication skeletons describe restricted data motion in vectors. In the definition of colrow in LU, brdcast (i-1) copies the (i-1)th vector element to all the other elements. Another communication skeleton is gather i M which copies the ith column of the matrix M to the ith row of M. There are seven communication skeletons which have been chosen because of their availability on parallel computers as hard-wired or optimized communication routines. The mask skeletons are dataparallel skeletons which apply their function argument only to a selected set of vector elements. In the definition of calc in LU, rect i (n-1) fcalc applies fcalc on vector elements whose index ranges from i to (n-1). Note that our approach is not restricted to these sole skeletons: even if it requires some work and caution, more iterators, computation or mask skeletons could be added. In order to enable a precise symbolic cost analysis, additional syntactic restrictions are necessary. The scalar arguments of communication skeletons, the iterfor operator and mask skeletons must be linear expressions of iterfor indexes and variables of vector size. We rely on a type system with simple subtyping (not described here) to ensure that each variable appearing in such expression, are only iterfor and vector size variables. It is easy to check that these restrictions are verified in LU. The integer argument of iterfor is a linear expression of an input vector size. The arguments of rect and brdcast skeletons are made of linear expressions of iterfor variable i or vector size variable n. The target language L4 expresses SPMD (Single Program Multiple Data) computations. A program acts on a single vector whose elements represent the data spaces of the processors. Communications between processors are made explicit by data motion within the vector of processors in the spirit of the source parallel language of [8]. L4 introduces new versions of skeletons acting on the vector of processors. An L4 program is a composition of computations, with possibly a communication inserted between each computation. More precisely, a program FunP has the following form: FunP ::= (pimap Fun | piterfor E FunP) ( . Com)0/1 ( . FunP)0/1 A computation is either a parallel operation pimap Fun which applies Fun on each processor, or an iteration piterfor E FunP which applies E times the parallel computation FunP to the vector of processors. The functions Fun are similar to functions of L1 . The only differences are the skeleton arguments which may include modulo, integer division, minimum and maximum functions. Com denotes the communication between processors. It describes data motion in the same way as the communication skeletons of L1 but on the vector of processors. For example, the corresponding parallel communication skeleton of brdcast is pbrdcast which broadcasts values from one processor of the vector of processors to each other.
692
4
Julien Mallet
Accurate Symbolic Cost Analysis
After transformation of the program according to different distributions, this step aims at automatically evaluating the complexity of each L4 program obtained in order to choose the most efficient. Our approach to get an accurate symbolic cost is to reuse results on polytope volume computations ([16], [5]). It is possible because the restrictions of the source language and the fixed set of data distributions guarantee that the abstracted cost of all transformed source programs can be translated into a polytope volume description. First, an abstraction function CA takes the program and yields its symbolic parallel cost. The cost expression is transformed so that it can be seen as the definition of a polytope volume. Then, classic methods to compute the volume of a polytope are applied to get a symbolic cost in polynomial form. Finally, a symbolic math package (such as Maple) can be used to compare symbolic costs and find the smallest cost among the L4 programs corresponding to the different distribution choices. 4.1
Cost Abstraction and Cost Language LC
The symbolic cost abstraction CA is a function which extracts cost information from parallel programs. The main rules shown below. use a non-standard We are n 1≤i notation for indexed sums: instead of i=1 , we note i { i≤n } . Communication costs are expressed as polynomials whose constants depend on the target computer. For example, the cost of pbrdcast involves the parameters αbr and βbr which denote respectively the time of one-word transfer between two processors and the message startup time on the parallel computer considered. One just has to set those constants to adapt the analysis for a specific parallel machine. CA [[f1 . f2]] = CA [[f1]] + CA [[f2]] CA [[f]] CA [[piterfor e (λi.f)]] = i 1≤i i≤e CA [[pimap (λi.f)]] CA [[pbrdcast]] CA [[rect e1 e2 f]]
p−1
= max CA [[f]]
where p is the number of processors p is the number of processors = (αbr *b + βbr )*p where b is the size of broadcast data e1≤i = i i≤e2 CA [[f]] i=0
The cost abstraction yields accurate costs expressed in a simple language LC. LC is made of sums of generalized sums (G-sums): i {...} (...) and generalized maxs (G-maxs): maxei=e (...) possibly nested. A G-max denotes the maximum expression one can get when the max variable ranges over the integer interval. Inequalities in G-sums involve only linear expressions and maximum and minimum function of them. There may be polynomial included in the body of Gsums and G-maxs. The variables of polynomials are only denoting vector sizes or the number of processors. In order to illustrate the analysis, we consider here only the two best distribution choices for LU: row block and row cyclic.
Symbolic Cost Analysis and Automatic Data Distribution
693
The abstracted costs are given below; note that only the part of the costs which differs is detailed: p−1 max(0,i−ip ∗b)≤j 1 1≤i i≤k +C Cbloc ≡ i { i≤n−1 } max j j≤min(b−1,n−1−ip ∗b+b) k { k≤n−1 } 1 ip =0
i−ip max(0, )≤j p−1 p 1 1≤i i≤k n−1−i Ccyc +C ≡ i { i≤n−1 } max j k { k≤n−1 } 1 p ip =0
j≤min(b,
p
)
We emphasize that writing the source program in L1 is crucial to get an accurate symbolic cost. First of all, without a severe limitation of the use of recursion no precise cost could be evaluated in general. Further, the restrictions imposed by L1 ensure that every abstract parallel cost belongs to LC. The mask skeletons (which limit conditional application to index intervals), the communication and the computation skeletons all have a complexity which depends polynomially on the vector size. Their costs can be described by nested sums. Another important restriction is that expressions involving iteration indexes (mask skeletons and iterfor bounds) are linear. This restriction, in addition to the form of the standard distributions which keep the linearity of the vector accesses, ensure that the inequalities occurring in LC are linear. 4.2
Transformation to Descriptions of Polytope Volume
Polytope volumes are only defined in terms of nested G-sums with linear inequalities containing neither max nor min ([5]). In order to apply methods to compute polytope volumes, we must remove the G-max, min, max occurring in the cost expressions. The transformation takes the form of a structural recursive scan of the cost expression. The min and max appearing in inequalities are transformed into conjunctions of inequalities. For example, the expression max(0, i − ip ∗ b) ≤ j, in 1 , becomes 0 ≤ j ∧ i − ip ∗ b ≤ j. G-max expressions are propagated through Cbloc p−1 0≤j∧i−ip ∗b≤j 1 their subexpressions. For example, in Cbloc , max j j≤b−1∧j≤n−1−i (...) p ∗b+b ip =0 p−1 p−1 (i−ip ∗b)≤j 0≤j∧min ip =0 max (...). The is temporarily transformed into j p−1 j≤b−1∧j≤max
ip =0
(n−1−ip ∗b+b)
ip =0
max value of a G-sum is a sum where the upper bounds are maximized and the lower ones minimized. For a sum inequality i ≤ e, we propagate the G-max inside e until it reaches a variable or a constant. We can finally use the facts p−1 ≡ ip . For an inequality e ≤ i, that maxp−1 ip =0 ip = p − 1 and maxip =0 k = k if k the transformation propagates a G-min analogously. Since no polynomial in LC may contain a G-max index, the G-max of a polynomial is just this polynomial. In the case of distributed LU, the cost expressions are transformed into 1≤i≤n−1 1 2 0≤j∧i−(p−1)∗b≤j = (i,j,k) j≤b−1∧j≤n−1+b∧i≤k≤n−1 Cbloc 1 + C ≡ Cbloc i−p+1 1≤i≤n−1∧0≤j∧ ≤j 1 + C ≡ C 2 1 p Ccyc = (i,j,k) cyc j≤b−1∧j≤ n −1∧i≤k≤n−1 p The technique described to remove G-max expressions may maximize the real cost. This is due to the duplication of G-max in G-sums. A more complicated
694
Julien Mallet
but accurate symbolic solution also exists. The idea is to evaluate (as described in the next section) the nested G-sums first. When the G-sum is reduced to a polynomial the problem amounts to computing symbolic maximum of polynomials over symbolic intervals. This can be done using symbolic differentiation, solution and comparison of polynomials. At this point, we have only used the simpler approximated solution because the approximation is minor for our examples (e.g. none for LU) and the accurate symbolic solution is not implemented as such in existing symbolic math packages. 4.3
Parametrized Polytope Volume Computation
A parametrized polytope is a set of points whose coordinates satisfy a system of linear inequalities with possibly unknown parameters. After the previous transformation, the cost expression denotes a polytope volume, that is the number of points in the associated polytope. The works of [16] and [5] describe algorithms for computing symbolically the volume of parametrized polytopes. The result is a polynomial whose variables are the system parameters. Further, in [5], Clauss
presentsan extension for systems of non-linear inequalities with floor ( ) and 2 ceiling ( ) functions as appearing in Ccyc . Thus, the algorithm of [5] allows one to transform each parallel cost expression into a polynomial. This method applied to LU yields: 2 2 2 3 − 2np + 6np + C = n3 3 6p p−1 ≡ Cbloc Cbloc 3 2 2 3 p−2 2 3 + n −3 p 6+6 + p −36 p+2 + C ≡ Ccyc Ccyc = 3np + n2 p−2 2p p 4.4
Symbolic Cost Comparison
The last step is to compare the symbolic costs of different distribution choices. It amounts to computing the symbolic intervals where the difference of cost (polynomial) is positive or negative. Symbolic math packages such as Maple can be used for solving this problem. In the case of LU, Maple produces the following condition: 3 3 − Ccyc ≥0⇔n≥p Cbloc The programmer may have to indicate if the relations given by Maple are satisfied or not. In our example, he must indicate if n ≥ p. Another (automatic) solution is to use the relations as run-time tests which choose between several versions of the program. In our example, the difference in the two costs can be explained by the fact that the cyclic distribution provides a much better load balancing than the block distribution whereas communications are identical.
5
Experiments
We have performed experiments on an Intel Paragon XP/S with a handful of standard linear algebra programs (LU, Cholesky factorization, Householder,
Symbolic Cost Analysis and Automatic Data Distribution
695
Jacobi elimination, ...). Our implementation is not completed and some compilation steps, such as the destructive update step and part of the symbolic cost computation, were done manually. Below, a table gathers the execution times obtained for LU decomposition. They are representative of the results we got for the other few programs. For all programs, the distribution chosen by the cost analysis proved to be the best one in practice. Processors Skel. cyclic Skel. bloc C Seq. HPF cyclic ScaLAPACK cyclic 1 14.77 15.07 13.61 15.36 3.78 4 5.25 6.75 × 5.41 1.84 16 2.97 5.33 × 3.06 1.50 32 2.57 5.58 × 2.67 1.41 We compared the sequential execution of skeleton programs with standard (and portable) C versions and our parallel implementation with High Performance Fortran (a manual distribution approach). No significant sequential or parallel runtime penalty seems to result from programming using skeletons, at least for such regular algorithms. We compared our code with the parallel implementation of NESL, a skeletonbased language [1]. The work on the implementation of NESL has mostly been directed towards SIMD machines. On the Paragon, the NESL compiler distributes vectors uniformly on processors and communications are not optimized. Not surprisingly, the parallel code is very inefficient (at least fifty times slower than our code). We also compared our implementation with ScaLAPACK, an optimized library of linear algebra programs designed for distributed memory MIMD parallel computers [4]. In ScaLAPACK, the user may explicitly indicate the data distribution. So, we indicated the best distribution found by the cost analysis in each ScaLAPACK program considered. If our code on 1 processor is much slower than its ScaLAPACK equivalent (between 3 to 5 times slower), the difference decreases as the number of processors increases (typically, 1.8 times slower on 32 processors). Much of this difference comes from the machine specific routines used by ScaLAPACK for performing matrix operations (the BLAS library). This suggests a possible interesting extension of our source language. The idea would be to introduce new skeletons corresponding to the BLAS operations in order to benefit from these machine specific routines. These preliminary results are promising but more experiments are necessary to assess both the expressiveness of the language and the efficiency of the compilation. We believe that these experiments may also indicate useful linguistic extensions (e.g. new skeletons) and new optimizations.
6
Related Work and Conclusion
The community interested in the automatic parallelization of FORTRAN has studied automatic data distribution through parallel cost estimation ([10], [3]).
696
Julien Mallet
If the complete FORTRAN language (unrestricted conditional, indexing with runtime value, ...) is to be taken into account, communication and computation costs cannot be accurately estimated. In practice, the approximated cost may be far from the real execution time leading to a bad distribution choice. [16], [5] focus on a subset of FORTRAN: loop bound and array indexes are linear expressions of the loop variables. This restriction allows them to compute a precise symbolic computation cost through their computations of polytope volume. Unfortunately, using this approach to estimate communication costs is not realistic. Indeed, the cost would be expressed in terms of point-to-point communications without taking into account hard-wired communication primitives. All these works estimate real costs too roughly to ensure that a good distribution is chosen. The skeleton community has studied the transformation of restricted computation patterns into lower-level parallel primitives. [7] defines a restricted set of skeletons which are transformed using cost estimation. Only cost-reducing transformations are considered. It is, however, well known that often intermediary cost-increasing transformations are necessary to derive a globally cost optimal algorithm. [15], [11] and [14] define cost analyses for skeleton-based languages. Their skeletons are more general than ours leading to approximate parallel cost (communication or/and computation). Furthermore, the costs are not symbolic (the size of input matrices and the number of processors are supposed to be known). [9] define a precise communication cost for scan and fold skeletons on several parallel topologies (hypercube, mesh, ...) which allows them to apply optimizations of communications through cost-reducing transformations. There are also a few real parallel implementations of skeleton-based languages. [2] uses cost estimations based on profiling which does not ensure a good parallel performance for different sizes of inputs. [13] uses a finer cost estimation but implementation decisions are taken locally and no arbitration of tradeoffs is possible. We have presented in this paper the compilation of a skeleton-based language for MIMD computers. Working by program transformations in a unified framework simplifies the correctness proof of the implementation. One can show independently for each step that the transformation preserves the semantics and that the transformed program respects the restrictions enforced by the target language. The overall approach can be seen as promoting a programming discipline whose benefit is to allow precise analyses and a predictable parallel implementation. The source language restrictions are central to the approach as well as the techniques to evaluate the volume of polytopes. We regard this work as a rare instance of cross-fertilization between techniques developed within the FORTRAN parallelization and skeleton communities. A possible research direction is to study dynamic redistributions chosen at compile-time. Some parallel algorithms are much more efficient in the context of dynamic data redistribution. A completely automatic and precise approach to this problem would be possible in our framework. However, this would lead to a search space of exponential size. A possible solution is to have the user select the redistribution places.
Symbolic Cost Analysis and Automatic Data Distribution
697
Acknowledgements Thanks to Pascal Fradet, Daniel Le M´etayer, Mario S¨ udholt for commenting on an earlier version of this paper.
References 1. G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. In 4th ACM Symp. on Princ. and Practice of Parallel Prog., pages 102–112. 1993. 695 2. T. Bratvold. A Skeleton-Based Parallelising Compiler for ML. In 5th Int. Workshop on the Imp. of Fun. Lang., pages 23–33, 1993. 696 3. S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. Teng. Automatic array alignment in data-parallel program. In 20th ACM Symp. on Princ. of Prog. Lang., pages 16– 28, 1993. 695 4. J. Choi and J. J. Dongarra. Scalable linear algebra software libraries for distributed memory concurrent computers. In Proc. of the 5th IEEE Workshop on Future Trends of Distributed Computing Systems, pages 170–177, 1995. 695 5. P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs. In ACM Int. Conf. on Supercomputing, 1996. 692, 693, 694, 696 6. M. Cole. A skeletal approach to the exploitation of parallelism. In CONPAR’88, pages 667–675. Cambridge University Press, 1988. 688 7. J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, Q. Wu, and R. L. While. Parallel programming using skeleton functions. In PARLE ’93, pages 146–160. LNCS 694, 1993. 688, 696 8. J. Darlington, Y. K Guo, H. W. To, and Y. Jing. Skeletons for structured parallel composition. In 5th ACM Symp. on Princ. and Practice of Parallel Prog., pages 19–28, 1995. 691 9. S. Gorlatch and C. Lengauer. (De)Composition rules for parallel scan and reduction. In 3rd IEEE Int. Conf. on Massively Par. Prog. Models, 1998. 696 10. M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, 1992. 695 11. C. B. Jay, M. I. Cole, M. Sekanina, and P. Steckler. A monadic calculus for parallel costing of a functional language of arrays. In Euro-Par’97 Parallel Processing, pages 650–661. LNCS 1300, 1997. 696 12. J. Mallet. Compilation for MIMD computers of a skeleton-based language through symbolic cost analysis and automatic data distribution. Technical Report 1190, IRISA, May 1998. 689 13. S. Pelagatti. A Methodology for the Development and the Support of Massively Parallel Programs. PhD thesis, Pise University, 1993. 696 14. R. Rangaswami. A Cost Analysis for a Higher-order Parallel Programming Model. PhD thesis, Edinburgh University, 1996. 696 15. D. B. Skillicorn and W. Cai. A cost calculus for parallel functional programming. Technical report, Queen’s University, 1993.
698
Julien Mallet
16. N. Tawbi. Estimation of nested loops execution time by integer arithmetic in convex polyhedra. In Int. Symp. on Par. Proc., pages 217–223, 1994. 696 692, 694, 696 17. P. Wadler. Linear types can change the world! In Programming Concepts and Methods, pages 561–581. North Holland, 1990. 689
Optimising Data-Parallel Programs Using the BSP Cost Model D.B. Skillicorn1 , M. Danelutto, S. Pelagatti, and A. Zavanella2 1
Department of Computing and Information Science Queen’s University, Kingston, Canada 2 Dipartimento di Informatica Universit´ a di Pisa, Corso Italia 40 56125 Pisa, Italy
Abstract. We describe the use of the BSP cost model to optimise programs, based on skeletons or data-parallel operations, in which program components may have multiple implementations. BSP’s view of communication transforms the problem of finding the best implementation choice for each component into a one-dimensional minimisation problem. A shortest-path algorithm that finds optimal implementations in time linear in the number of operations of the program is given.
1
Problem Setting
Many parallel programming models gain expressiveness by raising the level of abstraction. Important examples are skeletons, and data-parallel languages such as HPF. Programs in these models are compositions of moderately-large building blocks, each of which hides significant parallel computation internally. There are typically multiple implementations for each of these building blocks, and it is straightforward to order these implementations by execution cost. What makes the problem difficult is that different implementations require different arrangements of their inputs and outputs. Communication steps must typically be interspersed to rearrange the data between steps. Choosing the best global implementation is difficult because the cost of a program is the sum of two different terms, an execution cost (instructions), and a communication cost (words transmitted). In most parallel programming models, it is not possible to convert the communication cost into comparable units to the execution cost because the message transit time depends on which other messages are simultaneously being transmitted, and this is almost impossible to determine in practice. The chief advantage of the BSP cost model, in this context, is that it accounts for communication (accurately) in the same units as computation. The problem of finding a globally-optimal set of implementation choices becomes a one-dimensional minimisation problem, in fact with some small adaptations, a shortest path problem. The complexity of the shortest path problem with positive weights is quadratic in the number of nodes (Dijkstra’s algorithm) or O((e + n) log n), where e is D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 698–704, 1998. c Springer-Verlag Berlin Heidelberg 1998
the number of edges, for the heap algorithm. We show that the special structure of this particular application reduces the problem to shortest path in a layered graph, with complexity linear in the number of program steps. BSP’s cost model assumes that the bottleneck in communication performance is at the processors, rather than in the network itself. Thus the cost of communication depends on the fan-in and fan-out of data at each processor, and a single architectural parameter, g, that measures effective inverse bandwidth (in units of time per word transferred). Such a model is extremely accurate for today’s parallel computers. Other programming models can use the technique described here to the extent that the BSP cost model reflects their performance. In practice, this requires being able to decide which communication actions will occur in the same time frame. This will almost certainly be possible for P3L, HPF, GoldFish, and other data-parallel models. In Section 2 we show how the BSP cost model can be used to model the costs of skeletons or data-parallel operations. In Section 3, we present an optimisation algorithm with linear complexity. In Section 4, we illustrate its use on a nontrivial application written in the style of P3L. In Section 5, we review some related work.
2
Cost Modelling
Consider the program in Figure 1, in which program operations, BSP supersteps [7], are shown right to left in a functional style. Step A has two potential implementations, and we want to account for the total costs of the two possible versions of the program. Step A s1
s2
h2
h1
w1
w2
h4
h3
Alternate Implementation of Step A Fig. 1. Making a Local Optimisation. Rectangles represent computation, ovals communication, and bars synchronisation. Each operation is labelled with its total cost parameter. Both implementations may contain multiple supersteps although they are drawn as containing one. The alternate implementation of step A requires extra communication (and hence an extra barrier) to place data correctly. In general, it is possible to overlap this with the communication in the previous step. The communication stage of the operation preceding step A arranges the data in a way suitable for the first implementation of step A. Using the second imple-
mentation for step A requires inserting extra data manipulation code to change this arrangement. This extra code can be merged with the communication at the end of the preceding superstep, saving one barrier synchronisation, and potentially reducing the total communication load. For example, identities involving collective communication operations may allow some of the data movements to ‘cancel’ each other. We can describe the cost of these two possible implementations of Step A straightforwardly in the BSP cost framework. The cost of the original implementation of Step A is w1 + h1 g + s1 l and the cost of the new implementation is w2 + h2 g + s2 l plus a further rearrangement cost h3 g + l To determine the extent to which the h3 g cost can be overlapped with the communication phase of the previous step we must break both costs down into fan-out costs and fan-in costs. So suppose that h4 is the communication cost of the preceding step and in h3 = max(hout 3 , h3 )
in h4 = max(hout 4 , h4 )
since fan-ins and fan-outs are costed additively. The cost of the combined, overlapped communication is out in in max(hout 3 + h4 , h3 + h4 )g
The new implementation is cheaper than the old, therefore, if out in in w2 + h2 g + s2 l + max(hout 3 + h4 , h3 + h4 )g < w1 + h1 g + s1 l + h4 g
3
Optimisation
The analysis in the previous section shows how a program is transformed into a directed acyclic graph in which the nodes are labelled with the cost of computation (and synchronisation) and the edges with the cost of communication. Alternate implementation corresponds to choosing one path between two points rather than another. In what follows, we simply assume that different implementations for each program step are known, without concerning ourselves with how these are discovered. Given a set of alternate implementations for each program step, the communication costs of connecting a given implementation of one step with a given implementation of the succeeding step can be computed as outlined above. If there are a alternate implementations for each program step, then there are a2 possible communication patterns connecting them. If the program consists of n steps, then the total cost of setting up the graph is (n − 1)a2 . We want to find the shortest path through this layered graph. Consider the paths from any implementation block at the end of the program. There are (at most) a paths leading back to implementations of the previous step. The cost of a path is the sum of the costs of the blocks it passes through and the edges it contains. After two communication steps, there are a2 possible paths, but only (at most) a of them can continue, because we can discard all but the cheapest
of any paths that meet at a block. Thus there are only a possible paths to be extended beyond any stage. Eventually, (at most) a paths reach the start of the program, and the cheapest overall can be selected. The cost of this search is (n − 1)a2 , and it must be repeated for each possible starting point, so the overall cost of the optimisation algorithm is (n − 1)a3 . This is linear in the length of the program being optimised. The value of a is typically small, perhaps 3 or 4, so that the cubic term in a is unlikely to be significant in practice.
4
An Example
We illustrate the approach with a non-trivial program, Conway’s game of Life, expressed in the intermediate code used by the P3L compiler[5,1]. The operations available to the P3L compiler are: distribute :: Distr pattern → SeqArray α → ParArray (SeqArray α) gstencil :: ParArray α → Distr pattern → Stencil → Distr pattern → ParArray α lstencil :: ParArray α → Stencil → Distr pattern → ParArray (SeqArray α) map :: ParArray α → (α → β) → ParArray β reduce :: ParArray α → (α → α → α) → α reduceall :: ParArray α → (α → α → α) → ParArray α gather :: ParArray α → Distr pattern → SeqArray α where distribute distributes an array to a set of worker processes according to a distribution pattern (Distr pattern), gstencil fetches data from neighbour workers according to a given stencil, lstencil arranges the data local to a worker to have correct haloes in a stencil computation, map and reduce are the usual functional operations, reduceall is a reduce in which all the workers get the result, and gather collects distributed data in an array to a single process. Possible intermediate P3L code for the game of life is: 1. 2. 3. 4. 5. 6. 7. 8. 9.
d1 = (block block p) W = distribute d1 World while NOT(C2) W1 = gstencil W d1 [(1,1),(1,1)] d1 W2 = lstencil W1 [(1,1),(1,1)] d1 (W3,C)= map (map update) W2 C1 = map (reduce OR C) C2 = reduceall OR C X = gather d1 W3
where d1 is the distribution adopted for the World matrix (block,block over a rectangle of p processors), and W is the ParArray of the distributed slices of the initial World matrix. Line 4 gathers the neighbours from non-local processors. Line 5 creates the required copies. Line 6 performs all the update in parallel. Array C contains the boolean values saying if the corresponding element has changed. Line 9 gathers the global results. We concentrate on the computation inside the loop. It may be divided into three phases: Phase A gathers the needed values at each processor (Line 4). Phase B computes all of the updates in parallel
(Lines 5–7). Phase C checks for termination: nothing has changed in the last iteration (Line 8). There are several possible implementations for these steps: Operation Computation Cost Comment A1 l direct stencil implementation A2 2l gather to a single process then distribute A3 3l gather to a single process then broadcast 2 B (9 + t⊕ )N + l local update and reduce C1 (p − 1)tOR total exchange of results, then local reductions C2 (log p − 1)(l + tOR ) tree reduction where t⊕ is the cost of a local update, tOR is the time taken to compute a logical OR, and N is the length of the side of the world region held by each processor. In A1 , all workers in parallel exchange the boundary elements (4(N + 1)), requiring a single barrier (costing l). In A2 , all the results are gathered on a single processor and redistributed giving each worker its halo; this requires much more data exchange (N 2 p + 4(N + 1)) and two synchronisations (2l). In A3 , data are collected as in A2 and then broadcasted to all the workers. This requires three synchronisations (3l), as optimal BSP broadcast is done in two steps [7]. The cost of the communication between different implementations is: Pair Communication Cost Comment (A1 , B) 4(N + 1)g send the halo of an N × N block (A2 , B) N 2 pg + 4(N + 1)g gather and distribute the halo (A3 , B) 3N 2 pg gather and broadcast, no haloes (B, C1 ) (p − 1)g total exchange (B, C2 ) 2g (C1 , ∗) 0 data already placed as needed (C2 , ∗) 2(log p − 1)g where p is the number of target processors, and N the length of side of the block of World on each processor). The optimal solution depends on the values of g, l and s of the target parallel computer. In general, for this example, A1 is better than A2 and A3 , but C1 is better than C2 only for small numbers of processors or networks with large values of g. For instance, if we consider a CRAY T3D (with values of g ≈ 0.35µs/word, l ≈ 25µs and s ≈ 12M f lops as in [7]), C1 is faster than C2 when p ≤ 128, so the optimal implementation is A1 -B-C1 . For larger numbers of processors, A1 -B-C2 is the best implementation. For architectures with slower barriers (Convex Exemplar, IBM SP2 or Parsytec GC) the best solution is always A1 -B-C1 .
5
Related Work
The optimisation problem described here has, of course, been solved pragmatically by a large number of systems. Most of these use local heuristics and a
small number of program and architecture parameters, and give equivocal results [4,3,2]. Unpublished work on P3L compiler performance has shown that effective optimisation can be achieved, but it requires detailed analysis of the target architecture and makes compilation time-consuming. The technique presented here is expected to replace the present optimisation algorithm. A number of global optimisation approaches have also been tried. To [8] gives an algorithm with complexity n2 a to choose among block and cyclic distributions of data for a sequence of data-parallel collective operations. Rauber and R¨ unger [6] optimise data-parallel programs that solve numerical problems. Different implementations are homogeneous families of algorithms with tunable algorithmic parameters (the number of iterations, the number of stages, the size of systems), and the cost model is a simplification of LogP.
6
Conclusions
The contribution of this paper is twofold: a demonstration of the usefulness of the perspective offered by the BSP cost model in simplifying a complex problem to the point where its crucial features can be seen; and then the application of a shortest-path algorithm for finding optimal implementations of skeleton and data-parallel programs. The known accuracy of the BSP cost model in practice reassures us that the simplified problem dealt with here has not lost any essential properties. Reducing the problem from a global problem with many variables to a one-dimensional problem that can be optimised sequentially makes a straightforward optimisation algorithm, with linear complexity, possible.
Acknowledgements We gratefully acknowledge the input of Barry Jay and Fabrizio Petrini, in improving the presentation of this paper.
References 1. B. Bacci, B. Cantalupo, M. Danelutto, S. Orlando, D. Pasetto, S. Pelagatti, and M. Vanneschi. An environment for structured parallel programming. In L. Grandinetti, M. Kowalick, and M. Vaitersic, editors, Advances in High Performance Computing, pages 219–234. Kluwer, Dordrecht, The Netherlands, 1997. 701 2. Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, and M.-Y. Wu. Compiling Fortran 90D/HPF for distributed memory MIMD computers. Journal of Parallel and Distributed Computing, 21(1):15–26, April 1994. 703 3. S. Ciarpaglini, M. Danelutto, L. Folchi, C. Manconi, and S. Pelagatti. ANACLETO: a template-based p3l compiler. In Proceedings of the Seventh Parallel Computing Workshop (PCW ’97), Australian National University, Canberra, 1997. 703 4. M. Danelutto, F. Pasqualetti, and S. Pelagatti. Skeletons for data parallelism in p3l. In C. Lengauer, M. Griebl, and S. Gorlatch, editors, Proc. of EURO-PAR ’97, Passau, Germany, volume 1300 of LNCS, pages 619–628. Springer-Verlag, August 1997. 703
5. S. Pelagatti. Structured development of parallel programs. Taylor&Francis, London, 1997. 701 6. T. Rauber and G. R¨ unger. Deriving structured parallel implementations for numerical methods. The Euromicro Journal, (41):589–608, 1996. 703 7. D.B. Skillicorn, J.M.D. Hill, and W.F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, 1997. 699, 702 8. H.W. To. Optimising the Parallel Behaviour of Combinations of Program Components. PhD thesis, Imperial College, 1995. 703
A Parallel Multigrid Skeleton Using BSP Femi O. Osoba and Fethi A. Rabhi Department of Computer Science, University of Hull, Hull HU6 7RX {B.O.Osoba,F.A.Rabhi}@dcs.hull.ac.uk
Abstract. Skeletons offer the opportunity to improve parallel software development by providing a template-based approach to program design. However, due to the large number of architectural models available and the lack of adequate performance prediction models, such templates have to be optimised for each architecture separately. This paper proposes a programming environment based on the Bulk Synchronous Parallel(BSP) model for multigrid methods, where key implementation decisions are made according to a cost model.
1
Introduction
Programming environments and CASE tools are increasingly becoming popular for developing parallel software. Skeleton-based programming environments offer known advantages of intellectual abstraction and development of structured programs [1,9]. Most skeleton-based systems are implemented at a low level using functional languages [3,12] or imperative languages with parallel constructs e.g. PVM/C, dataparallel languages etc. [1,9]. However such models suffer from having no simple cost calculus, thereby hindering the functionality of the skeleton. To generate the required target code for different architectures, such skeletonbased systems have to be optimsed for each architecture. This paper describes a skeleton-based programming environment that is implemented at the low-level with a programming model that posseses a cost calculus, namely the Bulk Synchronous Parallelism (BSP) model. The next section presents our case study skeleton and Section 3 presents an overview of the proposed programming environment. The next sections present an overview of the user interface, the specification analysis phase and the code generation process. Finally the last section presents some conclusions and future directions of our work.
2
A Case Study : Multigrid Methods
Most scientific computing problems represent a physical system by a mathematical model equation. Presently we are considering a model differential equation of the form ∇.K∇u = f in the 2D space. To make it suitable for a computer solution, the continuous physical domain of the system being modelled is discretised D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 704–708, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Parallel Multigrid Skeleton Using BSP
705
and this leads to a set of linear equations of the form Ax = b with n unknowns, which needs to be solved for. A particular class of methods for solving such equations is known as multigrid(MG) methods. Such methods have been selected as a case study on the basis of their usefulness in an important application area (CFD) and their similarity with SIT algorithms which were the focus of earlier work [9,12]. The principle behind a multigrid method is to use a relaxation method (e.g. Gauss Seidel or Jacobi) on a sequence of m discretisation grids G1 . . . Gm (G1 represents the finest grid and Gm the coarsest). By employing several levels of discretisation, the multigrid method is able to accelerate the convergence rate of the relaxation method on the finest grid. One iteration of the MG method from finest to coarsest grid and back is called a cycle. During a cycle, the different stages in an MG algorithm are relaxation which involves the use of an iterative method on the current grid level, restriction which involves the transfer of the residual from one grid to the next coarser grid, coarsest grid solution which provides an exact solution on the coarsest grid and interpolation which is concerned with the transfer of the solution obtained from a coarser grid to the next finer grid. Additional details on multigrid methods can be found in [2].
3
Overview of a Skeleton-Based Programming Environment
The overall structure of the system is very similar to other skeleton-based programming environments [9] as figure 1 illustrates. The primary aim is separating the role of the user and the system developer. The user decides on how the multigrid method should be used and what parameters are to determine the optimal performance irrespective of the underlying architecture, while the system developer addresses issues concerned with generating efficient code and determining suitable or recommended parameters for different architectures based on performance prediction and cost modelling. The main components of the system as shown in figure 1 are described in turn in the next sections. 3.1
User Interface
The role of the User Interface is to allow the user to enter and edit the parameters of the multigrid algorithm. The User Interface also displays the results of the computation which consist of a visual representation of the solution. Figure 2 shows three sets of parameters: multigrid parameters, physical parameters and implementation parameters. Multigrid parameters The multigrid parameters include the number of grids, the relaxation method, the stencil (e.g. 4 or 8 points) and the cycling strategy (V or W cycles). Another parameter is the number of relaxation iterations performed at various grid levels. Some parameters (e.g. the relaxation method) are expected to be changed after the Specification Analyser has been invoked. When selecting values, the user primarily attempts to reduce the overall execution time by minimizing the number of cycles required to converge to the solution.
706
Femi O. Osoba and Fethi A. Rabhi
USER INTERFACE
SPECIFICATION ANALYSER
PARAMETERS
MG MODULES LIBRARY
CODE GENERATOR
BSP LIBRARY TARGET BSP CODE
BSP ARCHITECTURE
RESULTS
Fig. 1. Overall description of the system PARAMETERS MULTIGRID
PHYSICAL
-Number of grids -Number of points -Relaxation method -Stencil -Number of iterations -Cycling strategy
-Initial data -Boundary data -Number of cycles -K
IMPLEMENTATION -BSP architecture (p,g,l) -Partitioning strategy -Tiles shape
Fig. 2. User Interface parameters Physical parameters These parameters are dependent on the physical attributes of the problem e.g. the constant K describes the anisotropy in the problem. Implementation parameters These parameters represent the characteristics of the parallel implementation. First, the BSP architecture is specified as a triplet (p, g, l) [6]. Another parameter represents the partitioning strategy, which is generally based on grid partitioning techniques. Finally, the tiles shape parameter refers to the shape of sub-partitions (e.g. square or rectangular tiles). 3.2
Specification Analyser
The role of the Specification Analyser is to predict the optimal value of some of the parameters given a minimal subset of values provided by the user. These predictions are made based on estimates on the computation and communication requirements for parallel execution. We base our domain partitioning theoretical cost model on a simple analytic model presented in [11]. Because BSP removes the notion of network topology [6], the goal is to build solutions that are optimal
A Parallel Multigrid Skeleton Using BSP
707
with respect to total computation, total communication, and total number of supersteps[6]. For our system this is represented by a parameter set. Each set represents one variation of the algorithm. A parameter set consists of Number of Cycles(Cy), Relaxation Method(RM ), Stencil type(St), and Tileshape(T s). Designing a particular program then becomes a matter of choosing a parameter set that is optimal for the range of machine sizes envisaged for the application. Our BSP theoretical cost model for a standard MG algorithm for a regular grid is of the form: Cy[C.(n2 /p) + g.(n(1/tr + 1/tc).M )] + l.S
(1)
where C is the total number of computation steps, M is the total amount of communication, p is the number of processors (and also the number of tiles), tr is the number of tiles in a row, tc the number of tiles in a column, and S is the total number of supersteps. C is a function of (RM, St, n2 , p), M is a function of (T s, RM, St) and S is a function of (RM, T s). So for a given number of cycles (Cy), the analyser produces a parameter set (Cy, RM, St, T s) which results in the lowest execution time for the particular architecture. 3.3
The Code Generator
The role of the Code Generator is to produce BSP code according to the current values of the parameters. The Code Generator uses a set of reusable library modules which form the MG Module Library and the BSP library (BSPlib) [4] for handling communication and synchronisation. The MG modules are coded in C and implement each of the phases identified in Section 2. For further details on the multigrid programming environment, the reader is referred to the Ph.D thesis in [8].
4
Conclusion and Future work
This paper described a skeleton-based programming environment for multigrid methods which generates BSP code. The user is offered an interface that abstracts from the underlying hardware thus ensures portability and intellectual abstraction. By combining the skeletons idea with BSP, accurate analysis and predictions of the performance of the code on architectures can be made. Since the system is aware of the global properties of the underlying BSP architecture, the system can decide which aspects of the MG algorithm may need to be changed to provide optimal performance and then performs the code generation. Future work will refine the performance prediction model according to practical results, develop a systematic framework for automatic code generation and increase the parameter set to cater for more realistic problems. Finally, we will also investigate a suitable notation for specifying and modifying parameters and evaluate the proposed system with real users who work with such methods.
708
Femi O. Osoba and Fethi A. Rabhi
Acknowledgements We would like to thank members of the BSP team at Oxford University Computing Laboratory U.K, for their continuous support on the use of the BSP library, and also for making the SP-2 available to us.
References 1. G.H. Botorog and H. Kuchen, Algorithmic skeletons for adaptive multigrid methods, In Proceedings of Irregular ’95, LNCS 980, Springer Verlag , 1995, pp. 27-41. 704 2. J.H. Bramble, Multigrid methods, Pitman Research Notes in Mathematics, Series 294, Longman Scientific & Technical, 1993. 705 3. J. Darlington, A. Field, P. Harrison,P. Kelly, D. Sharp, Q. Wu, R. While, Parallel Programming using Skeleton Functions, In Proceedings of PARLE ’93, LNCS, Munich, Springer Verlag, June 1993. 704 4. M. W. Goudreau, J.M.D. Hill, K. Lang, W.F. McColl, S. B. Rao, D.C. Stefanescu, T. Suel, and T. Tsantilas, A proposal for the BSP worldwide standard library, July 1996, available on WWW http://www.bsp-worldwide.org/. 707 5. O.A. McBryan, P.O. Frederickson, J. Linden, A. Schuller, K. Solchenbach, K. Stuben, C.A. Thole, and U. Trottenberg, Multigrid methods on parallel computers - a survey of recent developments, IMPACT Comput. Sci. Eng., vol, 3, pp. 1-75, 1991. 6. W.F. McColl, Bulk synchronous parallel computing, In Abstract Machine Models for Highly Parallel Computers, J.R. Davy and P.M. Dew (eds), Oxford University Press, 1995, pp. 41-63. 706, 707 7. M. Nibhanupudi, C. Norton, and B. Szymanski, Plasma Simulation On Networks of Workstations using the Bulk Synchronous Parallel mode, in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Athens, GA, November 1995. 8. B.O. Osoba, Design of a Parallel Multigrid Skeleton-Based System using BSP, Ph.D thesis in preparation in the Department of Computer Science, University of Hull, 1998. 707 9. P.J. Parsons and F.A. Rabhi, Generating parallel programs from paradigm-based specifications, to appear in the Journal of Systems Architectures, 1998. 704, 705 10. F.A. Rabhi, A Parallel Programming Methodology Based on Paradigms, In Transputer and Occam Developments, P. Nixon (Ed.), IOS Press, 1995, pp. 239-252. 11. P. Ramanathan and S. Chalasani, Parallel multigrid algorithms on CM-5. In IEE Proc. Computers and Digital Techniques, vol. 142, no 3, May 1995. 706 12. J. Schwarz and F.A. Rabhi, A skeleton-based implementation of iterative transformation algorithms using functional languages, In Abstract Machine Models for Parallel and Distributed Computing, M. Kara et al. (eds), IOS Press, 1996, 704, 705
Flattening Trees Gabriele Keller1 and Manuel M. T. Chakravarty2 1
2
Fachbereich Informatik, Technische Universit¨ at Berlin, Germany [email protected] Inst. of Inform. Sciences and Electronics, University of Tsukuba, Japan [email protected]
Abstract. Nested data-parallelism can be efficiently implemented by mapping it to flat parallelism using Blelloch & Sabot’s flattening transformation. So far, the only dynamic data structure supported by flattening are vectors. We extend it with support for user-defined recursive types, which allow parallel tree structures to be defined. Thus, important parallel algorithms can be implemented more clearly and efficiently.
1
Introduction
The flattening transformation of Blelloch & Sabot [6,4] implements nested dataparallelism by mapping it to flat data-parallelism. Compared to flat parallelism, nested parallelism allows algorithms to be expressed on a higher level of abstraction while providing a language-based performance model [5]; in particular, algorithms operating on irregular data structures and divide-and-conquer algorithms benefit from nested parallelism. Efficient code can be generated for a wide range of parallel machines [2,11,12,10]. However, flattening supports only vectors (i.e., homogeneous, ordered sequences) as dynamic data structures. Thus, important parallel algorithms, like hierarchical n-body codes and other adaptive algorithms based on tree structures, are awkward to program—the tree structure has to be mapped to a vector structure, implying explicit index calculations, to keep track of the parent-child relation, and leading to a suboptimal data distribution on distributed-memory machines. In this paper, we propose user-defined recursive types to tackle the mentioned problems. We extend flattening such that it maps the new structures to efficient, flat data-parallel code. Our extension fits easily into existing formalizations and implementations of flattening; in particular, the optimization techniques of previous work [11,12,10,7,9] remain applicable. This paper makes the following three main contributions: (1) It demonstrates the usefulness of recursive types for nested data-parallel languages (Section 2), (2) it formally specifies our extension of flattening including user-defined recursive types (Section 3), and (3) it provides experimental results gathered with the resulting code on a Cray TE (Section 4). Regarding point (2), as a side-effect of our extension, we contribute to a rigorous specification of flattening by formalizing the instantiation of polymorphic primitives. Thereby, we also introduce a new kind D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 709–719, 1998. c Springer-Verlag Berlin Heidelberg 1998
710
Gabriele Keller and Manuel M. T. Chakravarty
of primitives, so-called chunkwise operations, for more efficient data redistribution on distributed-memory machines. We use the functional language Nesl [5] throughout this paper, but the discussed techniques also work for imperative languages [1]. Many details and discussions have been omitted in this paper due to shortage of space—this material can be found in [8]. Section 2 discusses the benefit of recursive types for tree-based algorithms in a purely vector-oriented language. Section 3 formalizes our extended flattening transformation. Section 4 presents benchmarks. Finally, Section 5 concludes.
2
The Problem: Encoding Trees by Vectors
Nesl [5] is strict functional language featuring nested vectors as its central data structure. In addition to built-in parallel operations on vectors, the apply-toeach construct is used to express parallelism. In its general form {e : x1 in e1 , ..., xn in en | f } we call e the body, the xi in ei the generators, and f (which is optional) the filter. The body e is evaluated for each element of the vectors ei and the result is included in the result vector if f evaluates to T (true); the vectors ei are required to be of equal length and are processed in lock-step. For example, {x + y : x in [1, -2, 3], y in [4, 5, 6] | x > 0} evaluates to [5, 9]. Nested parallelism occurs where parallel operations appear in the body of an apply-to-each, e.g., {plus scan (x) : x in xs}, which for xs = [[1, 2, 3], [4], [5, 6]] yields [[0, 1, 3], [0], [0, 5]]. The built-in plus scan is a prescan using addition [4]. The implementation of a tree-based algorithm in such a language implies representing trees by nested vectors, obscuring the code with explicit index calculations, to keep track of the parent-child relation. Moreover, in an implementation based on flattening, these nested vectors are represented by a set of flat vectors in the target code. All data elements of the tree are mapped to a single vector and are uniformly distributed to achieve load balancing. This, however, leads to superfluous redistributions as those algorithms usually traverse the trees breadth-first, i.e., level by level, all nodes on one level are processed in parallel. We illustrate these problems at the example of Barnes & Hut’s hierarchical n-body code [3]. It minimizes the number of force calculations by grouping particles hierarchically into cells according to their spatial position. The hierarchy is represented by a tree. This allows approximating the accelerations induced by a group of particles on distant particles by using the centroid of that group’s cell. The algorithm has two phases: (1) The tree is constructed from a particle set, and (2) the acceleration for each particle is computed in a down-sweep over the tree. The following Nesl function outlines the tree construction: function bhTree (ps) : [MassPnt] -> [Cell] = if #ps == 1 then [Cell (ps[0], [])] — cell has only one particle else let split particles ps into spatial groups pgs subtrees = {bhTree (pg): pg in pgs}; children = {subtree[0] : subtree in subtrees}; () cd = centroid ({mp : Cell (mp, is) in children});
Flattening Trees c0
p3
p8 p9
c1
p1
p2 p7 p6
711
c4 p6
p4 p5
c2
c3
p1
c5 p2 p3 p4 p5
p7 p8
p9
[[c1 c4 c5 p6 p7 p8 p9] [c2 p2 p3] [c3 p4 p5] [p1]] Processor 1 Processor 2
Fig. 1. Example of a Barnes-Hut tree and its representation as a vector. sizes = {#subtree : subtree in subtrees} () in [Cell (cd, sizes)] ++ flatten (subtrees) — whole tree as flat vector()
We represent the tree as a vector of cells. Each cell of the form Cell (mp, szs) corresponds to a node of the tree and is a pair of a mass point mp and the sizes of the subtrees szs (tuples can be named in Nesl). Given such a vectorized tree, the acceleration of a set of mass points mps is computed by function accels (tree, mps) : ([Cell], [MassPnt]) -> [Vec] = if #mps == 0 then [] else let Cell (cd, chNos) = tree[0]; — get root () split mps into closeMps and farMps (direct force calculation) farAcs = {accel (cd, mp) : mp in farMps}; subTrees = partition (drop (tree, 1), chNos); () closeAcss = {accels (tree, closeMps) : tree in subTrees}; in combine farAcs and closeAcss
It computes the acceleration for the mass points in farMps directly (using the function accel) and recurses into the tree for those in closeMps. The function drop omits the first element of a vector and partition forms a nested vector according to the lengths passed in the second argument (it is the same as P in the next section). The type Vec represents ‘vectors’ in the sense of physics. The lines marked by () in the functions are (partially) artifacts of maintaining the tree as a vector. Figure 1 depicts the grouping of an exemplary particle distribution and the corresponding tree. The tree is both built and traversed level by level, i.e., all nodes in one level of the tree are processed in a parallel step. Let us consider the data layout (over two processors) for the example tree in Figure 1. To ensure proper load balancing, all cells of the already constructed subtrees have to be redistributed in each recursive step of bhTree. Similarly, accels, while descending the tree, has to redistribute those cells that correspond to one level of the tree. We will quantify these costs experimentally in Section 4. It should be clear that it is more suitable to store the nodes of the tree in a distinct vector for each level of the tree, and then, to chain these vectors to represent the whole tree. At the source level, such a structure corresponds to regarding each node as being composed from some node data plus a vector of tree nodes that have exactly the same structure; in other words, we need a recursive type. For our example, we can use datatype Node (MassPnt, [Node]). Then,
712
Gabriele Keller and Manuel M. T. Chakravarty
we simplify the function bhtree as follows: We omit computing children and sizes, we compute cd by centroid ({mp : Node (mp, chs) in trees}), and change the body of the let into Node (cd, subtrees). In accels, computing subtrees becomes superfluous, and closeAcss is computed by accels (tree, closeMps) : tree in subtrees} where Node (cd, subtrees) = tree. Generally, we extend Nesl with named tuples that refer to themselves, but only in the element type of a vector (to avoid infinite types), as we can terminate recursion only by empty vectors—there are no arbitrary sum types. Handling sum types efficiently in flattening seems much harder than recursive types.
3
Flattening Trees by Program Transformation
After the source language extension, we proceed to the implementation of userdefined recursive types by flattening. The flattening transformation serves three aims: First, it replaces all nested parallel expressions by flat parallel expressions that contain the same degree of parallelism, second, it changes the representation of data structures such that vectors contain only elements of base type, and third, it replaces all polymorphic vector primitives with monomorphic instances. In this section, we begin by introducing the flat data-parallel kernel language Fkl, which is the target language of the first part of the flattening transformation. We continue with a discussion of the target representation of data structures, and then, describe an instantiation procedure, which implements the change of the representation of data structures as well as the generation of all necessary monomorphic instances of the primitives. Due to changing the representation of data structures, the instantiation of the polymorphic primitives becomes technically challenging—especially so, in the presence of recursive types. The representation of recursive types together with the instantiation procedure are the central technical contributions of this paper. Although the instantiation of the polymorphic primitives—excluding recursive types—is already implicit in previous work, it was never described in detail (for example, Blelloch [4] merely gives some examples and leaves the non-trivial induction of a general algorithm to the inclined reader); in particular, we provide the first formal specification. Our treatment also leads to more efficient code on distributed-memory machines than previous approaches (which concentrated on vector and shared-memory machines). A central idea of our method is the fact that only the primitive vector operations actually access or manipulate elements of nested vectors. Therefore, we can regard nested vectors as an abstract data type with some freedom in the concrete implementation. We will not discuss the first step of the flattening, as it is already detailed in previous work [4,11,10] and not affected by the addition of recursive types— however, a complete specification of the flattening transformation can be found in an unabridged version of this paper [8]. 3.1
The Flat Kernel Language
A kernel language (Fkl) program consists of a list of declarations produced by the rule D → V (V1 , . . . , Vn ) = E, with variables V and expressions produced by
Flattening Trees
713
E → C | V | V (E1 , . . . , En ) | let V = E1 in E2 | if E1 then E2 else E3 | [E1 , . . . , En ] where
C are constants. We assume programs are typed, with types from
T → Int | Bool | V | (T1 , . . . , Tn ) | [T] | µV.T For brevity, we only have Int and Bool as primitive types. In a recursive type µx .T , all occurrences of x in T must be within element types of vectors (to get a finite type). For example, the type Node of Section 2 is represented as µx .(MassPnt, [x ]), where MassPnt abbreviates the tuple of a mass point. The second component of Fkl are its primitive operations. Among the most important are the usual arithmetic and logic operations as well as construction of tuples τ n : α1 × · · · × αn → (α1 , . . . , αn ) and the corresponding projections πni : (α1 , . . . , αn ) → αi (the function type is given to the right of the colon). The vector operations include operations length # : [α] → Int, concatenation ++ : [α] × [α] → [α], and indexing ind : [α] × Int → α. Moreover, we have distribution dist : α × Int → [α], where dist (a, n) yields [a, . . . , a] with length n, permutation perm : [α] × [Int ] → [α], where perm (xs, is) permutes xs according to the index vector is, packing pack : [α] × [Bool ] → [α], where pack (xs, fs) removes all elements from xs that correspond to a false value in fs. Furthermore, there are families of reduction ⊕τ -reduce : [τ ] → τ and prescan ⊕τ -scan : [τ ] → [τ ] functions for associative binary primitives ⊕τ operating on data of basic type τ . Finally, Fkl contains three functions that form the basis for handling nested vectors: F : [[α]] → [α] correspond to ++ -reduce and removes one level of nesting, e.g., F ([[1, 2, 3], [], [4, 5]]) = [1, 2, 3, 4, 5]; S : [[α]] → [Int] corresponds to {#xs : xs ← xss} and returns the toplevel nesting structure, e.g., S ([[1, 2, 3], [], [4, 5]]) = [3, 0, 2]; and P : [α] × [Int ] → [[α]] reconstructs a nested vector from the results of the previous two function, e.g., P ([1, 2, 3, 4, 5], [3, 0, 2]) = [[1, 2, 3], [], [4, 5]]. For each nested vector xs, we have P (F (xs), S (xs)) = xs. We write the application of primitives like ++ infix. Some of the primitives are polymorphic (those with α in the type) and, as said, we discuss their instantiation later, but we assume that in an Fkl program all polymorphism in user-defined functions is already removed (by code duplication and type specialization). To compensate the lack of general nested parallelism (i.e., no apply-to-each construct), Fkl supports all primitive functions p : T1 × · · · × Tn → T in a vectorized form p↑ : [T1 ] × · · · × [Tn ] → [T ], which applies p in parallel to all the elements of its argument vectors (which must be of the same length). For example, we have [1, 2, 3] +↑Int [4, 5, 6] = [5, 7, 9]. In general, a primitive and its vectorized form relate through p ↑ (xs1 , . . . , xsn ) = {p(x1 , . . . , xn ) : x1 ← xs1 , . . . , xn ← xsn }. Note, the part of the transformation generating Fkl guarantees that nothing like (p↑ )↑ is needed. The next subsection introduces two additional sets of primitives: One handles recursive types and the other, although not strictly necessary, handles some operations on nested vectors more efficiently. We delay the discussion of these primitives as it benefits from knowing the target data representation. We choose Fkl as the target language as there are optimizing code-generation techniques mapping it on different parallel architectures [2,7,9].
714
3.2
Gabriele Keller and Manuel M. T. Chakravarty
Concrete Data Representation
Before presenting the instantiation procedure for polymorphic primitives, we discuss an efficient target representation of nested vectors, vectors of tuples, and vectors of recursive types. Nested vectors. We can represent nested vectors of basic type using only flat vectors by separating the data elements from the nesting structure. For example, [[1, 2, 3], [], [4, 5]] is represented by a pair consisting of the data vector [1, 2, 3, 4, 5] and the segment descriptor [3, 0, 2] containing the lengths of the subvectors. The primitives F and S extract these two components, whereas P combines them into an abstract compound structure representing a nested vector. In general, this representation requires a single data vector and one segment descriptor per nesting level of the represented vector. Instances of polymorphic primitives operating on these nested vectors can be realised by combining primitives on vectors of basic type with the functions F , S, and P, as we will see below. This representation allows an optimized treatment of costly reordering primitives, such as permutation. Consider the expression perm (as, is), where as is of type [[Int]]. Both the data vector and the segment descriptor of as have to be permuted. We have perm (as, is) = P (perm (F (as), is ), perm (S (as), is)), where perm operates on vectors of type Int and is is a new permutation vector computed from is and S (as). So, for example, perm ([[3, 4, 5], [1, 3]], [1, 0]) = P (perm ([3, 4, 5, 1, 3], [2, 3, 4, 0, 1]), perm ([3, 2], [1, 0])). This scheme is expensive, because (a) a new index vector is is computed for each level of nesting, and moreover, (b) the data vector is permuted elementwise, whereas the original expression allows to deduce that several continuous blocks of elements (the subvectors) are permuted, i.e., we lose information about the structure of communication. We can prevent this behaviour by employing an additional set of primitive functions, the so-called chunkwise operations distC , permC , and indC . The operation distC gets a vector together with a natural number and yields a vector that repeats the input vector as often as specified by the second argument. The operations permC and indC both get three arguments: (1) a flat data vector, (2) a segment descriptor, and (3) an index vector or index position (depending on the function). They permute and index blocks of the data vector chunkwise, where the chunk size is specified by the segment descriptor. Their semantics is defined as permC (xs, s, is) = perm (P (xs, s), is) and indC (xs, s, i) = ind (P (xs, s), i), respectively, where perm and ind operate on vectors of type [[s]] when xs is of type [s] and s is a basic type. Their implementation using blockwise communication is straightforward. Vectors of tuples. Vectors of tuples are represented by tuples of vectors. Accordingly, applications of vector primitives operating on such vectors are pulled inside the tuple, as proposed by [4]. Recursive Types. The most complicated case is the representation of vectors of recursive types by vectors of basic type in such a way that the nodes of each
Flattening Trees
715
level of the represented tree structure are stored in a separate vector (as discussed in Section 2). In Subsection 3.1 we required that, in a recursive type µx .T , each occurrence of x in T has the form [x ]. Let us consider the possible contexts of these occurrences. If the outermost type constructor of T is a primitive type (e.g., Int), there is no recursion and we represent T as usual. But, if the outermost type constructor of T is a vector, we have [[x ]], i.e., a nested vector of recursive type. Similar to nested vectors of basic type, we regard them as an abstract structure manipulated by the three functions F , S, and P. Next, if the outermost type constructor of T is a tuple, we have to represent it as a tuple of vectors like in the non-recursive case, while treating the components of the resulting tuple in the same way as T itself. Finally, if the outermost type constructor of T is a recursive type of the form µx .T , no special treatment is necessary, apart from handling T in the same way as T . Overall, any occurrence of a recursive type variable x (except in the root of a tree) is represented by [[x ]], due to the propagation of vectors inside tuples and the requirement that x occurs only as [x ]. Thus, these occurrences are manipulated by F , S, and P. These functions allow us to represent a tree structure of potentially unbounded depth using only flat vectors by separating data from nesting structure. Nevertheless, a problem remains: For nested vectors, the nesting depth of the data is statically known (due to strong typing), but not so for the depth of a recursive type. Hence, we get a sequence of segment descriptors of statically unknown length. Each of these is accompanied by the data vector for one level of the tree. A value of recursive type µx .T is always terminated by empty vectors of a recursive occurrence x in T , but due to the propagation of vectors inside tuples, we need a special value to identify the recursion termination. Hence, we wrap the representation of vectors of recursive type into a maybe type α , which is either x (where x is of type α) or . To process values of type α , we add two primitives (·?) : α → Bool and (·↑) : α → α. Operation (·?) yields true if the argument has the form x , in this case, (·↑) returns x (it is undefined, otherwise). The tree of Figure 1 is represented by (c0 , v ) with v = ([c1 , c2 , c3 , p1 ]), P (w , [2, 2, 2, 0])) x = ([p6 , p7 , p8 , p9 ], P (y, [0, 0, 0, 0])) P (, [])) w = ([c4 , c5 , p2 , p3 , p4 , p5 ], P (x , [2, 2, 0, 0, 0, 0])) y = ([],
3.3
Instantiation of Polymorphic Functions
Now, we are in a position to define the instantiation of the polymorphic uses of primitives. We denote the type of an instance of a primitive by annotating the type substituted for the type variable (α in the signatures from Subsection 3.1) as a subscript. For example, ++[Int] has type [[Int]] × [[Int]] → [[Int]], and we have to generate instances for all occurrences apart from ++Int and ++Bool . The generic primitives τ n , πin , #, F , S, and P are not instantiated as their code is independent of the type instance. The transformation algorithm. In the presence of recursive types, we cannot transform the program by replacing uses of polymorphic primitives with expressions that merely use primitives on basic type; instead, we add new declarations to a program, until all uses of primitives are either directly supported or covered
716
Gabriele Keller and Manuel M. T. Chakravarty
by a previously added declaration. The addition of a declaration may introduce primitives occurring at new instances, which in turn triggers the addition of declarations for these instances. We discuss the termination of this process later. Each equation says that if a primitive occurs at a type matching the left hand-side, add a declaration that is an instance of the right hand-side for that type. We start with the distribution dist , together with its chunkwise variant: dist(T1 ,...,Tn ) (x , n) = (distT1 (π1n (x ), n), . . . , distTn (πnn (x ), n)) distµx .T (x , n) dist[T ] (x , n)
= distT {µx .T /x } (x , n) = P(distC[T ] (x , n), distInt (#x , n))
distCµx .T (xs, n) distC[T ] (xs, n)
= if xs? then distCT {µx .T /x } (xs↑, n) else = P(distC[T ] (F (xs), n), distCInt (S (xs), n))
The rules for tuples propagate the function inside. Vectors are distributed using the chunkwise version, of which the second rule demonstrates the handling of the maybe type α . We omit the rules for tuples in the following, as they have the same structure as above. The next set of rules covers chunkwise concatenation: xs ++µx .T ys = if xs? then (if ys? then xs↑ ++T {µx .T /x } ys↑ else xs) else ys xs ++[T ] ys
= P(F (xs) ++T F (ys)), S(xs) ++Int S(ys))
We continue with plain and chunkwise indexing. Note that indexing an empty vector of recursive type leads to an error as the index is out of bounds. indµx .T (xs, i) ind[T ] (xs, i)
= if xs? then indT {µx .T /x } (xs, i) else error = indCT (F (xs), S(xs), i)
indCµx .T (xs, s, i) = if xs? then indCT {µx .T /x } (xs, s, i) else indC[T ] (xs, s, i)
= P(indCT (F (xs), +Int -reduce ↑ (P (S(xs), s)), i), indCInt (S(xs), s, i))
For indC[T ] , the segment descriptor passed to the recursive call is computed by summing over the segments of the current level. For space reasons, we omit the rules for perm, pack , and combine. They are similar to ind , but, e.g., permuting an empty vector does not raise an error, but returns the empty vector. Finally, considering vectorized primitives, the rules for types (T1 , . . . , Tn ) and µx .T are as above; for [T ], the vectorized version is generated by vectorizing the above rules—this vectorization is essentially the same as the first step of the flattening transformation that was mentioned in the beginning of the present section. The details of this and of handling perm, pack , and combine can be found in [8]. Termination. Considering the termination of the above process, we see that for each primitive p, the rules for instances p(T1 ,...,Tn ) and p[T ] decrease the structural depth of the type subscript. Merely the rules for pµx .T may be problematic as they recursively unfold the type in their right hand-side. However, in
Flattening Trees
t[sec] 6.0 5.0 4.0 3.0 2.0 1.0 0
t u rs
717
Speedup 20
b b
9 000 particles 15 20 000 particles 10
sr rs tu u t
t u sr rs tu u t
t u t u
rs rs
rs
t u
tu u tu t
u t t rs rs uu tt uu tt uu sr sr rs rs tu rs rs sr rs sr rs sr rs sr rst
t rs sr u u sr rs sr tt s r uu t rs rs rsu sr ut t b b b b b u t rs rs b b b u t s r b b b u t s r b b b rsu b b t
5 0
0 2 4 6 8 10 12 14 16 18 20PE
absolute (20,000) optimal t tu tu relative (9,000) tu rs tu u s r relative (20,000)ut ut rs rs rs
0
5
10
15
20
Fig. 2. Absolute time and relative speed-up for Barnes-Hut on a Cray TE. the definition of the kernel languages, we required that type variables bound by µ occur only as the element type of a vector. So, for all occurrences where the type unfolding substitutes the recursive type, we get an expression that requires p[µx .T ] . If p is a chunk operation or ++, the right hand-side of an instance of the rule for p[T ] with T = µx .T requires pµx .T , which is exactly the instance we started with and which, therefore, is already defined. If p is neither a chunk operation nor ++, it requires the corresponding chunkwise operation and we already discussed the termination for chunkwise operations.
4
Experimental Results
We measured the accumulated runtime of the expressions in the let-binding of bhtree (Section 2) for the original Nesl-program using 15,000 particles and CMU-Nesl [2] (on a single processor, 200MHz Pentium/Linux machine), to quantify the overhead of mapping the tree structure to a vector. From the overall runtime of 1.75 seconds, 0.5 seconds (29%) are spend on the operations introduced by the mapping. We gathered this data on a sequential machine, due to inaccurate profiling information from CMU-Nesl on the Cray TE. The inefficiencies would even be worse in a parallel run, since these operations include algorithmically unnecessary reordering operations, which cause communication. To measure our proposed techniques for implementing recursive types, we timed an implementation of Barnes-Hut generated using the presented rules.1 Figure 2 shows the timing of a single simulation step on a Cray TE for 9000 and 20000 particles as well as the relative speedup and the absolute speed up (we only had access to 20 processors). The relative speedup is already quite close to the theoretical optimal speedup, but the absolute speedup shows that there is still room for improvement and we already know some possible optimizations. We did not compare CMU-Nesl and our implementation on the Cray TE directly, since our code generation techniques for the flat language already outperform CMU-Nesl by more than an order of magnitude [9]. 1
The code was ‘hand-generated’; we are currently implementing a compiler.
718
5
Gabriele Keller and Manuel M. T. Chakravarty
Conclusion
After its introduction by Blelloch & Sabot, flattening has received considerable attention, but to the best of our knowledge, we are the first to extend flattening to tree structures. There are other approaches to implementing nested dataparallelism, and of course, there is a wealth of literature on trees in parallel computing, but we do not have the space to discuss this work here—some of it is discussed in [8]. Our extension is easily integrated into existing systems and the gathered benchmarks support the efficiency of the generated code.
Acknowledgements. We are indebted to Martin Simons and Wolf Pfannenstiel for fruitful discussions and helpful comments on earlier versions of this paper. Furthermore, we are grateful to the anonymous referees of EuroPar’98 for their valuable comments and the members of the CACA seminar (University of Tokyo) for lively discussions. The first author receives a PhD scholarship from the DFG (German Research Council). She thanks the Score group of the University of Tsukuba, especially, Tetsuo Ida, for the hospitality while being a guest at Score.
References 1. P. K. T. Au, M. M. T. Chakravarty, J. Darlington, Yike Guo, Stefan J¨ ahnichen, G. Keller, M. K¨ ohler, W. Pfannenstiel, and M. Simons. Enlarging the scope of vector-based computations: Extending Fortran 90 with nested data parallelism. In W. K. Giloi, editor, Proc. of the Intl. Conf. on Advances in Parallel and Distributed Computing. IEEE Computer Society Press, 1997. 710 2. G. E. Blelloch, S. Chatterjee, J. C. Hardwick, J. Sipelstein, and M. Zagha. Implementation of a portable nested data-parallel language. In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 1993. 709, 713, 717 3. J. Barnes and P. Hut. A hierarchical O(n log n) force calculation algorithm. Nature, 324, December 1986. 710 4. G. E. Blelloch. Vector Models for Data-Parallel Computing. The MIT Press, 1990. 709, 710, 712, 714 5. G. E. Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3):85–97, 1996. 709, 710 6. G.E. Blelloch and G. W. Sabot. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing, 8:119– 134, 1990. 709 7. S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. on Prog. Lang. and Systems, 15(3), 1993. 709, 713 8. Gabriele Keller and Manuel M. T. Chakravarty. Flattening trees— unabridged. Forschungsbreicht 98-6, Technical University of Berlin, 1998. http://cs.tu-berlin.de/cs/ifb/TechnBerichteListe.html. 710, 712, 716, 718 9. G. Keller. Transformation-based Implementation of Nested Parallelism for Parallel Computers with Distributed Memory. PhD thesis, Technische Universit¨ at Berlin, Fachbereich Informatik, 1998. Forthcoming. 709, 713, 717
Flattening Trees
719
10. G. Keller and M. Simons. A calculational approach to flattening nested data parallelism in functional languages. In J. Jaffar, editor, The 1996 Asian Computing Science Conference, LNCS. Springer Verlag, 1996. 709, 712 11. J. Prins and D. Palmer. Transforming high-level data-parallel programs into vector operations. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119–128, San Diego, CA., May 19-22, 1993. ACM. 709, 712 12. D. Palmer, J. Prins, and S. Westfold. Work-efficient nested data-parallelism. In Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Processing (Frontiers 95). IEEE, 1995. 709
Dynamic Type Information in Process Types Franz Puntigam Technische Universit¨ at Wien, Institut f¨ ur Computersprachen Argentinierstr. 8, A-1040 Vienna, Austria. [email protected]
Abstract. Static checking of process types ensures that each object accepts all messages received from concurrent clients, although the set of acceptable messages can depend on the object’s state. However, conventional approaches of using dynamic type information (e.g., checked type casts) are not applicable in the current process type model, and the typing of self-references is too restrictive. In this paper a refinement of the model is proposed. It solves these problems so that it is easy to handle, for example, heterogeneous collections.
1
Introduction
The process type model was proposed as a statically checkable type model for concurrent and distributed systems based on active objects [10,11]. A process type specifies not only a set of acceptable messages, but also constraints on the sequence of these messages. Type safety can be checked statically by ensuring that each object reference is associated with an appropriate type mark which specifies all message sequences accepted by the object via this reference. The object accepts messages sent through different references in arbitrary interleaving. A type mark is a limited resource that represents a “claim” to send messages. It can be used up, split into several, more restricted type marks or forwarded from one user to another, but it must not be exceeded by any user. Support of using dynamic type information shall be added. But, checked type casts break type safety: An object’s dynamic type is not a “claim” to send messages. This problem is solved by checking against dynamic type marks. Each object has better knowledge about its own state than about the other objects’ states. The additional knowledge can be used in changing the selfreferences’ type marks in a very flexible and still type-safe way. The rest of the paper is structured as follows: An improved version of the typed process calculus presented in [11] is introduced in Sect. 2. Support of using dynamic type information is added in Sect. 3, a more flexible typing of self-references in Sect. 4. Static type checking is dealt with in Sect. 5.
2
A Typed Calculus of Active Objects
The proposed calculus describes systems composed of active objects that communicate through asynchronous message passing. An object has its own thread D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 720–728, 1998. c Springer-Verlag Berlin Heidelberg 1998
Dynamic Type Information in Process Types
721
of execution, a behavior, an identifier, and an unlimited buffer of received messages. According to its behavior, an object accepts messages from its buffer, sends messages to other objects, and creates new objects. All messages are received and accepted in the same logical order as they were sent. We use following notation. Constants (denoted by x, y, z, . . . ) are used as message selectors and in types. Identifiers (u, v, w, . . . ) in the infinite set V are used as object and procedure identifiers as well as formal parameters. For each ˜ is a sequence of constants. {˜ e} symbol e, e˜ is an abbreviation of e1 , . . . , en ; e.g., x denotes the smallest set containing e˜, and |˜ e| the length of the sequence. For each g for e1 ·g1 , . . . , en ·gn (|˜ e| = |˜ g |). symbol ·, e˜·g stands for e1 ·g, . . . , en ·g, and e˜·˜ Processes (p, q, . . . ) specify object behavior. Their syntax is: p s m α τ a ϕ µ
::= ::= ::= ::= ::= ::= ::= ::=
{˜ s} | u.m; p | u := (˜ up:ϕ); q | v$u˜ α, u ˜; p | u˜ α, u ˜ | u=v ? p | q x˜ u; p x˜ α, u˜ τ | ϕ a}[˜ x] | σ + τ | u {˜ a}[˜ x] | ∗u{˜ x˜ u:˜ µ, α ˜ [˜ y][˜ z] µ, α ˜ τ | ∗u˜ u:˜ µ, α ˜ τ | u ˜ u:˜ t | ot | pt
A selector {˜ s} specifies a possibly empty, finite set of guarded processes (r, s, . . . ). Each si is of the form x˜ u; p, where x is a message selector, u ˜ a list of formal parameters, and p a process to be executed if si is selected. An si is selectable if the first received message in the object’s buffer is of the form x˜ α, v˜, where α ˜ , v˜ are arguments to be substituted for the parameters u ˜ (|˜ α, v˜| = |˜ u|); α ˜ is a list of types and v˜ a list of object and procedure identifiers. A process u.m; p sends a message m to the object with identifier u, and then behaves as p. A up:ϕ); q introduces an identifier u of a procedure of procedure definition u := (˜ type ϕ and then behaves as q; the procedure takes u ˜ as parameters and specifies α, u ˜; p creates a new object with identifier v and the behavior p. A process v$u˜ then behaves as p; the new object behaves as u˜ α, u˜. A call u˜ α, u ˜ behaves as specified by the procedure with identifier u. A conditional expression u=v ? p | q behaves as p if the identifiers u and v are equal, otherwise as q. There are two kinds of types, object types (π, , σ, τ, . . . ) and procedure types (ϕ, ψ, . . . ). A type is denoted by α, β, γ, . . . if its kind does not matter. Furthermore, there are meta-types (µ, ν, . . . ): “ot” is the type of all object types, “pt” the type of all procedure types, and “t” the type of all types of any kind. The activating set [˜ x] of an object type {˜ a}[˜ x] is a multi-set of constants, the behavior descriptor {˜ a} a finite collection of message descriptors (a, b, c, . . . ). A µ, α ˜ [˜ y][˜ z ] describes a message with selector x, type pamessage descriptor x˜ u:˜ rameters u ˜ of meta-types µ ˜ , and parameters of types α ˜ . The u˜ can occur in α ˜ . The message descriptor is active if the activating set [˜ x] contains all constants in the multi-set [˜ y] (in-set ). When a corresponding message is sent (or accepted), the type is updated by removing the constants in the in-set from the activating set and adding those in the multi-set [˜ z ] (out-set ). Using type updating repeatedly
722
Franz Puntigam
for all active message descriptors, the type specifies a set of acceptable message sequences. An expression ∗u{˜ a}[˜ x] is a recursive version of an object type. A type combination σ + τ specifies a set of acceptable message sequences that includes all arbitrary interleavings of acceptable message sequences specified by σ and τ . An identifier u can be used as type parameter. µ, α ˜ τ specifies that a procedure of this type takes type A procedure type ˜ u:˜ parameters u ˜ of meta-types µ ˜ and parameters of types α ˜ . Objects behaving according to the procedure accept messages as specified by τ . For example, a procedure B := (u{putv; {getw; w.backv; Bu}}:ϕB ) specifies the behavior of a buffer accepting “put” and “get” in alternation and sending “back” to the argument of “get” after receiving “get”. The type of B is given by ϕB =def u:t{putu[e][f], get{backu[once][]}[once][f][e]}[e]. The type of a procedure specifying the behavior of an infinite buffer that accepts as many “get” messages as there are elements in the buffer can be given by ϕBI =def u:t{putu[][f], get{backu[once][]}[once][f][]}[]. Each underlined occurrence of an identifier binds this and all following occurrences of the identifier. An occurrence is free if it is not bound. Free(e) denotes the set of all identifiers occurring free in e. Two processes are regarded as equal if they can be made identical by renaming bound identifiers (α-conversion) and repeatedly applying the equations {˜ r , s} = {˜ r , s, s} and {˜ r , s, s˜} = {˜ r , s˜, s} to selectors. Two types are regarded as equal if they can be made identical by renaming bound identifiers (α-conversion) and applying these equations: {˜ a, b} = {˜ a, b, b}
{˜ a, c, c˜} = {˜ a, c˜, c}
[˜ x, y, y˜] = [˜ x, y˜, y]
σ+τ =τ +σ ( + σ) + τ = + (σ + τ ) τ + {}[] = τ {˜ a}[˜ x, y˜] = {˜ a}[˜ x] (˜ y∈ / Eff {˜ a}) ∗uα = [∗uα/u]α {x˜ u:˜ µ, α[˜ ˜ x][˜ y , y˜ ], a ˜}[˜ z ] = {x˜ u:˜ µ, α ˜ [˜ x][˜ y ], a ˜}[˜ z ] (˜ y ∈ / {˜ x}∪Eff {˜ a}) {˜ a}[˜ x] + {˜ c}[˜ y ] = {˜ a, c˜}[˜ x, y˜] (˜ x ∈ Eff {˜ a}; y˜ ∈ Eff {˜ c}) where Eff {x1 ˜ u1 :˜ µ1 , α ˜ 1 [˜ x1 ][˜ y1 ], . . . , xi ˜ ui :˜ µi , α ˜ i [˜ xi ][˜ yi ]} = {˜ x1 } ∪ · · · ∪ {˜ xi }. Subtyping on meta-types is defined by ot ≤ t and pt ≤ t. Subtyping for types depends on an environment Π containing typing assumptions of the forms u ≤ α and α ≤ u. The subtyping relation ≤ on types and the redundancy elimination relation on object types are the reflexive, transitive closures of the rules: Π ∪ {u ≤ α} u ≤ α Π ∪ {α ≤ u} α ≤ u Π τ + σ Π π ≤ Π σ ≤ τ Π γ˜ ≤ α ˜ Π σ ≤ τ ν˜ ≤ µ ˜ Π σ≤τ Π π+σ ≤+τ Π ˜ u:˜ µ, ασ ˜ ≤ ˜ u:˜ ν , γ˜ τ Π {˜ a, c˜}[˜ x] {˜ a}[˜ x] (∀c ∈ {˜ c} • ∃a ∈ {˜ a} • c = x˜ u:˜ µ, α ˜ [˜ y , y˜ ][˜ z , z˜ ] ∧ z˜ ∈ / Eff {˜ a} ∧ a = x˜ u:˜ ν , γ˜ [˜ y ][˜ z , z˜ ] ∧ µ ˜ ≤ ν˜ ∧ Π α ˜ ≤ γ˜ ) (α ≤ β and σ τ are simplified notations for ∅ α ≤ β and ∅ σ τ .) The relation is used for removing redundant message descriptors from behavior descriptors. Using these definitions we can show, for example, ϕBI ≤ ϕB .
Dynamic Type Information in Process Types
723
A type {˜ a}[˜ x] is deterministic if all message descriptors in {˜ a} have pairwise different message selectors. For each deterministic type σ and message m there is only one way of updating σ according to m. A system configuration C contains expressions of the forms u → p, m ˜ and u → ˜ up:ϕ. The first expression specifies an object with identifier u, behavior p and sequence of received messages m. ˜ The second expression specifies a procedure → of identifier u. C contains at most one expression u → e for each u ∈ V. C[u1 e1 , . . . , u n → en ] is a system configuration with expressions added to C, where → in C. u1 , . . . , un are pairwise different and do not occur to the left of u For each u ∈ V the relation −→ on system configurations (defined by the rules given below) specifies all possible execution steps of object u. ([˜ e/˜ u]f is the simultaneous substitution of the expressions e˜ for all free occurrences of u ˜ in f .) u
α, v˜, m] ˜ −→ C[u → ([˜ α, v˜/˜ u]p), m] ˜ C[u → {x˜ u; p, s˜}, x˜ u
C[u → v.m; p, m, ˜ v → q, m ˜ ] −→ C[u → p, m, ˜ v → q, m ˜ , m] u
up:ϕ); q, m] ˜ −→ C[u → q, m, ˜ v → ˜ up:ϕ] C[u → v := (˜ u
α, u ˜; p, m] ˜ −→ C[u → p, m, ˜ v → w˜ α, u ˜] C[u → v$w˜ u
C[u → v˜ α, u ˜, m, ˜ v → ˜ v p:ϕ] −→ C[u → ([˜ α, u ˜/˜ v]p), m, ˜ v → ˜ v p:ϕ] u
→ p, m] ˜ C[u → v=v ? p | q, m] ˜ −→ C[u u
→ q, m] ˜ C[u → v=w ? p | q, m] ˜ −→ C[u
(v = w) ∗
−→ denotes the closure of these relations over all u ∈ V, and −→ the reflexive, ∗ transitive closure of −→. −→ defines the operational semantics of object systems.
3
Dynamic Type Information
It seems to be easy to extend the calculus with a dynamic type checking concept: Processes of the form α≤β ? p | q are introduced, where α or β initially is a type parameter to be replaced with a (dynamic) type during computation. The typing assumption α ≤ β (for α or β in V) is statically known to hold in p. The relation u −→ (for each u ∈ V) is extended by the rules: u
→ p, m] ˜ (α ≤ β) C[u → α≤β ? p | q, m] ˜ −→ C[u u
→ q, m] ˜ (α ≤ β) C[u → α≤β ? p | q, m] ˜ −→ C[u For example, assume that a reference w is of type mark St[f], where St = {putu:t, u[e][f], getR[one][f][e]} and R =def {backv:t, v[one][]}. The def process w.getw ; {backu, v; u≤St[e] ? v.putSt[f], v; {} | {}} (w denotes the object’s self-reference) is type-safe. First, “get” is sent to w; then “back” is accepted. If the type mark of v is a subtype of St[e], v is put into v. The type used as argument of “put” is St[f] because the type is updated when sending “put”. Type information is lost when types are updated. St[e] is a supertype of v’s type mark u; but, putu, v cannot be sent to v: So far it is not possible to specify
724
Franz Puntigam
that v’s type mark is equal to u, except that u is updated. To solve the problem, processes of the form σ≤τ ? v; p | q are introduced, where σ initially is a type u parameter to be replaced with an object type. The relation −→ is extended: u
C[u → σ≤τ ? v; p | q, m] ˜ −→ C[u → ([/v]p), m] ˜ (τ + σ) u
C[u → σ≤τ ? v; p | q, m] ˜ −→ C[u → q, m] ˜ (σ ≤ τ ) v in p is replaced with a dynamically determined object type which specifies the difference between σ and τ . This version of the above process keeps type information: w.getw ; {backu, v; u≤St[e] ? u ; v.putSt[f] + u , v; {} | {}}. The type St[f] + u equals u after removing e and adding f to the activating set.
4
Typing Self-References
Without special treatment, static typing of self-references is very restrictive: If an object behaves according to a type τ , the type mark σ of a self-reference is a subtype of τ . Both, σ and τ specify all messages that may return results of possibly needed services. In general, it is impossible to determine statically which services are needed. Hence, σ and τ usually specify much larger sets of message sequences than needed. This problem can be solved because each object has much more knowledge about its own state than other objects: Before a message with a self-reference as argument is sent to a server, this reference’s type mark is selected so that it supports the messages returned by the server. At the same time the object’s type is extended correspondingly. The object’s type and selfreference’s type mark are determined dynamically as needed in the computation. We extend the calculus: The distinguished name “self” can be used as arguments in procedures; “self” is replaced with a self-reference when the procedure u is called. The fifth rule in the definition of −→ has to be replaced with: u
→ ([u/self][˜ α, u ˜/˜ v ]p), m, ˜ v → ˜ v p:ϕ] C[u → v˜ α, u ˜, m, ˜ v → ˜ v p:ϕ] −→ C[u The type mark of an occurrence of “self” as argument is equal to the type of the corresponding formal parameter. (Elements of V ∪ {self} are denoted by o, . . . ) An example demonstrates the use of self and dynamic type comparisons: w := ((u, u , v, w u≤St[f] ? v.getself; {backu, v; wu, u , v, w } | w u, v) : u:t, u :ot, u, u:t, uu u ); . . . The procedure w takes as parameters a type u, an object type u , an identifier v of type u and a procedure identifier w of type u:t, uu . If v is a non-empty store, an element is got from the store and w is called recursively with the element and its type as arguments. Otherwise w is called with u and v as arguments. The type mark of self is R[one], as specified by St. The corresponding message “back” is accepted if the expression containing self is executed.
Dynamic Type Information in Process Types
5
725
Static Type Checking
An important result is that the process type model with the proposed extensions supports static type checking. Type checking rules are given in Appendix A. → p1 , . . . , un → pn } be a system configuration Theorem 1. Let C = {u1 and σi,j and τi (1 ≤ i, j ≤ n) object types such that τi ≤ σi,1 + · · · + σi,n and ∅, {u1 :σ1,i , . . . , un :σn,i } pi :τi as defined by the type checking rules. Then, ∗ for each system configuration D with C −→ D, if D contains an expression u u → p, m, m, ˜ there exists a system configuration E with D −→ E. (The proof is omitted because of lack of space.) Provided that type checking succeeds for an initial system configuration, if an object has a nonempty buffer of received messages, the execution of this object cannot be blocked. Especially, the next message in the buffer is acceptable. Theorem 1 also implies that separate compilation is supported: If a process pi in C is replaced with another pi satisfying the conditions, the consequences of the theorem still hold.
6
Related Work
Much work on types for concurrent languages and models was done. The majority of this work is based on Milner’s π-calculus [3,4] and similar calculi. Especially, the problem of inferring most general types was considered by Gay [2] and Vasconcelos and Honda [14]. Nierstrasz [5], Pierce and Sangiorgi [8], Vasconcelos [13], Colaco, Pantel and Sall’e [1] and Ravara and Vasconcelos [12] deal with subtyping. But their type models differ in an important aspect from the process type model: They cannot represent constraints on message sequences and ensure statically that all sent messages are acceptable; the underlying calculus does not keep the message order. The proposals of Nielson and Nielson [6] can deal with constraints on message sequences. As in the process type model, a type checker updates type information while walking through an expression. However, their type model cannot ensure that all sent messages are understood, and dealing with dynamic type information is not considered. The process type model can ensure statically that all sent messages are understood, although the set of acceptable messages can change. Therefore, this model is a promising approach to strong, static types for concurrent and distributed applications based on active objects. However, it turned out that the restrictions caused by static typing are an important difficulty in the practical applicability of the process type model. So far it was not clear how to circumvent this difficulty without breaking type safety. New techniques for dealing with dynamic type information (as presented in this paper) had to be found. The approach of Najm and Nimour [7] has a similar goal as process types. However, this approach is more restrictive and cannot deal with dynamic type information, too.
726
Franz Puntigam
The work presented in this paper refines previous work on process types [9,10,11]. The major improvement is the addition of dynamic type comparisons and a special treatment of self-references. Some changes to earlier versions of the process type model were necessary. For example, the two sets of type checking rules presented in [11] had to be modified and combined into a single set so that it was possible to introduce “self”. Dynamic type comparisons exist in nearly all recent object-oriented programming languages. However, the usual approaches cannot be combined with process types because these approaches do not distinguish between object types and type marks and cannot deal with type updates. Conventional type models do not have problems with the flexibility of typing self-references.
7
Conclusions
The process type model is a promising basis for strongly typed concurrent programming languages. The use of dynamic type information can be supported in several ways so that the process type model becomes more flexible and useful. An important property is kept: A type checker can ensure statically that clients are coordinated so that all received messages can be accepted in the order they were sent, even if the acceptability of messages depends on the server’s state.
References 1. J.-L. Colaco, M. Pantel, and P. Sall’e. A set-constraint-based analysis of actors. In Proceedings FMOODS ’97, Canterbury, United Kingdom, July 1997. Chapman & Hall. 725 2. Simon J. Gay. A sort inference algorithm for the polyadic π-calculus. In Conference Record of the 20th Symposium on Principles of Programming Languages, January 1993. 725 3. Robin Milner. The polyadic π-calculus: A tutorial. Technical Report ECS-LFCS91-180, Dept. of Comp. Sci., Edinburgh University, 1991. 725 4. R. Milner, J. Parrow, and D. Walker. A calculus of mobile processes (parts I and II). Information and Computation, 100:1–77, 1992. 725 5. Oscar Nierstrasz. Regular types for active objects. ACM SIGPLAN Notices, 28(10):1–15, October 1993. Proceedings OOPSLA’93. 725 6. Flemming Nielson and Hanne Riis Nielson. From CML to process algebras. In Proceedings CONCUR’93, number 715 in Lecture Notes in Computer Science, pages 493–508. Springer-Verlag, 1993. 725 7. E. Najm and A. Nimour. A calculus of object bindings. In Proceedings FMOODS ’97, Canterbury, United Kingdom, July 1997. 725 8. Benjamin Pierce and Davide Sangiorgi. Typing and subtyping for mobile processes. In Proceedings LICS’93, 1993. 725 9. Franz Puntigam. Flexible types for a concurrent model. In Proceedings of the Workshop on Object-Oriented Programming and Models of Concurrency, Torino, June 1995. 726
Dynamic Type Information in Process Types
727
10. Franz Puntigam. Types for active objects based on trace semantics. In Elie Najm et al., editor, Proceedings FMOODS ’96, Paris, France, March 1996. IFIP WG 6.1, Chapman & Hall. 720, 726 11. Franz Puntigam. Coordination requirements expressed in types for active objects. In Mehmet Aksit and Satoshi Matsuoka, editors, Proceedings ECOOP ’97, number 1241 in Lecture Notes in Computer Science, Jyv¨ askyl¨ a, Finland, June 1997. Springer-Verlag. 720, 726 12. Ant´ onio Ravara and Vasco T. Vasconcelos. Behavioural types for a calculus of concurrent objects. In Proceedings Euro-Par ’97, Lecture Notes in Computer Science. Springer-Verlag, 1997. 725 13. Vasco T. Vasconcelos. Typed concurrent objects. In Proceedings ECOOP’94, number 821 in Lecture Notes in Computer Science, pages 100–117. Springer-Verlag, 1994. 725 14. Vasco T. Vasconcelos and Kohei Honda. Principal typing schemes in a polyadic pi-calculus. In Proceedings CONCUR’93, July 1993. 725
728
A
Franz Puntigam
Type Checking Rules
Π fa ˜g[˜ y ] ≤ τ ∀1≤i≤n • Π, Γ [˜ ui :˜ µi , v˜i :α ˜ i ] pi :fa ˜g[˜ zi ] (1) sel Π, Γ fx1 ˜ u1 , v˜1 ; p1 , . . . , xn ˜ un , v˜n ; pn g:τ Π σ ≤ fag[˜ x] + Π, Γ [u:fag[˜ y ] + ] α:˜ ˜ µ, o˜:([α/˜ ˜ u]˜ γ ), p:τ (2) send Π, Γ [u:σ] (u.xα, ˜ o˜; p):τ ν, w ˜2 :ϕ, ˜ u:ψ, u ˜:˜ µ, v˜:α] ˜ ψ:pt, p:σ Π, Γ [u:ψ] q:τ Π, ∅[w ˜1 :˜ (3) def Π, Γ (u := (˜ u, v˜p:∗uψ); q):τ Π ϕ ≤ ˜ u:˜ µ, γ ˜ σ Π, Γ [u:([α/˜ ˜ u]σ), v:ϕ] α:˜ ˜ µ, o˜:([α/˜ ˜ u]˜ γ ), p:τ new Π, Γ [v:ϕ] (u$vα, ˜ o˜; p):τ µ, γ˜ σ Π ([α/˜ ˜ u]σ) ≤ τ Π, Γ [u:ϕ] α:˜ ˜ µ, o˜:([α/˜ ˜ u]˜ γ ), fg:fg[] Π ϕ ≤ u:˜ call Π, Γ [u:ϕ] uα, ˜ o˜:τ Π, Γ [u: + σ] ([u/v]p):τ Π, Γ [u:, v:σ] q:τ equ1 Π, Γ [u:, v:σ] (u=v ? p | q):τ Π, Γ [u:ϕ, v:ψ] p:τ Π, Γ [u:ϕ, v:ψ] q:τ equ2 Π, Γ [u:ϕ, v:ψ] (u=v ? p | q):τ Π ∪ {u ≤ α}, Γ [u:µ] p:τ Π, Γ [u:µ] α:t, q:τ (u ∈ / Free(α)) sub1 Π, Γ [u:µ] (u≤α ? p | q):τ Π ∪ {α ≤ u}, Γ [u:µ] p:τ Π, Γ [u:µ] α:t, q:τ (u ∈ / Free(α)) sub2 Π, Γ [u:µ] (α≤u ? p | q):τ Π ∪ {u ≤ σ}, Γ [u:ot, v:ot] p:τ Π, Γ [u:ot] σ:ot, q:τ (u ∈ / Free(σ)) split Π, Γ [u:ot] (u≤σ ? v; p | q):τ Π σ ≤ + τ Π, Γ [u:] ˜ e:˜ g obj Π, Γ [u:σ] u:τ, e˜:˜ g Π, Γ ˜ e:˜ g , p:σ + τ self Π, Γ self:σ, e˜:˜ g , p:τ Π ψ ≤ ϕ Π, Γ [u:ψ] ˜ e:˜ g proc Π, Γ [u:ψ] u:ϕ, e˜:˜ g Π, Γ [u:ν] ˜ e:˜ g ν ≤ µ tpar Π, Γ [u:ν] u:µ, e˜:˜ g µi ] α ˜ i :t, fg:fg[] Π, Γ ˜ e:˜ g ot ≤ ν ∀1≤i≤n • Π, Γ [u:ot, u ˜i :˜ objt1 Π, Γ ∗ufx1 ˜ u1 :˜ µ1 , α ˜ 1 [˜ y1 ][˜ z1 ], . . . , xn ˜ un :˜ µn , α ˜ n [˜ yn ][˜ zn ]g[˜ y]:ν, e˜:˜ g Π, Γ σ:ot, τ :ot, e˜:˜ g ot ≤ µ objt2 Π, Γ (σ + τ ):µ, e˜:˜ g Π, Γ [u:pt, u ˜:˜ µ] α:t, ˜ τ :ot, fg:fg[] Π, Γ ˜ e:˜ g pt ≤ ν proct Π, Γ (∗u˜ u:˜ µ, α ˜ τ ):ν, e˜:˜ g ˜g[˜ y] = fx1 ˜ u1 :˜ µ1 , α ˜ 1 [][˜ z1 ], . . . , xn ˜ un :˜ µn , α ˜ n [][˜ zn ] g (1) Act fa
where
Act fg[˜ x] = fg Act fx˜ u:˜ µ, α[˜ ˜ x][˜ y ], a ˜g[˜ x, z˜] = fx˜ u:˜ µ, α ˜ [][˜ y , z˜], c˜g (Act fa ˜g[˜ x, z˜] = fc˜g) u:˜ µ, α ˜ [˜ x][˜ y ], a ˜g[˜ z ] = Act fa ˜g[˜ z] (∀˜ x • [˜ z ] = [˜ x, x ˜ ]) Act fx˜ and fa ˜g[˜ y ] is deterministic. µ, γ˜ [˜ x][˜ y] (2) a = x˜ u:˜ (3) ψ = ˜ u:˜ µ, ασ ˜
and
Γ = {w ˜1 :˜ ν, w ˜2 :ϕ, ˜ w ˜3 :˜}
Generation of Distributed Parallel Java Programs Pascale Launay and Jean-Louis Pazat IRISA, Campus de Beaulieu, F35042 RENNES cedex [email protected] [email protected]
Abstract. The aim of the Do! project is to ease the standard task of programming distributed applications using Java. This paper gives an overview of the parallel and distributed frameworks and describes the mechanisms developed to distribute programs with Do!.
1
Introduction
Many applications have to cope with parallelism and distribution. The main targets of these applications are networks and clusters of workstations (nows and cows) which are cheaper than supercomputers. As a consequence of the widening of parallel programming application domains, programming tools have to cope both with task and data parallelism for distributed program generation. The aim of the Do! project is to ease the task of programming distributed applications using object-oriented languages (namely Java). The Do! programming model is not distributed, but is explicitly parallel. It relies on structured parallelism and shared objects. This programming model is embedded in a framework described in section 2, without any extension to the Java language. The distributed programs use a distributed framework described in section 3. Program distribution is expressed through the distribution of collections of tasks and data; the code generation is described in section 4.
2
Parallel Framework
In this section, we give an overview of the parallel framework described in [8]. The aim of this framework is to separate computations from control and synchronizations between parallel tasks allowing the programmer to concentrate on the definition of tasks. This framework provides a parallel programming model without any extension to the Java language. It is based on active objects (tasks) and structured parallelism (through the class par) that allows the execution of tasks grouped in a collection in parallel. The notion of active objects is introduced through task objects. A task is an object which type extends the task class, that represents a model of task, and is part of the framework class library. The task default behavior is inherited from the behavior described in the task class through its run method. This behavior D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 729–732, 1998. c Springer-Verlag Berlin Heidelberg 1998
730
Pascale Launay and Jean-Louis Pazat
can be re-defined to implement a specific task behavior. A task can be activated synchronously or asynchronously. We have extended the operators design pattern [6] designed to express regular operations over collections through operators, in order to write dataparallel spmd programs: a collection manages the storage and the accesses to elements; an operator represents an autonomous agents processing elements. Including the concept of active objects, we offer a parallel programming model, integrating task parallelism: active and passive objects are stored by collections; a parallel program consists in a processing over task collections, task parameters being grouped in data collections. The par class implements the parallel activation and synchronization of tasks, providing us with structured parallelism. Nested parallelism can be expressed, the par class being a task. Figure 1 shows an example of a simple parallel program, using collections of type array; the class my task represents the program specific tasks; it extends task, and takes an object of type param as parameter. import do.shared.∗; /* the task definition */ public class my task extends task { public void run (Object param) { my data data = (my data)param; data.select(criterion); data.print(out); data.add(value); } }
/* the task parameter */ public class my data { public void add(...) { ... } public void remove(...) { ... } public void print(...) { ... } public void select(...) { ... } }
public class simple parallel { public static void main (String argv[ ]) { /* task and data array initializations */ array tasks = new array(N); array data = new array(N); for (int i=0; i
Fig. 1. A simple parallel program
3
Distributed Framework
The distributed framework [9] is used as a target of our preprocessor to distribute parallel programs expressed with our parallel framework. In the parallel framework, we use collections to manage the storage and accesses to active and passive objects. The distributed framework is based on distributed collections: a distributed collection is a collection that manages elements mapped on distinct processors, the location of the elements being masked to the user: when a client retrieves a remote element through a distributed collection, it gets a remote reference to the element, that can be invoked transparently. Distributed tasks are activated in parallel by remote asynchronous invocations. A distributed collection is
Generation of Distributed Parallel Java Programs
731
composed of fragments mapped on distinct processors, each fragment managing the local subset of the collection elements. A local fragment is a non distributed collection; the distributed collection processes an access to a distributed element by remote invocation to the fragment owning this element (local to this element). To access an element of a distributed collection, a client identifies this element with a global identifier (relative to the whole set of elements). The distributed collection has to identify the fragment owning the element and transform the global identifier into a local identifier relevant to the local fragment. The task of converting a global identifier into the corresponding local identifier and the owner identifier devolves on a layout manager object. Different distribution policies are provided by different types of layout managers. The program distribution is guided by the user choice of a specific layout manager implementation. Nevertheless the distributed framework does not manage the remote creations and accesses. They are handled by a specific runtime, and require to transform objects of the program (section 4).
4
Distributed Code Generation
We have developed a preprocessor to transform parallel programs expressed with our parallel framework into distributed programs, using our distributed framework. The parallel and distributed frameworks have the same interface, so the framework replacement is obtained by changing the imports in the program. Despite this, we have to transform some objects of the program: the objects stored in the distributed collections are distributed upon processors and need to have access to other objects. So, we have to: – create (map) objects of the program on processors: when an object is located on a processor, its attributes are managed by the local memory and its methods run on this processor; – have a mechanism allowing objects to be accessed remotely without any change from the caller point of view. We have extended the object creation semantics to take into account remote creations of objects. Servers running on each host as separate threads are responsible for remote object creations. This extension is implemented using the standard Java reflection mechanism. To allow the transparent accesses to remote objects, the Do! preprocessor transforms a class into two classes: – the implementation class contains the source methods implementations. An implementation object is not replicated and is located where the source object has been created (mapped). It is shared between all proxy objects. – the proxy class has the same name and interface as the source class, but the method bodies consist in remote invocations of the corresponding methods in the implementation class. The proxy object handles a remote reference on the implementation object; it catches the invocations to the source object and redirects them to the right host. The proxy object state is never modified, so it can be replicated on each processor getting a reference on the source object.
732
5
Pascale Launay and Jean-Louis Pazat
Related Work
Many research projects have appeared around the Java language, aiming at filling the gaps of parallelism and distribution in Java. Some of them are presented in [1]. Some projects are based on parallel extensions to the Java language: tools [2,4] produce Java parallel (multi-threaded) programs, relying on a standard Java runtime system using thread libraries and synchronization primitives; they do not generate distributed programs; others [7,10] are based on distributed objects and remote method invocations. Other projects use the Java language without any extension: some environments [5] rely on a data-parallel programming model and a spmd execution model; as in the Do! project, parallelism may be introduced through the notion of active objects [3].
6
Conclusion
In this paper, we have presented an overview of the Do! project, that aims at automatic generation of distributed programs from parallel programs using the Java language. We use the Java rmi as run-time for remote accesses; a foreseen extension of this work is to use corba for objects communications. Ongoing work consists in extending the parallel and distributed frameworks to handle dynamic creation of tasks, through dynamic collections (e.g. lists) and distributed scheduling.
References 1. ACM 1997 Workshop on Java for Science and Engineering Computation. Concurrency: Practice and Experience, 9(6):413–674, June 1997. 732 2. A. J. C. Bik and D. B. Gannon. Exploiting implicit parallelism in Java. Concurrency, Practice and Experience, 9(6):579–619, 1997. 732 3. D. Caromel. Towards a method of object-oriented concurrent programming. Communications of the ACM, 36(9):90–102, September 1993. 732 4. Y. Ichisugi and Y. Roudier. Integrating data-parallel and reactive constructs into Java. In OBPDC’97, France, October 1997. 732 5. V. Ivannikov, S. Gaissaryan, M. Domrachev, V. Etch, and N. Shtaltovnaya. DPJ: Java class library for development of data-parallel programs. Institute for System Programming, Russian Academy of Sciences, 1997. 732 6. J.-M. J´ez´equel and J.-L. Pacherie. Parallel operators. In P. Cointe, editor, ECOOP’96, number 1098 in LNCS, Springer Verlag, pages 384–405, July 1996. 730 7. L. V. Kal´e, M. Bhandarkar, and T. Wilmarth. Design and implementation of Parallel Java with a global object space. In Conference on Parallel and Distributed Processing Technology and Applications, Las Vegas, Nevada, July 1997. 732 8. P. Launay and J.-L. Pazat. A framework for parallel programming in Java. In HPCN’98, LNCS, Springer Verlag, Amsterdam, April 1998. To appear. 729 9. P. Launay and J.-L. Pazat. Generation of distributed parallel Java programs. Technical Report 1171, Irisa, February 1998. 730 10. M. Philippsen and M. Zenger. JavaParty – transparent remote objects in Java. In PPoPP, June 1997. 732
An Algebraic Semantics for an Abstract Language with Intra-Object-Concurrency Thomas Gehrke Institut f¨ ur Informatik, Universit¨ at Hildesheim Postfach 101363, D-31113 Hildesheim, Germany [email protected]
Object-oriented specification and implementation of reactive and distributed systems are of increasing importance in computer science. Therefore, languages like Java [1] and Object REXX [3,7] have been introduced which integrate objectorientation and concurrency. An important area in programming language research is the definition of semantics, which can be used for verification issues and as a basis for language implementations. In recent years several semantics for concurrent object-oriented languages have been proposed which are based on the concepts of process algebras; see, for example, [2,6,8,9]. In most of these semantics, the notions of processes and objects are identified: Objects are represented by sequential processes which interact via communication actions. Therefore, in each object only one method can be active at a given time. Furthermore, language implementations based on these semantics tend to be inefficient, because the simulation of message passing among local objects by communication actions is more expensive than procedure calls in sequential languages. In contrast to this identification of objects with processes, languages like Java and Object REXX allow for intra-object-concurrency by distinguishing passive objects and active processes (in Java, processes are objects of a special class). Objects are data structures containing methods for execution, while system activity is done by processes. Therefore, different processes may perform methods of the same object at the same time. To avoid inconsistencies of data, the languages contain constructs to control intra-object-concurrency. In this paper, we introduce an abstract language based on Object REXX for the description of the behaviour of concurrent object-based systems, i.e. we ignore data and inheritance. The operational semantics of this language is defined by a translation of systems into terms of a process calculus developed previously. This calculus makes use of process creation and sequential composition instead of the more common action prefixing and parallel composition. Due to the translation of method invocation into process calls similar to procedure calls in sequential languages, the defined semantics is appropiate as a basis for implementation. The language O: An O-system consists of a set of objects and an additional sequence of statements, which describes the initial system behaviour. Each object is identified by a unique object identifier O and must contain at least one
This work was supported by the DFG grant EREAS (Entwurf reaktiver Systeme).
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 733–738, 1998. c Springer-Verlag Berlin Heidelberg 1998
734
Thomas Gehrke
object O1 method m1 a; b method m2 a; reply; b method m3 a ✷ reply; b
object O2 method m1 guarded c; reply; d method m2 guarded e; guard off; reply; f
object O3 method m1 guarded a; reply; b object O4 method m1 guarded c; d
Fig. 1. O examples.
method. The special object identifier self can be used for the access of methods of the calling object (self is not allowed in the initial statement sequence). Methods are identified by method identifiers m and must contain at least one statement. Statements include single instructions and nondeterministic choices between sequences: s1 ✷s2 performs either s1 or s2 . We distinguish five kinds of instructions: atomic actions a representing the visible actions of systems (e.g. access to common resources), methods calls O.m and the three control instructions reply, guard on and guard off. ; denotes sequential composition; for syntactical convenience we assume that ; has a higher priority than ✷ (e.g., a ✷ b; c is a ✷ (b; c)). System runs are represented by sequences a1 ...an of atomic actions, called traces. During the execution of a called method, the caller is blocked until the called method has terminated. For example, the execution of the sequence O1.m1; c in combination with the object O1 in Figure 1 generates the trace a b c. In some cases it is desirable to continue the execution of the caller before the called method has finished (e.g., the execution of the remaining instructions of the called method does not influence the result delivered to the caller). The instruction reply allows the caller to continue its execution concurrently to the remaining instructions of the called method. Therefore, the execution of O1.m2; c leads to the traces a b c and a c b. Furthermore, the execution of O1.m3; c leads to the traces a c, b c and c b. Methods of different objects can always be executed in parallel. Concurrency within objects can be controlled by the notion of guardedness. If a method m of an object O is guarded, no other method of O is allowed to become active when m is taking place. If m is unguarded, other methods of O can be performed in parallel with m. The option guarded declares a method to be guarded. The instructions guard on and guard off allow to change the guardedness of a method during its execution. Consider the object O2 in Figure 1 in which both methods are declared as guarded. Although O2.m1 contains a reply instruction, the execution of the sequence O2.m1; O2.m2 can only generate the trace c d e f , because the guardedness of the methods prevents their concurrent execution. On
An Algebraic Semantics for an Abstract Language
δ 1 1− →
T1
δ spawn(t) spawn(t) − →
α 1 α −→ a† t −− →t
R1
a =b
a† (b : t) −− → (b : t ) δ t t− →
R4
α u u −→
α t ; u t; u −→
R8
T2
τ Θ(n) n− → ω t t −→ ω t t + u −→ δ t t− →
R2
δ t t− →
δ u− u →
δ (t; u) (t; u) − →
T3
δ t t− → δ (a : t) (a : t) − →
α t t −→ α spawn(t ) spawn(t) −→
R5
α t t −→
ω u u −→
R6 ω u t + u −→
α u −− →u
735
T4
R3
α t t −→ α t ; u t; u −→
{α, α } = {a, a?}
τ t ; u t; u − →
R7
R9
Fig. 2. Transition rules.
the other hand the sequence O2.m2; O2.m1 leads to the traces e f c d and e c d f (the trace e c f d is prevented by the guardedness of O2.m1). Guardedness does not play a role in the initial statement sequence. As a third example, consider the sequence O3.m1; O4.m1. This sequence leads to the possible traces a b c d, a c b d and a c d b, because the called methods belong to different objects. The process calculus P: To model the behaviour of O-systems, we introduce a process calculus P, which is a slightly modified version of a calculus studied in [5]. We assume a countable set C of action names, ranged over by a, b, c. A name a can be used either for input, denoted a?, or for output, denoted with only the name a itself. We sometimes use a† as a “meta-notation” denoting either a? or a. The set of actions is denoted A = {a† | a ∈ C} ∪ {τ }, ranged over by α, β. τ is a special action to indicate internal behaviour of a process. Ω = A ∪ {δ}, ranged over by ω, is the set of transition labels, where δ signals successful termination. N , ranged over by n, n , is a set of names of processes. The calculus P, ranged over by t, u, v, is defined through the following grammar: t ::= 0 | 1 | α | n | t; t | spawn(t) | t + t | (a : t) 0 denotes the inactive process, 1 denotes a successfully terminated term. A process name n is interpreted by a function Θ : N → P, called process environment, where n denotes a process call of Θ(n). t; u denotes the sequential composition of t and u, i.e. u can perform actions when t has terminated. spawn(t) creates a new process which performs t concurrently to the spawning process: spawn(t); u represents the concurrent execution of t and u. The choice operator t + u performs either t or u. (a : t) restricts the execution of t to the actions in A \ {a, a?}. The semantics of P is given by the transition rules in Figure 2. Translation: An object system is translated into a process environment Θ and a term representing the initial statement sequence. The set N of process names
736
Thomas Gehrke
O1 m1 O1 m2 O1 m3
→ a; b a; spawn(b) → τ ; a + τ ; spawn(b) →
o2 l; c; spawn(d; o2 u) → O2 m1 o2 l; e; o2 u; spawn(o2 l; f ; o2 u) → O2 m2 o2 l?; o2 u?; O2 mutex O2 mutex → o3 l; a; spawn(b; o3 u) → O3 m1 o3 l?; o3 u?; O3 mutex O3 mutex → o4 l; c; d; o4 u → O4 m1 o4 l?; o4 u?; O4 mutex O4 mutex →
Fig. 3. Process semantics of the objects in Figure 1.
is defined as N = {O m | O object, m method of O} ∪ {O mutex | O object}. For each method m of an object O, we include a process name O m into the set N ; the process term Θ(O m) models the behaviour of the body of m. For example, the method m1 of object O1 in Figure 1 is translated into O1 m1 → a; b (see Figure 3). The additional O mutex processes are used for synchronizing methods. Method calls are translated into process calls of the corresponding process definitions. Therefore, the statement sequence O1.m1; c is translated τ into O1 m1; c, which is able to perform the following transitions: O1 m1; c − → a b c 1; b; c − 1; 1; c − 1; 1; 1. Note that the sequence of the transition a; b; c − → → → labels without τ corresponds to the trace for O1.m1; c. The translation of the reply-instruction is realized by enclosing the remaining instructions of the method in a spawn-operator. For example, the method m2 of O1 is translated into the process O1 m2 → a; spawn(b). The sequence O1.m2; c is translated into O1 m2; c, which leads to the following transition system:
O1_m2;c
τ
a;spawn(b);c
a
b
1;spawn( 1 );c
c
c
1;spawn(b);1
b
1;spawn(b);c
1;spawn( 1 ); 1
Choice operators s1 ✷ s2 are translated into terms τ ; t1 + τ ; t2 where t1 , t2 are the translations of s1 and s2 . The initial τ -actions simulate the internal nondeterminism of the ✷-operator. If an instruction sequence contains subterms of the form (s1 ✷ s2 ); s3 , we have to distribute sequential composition over choice, i.e. (s1 ✷ s2 ); s3 has to be transformed into s1 ; s3 ✷ s2 ; s3 before the translation into the process calculus can be applied. This transformation is necessary for correct translation of the reply-instruction. For example, the
An Algebraic Semantics for an Abstract Language
737
sequence (a; b ✷ reply; c); d is transformed into a; b; d ✷ reply; c; d, and then translated into τ ; a; b; d + τ ; spawn(c; d). Guardedness is implemented by mutual exclusion with semaphores. In the absence of data, we have to simulate semaphores by communication. For each object O using guardedness, we introduce a special semaphore process → o l?; o u?; O mutex. Communication on channel o l means lockO mutex ing the semaphore, communication on o u the corresponding release (unlock). To ensure that only one guarded method of an object can be active at the same time, methods of objects using guardedness have to interact with the corresponding semaphore process. For example, consider the translation of object O2 in Figure 3. In order to integrate unguarded actions into the synchronization mechanism, we have to enclose every unguarded action by lock and unlock actions. Therefore, in O2.m2 the instruction f is translated into o2 l; f ; o2 u. Without the synchronization actions, unguarded actions could be performed although a guarded method has locked the semaphore. In objects without guardedness, the insertion of lock and unlock actions is not necessary and therefore omitted (see Figure 3). To enforce communication over the l and u channels, we have to restrict these actions to the translation of the initial instruction sequence. Furthermore, the semaphore process has to be spawned initially. Therefore, the initial statement sequence O2.m1; O2.m2 is translated into ({o2 l, o2 u} : spawn(O2 mutex); O2 m1; O2 m2). The semaphore process is spawned initially and runs concurrently to the method calls. o2 l and o2 u are restricted, therefore they can only be performed in communications between O2 m1, O2 m2 and the semaphore process. It is easy to see that the only possible trace is c d e f (with τ -actions omitted). In the full paper [4], the translation of O-systems is defined via translation functions. Furthermore, weak bisimulation is used as a equivalence relation on object systems.
References 1. K. Arnold and J. Gosling. The Java Programming Language. Addison-Wesley, 1996. 733 2. P. Di Blasio and K. Fisher. A Calculus for Concurrent Objects. In Proceedings of CONCUR ’96, LNCS 1119. Springer, 1996. 733 3. T. Ender. Object-Oriented Programming with REXX. John Wiley and Sons, Inc., 1997. 733 4. T. Gehrke. An Algebraic Semantics for an Abstract Language with Intra-ObjectConcurrency. Technical Report HIB 7/98, Institut f¨ ur Informatik, Universit¨ at Hildesheim, May 1998. 737 5. T. Gehrke and A. Rensink. Process Creation and Full Sequential Composition in a Name-Passing Calculus. In Proceedings of EXPRESS ’97, vol. 7 of Electronic Notes in Theoretical Computer Science. Elsevier, 1997. 735 6. K. Honda and M. Tokoro. An Object Calculus for Asynchronous Communication. In Proceedings of ECOOP ’91, LNCS 512. Springer, 1991. 733 7. C. Michel. Getting Started with Object REXX. In Proceedings of the SHARE Technical Conference, March 1996. 733
738
Thomas Gehrke
8. E. Najm and J.-B. Stefani. Object-Based Concurrency: A Process Calculus Analysis. In Proceedings of TAPSOFT ’91 (vol. 1), LNCS 493. Springer, 1991. 733 9. D. Walker. Objects in the π-Calculus. Information and Computation, 116:253–271, 1995. 733
An Object-Oriented Framework for Managing the Quality of Service of Distributed Applications St´ephane Lorcy and No¨el Plouzeau Irisa Campus de Beaulieu - 35042 Rennes Cedex - France [email protected]
Abstract. In this paper we present a framework aiming at easing the design and implementation of distributed object-oriented applications, especially interactive ones, where quality of service has to be monitored and controlled. Our framework relies on a new computation model based on a contract concept. Contracts are used to specify execution requirements and to monitor the remote execution of methods; they enable a good separation of concern between the application domain issues and the distribution domain ones, while promoting structured interactions between objects in these two domains, especially regarding quality of service.
1
Introduction
It is now a well-known fact that the design, implementation and testing of distributed applications pose many challenges to the computer engineer. A potential and partial cure of these problems may lie in the use of object-oriented techniques. But parallel and distributed applications have specific problems requiring the development of specific object-oriented techniques. Handling distributed objects (ie objects executing in execution spaces disseminated over a network) is a serious concern to the application designer. A lot of effort has been devoted to techniques for using and managing distributed objects. In this paper we focus on the quality of service management issue in object-oriented distributed applications. Designers of distributed computations build on a huge amount of experience both on architectural side and on the algorithmic side. Concurrency control, efficiency, fault-tolerance, best use of the resources are some of the main qualities that distributed software designer strive to obtain [1,9]. However, once obtained in some particular implementation , these qualities cannot be transposed and adapted easily to other application settings. Cross-breeding object-oriented techniques with distributed computations ones is a fairly hot topic and many interesting works have been published [8]. Let us mention a few of major approaches.
This work has been supported in part by the council of R´egion Bretagne
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 738–741, 1998. c Springer-Verlag Berlin Heidelberg 1998
An Object-Oriented Framework
739
1. One of the most popular approach relies on hiding distribution. To the application program developer, all objects are local (except maybe at initialization time). The bad side is the lack of control and the fact that some important issues are hidden although they are too important to be ignored. 2. Another popular approach uses a different strategy and gives tools to the application designer to build a custom distributed object framework. The possibilities that this scheme gives are numerous and it overpowers the simple yet limited scheme of proxies. The drawback here is that using and assembling parts requires a great deal of knowledge on distributed application architectures. 3. A third scheme goes farther on the reification path. Even higher order entities such as protocols can be reified[4]. Defining a protocol between objects is done by stacking or composing existing classes of protocols. On the down side, an application needs specific protocols which are difficult to identify by the application developer. In this paper we present another approach to the cross-breeding of objectorientedness and distributed computation domains. The main concept we use is the contract model. Most features derive from the semantics of this contract notion. The main aim of the contract notion is to capture the interaction properties between two or more objects. A method invocation is a common case of object interaction. By using an adequate contract, a customer object may invoke a method on a provider object and still get vital information on the method execution, such as delays, failures, etc. This paper does not aim at providing an extensive definition of the contract model we designed. We rather aim at presenting the overall objectives, structure and logic of a distributed framework we built for managing quality of service issues in distributed applications. Section 2 contains a general exposition of out contract model. Section 3 gives a very brief overall description of our framework for distributed objects.
2
The Contract Concept
A contract defines the ways and means of interaction between two or more objects. Several contract models have already been defined to enhance the clarity of design and the reliability of object-oriented applications. A common form of contracts use method call precondition, postconditions, loop invariant, class invariants, etc. For instance, the Eiffel language [7] includes constructs to state explicitly this kind of contract within the code. Other forms of contracts may include the specification of valid interaction protocols between objects [5]. Also, real-time software designs include timing constraints on object interaction. Our work on contracts and distribution aims at extending the contract notion to deal properly with many distribution related problems, such as quality of service management (eg varying network latency, network failures, site failures,. . . ) or load balancing (eg performance evaluation of sites, selection of a site for method evaluation).
740
St´ephane Lorcy and No¨el Plouzeau
Our model relies on four important types of entities : the contract, the contractable, the customer and the command. Briefly stated, a contract is setup by a negotiation between the customer and the contractable. The contract specifies how a command will be executed, at a customer’s request. The most common form of request is method invocation, ie an object (customer) wants another one (provider) to execute some method. Typical contracts may include specifications of preconditions, postconditions, execution deadlines, alternate behaviors to follow in case of contract violation. From the language point of view, a contract is an object instantiated from some class which inherits from the abstract Contract class. Instantiation of contract objects is performed by contractable objects only, in a way similar to the abstract factory design pattern [3]. As it is customary with object-oriented architectures, the exact concrete class of a contract object is hidden to the customer object, allowing sophisticated means of contract implementation selection and tuning within the contract management software. Contract negotiation is the first phase within a contract’s life. If some given customer object aims at requesting a command execution under specific conditions (eg a maximum duration of execution), the customer object must ask a known contractable object to provide a contract stating these execution conditions (which command, which delays, etc). The contractable object is fully responsible for finding out whether the conditions are reasonable or not. This implies that some estimation of the feasibility has to be performed, eg a network load evaluation. By returning a valid contract a contractable object give hints to the customer object about the current quality of service of command execution. A contract is used by a customer object when it requests execution of a command to its contractable partner. The contractable object is responsible for finding out means to execute the command under the specifications of the contract. The contractable object is also responsible for monitoring the execution, based on the contract features.
3
Contracts Between Distributed Objects
Our contract model states that a contract negotiation starts at initiative of a customer object. In a distributed execution environment, work has to be done to find out whether a contract proposed by a customer object is feasible, which site or sites can deal with requests under the contract, what adjustments have to be made. This involves specific trading algorithms. Our framework includes several subclasses of the Contract class, to provide the application developer with distributed services such as execution of command in a group of objects, etc. The simplest contract is a maximum delay of execution contract. To get this kind of contract, the designer of the scene management subsystem simply needs to subclass a basic contract class of our framework, named TimeoutContract. Another useful contract is the Delta T Reachable Set Contract (DTRSContract for short). A DTRSContract contains a list of remote sites which have to be reached in less than T units of time. DTRSContract objects are created
An Object-Oriented Framework
741
and configured by the kernel of our framework, which monitors the communication between a set of sites using the fail awareness datagram scheme of Fetzer and Cristian [2]. This scheme uses physical clocks on each site to compute the message transmission delays and take decisions about the reachability of sites. The Delta Connected Group Contract includes requirements for the connectivity of members of groups: all members of a group must follow the requirements of the same group contract. We designed a specific group management algorithm to compute a partition of the sites set [6].
4
Conclusion
Like in the Bast system [4], we use high level building blocks to help the distributed application designer. However, our contract model has more expressive power to selectively hide or expose features of the underlying distribution mechanisms. Thanks to this, distributed applications can be constructed with a limited expertise in distribution issues, without sacrifying the ways of control. This improves the separation of concern: some part of the application design does not care about distribution issues, some other part includes specific, application-specific distributed mechanisms and expose them via the contract user & contract provider metaphor.
References 1. R. Van Renesse K. P. Birman and S. Maffeis. Horus : a flexible group communication system. Communication of the ACM, 39(4), April 1996. 738 2. C. Fetzer and F. Cristian. Fail-awareness in timed asynchronous. In 15th ACM Symposium on Principles of Distributed Computing, Philadelphia, May 1996. 741 3. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns, element of Reusable Object-Oriented Software. Addison-Wesley, 1995. 740 4. B. Garbinato, P. Felber, and R. Guerraoui. Composing reliable protocols using the strategy design patterns. In Usenix International Conference on Object-Oriented Technologies (COOTS’97), 1997. 739, 741 5. R. Helm, I. M. Holland, and D. Gangopadhay. Contracts : Specifying behavioral compositions in object-oriented systems. In ECOOP/OOPSLA, pages 169–179, October 1990. 739 6. S. Lorcy and N. Plouzeau. A distributed algorithm for managing group membership with multiple groups. In Proc. of the PDPTA’98 conference, 1998. 741 7. B. Meyer. Applying ”design by contract”. Computer (IEEE), 25(10):40–51, October 1992. 739 8. D. C. Schmidt. The adaptative communication environment : Object-oriented network, programming components for developping client/server application. 12th Sun Users Group Conference, June 1994. 738 9. P. Verissimo, L. Rodriguez, F. Cosquer, H. Fonseca, and J. Frazao. An overview of the navtech system. Technical report, Department of Informatics, Faculty of Sciences of the University of Lisboa, Portugal, 1995. 738
A Data Parallel Java Client-Server Architecture for Data Field Computations over ZZ n Jean-Louis Giavitto, Dominique De Vito, and Jean-Paul Sansonnet LRI u.r.a. 410 du CNRS, Bˆ atiment 490 – Universit´e de Paris-Sud, F-91405 Orsay Cedex, France. giavitto,devito}@lri.fr
Abstract. We describe FieldBroker, a software architecture, dedicated to data parallel computations on fields over ZZ n . Fields are a natural extension of the parallel array data structure. From the application point of view, field operations are processed by a field server, leading to a client/server architecture. Requests are translated successively in three languages corresponding to a tower of three virtual machines processing respectively mappings on ZZ n , sets of arrays and flat vectors in core memory. The server is itself designed as a master/multithreadedslaves program. The aim of FieldBroker is to mutually incorporate approaches found in distributed computing, functional programming and the data parallel paradigm. It provides a testbed for experiments with language constructs, evaluation mechanisms, on-the-fly optimizations, load-balancing strategies and data field implementations.
1
Introduction
Collections, Data Fields and Data Parallelism. The data parallel paradigm relies on the concept of collection: it is an aggregate of data handled as a whole [6]. A data field is a theoretically well founded abstract view of a collection as a function from a finite index set to a value domain. Higher order functions or intensional operations on these mappings correspond to data parallel operations: point-wise applied operation (map), reduction (fold), etc. Data fields enable to represent irregular data by using a suitable index set. Another attractive advantage of the data field approach, in addition to its generality and abstraction, is that many ambiguities and semantical problems of “imperative” data parallelism can be avoided in the declarative framework of data fields. A Distributed Paradigm for Data Parallelism. Data parallelism was motivated to satisfy the increasing needs of computing power in scientific applications. Thus, the main target of data parallel languages has been supercomputers and the privileged linguistic framework was Fortran (cf. HPF). Several factors urge to reconsider this traditional framework: – Advances in network protocols and bandwidths have made practical the development of high performance applications whose processing is distributed over several supercomputers (metacomputing). D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 742–745, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Data Parallel Java Client-Server Architecture
743
– The widening of parallel programming application domains (e.g. data mining, virtual reality, generalization of numerical simulations) urges to use cheaper computing resources, like NOWs (networks of workstations). – Development in parallel compilation and run-time environments have made possible the integration of data parallelism and control parallelism, e.g. to hide the communication latency with the multithreaded execution of independent computations. – New algorithms exhibit more and more a dynamic behavior and perform on irregular data. Consequently, new applications depend more and more on the facilities provided by a run-time (dynamic management of resources, etc.). – Challenging applications consist of multiple heterogeneous modules interacting with each other to solve an overall design problem. New software architectures are needed to support the development of such applications. All these points require the development of portable, robust, high-performance, dynamically adaptable, architecture neutral applications on multiple platforms in heterogeneous, distributed networks. Many of theses attributes can be cited as descriptive characteristics of distributed applications. So, it is not surprising that distributed computing concepts and tools, which precisely face this kind of problems, become an attractive framework for supporting data parallel applications. In this perspective, we propose FieldBroker, a client server architecture dedicated to data parallel computations on data field over ZZ n . Data field operations in an application are requests processed by the FieldBroker server. FieldBroker has been developed to provide an underlying virtual machine to the 81/2 language [5] and to compute recursive definitions of group based fields [2]. However, FieldBroker aims also to investigate the viability of client server computing for data parallel numerical and scientific applications, and the extent to which this paradigm can integrate efficiently a functional approach of the data parallel programming model. This combination naturally leads to an environment for dynamic computation and collaborative computing. This environment provides and facilitates interaction and collaboration between users, processes and resources. It also provides a testbed for experiments with language constructs, evaluation mechanisms, on-the-fly optimizations, loadbalancing strategies and data field implementations.
2
A Distributed Software Architecture for Scientific Computation
The software architecture of the data field server is illustrated by Fig. 1 right. Three layers are distinguished. They correspond to three virtual machines: – The server handles requests on functions over ZZ n . It is responsible for parallelization and synchronization between requests from one client and between different clients.
744
Jean-Louis Giavitto et al.
– The master handles operations between sets of arrays. This layer is responsible for various high-level optimizations on data field expressions. It also decides the load balancing strategy and synchronizes the computations of the slaves. – The slaves implement sequential computations over contiguous data in memory (vectors). They are driven by the master requests. Master requests are of two kinds: computations to perform on the slave’s data or communications (send data to other slaves; receives are implicit). Computations and communications are multithreaded in order to hide communication latency.
L0 [[ ]]0
web client
client application
HTTP-server
API
L0
0 T1
server
✠ ❄ [[ ]]1 (ZZ n → Value) ✛ L1 I ❅ ❅ [[ ]]2 ❅ ❅
client
L1 master
1 T2
❄
L2
L2 slave read send receive compute
read slave send receive compute
Fig. 1. Left: Relationships between field algebras L0 , L1 and L2 . Right: A client/server-master/multithreaded-slaves architecture for the data parallel evaluation of data field requests. The software architecture described on the right implements the field algebras sketched on the left. Functions i Ti+1 are phases of the evaluation. The functions [[ ]]i are the semantic functions that map an expression to the denoted element of ZZ n → Value. They are defined such that the diagram commutes, that is [[ei ]]i = [[i Ti+1 (ei )]]i+1 is true for i ∈ {0, 1} and ei ∈ Li . This property ensures the soundness of the evaluation process.
Our software architectures corresponds to a three levels language tower. Each language specifies the communications between two levels of the architecture and describes a data structure and the corresponding operations. Three languages are used, going from the more abstract L0 (client view on a field) to L1 and to the more concrete L2 (in core memory view on a field). The server-master and the slave programs are implemented in Java. The rationale of this design decision is to support portability and dynamic extensibility. The expected benefits of this software architecture are the following: – Accessibility and client independence: requests for the data field computation are issued by a client through an API. However, because the slave is a Java program, Java applets can be easily used to communicate with the server. So, an interactive access could be provided through a web client at no further cost. In this case, the server appears as a data field desk calculator.
A Data Parallel Java Client-Server Architecture
745
– Autonomous services: the server lifetime is not linked to the client lifetime. Thus, implementing persistence, sharing and checkpointing will be much easier with this architecture than with a monolithic SPMD program. – Multi-client interactions: this architecture enables applications composition by pipelining, data sharing, etc. The figure 1 illustrates that L0 terms are successively translated into L1 and L2 and that L2 terms are dispatched to the slaves to achieve the data parallel final processing. More details about these languages can be found in [1].
3
Conclusion
The aim of our first ongoing implementation is to evaluate the functionalities provided by such an architecture. At this stage, we have not pay attention to its performance which is certainly disappointing. A promising way to tackle this drawback consists in the use of just-in-time Java compiler that are able to translate Java bytecode into executable machine-dependent code. FieldBroker integrates concepts and technics that have been developed separately. For example, relationships between the definition of functions and data fields are investigated in [4]. A proposal for an implementation is described in [3] but focuses mainly on the management of the definition domain of data fields. One specific feature of FieldBroker is the use of heterogeneous representations, i.e. extensional and intensional data fields, to simplify field expressions. Clearly, the algebraic framework is the right one to reason about the mixing of multiple representations.
References 1. J.-L. Giavitto and D. De Vito. Data field computations on a data parallel Java clientserver distributed architecture. Technical Report 1167, Laboratoire de Recherche en Informatique, Apr. 1998. 9 pages. 745 2. J.-L. Giavitto, O. Michel, and J.-P. Sansonnet. Group based fields. Parallel Symbolic Languages and Systems (International Workshop PSLS’95), vol. 1068 of LNCS, pages 209–215, Beaune (France), 2-4 October 1995. Springer-Verlag. 743 3. J. Hal´en, P. Hammarlund, and B. Lisper. An experimental implementation of a higly abstract model of data parallel programming. Technical Report TRITA-IT 9702, Royal Institute of Technology, Sweden, March 1997. 745 4. B. Lisper. On the relation between functional and data-parallel programming languages. In Proc. of the 6th. Int. Conf. on Functional Languages and Computer Architectures. ACM, ACM Press, June 1993. 745 5. O. Michel. Introducing dynamicity in the data-parallel language 81/2. EuroPar’96 Parallel Processing, vol. 1123 of LNCS, pages 678–686. Springer-Verlag, Aug. 1996. 743 6. J. M. Sipelstein and G. Blelloch. Collection-oriented languages. Proceedings of the IEEE, 79(4):504–523, Apr. 1991. 742
Workshop 7+20 Numerical and Symbolic Algorithms Maurice Clint and Wolfgang Kreuchlin Co-chairmen
Numerical Algorithms The workshop on numerical algorithms is organised in three sessions under the headings, Parallel Linear Algebra, Parallel Decomposition Methods and Parallel Numerical Software. The use of numerical linear algebra in the development of scientific and engineering applications software is ubiquitous. With the rapid increase in the size of the problems which can now be solved there is a concentration on the use of iterative methods for the solution of linear systems of equations and eigenproblems. Such methods are particularly amenable to efficient implementation on many types of parallel computer. Efficient methods for the partial eigensolution of large systems include Davidson’s method and the Krylov-space based method of Lanczos. An important method for the iterative solution of linear systems is another Krylov-space based method, GMRES. All of these methods incorporate an orthogonalization phase. Frayse´e, Giraud and Kharraz-Aroussi, compare the effectiveness, on a distributed memory machine, of four variants of the Gram-Schmidt orthogonalization procedure when used as components of the GMRES method. They conclude that, in a parallel environment, it is best to use ICGS (Iterative Classical Gram-Schmidt) rather than the more widely recommended modified versions of the procedure since it offers the best trade off between numerical robustness and parallel efficiency. Borges and Oliveira present a new parallel algorithm (RDME), based on Davidson’s method, for computing a group of the smallest eigenvalues of a symmetric matrix. They demonstrate the effectiveness of their algorithm by comparing its performance on a Paragon with that of a parallel Lanczos-type method taken from PARPACK. Adam, Arbenz and Geus compare the performances on a HP Exemplar machine of a number of methods, including Davidson-based and Lanczos-based procedures, for computing partial eigensolutions of a large sparse symmetric matrix arising from an application of Maxwell’s equations. They observe that parallel efficiencies are as might be expected from an analysis of the algorithms and that subspace iteration is not competitive with the other methods they investigate. Many powerful algorithms for the solution of problems in numerical mathematics are based on the decomposition of linear operators into simpler forms. It is thus imperative that the potential for efficient and robust parallel implementation of such methods be studied. In addition, the use of decomposition may allow D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 747–750, 1998. c Springer-Verlag Berlin Heidelberg 1998
748
Maurice Clint and Wolfgang Kreuchlin
the solution of a problem to be constructed from an appropriate combination of the simultaneously computed solutions of a number of subproblems. Osawa and Yamada present a decomposition-based waveform relaxation method, well suited to parallel implementation, for the solution of second order differential equations. The usefulness of their approach is supported by the results of experiments conducted on a KSR1 machine for both linear and non-linear equations. Incomplete LU factorization is an important operation in preconditioned iterative methods for solving sets of linear equations. This phase of the overall process is, however, the least tractable as far as parallel implementation is concerned. Nodera and Tsuma present a generalization of Bastion and Horton’s parallel preconditioner and demonstrate its effectiveness on a distributed memory machine when used with the solvers, BiCGStab and GMRES. Kiper addresses the problem of computing the inverse of a dense triangular matrix and proposes an approach, based on Gaussian elimination, in which arithmetic parallelism is exploited by means of diagonal and triangular sweeps. A PRAMbased theoretical analysis is given and it is shown that, if sufficient processors are available, the method outperforms the standard algorithm of Sameh and Brent. Maslennikov, Kaniewski and Wyrzkowski present a Givens rotation-based QR decomposition method, designed for execution on a linear processor array, which permits the correction of errors arising from transient hardware faults: this is achieved at the expense of a modest increase in arithmetic complexity. Developers of applications software intended for execution on high performance computers increasingly rely on the availability of libraries of efficient parallel implementations of algorithms from which their systems may be built. The provision of such libraries is the main goal of the ScaLAPACK project in the USA and the Parallel NAG Library project in the UK. Krommer reports on work being undertaken within the PINAPL project, an aim of which is to extend the NAG Parallel Library to meet the needs of developers of parallel industrial applications. He reports favourably on the efficiency of execution and the scalability of a number of core sparse linear algebra routines which have been implemented on a Fujitsu AP3000 distributed memory machine. However, he sounds a note of caution by indicating that, in some cases, parallel performance may be poor. De Bono, di Serafino and Ducloux integrate a parallel 2D FFT routine from the PINEAPL Library into an industrial application concerned with the design of high performance lasers: most of the computational load in this application arises from the execution of this routine. The performance of the parallel software on an IBM SP2 is evaluated. Wagner presents an efficient parallel implementation, for execution on an IBM SP2, of an algorithm for computing the Hermite normal form of a polynomial matrix with its associated unimodular transformation matrix. The latter is used in the context of convolutional decoders to find the decoder for a given encoder. The implementation is expressed in C/C++ and uses the IBM message passing library, MPL.
Workshop 7+20 Numerical and Symbolic Algorithms
749
Symbolic Algorithms Symbolic computation is concerned with doing exact mathematics by computer, using the laws of Symbolic Logic or the laws of Symbolic Algebra. Consequently we are dealing with a wide spectrum, including theorem proving, logic programming, functional programming, and Computer Algebra. Symbolic computation is usually extremely resource demanding, because computation proceeds on a highly abstract level. Systems typically have to engage in some form of list processing with automated garbage collection, and they have little or no compiletime knowledge about the ensuing workloads. After the demise of special purpose list-processing hadware, however, there is little hardware support for symbolic computation except for the use of many processors in parallel. Therefore, investigating parallelism is one of the major hopes in symbolic computation to achieve new levels of performance in order to tackle yet larger and more significant applications. At the same time the system issues in parallelisation are among the most demanding one can face: highly dynamic data structures, highly data dependend algorithms, and extremely irregular workloads. This is most obvious in logic programming, where the data form a program and parallelisation must follow the structure of the data. Hence workloads and other dynamic behaviour are even impossible to predict in theory. Especially in the fields of Theorem Proving and Computer Algebra, there is also still the problem of finding good parallel algorithms. Frequently, naive parallelisations are possible, which however ignore sequential optimisations and perform worse than standard approaches. Parallel algorithms which outperform highly tuned sequential ones on realistic problems are still sorely needed. The field has seen very good progress in the past years. It is a great advantage that shared memory machines and standard programming environments have taken an upswing recently. This greatly favors parallel symbolic computation applications which usually have complex code with frequent synchronization needs. We may therefore hope to see more real speedups on real problems, which still is the great overall challenge to parallel symbolic computation. Costa and Bianchini optimise the Andorra parallel logic programming system. They analyse the caching behavior of the existing system using a simulator and develop 5 techniques for improving the system. As a result, the speedups improve significantly. Although itself based on simulation, this work is important for closing the gap between theoretical parallelisations and real architectures for which caching behaviour is critical in achieving speedup. N´emeth and Kacsuk’s paper is concerned with the improvement of the LOGFLOW parallel logic programming system. They analyse the closed binding method and propose a new hybrid binding scheme for the logic variables. Ogata, Hirata, Ioroi and Futatsugi present an extension of earlier work on their parallel term-rewriting machine PTRAM. PTRAM was designed for shared memory multiprocessors; PTRAM/MPI in addition supports message passing computation and was implemented on a Cray T3E. Scott, Fisher and Keane investigate the parallelisation of the tableau method for theorem proving in temporal logic. Temporal logic permits reasoning about time-varying properties. Proof methods are PSPACE
750
Maurice Clint and Wolfgang Kreuchlin
complete and hence provers are time and resource intensive. They present two shared memory parallel systems one of which achieves good speedups over the sequential system.
On the Influence of the Orthogonalization Scheme on the Parallel Performance of GMRES Val´erie Frayss´e1, Luc Giraud1 , and Hatim Kharraz-Aroussi2 1
CERFACS, 42 av. Gaspard Coriolis, 31057 Toulouse Cedex 1, France. fraysse,[email protected]. 2 ENSIAS, BP713, Agdal, Rabat, Morocco.
Abstract. In Krylov-based iterative methods, the computation of an orthonormal basis of the Krylov space is a key issue in the algorithms because the many scalar products are often a bottleneck in parallel distributed environments. Using GMRES, we present a comparison of four variants of the Gram-Schmidt process on distributed memory machines. Our experiments are carried on an application in astrophysics and on a convection-diffusion example. We show that the iterative classical GramSchmidt method overcomes its three competitors in speed and in parallel scalability while keeping robust numerical properties.
1
Introduction
Krylov-based iterative methods for solving linear systems are attractive because they can be rather easily integrated in a parallel distributed environment. This is mainly because they are free from matrix manipulations apart from matrixvector products which can often be parallelized. The difficulty is then to find an efficient preconditioner which is good at reducing the number of iterations without degrading too much the parallel performance. We consider the solution of a large sparse linear system Ax = b where A is a nonsingular n × n matrix, b and x are two vectors of length n. Given a starting vector x0 , the GMRES method [10] consists in building an orthonormal basis Vm for the Krylov space K(A) = x0 , Ax0 , . . . , Am−1 x0 . The integer m is called the projection size. GMRES produces a solution whose restriction to this Krylov space has a minimal 2-norm residual. When one wants to limit the amount of storage for Vm , m is kept fixed and the method is restarted with a better initial guess: m is then called the restart parameter and the restarted GMRES method is denoted by GMRES(m). The construction of the basis Vm is an important step of GMRES. Several variants of the GramSchmidt (GS) process are available; they have different efficiencies and different numerical properties in finite precision arithmetic [1]. Classical GS (CGS) is the most efficient but is numerically unstable, modified GS (MGS) improves on D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 751–762, 1998. c Springer-Verlag Berlin Heidelberg 1998
752
Val´erie Frayss´e et al.
CGS but can still suffer from a loss of orthogonality. The iterative MGS (IMGS) method and the iterative CGS (ICGS) have been designed to reach the numerical quality of the Householder (or Givens) factorization with a reasonable computational cost. From a point of view of computational efficiency, CGS and ICGS are the best because the scalar products can be gathered and implemented in a matrix-vector product form. However, the iterative procedures ICGS and IMGS may require more than twice as much work than their counterparts CGS and ICGS, when reorthogonalization is needed. Our implementations are the ones described in [1]. From a numerical point of view, it has been proved that MGS, IMGS and ICGS are able to produce a good enough computed Krylov basis to ensure the convergence of GMRES [4,6]. In this paper, we wish to compare the efficiency of GMRES(m) with these four orthogonalization processes in a parallel distributed environment. It is wellknown that the many scalar products arising in the orthogonalization phase are the bottleneck of Krylov based methods. Our tests are based on a restarted GMRES code developed at CERFACS which implements the four GS variants and uses reverse communication for the preconditioner, the matrix-vector products and dot products [5]. The numerical experiments have been performed on the 128 node CRAY T3D available at CERFACS, using the MPI message passing library. We present results concerning an application in astrophysics developed in collaboration of M. Rieutord (Observatoire Midi-Pyr´en´ees, Toulouse) and L. Valdettaro (Politecnico di Milano). In order to explain in depth these results, we study a model problem deriving from a convection-diffusion equation, on which we can test the scalability of GMRES.
2 2.1
An Application in Astrophysics Description of the Application
The application we are interested in belongs to the class of flow stability problems and arises in astrophysics, when modelling the internal structure of stars and planets, to study for instance the electromagnetic field in the earth kernel [8,9]. The study of the stability of a solution of a nonlinear problem begins by solving the eigenvalue problem verified by small perturbations of the original solution. Therefore, the inertial modes (or eigenmodes) for an incompressible viscous fluid located between two concentric spheres are obtained by solving the linearized Navier Stokes continuity equation E∆∇ × u − ∇ × (ez × u) = λ∇ × ×u, div u = 0 where u is the scaled velocity of the fluctuations, and (ez × u) is the non dimensional Coriolis force. The Eckman number E is equivalent to the inverse of a scaled Reynolds number and is intended to be very small. When using a spherical geometry and after projecting on the space of Chebyshev polynomials,
On the Influence of the Orthogonalization Scheme
753
one obtains a generalized eigenproblem Ax = λBx where A and B are two non hermitian matrices, and B may be singular. The smaller E, the finer must be the discretization and thus A and B must be larger. All the eigenvalues have an imaginary part between −1 and 1. The eigenvalues of interest are those closest to the imaginary axis. The eigenproblem is solved by the Arnoldi method with a Chebyshev acceleration [2]. The strategy for selecting the interesting eigenvalues is based on a shift and invert technique, with several imaginary shifts σ in the interval [−i, i]. We then solve for µ the eigenproblem (A − σB)−1 Bx = µx with µ = 1/(λ−σ) being the largest eigenvalue of C = (A−σB)−1 B. The Arnoldi method, like any Krylov-based iterative method, requires only the application of C to a vector. Clearly in this case, this operation involves the solution of linear systems such as (A − σB)x = y. Even if A and B are sparse, the use of direct methods for solving this system imposes severe limitations on the size of A and B, and consequently limits the range of possible values of the Eckman number E. We have therefore investigated the use of GMRES(m) in a parallel distributed memory environment. After detailing some choices about the data distribution and the preconditioning, we give some results on the performance of GMRES(m) on this example, where, for sake of simplicity, we have assumed the shift σ to be zero. In the rest of this paper, the acronym GMRES will always refer to the restarted method. 2.2
The Data Distribution
The matrix A is tridiagonal by blocks. The size of the diagonal blocks is usually twice the size of the off-diagonal blocks (also called coupling blocks). For example, the test matrix A5850 used in our experiments is 5850 × 5850, with 45 diagonal blocks of size 130 × 130 and 88 off-diagonal blocks of size 61 × 61. Only the blocks are stored. They are dense enough so that there is no substantial gain to hope from a sparse storage. Note that the matrix B is diagonal by blocks, with a block size compatible with the storage of A: therefore the matrices A−σB can be stored identically as A. The matrix is distributed over the processors so that each processor possesses complete blocks only. The advantage of this distribution is that it allows an easy implementation of the preconditioner but the drawback is that it may induce some load imbalance when the number of blocks is not a multiple of the number of processors. In the case of two processors for instance, processor 1 will receive the first 23 diagonal blocks of A5850 and processor 2 the remaining 22 blocks. In other words, processor 1 (resp., processor 2) will treat a subproblem of size 130 × 23 = 2990 (resp., 130 × 22 = 2860). But out of 32 processors, 13 will have two blocks whereas 19 processors will have half of the load that is one block only. Let n(p) be the size of the subproblem treated by the first processor when the
754
Val´erie Frayss´e et al.
matrix is distributed over p processors. If the data distribution were equilibrated, n(p) would be kept proportional to 1/p. Figure 1 shows the evolution of the ratio n(2)/n(p), where the optimal distribution (dotted line) would give n(2)/n(p) = p/2. We see that our distribution is reasonably balanced whenever p ≤ 16. When p ≥ 32, the first processor is penalized by a larger amount of data than expected with the increase in processors. The opposite phenomena arises for the last processor which sees its load decrease faster than the increase in processors.
Astro − 5850
n(2) / n(p)
16
8
4 2 1 2 4
8 16 Number of processors
32
Fig. 1. Load of the first processor
2.3
Preconditioning
The restarted GMRES method applied on this problem does not converge even with large restarts. We have chosen one of the simplest preconditioner: the block Jacobi preconditioner. The preconditioner consists in the diagonal blocks of A and is well adapted to the data distribution. Each processor computes once a dense LU factorization of its blocks at the beginning of the computation. Applying the preconditioner consists in performing two triangular solves per block locally on each processor. There is no communication involved when building or applying the preconditioner. 2.4
Performance Results
We describe now the results obtained for the matrix of size 5850. Since the number of processors on the CRAY T3D has to be a power of 2, we show results for 2, 4, 8, 16 and 32 processors (which is the maximum allowed for 45 blocks). The sequential code could not be run due to memory limitations: consequently, our reference for speed-ups will be the time obtained for 2 processors.
On the Influence of the Orthogonalization Scheme
755
From a numerical point of view, the matrix A is very ill-conditioned (with a condition number larger than 1013 ). GMRES with a CGS reorthogonalization does not converge on this example. In our tests, we stop the iterations when the backward error on the preconditioned system becomes lower than 10−7 . The original system then has a normwise backward error A˜ x −b2 /(A2 ˜ x2 +b2 ) smaller than 10−10 (here x˜ denotes the computed solution). We will show the results obtained with a restart of 100 since we obtained similar behaviors for other values. Because the matrix is ill-conditioned, the number of iterations may vary significantly with the orthogonalization strategy and the number of processors (see Figure 2). However, these variations are not monotonous and one cannot predict the preeminence of one method over the other. The variation with the number of processors is due to the different orderings of the floating-point operations in the parallel matrix-vector and dot products. Because of this sensitivity, we have chosen to evaluate the performance of the code mainly on one iteration rather than on the complete solution: the times measured over the solution are then scaled by the number of iterations required for this solution. However, to be fair, we first look at the total computational time needed to obtain the solution (see Figure 3). Regardless of the number of iterations, the fastest method is GMRES-ICGS and the slowest is GMRES-IMGS. The success of GMRES-ICGS is mainly due to the better scalability properties of the ICGS orthogonalization scheme, as we will see in the next section.
Astro − 5850
Astro − 5850 250
Computational time (in s)
Number of iterations
1000
900
800
MGS IMGS ICGS
700
600
2 4
8 16 Number of processors
32
Fig. 2. Number of iterations required for convergence
MGS IMGS ICGS
200 150 100 50 0
2 4
8 16 Number of processors
32
Fig. 3. Time for convergence
Speed-up for the preconditioner. The speed-up for the preconditioner is taken as the ratio preconditioning time on p processors preconditioning time on 2 processors
756
Val´erie Frayss´e et al.
and is plotted on Figure 4. We cannot compare with the sequential time because the problem is too large to fit on one processor. It departs from its optimal value (dotted line) when p ≥ 16. This only reflects the imbalance of the processors load (see Figure 1) and, because there is no communication overhead, is the best one can expect from this data distribution. Speed-up for the matrix-vector product. The matrix-vector product in parallel involves a local part corresponding to the contribution of the diagonal blocks and a communication part with the left and right neighbours to take into account the contribution of the coupling blocks. It is implemented so as to overlap communication and computation as much as possible. The speed-up, computed as for the preconditioning is shown on Figure 5. Here again, it reflects rather faithfully the distribution of the load on the processors (see Figure 1), which proves that the overhead due to communications does not penalize the computation.
Astro − 5850 − ICGS
Astro − 5850 − ICGS 16
Speed up
Speed up
16
8
8
4
4
2 1
2 1 2 4
8 16 Number of processors
Fig. 4. Preconditioning
32
2 4
8 16 Number of processors
32
Fig. 5. Matrix-vector product
Influence of dot products. The dot products are the well-known bottleneck for Krylov methods in a distributed environment. Figure 6 shows the time spent, per iteration, in the dot products and Figure 7 gives the percentage of the solution time spent in the dot products. Both figures indicate clearly that GMRES with ICGS is the best method for avoiding the degradation of the performance generally induced by the scalar products. We even see on Figure 6 that, when p ≥ 32, the time spent in iteration in the dot products starts increasing for IMGS and MGS whereas it continues to decrease with ICGS: this happens when the communication time overcomes the computation time in the dot products for IMGS and MGS. By gathering the scalar products, ICGS ensures more computational work to the processors. Finally, Figure 8 gives the speed-up, for an average iteration, of the complete solution. When comparing with the best possible curve with the given processor
On the Influence of the Orthogonalization Scheme
Astro − 5850
Astro − 5850
0.08
70
MGS IMGS ICGS
60 Percentage of time
0.06 Time (in s)
757
0.04
0.02
50
MGS IMGS ICGS
40 30 20
0
2 4
8 16 Number of processors
10
32
Fig. 6. Time spent in dot products per iteration
2 4
8 16 Number of processors
32
Fig. 7. Percentage of the solution time spent in dot products
load on Figure 1, we see that GMRES-MGS and GMRES-IMGS give bad performance whereas GMRES-ICGS is less affected by scalar products. We now present a test problem for which it is easier to vary the size in order to test the parallel scalability properties of GMRES.
Astro − 5850
Speed−up
16
MGS IMGS ICGS 8
4 2 1 2 4
8 16 Number of processors
32
Fig. 8. Speed-up for an average iteration
3
Scalability on a Model Problem
We intend to investigate the parallel scalability of the GMRES code and the influence of the orthogonalization schemes. For this purpose we consider the solution, via two classic domain decomposition techniques, of an elliptic equation −(
∂u ∂u ∂2u ∂2u +b =f + 2)+a ∂x2 ∂y ∂x ∂y
(1)
758
Val´erie Frayss´e et al.
on the unit square Ω = (0, 1)2 with Dirichlet boundary conditions. Assume that the original domain Ω is triangulated by a set of non-overlapping coarse elements defining the N sub-domains Ωi . We first consider an additive Schwarz preconditioner that can be briefly described as follows. Each substructure Ωi is extended to a larger substructure Ωi , within a distance δ from Ωi , where δ refers to the amount of overlap. Let Ai denote the discretizations of the differential operator on the sub-domain Ωi . Let RiT denote the extension operator which extends by zero a function on Ωi onto Ω, and Ri the corresponding pointwise restriction operator. With these notations the additive Schwarz preconditioner, MAD , can be compactly described as MAD u = RiT Ai−1 Ri u. We also consider the class of domain decomposition techniques that use nonoverlapping sub-domains. The basic idea is to reduce the differential operator on the whole domain to an operator on the interfaces between the sub-domains. Let I denote the union of the interior nodes in the sub-domains, and let B denote the interface nodes separating the sub-domains. Then grouping the unknowns corresponding to I in the vector uI and the unknowns corresponding to B in the vector uB , we obtain the following reordering of the problem: AII AIB uI fI Au = = . (2) ABI ABB uB fB For standard discretizations, AII is a block diagonal matrix where each diagonal block, Ai , corresponds to the discretization of (1) on the sub-domain Ωi . Eliminating uI in the second block row of (2) leads to the following reduced equation for uB : SuB = gB = fB − ABI A−1 (3) II fI , where
S = ABB − ABI A−1 II AIB .
S is referred to as the Schur complement matrix (or also the capacitance matrix). In our experiments, Equation (1) is discretized using finite elements. The Schur complement matrix can then be written as S=
N
S (i)
(4)
i=1
where S (i) is the contribution from the ith sub-domain with S (i) = ABB − (i) (i) (i) (i) ABI (AII )−1 AIB and AXX denotes submatrices of Ai Without more sophisticated preconditioners, these two domain decomposition methods are not numerically scalable [12]; that is the number of iterations required grows significantly with the number of sub-domains. However, in this section we are only interested in the study of the influence of the orthogonalization schemes on the scalability of the GMRES iterations from a computer science (i)
On the Influence of the Orthogonalization Scheme
759
point of view. This is the reason why we selected those two domain decomposition methods that exhibit a large amount of parallelism and we intend to see how their scalability is affected by the GMRES solver. For a detailed overview of the domain decomposition techniques, we refer to [12]. For both parallel domain decomposition implementations, we allocate one sub-domain to one processor of the target distributed computer. To reduce as much as possible the time per iteration we use an efficient sparse direct solver from the Harwell library [7] for the solution of the local Dirichlet problems arising in the Schur and Schwarz methods. To study the scalability of the code, we keep constant the number of nodes per sub-domain when we increase the number of processors. In the experiments reported in this paper, each sub-domain contains 64 × 64 nodes. For the additive Schwarz preconditioner, we selected an overlap of one element (i.e. δ = 1) between the sub-domains that only requires one communication after the local solution A−1 while one more communication would be necessary before the solution for a larger overlap. In Figure 9 are displayed the elapsed time for both the Schur and the Schwarz approaches observed on a 128 node Cray T3D. If the parallel code were perfectly scalable, the elapsed time would remain constant when the number of processors is increased.
Elapsed time for Schur complement method
Elapsed time for Schwarz method
6
16
MGS IMGS CGS ICGS
5
15 Elapsed time
Elapsed time
5.5
4.5
MGS IMGS CGS ICGS
13
4 3.5
14
4 16
64 Number of processors
128
12
4 16
64 Number of processors
128
Fig. 9. Elapsed time
We define also the scaled speed-up by SUp = p ×
T4 Tp
where T is the elapsed time to perform a complete restart step of GMRES(50) on " processors. For both Schur and the Schwarz approaches we report in Figure 11 the scaled speed-ups associated with the elapsed time displayed in Figure 9.
760
Val´erie Frayss´e et al.
Elapsed time for Schur complement method
Elapsed time for Schwarz method
5
13.5 13 Elapsed time
Elapsed time
4.8
4.6
4.4
12.5 12 11.5
4.2 16 32
64
128 Number of processors
11 16 32
256
64
128 Number of processors
256
+ : MGS - * : IMGS - ◦ : CGS - × : ICGS
Fig. 10. Elapsed time Scaled speed−up for the Schur complement method
Scaled speed−up for the Schwarz method 120
MGS IMGS CGS ICGS
100 80
Scaled Speed−up
Scaled Speed−up
120
60 40 20 0
MGS IMGS CGS ICGS
100 80 60 40 20
4 16
64 Number of processors
0
128
4 16
64 Number of processors
128
Scaled speed−up for the Schur complement method 300
300
250
250
200 150 100 50 0 16 32
Scaled speed−up for the Schwarz method
Scaled Speed−up
Scaled Speed−up
Fig. 11. Scaled speed-ups
200 150 100 50
64
128 Number of processors
256
0 16 32
64
128 Number of processors
+ : MGS - * : IMGS - ◦ : CGS - × : ICGS
Fig. 12. Scaled speed-ups
256
On the Influence of the Orthogonalization Scheme
761
As it can be seen in Figure 9, the solution of the Schur complement system does not require iterative orthogonalization; the curves associated with CGS/ICGS and MGS/IMGS perfectly overlap each other. In such a situation, the CGS/ICGS orthogonalizations are more attractive than MGS/IMGS. They are faster and exhibit a better parallel scalability as it can be seen in the left picture of Figure 11. On 128 nodes the scaled speed-up is equal to 114 for CGS/ICGS and only 95 for MGS/IMGS. When iterative orthogonalization is required, as for the Schwarz method on our example for instance, CGS is the fastest and the most scalable but may lead to a loss of orthogonality in the Krylov basis resulting in a poor and even a loss of convergence. In that case ICGS offers the best trade-off between numerical robustness and parallel efficiency. Among the numerically reliable orthogonalization schemes, ICGS gives rise to the fastest and the most scalable iterations.
4
Conclusion
The choice of the orthogonalization scheme is crucial to obtain good performance from Krylov-based iterative methods in a parallel distributed environment. From a numerical point of view, CGS should be discarded together with MGS in some eigenproblem computations [3]. Recent works have proved that for linear systems, MGS, IMGS and ICGS ensure enough orthogonality to the computed basis so that the method converges. Finally, in a parallel distributed environment, ICGS is the orthogonalization method of choice because, by gathering the dot products, it reduces significantly the overhead due to communication.
References 1. ˚ A. Bj¨ orck. Numerics of Gram-Schmidt orthogonalization. Linear Algebra Appl., 197–198:297–316, 1994. 751, 752 2. T. Braconnier, V. Frayss´e, and J.-C. Rioual. ARNCHEB users’ guide : Solution of large non symmetric or non hermitian eigenvalue problems by the ArnoldiTchebycheff method. Tech. Rep. TR/PA/97/50, CERFACS, 1997. 753 3. F. Chaitin-Chatelin and V. Frayss´e. Lectures on Finite Precision Computations. SIAM, Philadelphia, 1996. 761 4. J. Drkoˇsov´ a, A. Greenbaum, Z. Strakoˇs, and M. Rozloˇznik. Numerical stability of GMRES. BIT, 35, 1995. 752 5. V. Frayss´e, L. Giraud, and S. Gratton. A set of GMRES routines for real and complex arithmetics. Technical Report TR/PA/97/49, CERFACS, 1997. 752 6. A. Greenbaum, Z. Strakoˇs, and M. Rozloˇznik. Numerical behaviour of the modified Gram-Schmidt GMRES implementation. BIT, 37:707–719, 1997. 752 7. HSL. Harwell Subroutine Library. A Catalogue of Subroutines (Release 12). AEA Technology, Harwell Laboratory, Oxfordshire, England, 1995. 759 8. M. Rieutord. Inertial modes in the liquid core of the earth. Phys. Earth Plan. Int., 90:41–46, 1995. 752 9. M. Rieutord and L. Valdettaro. Inertial waves in a rotating spherical shell. J. Fluid Mech., 341:77–99, 1997. 752
762
Val´erie Frayss´e et al.
10. Y. Saad and M. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7:856–869, 1986. 751 11. J. N. Shadid and R. S. Tuminaro. A comparison of preconditioned nonsymmetric Krylov methods on a large-scale MIMD machine. SIAM J. Sci. Comp., 14(2):440– 459, 1994. 12. B.F. Smith, P. Bjørstad, and W. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, New York, 1st edition, 1996. 758, 759
A Parallel Solver for Extreme Eigenpairs Leonardo Borges and Suely Oliveira Computer Science Department, Texas A&M University, College Station, TX 77843-3112, USA. [email protected].
Abstract. In this paper a parallel algorithm for finding a group of extreme eigenvalues is presented. The algorithm is based on the well known Davidson method for finding one eigenvalue of a matrix. Here we incorporate knowledge about the structure of the subspace through the use of an arrowhead solver which allows more parallelization in both the original Davidson and our new version. In our numerical results various preconditioners (diagonal, multigrid and ADI) are compared. The performance results presented are for the Paragon but our implementation is portable to machines which provide MPI and BLAS.
1
Introduction
A large number of scientific applications rely on the computation of a few eigenvalues for a given matrix A. Typically they require the lowest or highest eigenvalues. Our algorithm (DSE) is based on the Davidson algorithm, but calculates various eigenvalues through implicit shifting. DSE was first presented in [10] under the name RDME to express its ability to identify eigenvalues with multiplicity bigger than one. The choice of preconditioner is an important issue in eliminating convergence to the wrong eigenvalue [14] In the next section, we describe the Davidson algorithm and our version for computing several eigenvalues. In [9] Oliveira presented convergence rates for Davidson type algorithm dependent on the type of preconditioner. These results are summarized here in Section 3. Section 4 addresses parallelization strategies discussing the data distribution in a MIMD architecture, and a fast solver for the projected subspace eigenproblem. In Section 5 we present numerical and performance results for the parallel implementation on the Paragon. Further results about the parallel algorithm and other numerical results are presented in [2].
2
The Davidson Algorithm
Two of the most popular iterative methods for large symmetric eigenvalue problems are Lanczos and Davidson algorithms. Both methods solve the eigenvalue
This research is supported by NSF grant ASC 9528912 and a Texas A&M University Interdisciplinary Research Initiative Award.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 763–770, 1998. c Springer-Verlag Berlin Heidelberg 1998
764
Leonardo Borges and Suely Oliveira
problem Au = λu by constructing an orthonormal basis Vk = [v1 , . . . , vk ], at each k th iteration step, and then finding an approximation for the eigenvector u of A by using a vector uk from the subspace spanned by Vk . Specifically, the original problem is projected onto the subspace which reduces the problem to ˜ k , where Sk = V T AVk . Then the eigenpair a smaller eigenproblem Sk y = λy k ˜ k , yk ) can be obtained by applying a efficient procedure for small matrices. To (λ complete the iteration, the eigenvector yk is mapped back as uk = Vk yk , which is an approximation to the eigenvector u of the original problem. The difference between the two algorithms consists on the way that basis Vk is built. The attractiveness of the Lanczos algorithm results from the fact that each projected matrix Sk is tridiagonal. Unfortunately, sometimes this method may require a large number of iterations. The Davidson algorithm defines a dense matrix Sk on the subspace, but since we can incorporate a preconditioner in this algorithm the number of iterations can be much lower than for Lanczos. In Davidson type algo˜ k uk , rithms, a preconditioner Mλk is applied to the current residual, rk = Auk − λ and the preconditioned residual tk = Mλk rk is orthonormalized against the previous columns of Vk = [v1 , v2 , . . . , vk ]. Although in the original formulation Mλ is −1 the diagonal matrix (diag(A) − λI) [6], the Generalized Davidson (GD) algorithm allows the incorporation of different operators for Mλ . The DSE algorithm can be summarized as follows. Algorithm 1 – Restarted Davidson for Several Eigenvalues Given a matrix A, a normalized vector v1 , number of eigenpairs p, restart index q, and the minimal dimension m for the projected matrix S (m > p), compute approximations λ and u for the p smallest eigenpairs of A. 1. Set V1 ← [v1 ]. (initial guess) 2. For j = 1, ... , p (approximation for j-th eigenpair) While k = 1 or rk−1 < do (a) Project Sk = VkT AVk . (b) If (m + q) ≤ dim S (restart Sk ) Reduce Sk ← (Λk )(m×m) to its m smaller eigenvectors, and update Vk for the new basis. (c) Compute the j th smallest eigenpair λk , yk of Sk . (d) Compute the Ritz vector uk ← Vk yk . (e) Check convergence for rk ← Auk − λk uk . (f ) Apply preconditioner tk ← M rk . (g) Expand basis Vk+1 ← [Vk , tk ] using modified Gram Schmidt (MGS). End while 3. End For. The core ideas of DSE (Algorithm 1) are based on the projection of A into the subspace spanned by the columns of Vk . The interation number k is not necessarily equal to dim Sk , since we have incorporated implicit restarts. The matrix Sk is obtained by adding one more column and row VkT Avk to matrix Sk−1 (step 2.a). Other important aspects of the DSE algorithm are:
A Parallel Solver for Extreme Eigenpairs
765
(1) the eigenvalue solver for the subspace matrix Sk (step 2.c); (2) the use of an auxiliary matrix Wk = [w1 , . . . , wk ] to provide a residual calculation rk = Auk − λk uk = wk yk − λk uk with less computational work (step 2.e); the choice of a preconditioner M (step 2.f); and the use of modified Gram-Schmidt orthonormalization (step 2.g) which preserves numerical stability when updating the orthonormal basis Vk+1 . At each iteration, the algorithm expands the matrix S either until all the first p eigenvalues have been converged, or S reaches a maximum dimension m + q; In the latter case, restarting is applied by using the orthonormal decomposition Sk = YkT Λk Yk of S. It corresponds to step 2.b in the algorithm. Because of our choice for m, note that in step 2.c dim S will be always bigger or equal to j.
3
Convergence Rate
A proof of convergence (but without a rate estimate) for the Davidson algorithm is given in Crouzeix, Philippe and Sadkane [5]. A bound on the convergence rate was first presented in [10]. The complete proof is shown in Oliveira [9]. Let A be the given matrix whose eigenvalues and eigenvectors are wanted. The preconditioner M is given for one step, and Davidson’s algorithm is used with uk being the current computed approximate eigenvector. The current eigenvalue k = ρA (uk ) = (uT Auk )/(uT uk ). Let the exact estimate is the Rayleigh quotient λ k k eigenvector with the smallest eigenvalue of A be u, and Au = λu. (If λ is a repeated eigenvalue of A, then we can let u be the normalized projection of uk onto this eigenspace.) Theorem 1. Let P be the orthogonal projection onto ker(A − λI)⊥ . Suppose that A and M are symmetric positive definite. If P − P M P (A − λI)2 ≤ σ < 1, then for almost any starting value x1 , the convergence of the eigenvalue estimates k converge to λ ultimately geometrically with convergence factor bounded by σ 2 , λ and the angle between the computed eigenvector and the exact eigenspace goes to zero ultimately geometrically with convergence factor bounded by σ. A geometric convergence rate can be found for DSE (which obtains eigenvalues beyond the smallest (or largest) eigenvalue) by modifying Theorem 1. In the following theorem assume that σ = P − P M P P (A − λp I)P 2 where P is the orthogonal projection onto the orthogonal complement of the span of the first p − 1 eigenvectors. Then we can shown, in a similar way to Theorem 1 that the convergence factor for the new algorithm is bounded by
766
Leonardo Borges and Suely Oliveira
(σ )2 To prove Theorem 2 we use the fact that P sk = sk , as sk is orthogonal to the bottom p eigenvectors. and that although (A − λp I) is no longer positive semi-definite, P (A − λp I)P is. Theorem 2. Suppose that A and M are symmetric positive definite and that the first p − 1 eigenvectors have been found exactly. Let P be the orthogonal projection onto the orthogonal complement of the span of the first p − 1 eigenvectors of A. If P − P M P (A − λp I)P 2 ≤ σ < 1, k obtained by our then for almost any starting value x1 , the eigenvalue estimates λ modified Davidson algorithm for several eigenvalues converges to λp ultimately geometrically with convergence factor is bounded by (σ )2 , and the angle between the exact and computed eigenvector goes to zero ultimately geometrically with convergence factor bounded by σ .
4
Parallel Implementation
Previous implementations for the Davidson algorithm solve the eigenvalue problem in subspace S by using algorithms for dense matrices: early works [3,4,17] adopt EISPACK [12] routines, and later implementations [13,15] use LAPACK [1] or reductions to tridiagonal form. Partial parallelization is obtained through the matrix-vector operations and sparse format storage for matrix A [13,15]. Here we explore the relationship between two successive matrices Sk which allows us to represent Sk through an arrowhead matrix. The arrowhead structure is extremely sparse and the associated eigenvalue problem can be solved by a highly parallelizable method. 4.1
Data Distribution
Data partitioning significantly affects the performance of a parallel system by determining the actual degree of concurrency of the processors. Matrices are partitioned along distinct processors so that the program exploits all the best possible data parallelism: The final distribution is well balanced, and most of the computational work can be performed without communication. These two conditions make the parallel program very suited for distributed memory architectures. Both computational workload and storage requirements are the same for all processors. Communication overhead is kept as low as possible. Matrix A is split into row blocks Ai , i = 1, . . . , N , each one containing ≤ n/N rows of A. Thus processor i, i = 1, . . . , N stores Ai , the ith row block of A. Matrices Vk and Wk are stored in the same fashion. This data distribution allow us to perform many of the matrix-vector computations in place. The orthonormalization strategy is also an important aspect in parallel environments. Recall that the modified Gram Schmidt (MGS) algorithm will be applied to the extended matrix [Vk , tk ] where the current basis Vk has been previously orthonormalized. This observation reduces the computational work by eliminating the outer loop from the two nested loops in the full MGS algorithm.
A Parallel Solver for Extreme Eigenpairs
4.2
767
The Arrowhead Relationship Between Matrices Sk
As pointed in [2,10], the relationship between Sk and Sk−1 can be used to show that Sk is explicitly similar to an arrowhead matrix S˜k of the form Λk−1 s˜k S˜k = , (1) s˜Tk skk T T Vk−1 wk , skk = vkT wk , and the diagonal matrix Λk−1 correwhere s˜k = Yk−1 T sponds to the orthonormal decomposition Sk−1 = Yk−1 Λk−1 Yk−1 . In practice, the matrix Sk does not need to be stored: only a vector for Λk and a matrix for Yk are required from one iteration to the next. Thus, given the eigenvalues Λk−1 and eigenvectors Yk−1 of Sk−1 , matrix S˜k can be used to find the eigenvalues Λk of Sk . Arrowhead eigensolvers [8,11] are highly parallelizable and typically perform O(k 2 ) operations, instead of the usual O(k 3 ) effort of algorithms for dense matrices S.
5
Numerical Results
In our numerical results we employ three kind of preconditiners: diagonal preconditioner (as in the original Davidson), multigrid and ADI. A preconditioner can be expressed as the matrix which solves Ax = b by applying an iterative method to M Ax = M b instead. In the case of a Diagonal preconditioner this would correspond to scaling the system and then solving. Multigrid and ADI preconditioners are more complex and for that we refer the reader to [16,18,19]. In our implementation level 1, 2 and 3 BLAS and the Message Passing Interface (MPI) library were used for easy portability. The computational results in this section were obtained with a finite difference approximation for − ∆u + gu = f (2) on a unit square domain. Here g is null inside a 0.2 × 0.2 square on the center of the rectangle and g = 100 for the remaining domain. To compare the performance delivered by distinct preconditioners we observe the total timing and number of iterations required for the sequential DSE for finding the ten smallest eigenpairs (p = 10) assuming convergence for residual norms less or equal to 10−7 . The restart indexes were q = 10 and m = 15. This corresponds to apply restarting every time that the projected matrix Sk achieves order 25, reducing its order to 15. Table 1 presents the actual running tIMINGS In a single processor of the Intel Paragon, running three grid sizes: 31×31, 63×63, and 127×127 (matrices of orders 961, 3969 and 16129, respectively.). It reflects the tradeoff between preconditioning strategies: although the diagonal preconditioner (DIAG) is the easiest and fastest to compute, it requires an increasing number of iterations for larger matrices. Multigrid preconditioners (MG) are more expensive than DIAG, but they turn to be more effective for larger matrices. Finally, the ADI method aggregate the advantages of the previous preconditioners in the sense that it is more effective and less expensive than
768
Leonardo Borges and Suely Oliveira
MG. More details about the preconditioners used here can be found in [2] and its references.
Table 1. Sequential times and number of iterations for three preconditioners. matrix order 961 matrix order 3969 matrix order 16129 iterations time (sec) iterations time (sec) iterations time (sec) ADI 29 4.7 26 16.2 27 80.7 MG 34 8.8 40 43.0 40 254.1 DIAG 174 8.6 319 43.2 700 386.0
The overall behavior of the DSE algorithm (with a multigrid preconditioner) is shown in Figure 1 for matrices sizes 3969 and 16129, as a function of the number of processors. Note that the estimated optimal number of processors is not far from the actual optimal. The model for our estimates is presented in [2].
Actual times compared with theoretical estimates 22
20 order 16129 order 3969
18
estimated optimal
time (sec)
16
14
12
10
8
6 0
5
10 15 number of processors
20
25
Fig. 1. Actual and estimated times for equation (2). Results for two different matrix sizes performance on the Paragon are shown.
To conclude, we compare the performance of the parallel DSE with PARPACK [7], a parallel implementation of ARPACK 1 . Figure 2 presents the total running times for both algorithms for the problem described above. For these 1
ARPACK implements an Implicitly Restarted Arnoldi Method (IRAM) which in the symmetric case corresponds to the Implicitly Restarted Lanczos algorithm. We used the regular mode when running PARPACK.
A Parallel Solver for Extreme Eigenpairs
769
runs, DSE used our parallel implementation of the ADI as its preconditioner. The problem was solved by using 4, 8, and 16 processors to obtain relative residuals Au − λu/u of order less than 10−5 . We show our theoretical analysis for the parallel algorithm in [2]. Other numerical results for the sequential DSE algorithm, including examples showing the behavior of the algorithm for eigenvalues with multiplicity greater than one, were presented in [10].
DSE (ADI) and PARPACK for three distinct problem sizes
DSE (ADI)
60
PARPACK 4 processors 8 processors
50
time (sec)
16 processors 40
30
20
10 3
4
10
10 log: problem size
Fig. 2. Running times for DSE and PARPACK using 4, 8, and 16 processors on the Paragon.
References 1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK User’s Guide. SIAM, Philadelphia, 1992. 766 2. L. Borges and S. Oliveira. A parallel Davidson-type algorithm for several eigenvalues. Journal of Computational Physics. To appear. 763, 767, 768, 769 3. G. Cisneros, M. Berrondo, and C. F. Brunge. DVDSON: A subroutine to evaluate selected sets of eigenvalues and eigenvectors of large symmetric matrices. Compu. Chem., 10:281–291, 1986. 766 4. G. Cisneros and C. F. Brunge. An improved computer program for eigenvector and eigenvalues of large configuration iteraction matrices using the algorithm of Davidson. Compu. Chem., 8:157–160, 1984. 766 5. M. Crouzeix, B. Philippe, and M. Sadkane. The Davidson method. SIAM J. Sci. Comput., 15(1):62–76, 1994. 765 6. E. R. Davidson. The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices. J. Comp. Phys., 17:87– 94, 1975. 764
770
Leonardo Borges and Suely Oliveira
7. K. Maschhoff and D. Sorensen. A portable implementation of ARPACK for distributed memory parallel architectures. In Proceedings of Copper Mountain Conference on Iterative Methods, April 9–13 1996. 768 8. D. P. O’Leary and G. W. Stewart. Computing the eigenvalues and eigenvectors of symmetric arrowhead matrices. J. Comp. Phys., 90:497–505, 1990. 767 9. S. Oliveira. On the convergence rate of a preconditioned algorithm for eigenvalue problems. Submitted. 763, 765 10. S. Oliveira. A convergence proof of an iterative subspace method for eigenvalues problem. In F. Cucker and M. Shub, editors, Foundations of Computational Mathematics Selected Papers, pages 316–325. Springer, January 1997. (selected). 763, 765, 767, 769 11. S. Oliveira. A new parallel chasing algorithm for transforming arrowhead matrices to tridiagonal form. Mathematics of Computation, 67(221):221–235, January 1998. 767 12. B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler. Matrix eigensystem routines: EISPACK guide. Number 6 in Lecture Notes Comput. Sci. Springer–Verlag, Berlin, Heidelberg, New York, second edition, 1976. 766 13. A. Stathopoulos and C. F. Fischer. A Davidson program for finding a few selected extreme eigeinpairs of a large, sparse, real, symmetric matrix. Comp. Phys. Comm., 79:268–290, 1994. 766 14. A. Stathopoulos, Y. Saad, and C. F. Fisher. Robust preconditioning of large, sparse, symmetric eigenvalue problems. J. Comp. and Appl. Mathematics, 64:197– 215, 1995. 763 15. V. M. Umar and C. F. Fischer. Multitasking the Davidson algorithm for the large, sparse eigenvalue problem. Int. J. Supercomput. Appl., 3:28–53, 1989. 766 16. R. S. Varga. Matrix Iterative Analysis. Prentice-Hall, 1963. 767 17. J. Weber, R. Lacroix, and G. Wanner. The eigenvalue problem in configuration iteration calculations: A computer program based on a new derivation of the algorithm of Davidson. Compu. Chem., 4:55–60, 1980. 766 18. J. R. Westlake. A handbook of numerical matrix inversion and solution of linear equations. Wiley, 1968. 767 19. D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, 1971. 767
Parallel Solvers for Large Eigenvalue Problems Originating from Maxwell’s Equations Peter Arbenz and Roman Geus Swiss Federal Institute of Technology, (ETH) Institute of Scientific Computing, CH-8092 Zurich {arbenz,geus}@inf.ethz.ch
Abstract. We present experiments with two new solvers for large sparse symmetric matrix eigenvalue problems: (1) the implicitly restarted Lanczos algorithm and (2) the Jacobi-Davidson algorithm. The eigenvalue problems originate from in the computation of a few of the lowest frequencies of standing electromagnetic waves in cavities that have been discretized by the finite element method. The experiments have been conducted on up to 12 processors of an HP Exemplar X-Class multiprocessor computer.
1
Introduction
Most particle accelerators use standing waves in cavities to produce the high voltage RF (radio frequency) fields required for the acceleration of the particles. The mathematical model for these high frequency electromagnetic fields is the eigenvalue problem solving Maxwell’s equations in a bounded volume. Usually, the eigenfield corresponding to the fundamental mode of the cavity is used as the accelerating field. Due to higher harmonic components contained in the RF power fed into the cavity, and, through interactions between the accelerated particles and the electromagnetic field, an excitation of higher order modes can occur. The RF engineer designing such an accelerating cavity therefore needs a tool to compute the fundamental and about ten to twenty of the following eigenfrequencies together with the corresponding electomagnetic eigenfields. After separation of variables depending on space and time and after elimination of the magnetic field terms the variational form of the eigenvalue problem for the electric field intensity is given by [1] = 0 and Find (λ, u) ∈ IR × W0 such that u (curl u, curl v) = λ(u, v), ∀v ∈ W0 .
(1)
Let L2 (Ω) be the Hilbert space of square-integrable functions over the 3-dimensional domain Ω (the cavity) with inner product (u, v) = Ω u(x) · v(x) dx and norm u0,Ω := (u, u)1/2 . In (1), W0 := {v ∈ W | div v = 0} with W := {v ∈ L2 (Ω)3 | curl v ∈ L2 (Ω)3 , div v ∈ L2 (Ω) , n × v = 0 on ∂Ω}. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 771–779, 1998. c Springer-Verlag Berlin Heidelberg 1998
772
Peter Arbenz and Roman Geus
The difficulty with (1) stems from the condition div v = 0 as it is hard to find divergence-free finite elements. Therefore, ways have been looked for to get around the condition div v = 0. In this process care has to be taken in order not to introduce so-called spurious modes, i.e. eigenmodes that have no physical meaning [8]. We considered two approaches free of spurious modes, a penalty method and an approach based on Lagrange multipliers. In the penalty method approach (1) is replaced by [9,11] For fixed s > 0, find (λ, u) ∈ IR × W such that u = 0 and (curl u, curl v) + s (div u, div v) = λ(u, v), ∀v ∈ W.
(2)
Here, s is a positive, usually small parameter. The eigenmodes u(x) of (2) corresponding to eigenvalues λ < µ1 s are eigenmodes of (1). µ1 is the smallest eigenvalue of the negative Laplace operator −∆ on Ω. When discretized by ordinary nodal-based finite elements, (2) leads to matrix eigenvalue problems of the form Ax = λM x.
(3)
For positive s, both A and M are symmetric positive definite sparse n-by-n matrices. In the mixed formulation the divergence-free condition is enforced by means of Lagrange multipliers [9]. = 0 and Find (λ, u, p) ∈ IR × H0 (curl; Ω) × H01 (Ω) such that u (a) (curl u, curl Ψ) + (grad p, Ψ) = λ(u, Ψ), ∀Ψ ∈ H0 (curl; Ω) (b) (u, grad q) = 0, ∀q ∈ H01 (Ω)
(4)
Here, H0 (curl; Ω) := {v ∈ L2 (Ω)3 | curl v ∈ L2 (Ω)3 , n × v = 0 on ∂Ω}. The finite element discretization of (4) yields a matrix eigenvalue problem of the form x MO x A C =λ . (5) y O O y CT O where A and M are n-by-n and C is n-by-m. M is positive definite, A is only positive semidefinite. It turns out that (5) becomes most convenient to handle if the finite elements for the vector fields are chosen to be edge elements as proposed by N´ed´elec [12,8] and the Lagrange multipliers are represented by nodal-based finite elements of the same degree. Then, the columns of M −1 C form a basis for the nullspace of A and, in principle, it suffices to compute the eigenvalues of an eigenproblem formally equal to (3) but with A and M from (5). To get rid of the high-dimensional eigenspace associated with the eigenvalue zero Bespalov [4] proposed to replace (5) by ˜ = λM x, Ax
A˜ = A + CHC T ,
(6)
where H is a positive definite matrix chosen such that the zero eigenvalues are shifted to the right of the desired eigenvalues and do not disturb the computations. In this note we consider solvers for Ax = λM x with A and M from the penalty approach (3) as well as from the mixed element approach (6).
Parallel Solvers for Large Eigenvalue Problems
2
773
Algorithms
In this section we briefly survey the numerical methods that we will apply to the model problem which is a cavity of the form of a rectangular box, Ω = (0, a) × (0, b) × (0, c). We investigate two algorithms for computing a few of the smallest (positive) eigenvalues of the matrix eigenvalue problems originating from both the penalty method and the mixed method. For computing a few, say p, eigenvalues of a sparse matrix eigenvalue problem Ax = λM x,
A = AT ,
M = M T > 0,
(7)
closest to a number τ it is advisable to make a so-called shift-and-invert approach and apply a spectral transformation with a shift σ close to τ and solve [7] (A − σM )−1 M x = µx,
µ=
1 . λ−σ
(8)
instead of solving (7). Notice that (A − σM )−1 M is M -symmetric, i.e., it is symmetric with respect to the inner product xT M y. The spectral transformation leaves the eigenvectors unchanged. The eigenvalues of (7) close to the shift σ become the largest absolute of (8). They are relatively well-separated which improves the speed of convergence. The cost of the improved convergence rate is the need to solve (at least approximately) systems of equations with the matrix A − σM . In all algorithms the shift σ was chosen to be 48 < λ1 ≈ 48.4773. 1. The Implicitly Restarted Lanczos algorithm (IRL). Because of the large memory consumption of the Lanczos algorithm it is often impossible to proceed until convergence. It is then necessary to restart the iterative process in some way with as little loss of information as possible. An elegant way to restart has been proposed by Sorensen for the Arnoldi algorithm [16], see [5] for the symmetric Lanczos case. Software is publicly available in ARPACK [10]. The algorithm is based on the spectral transformation Lanczos algorithm. The iteration process is executed until j = p + k, where k is some positive integer, often k = p. Complete reorthogonalization is done for stability reasons. This is possible since by assumption p + k is not big. In the restarting phase a clever application of the QR algorithm [14] reduces the dimension of the search space to p in such a way that the p new orthogonal basis vectors still form a Krylov sequence [16,5]. This allows the Lanczos phase to be resumed smoothly. Here, we solved the symmetric indefinite system of equations (A− σM )x = y iteratively by SYMMLQ [13,3]. The accuracy of the solution of the linear system has to be at least as high as the desired accuracy in the eigenvalue calculation in order that the coefficients of the Lanczos three-term recurrence are reasonably accurate [10]. We experimented with various preconditioners. None of them was satisfactory. We obtained the best results with diagonal preconditioning. In our implementation we chose p = k = 15. Besides the storage for the matrices A and M , IRL requires space for two n × (p + k) arrays.
774
Peter Arbenz and Roman Geus
2. The Jacobi-Davidson algorithm (JDQR). In the Jacobi-Davidson algorithm the eigenpairs of Ax = λM x are computed one by one. As with the implicitly restarted Lanczos procedure the search space Vj is expanded up to a certain dimension, say j = jmax . The basis of this space is constructed to be M -orthogonal but there is no three-term recurrence relation. In the expansion phase the search space is extended by a vector v orthogonal to the current ˜ q ˜ ) by means of equation [15] eigenpair approximation (λ, ˜ )v = −(I − q ˜q ˜T M )r, (I − q ˜q ˜T M )(A − λM
(I − q ˜q ˜T M )v = v.
(9)
(If eigenvectors have already been computed, this process is executed in the space M -orthogonal to them.) The solution of (9) is only needed approximately. Therefore, it can be solved iteratively. In [6] the authors propose to use a preconditioner of the form ˜q ˜T M ) (I − q ˜q ˜T M )K(I − q
(10)
˜ . They also where K is a good and easily invertible approximation of A − λM give a generalized inverse of the matrix in (10). We solved the systems by the conjugate gradient squared (CGS) method [3] with K equal to the diagonal of ˜ . The next approximate eigenvalue λ ˜ and corresponding Ritz vector A − λM ˜ are obtained by a Ritz step for the subspace Vj+1 . If the search space has q dimension jmax it is shrunk to dimension jmin by selecting only those jmin Ritz vectors corresponding to Ritz values closest to the target value τ = 48. In our experiments we set jmin = p = 15 and jmax = 2p as suggested in [10]. Thus, besides the storage for the matrices A and M , memory space is needed for a n × jmax array and for three n × p arrays.
3
Numerical Experiments
In this section we compare IRL and JDQR for computing the 15 smallest eigenvalues of (1) with Ω = (0, a) × (0, b) × (0, c) where a = 0.8421, b = 0.5344, c = 0.2187. We apply penalty as well as mixed methods to this model problem whose eigenvalues can be computed analytically [1]. The computational results have been obtained with the HP Exemplar X-class system at the ETH Zurich. 1. Sequential results. In Figs. 1 and 2 execution times for computing the 15 lowest eigenvalues and associated eigenvectors vs. the accuracy of the computed solutions relative to the analytic solution are plotted for different mesh sizes. The largest problems sizes were n = 65 125 and n = 45 226 for the linear and quadratic edge elements, and n = 28 278 and n = 118 539 for the linear and quadratic node elements. For all experiments we used a tolerance of 10−8 in the stopping criterion of the eigensolvers. Loosely speaking, this means that the eigenvalues are computed to at least 8 significant digits. The execution times comprise the solution time of the eigensolver but not the time for building the matrices. For each finite element type (linear/quadratic, node element/edge element) the performance of the two algorithms is shown. Fig. 1 shows the results
Parallel Solvers for Large Eigenvalue Problems
775
performance comparision of ARPACK and JDQR with nodal elements
5
10
lin node w/ arpack lin node w/ jdqr quad node w/ arpack quad node w/ jdqr
4
10
3
time
10
2
10
1
10
0
10
−1
10
−6
10
−5
−4
10
−3
10
−2
10 rel accuracy of λ
−1
10
10
0
10
1
Fig. 1. Comparison of IRL and JDQR with node elements: Accuracy of the computed λ1 relative to the analytic solution vs. computation time performance comparision of ARPACK and JDQR with edge elements
5
10
lin edge w/ arpack lin edge w/ jdqr quad edge w/ arpack quad edge w/ jdqr
4
10
3
time
10
2
10
1
10
0
10
−1
10
−5
10
−4
10
−3
−2
10
10
−1
10
0
10
rel accuracy of λ
1
Fig. 2. Comparison of IRL and JDQR with edge elements: Accuracy of the computed λ1 relative to the analytic solution vs. computation time
776
Peter Arbenz and Roman Geus
that we obtained with linear and quadratic node elements for λ1 . The higher eigenvalues behave similarly but are of course less accurate. Fig. 2 shows the results for the linear and quadratic edge elements. For edge elements the convergence is improved by introducing the matrix C as given in (6). Experiments to that end are reported in [2]. H in (6) was heuristically chosen to be αI with α = 100/h, where h is the mesh width. The comparison of linear with quadratic element types reveals immediately the inferiority of the former. They give much lower accuracy for a given computation time. The node elements in turn are to be preferred to the edge elements, at least in this simplified model problem. For a given computational effort, the eigenvalues obtained with the node elements are about an order of magnitude more accurate than those obtained with the edge elements. The situation is not so evident with the linear elements. With the quadratic elements and linear node elements JDQR is consistently faster than IRL by about 10 to 30% for the large problem sizes. For linear edge elements JDQR still is 10 to 20% ahead of IRL for most of the larger problem sizes. For the smaller problems the situation is not so clear. We now discuss the behavior of the two algorithms in the case of the quadratic edge elements where the problem size grows from n = 1408 to 45226. (In the latter case A and M have each about 1’730’000 nonzero elements.) With both algorithms the number of outer iteration steps varies only little. It is ∼ 65 with IRL and ∼ 210 with JDQR. So, the number of restarts does not depend on the problem size. Each outer iteration step requires the solution of one system of equations. We used the solver SYMMLQ with IRL and CGS for JDQR. One (inner) iteration step of CGS counts for about two iteration steps of CGS. We applied diagonal preconditioning in all cases. The superiority of JDQR over IRL for large problem sizes can be explained by the number of inner iteration steps. The average number of inner iteration steps per outer iteration step grows from 64 to 91 with JDQR but from 423 to 1140 with IRL. In ARPACK each system of equations is solved to high accuracy. In JDQR the accuracy requirement is not so stringent and actually varies from (outer) step to step, cf. §2.2. This explains the higher iteration numbers for IRL. The projections in (9) improve the condition number of the system matrix. Further, the shift σ is updated in JDQR while it stays constant with ARPACK. These may be the reasons for the less pronounced increase of the number of inner iteration steps with JDQR. Notice that the projections made in each iteration step account for less than 10% of the execution time of JDQR. Similar observations can be made with the other element types. 2. Parallel results. The parallel experiments were carried out on the 32 processor HP Exemplar X-class system at ETH Zurich. This shared-memory CCNUMA machine consists of two hypernodes with 16 HP PA-8000 processors and 4 GBytes memory each. The processors and memory-banks within a hypernode are connected through a crossbar-switch with a bandwidth of 960 MBytes/s per port. The hypernodes themselves are connected by a slower network with a ring
Parallel Solvers for Large Eigenvalue Problems
777
350
2200 arpack node arpack edge jdqr node jdqr edge
2000 1800
arpack node arpack edge jdqr node jdqr edge
300
1600
250
1400
200 time
time
1200 1000
150
800
100
600 400
50 200 0
1
2
3
4
5
6 7 8 number of processors
9
10
11
0
12
1
2
3
4
5
(a)
6 7 8 number of processors
9
10
11
12
(b)
Fig. 3. Execution times (a) for the large eigenproblem (n = 22323 for node, n = 5798 for edge elements) and (b) for the small eigenproblem (n = 4981 and n = 1408, respectively) 6
6 arpack node arpack edge jdqr node jdqr edge
5.5
5
5
4.5
4.5
4 speedup
speedup
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
arpack node arpack edge jdqr node jdqr edge
5.5
1
2
3
4
5
6 7 8 number of processors
(a)
9
10
11
12
1
1
2
3
4
5
6 7 8 number of processors
9
10
11
12
(b)
Fig. 4. Speedups (a) for the large eigenproblem and (b) for the small eigenproblem topology. The HP PA-8000 processors have a clock rating of 180 MHz and a peak performance of 720 MFLOPS. We carried out the parallel experiments for IRL (ARPACK) and JDQR using both quadratic node and quadratic edge elements and two different problem sizes. Linear elements are not considered because of the inferior results in the sequential case. For both edge and node elements we chose two problem sizes: the larger problems require about 1700 seconds one processor, whereas the smaller problems require only about 220 seconds. For our experiments we were allowed to use up to 12 processors. With ARPACK about 95% of the sequential computation time is spent in the inner loops forming sparse matrix-vector products with the matrices A, M and C. JDQR spends about 90% for this task. We therefore parallelized the sparse matrix-vector product using the so called HP Exemplar advanced sharedmemory programming technique, that is directive-based.
778
Peter Arbenz and Roman Geus
The matrix C and the strictly lower triangles of A and M are stored in the compressed sparse row format [3]. The diagonals of A and M are stored separately. We parallelized the outermost loops with the loop parallel directives. The necessary privatization of variables was done manually by means of the loop private directive. For the product with C and the lower triangular parts of A and M no special considerations were necesary. For the product with C T and the upper triangular parts of A and M the processors store their “local” result into distinct vectors, that are accumulated into the global result vector afterwards. To get a well-balanced work load we distributed the triangular matrices in block-cyclic fashion over the processors. In Fig. 3 the execution times for both problem sizes are plotted. In Fig. 4 the corresponding speedups are found. The best speedups are reached with 10 processors for the large problems. ARPACK’s IRL gives speedups of 5.8 for node elements and of 3.7 for edge elements. With JDQR we get 3.8 and 3.0, respectively. These numbers are lower for the small problem size. The reason for the better speedups with ARPACK is, that the implicit classical Gram-Schmidt orthogonalization which consumes between 5 and 10% of the sequential execution time in JDQR doesn’t scale well on the HP-Exemplar. The parallelized BLAS routine DGEMV in the HP MLIB library shows no speedup even though the matrices have more than 100’000 nonzero elements! We are currently resolving this issue with HP. Using a reasonably parallelizing DGEMV routine we expect JDQR to scale as well as ARPACK. Fig. 4 further shows that in our experiments node elements scale better than edge elements. Since we chose the problem sizes such that both edge and node elements require about the same computation time on one processor, and the linear systems arising from node elements are better conditioned, the matrices originating from edge elements are smaller. Matrices A and M stemming from edge elements together have about 7 times fewer non-zero elements than in the nodal case. So, the relative parallelization overhead is bigger for edge elements. For IRL the situation is improved because instead of A we store the shifted matrix A − σM , which has about twice as many non-zero elements. Our parallel experiments show the limitations of directive-based sharedmemory programming. Before and after each parallel section (e.g. parallel loops) of the code, the compiler inserts global synchronization operations to ensure correct execution. However, in most cases, these global synchronization operations are unnecessary. That is why our implementation doesn’t scale well. In order to remove unnecessary synchronization points, a different programming paradigm, such as message-passing or low-level shared-memory programming, has to be employed. Our parallel results depend very much on the computer architecture, the parallel libraries and the programming paradigm we used. They can therefore not be easily generalized. Nevertheless, our experiments prove that reasonable parallel performance can be obtained on small processor numbers with only modest programming effort using directive-based shared-memory programming on the HP Exemplar.
Parallel Solvers for Large Eigenvalue Problems
779
References 1. St. Adam, P. Arbenz, and R. Geus, Eigenvalue solvers for electromagnetic fields in cavities, Tech. Report 275, ETH Z¨ urich, Computer Science Department, October 1997, (Available at URL http://www.inf.ethz.ch/publications/tr.html). 771, 774 2. P. Arbenz and R. Geus, Eigenvalue solvers for electromagnetic fields in cavities, FORTWIHR International Conference on High Performance Scientific and Engineering Computing (F. Durst et al., ed.), Springer-Verlag, 1998, (Lecture Notes in Computational Science and Engineering). 776 3. R. Barret, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, Ch. Romine, and H. van der Vorst, Templates for the solution of linear systems: Building blocks for iterative methods, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1994, (Available from Netlib at URL http://www.netlib.org/templates/index.html). 773, 774, 778 4. A. N. Bespalov, Finite element method for the eigenmode problem of a RF cavity resonator, Soviet Journal of Numerical Analysis and Mathematical Modelling 3 (1988), 163–178. 772 5. D. Calvetti, L. Reichel, and D. C. Sorensen, An implicitely restarted Lanczos method for large symmetric eigenvalue problems, Electronic Transmissions on Numerical Analysis 2 (1994), 1–21. 773 6. D. R. Fokkema, G. L. G. Sleijpen, and H. A. van der Vorst, Jacobi-Davidson style QR and QZ algorithms for the partial reduction of matrix pencils, Preprint 941, revised version, Utrecht University, Department of Mathematics, Utrecht, The Netherlands, January 1997. 774 7. R. Grimes, J. G. Lewis, and H. Simon, A shifted block Lanczos algorithm for solving sparse symmetric generalized eigenproblems, SIAM J. Matrix Anal. Appl. 15 (1994), 228–272. 773 8. J. Jin, The finite element method in electromagnetics, Wiley, New York, 1993. 772 9. F. Kikuchi, Mixed and penalty formulations for finite element analysis of an eigenvalue problem in electromagnetism, Computer Methods in Applied Mechanics and Engineering 64 (1987), 509–521. 772 10. R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK users’ guide: Solution of large scale eigenvalue problems by implicitely restarted Arnoldi methods, Department of Mathematical Sciences, Rice University, Houston TX, October 1997, (Available at URL http://www.caam.rice.edu/software/ARPACK/index.html). 773, 774 11. R. Leis, Zur Theorie elektromagnetischer Schwingungen in anisotropen Medien, Mathematische Zeitschrift 106 (1968), 213–224. 772 12. J. C. N´ed´elec, Mixed finite elements in IR3 , Numerische Mathematik 35 (1980), 315–341. 772 13. C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Numer. Anal. 12 (1975), 617–629. 773 14. B. N. Parlett, The symmetric eigenvalue problem, Prentice Hall, Englewood Cliffs, NJ, 1980. 773 15. G. L. G. Sleijpen, A. G. L. Booten, D. R. Fokkema, and H. A. van der Vorst, Jacobi-Davidson type methods for generalized eigenproblems and polynomial eigenproblems, BIT 36 (1996), 595–633. 774 16. D. Sorensen, Implicite application of polynomial filters in a k-step Arnoldi method, SIAM J. Matrix Anal. Appl. 13 (1992), 357–385. 773
Waveform Relaxation for Second Order Differential Equation y = f (x, y) Kazufumi Ozawa and Susumu Yamada Graduate School of Information Science, Tohoku University, Kawauchi Sendai 980-8576, JAPAN
Abstract. Waveform relaxation (WR) methods for second order equations y = f (t, y), y(t0 ) = y0 , y (t0 ) = y0 are studied. For linear case, the method converges superlinearly for any splittings of the coefficient matrix. For nonlinear case, the method converges quadratically only for waveform Newton method. It is shown, however, that the method with approximate Jacobian matrix converges superlinearly. The accuracy, execution times and speedup ratios of the WR methods on a parallel computer are discussed.
1
Introduction
In this paper we propose a waveform relaxation (WR) method for solving y = f (x, y) on a parallel computer. The basic idea of this method is to solve a sequence of differential equations, which converges to the exact solution, with a starting solution y [0] (x). In this iteration, we can solve each component(block) of the equation in parallel.
2
Linear Differential Equation
Consider first the linear equation of the form y (x) = Qy(x) + g(x),
y(x0 ) = y0 ,
y (x0 ) = y0 ,
y ∈ Rm ,
(1)
where we assume −Q is a positive definite matrix. The Picard iteration for solving (1) is given by
y [ν+1] (x) = Qy [ν] (x) + g(x),
y [ν] (x0 ) = y0 , y [ν] (x0 ) = y0 .
(2)
Here we consider the rate of convergence of (2) and propose an acceleration of the iteration. The solutions of (1) and (2) are given by x s {Qy(τ ) + g(τ )}dτ ds, (3) y(x) = y0 + (x − x0 )y0 + x0 x0 x s {Qy [ν] (τ ) + g(τ )}dτ ds, (4) y [ν+1] (x) = y0 + (x − x0 )y0 + x0
x0
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 780–787, 1998. c Springer-Verlag Berlin Heidelberg 1998
Waveform Relaxation for Second Order Differential Equation
respectively. By subtracting (3) from (4) we can obtain x s ε[ν+1] (x) = Qε[ν] (τ )dτ ds, x0
781
(5)
x0
where ε[ν] (x) = y [ν] (x)−y(x). Taking the norm of the left-hand side and assuming ||ε[0] (x)|| ≤ K(x − x0 ), we have by induction ||ε[ν] (x)|| ≤ K
cν1 (x − x0 )2ν+1 , (2ν + 1)!
(6)
where we set c1 = ||Q||. The inequality shows that Picard iteration (2) converges on any finite interval x ∈ [x0 , T ] and the rate of convergence is superlinear. It is, however, expected that the method converges slowly if the system is stiff or the length of the interval is large. Here we propose an acceleration of the Picard iteration using the splitting Q = N − M . The iterative method to be considered is
y [ν+1] (x) + M y [ν+1] (x) = N y [ν] (x) + g(x),
(7)
where we assume that matrix M is symmetric and positive definite. The error of the iterate y [ν] (x) satisfies the differential equation
ε[ν+1] (x) + M ε[ν+1] (x) = N ε[ν] (x).
(8)
In order to obtain the explicit expression for ε[ν] (x), here we define the square root and the trigonometric functions of matrices. Let λk (k = 1, . . . , m) be the eigenvalues of M, and P is a unitary matrix that diagonalizes M , i.e. M = P diag (λ1 , . . . , λm ) P −1 , then using P the square root of M can be defined by √ M = P diag λ1 , . . . , λm P −1 . For any square matrix H and scalar x, the trigonometric functions can be defined by cos(Hx) =
∞ (−1)k H 2k k=0
(2k)!
2k
x ,
sin(Hx) =
∞ (−1)k H 2k+1 k=0
(2k + 1)!
x2k+1 .
Using the functions defined above we have ε[ν+1] (x) = x √ √ √ √ √ ( M )−1 cos( M τ ) sin( M x) − cos( M x) sin( M τ ) N ε[ν] (τ )dτ x0 x √ √ = ( M )−1 sin M (x − τ ) N ε[ν] (τ )dτ x0 x τ √ cos M (x − τ ) N ε[ν] (s)dsdτ, (9) = x0
x0
782
Kazufumi Ozawa and Susumu Yamada
d [ν+1] where the condition ε[ν+1] (x)|x=x0 = dx ε (x)|x=x0 = 0 is used. To bound [ν] ε (x) if we use the Euclidean norm then we have √ || cos( M x)|| ≤ ||P || diag cos( λ1 x), . . . , cos( λn x) ||P −1 || ≤ 1, (10)
since ||P || = ||P −1 || = 1. Assuming ||ε[0] (x)|| ≤ K(x − x0 ) as before and letting c2 = ||N ||, we find by induction the inequality ||ε[ν] (x)|| ≤ K
c2 ν (x − x0 )2ν+1 . (2ν + 1)!
(11)
The result shows that although the rate of convergence of splitting method (7) remains superlinear, we can accelerate the convergence by making the value of ||N || small.
3
Nonlinear Differential Equation
Next we discuss the rate of convergence of the waveform relaxation for the second order nonlinear equation y (x) = f (x, y),
y (x0 ) = y0 ,
y(x0 ) = y0 ,
y ∈ Rm .
(12)
For the first order nonlinear equation y = f (x, y), the method called the waveform Newton method is proposed[5]. The method is given by
y [ν+1] (x) − Jν y [ν+1] (x) = f (y [ν] (x)) − Jν y [ν] (x),
(13)
[ν]
where Jν is the Jacobian matrix of f (y ). It is shown by Burrage[1] that the waveform Newton method converges quadratically on all finite intervals x ∈ [x0 , T ]. In this section we first consider the convergence for the case that Jν is replaced with an approximation in order to enhance the efficiency of the parallel computation. Instead of (13) let us consider the iteration y [ν+1] (x) − J˜ν y [ν+1] (x) = f (y [ν] (x)) − J˜ν y [ν] (x),
where J˜ν is an approximation to Jν . Let ε
[ν]
f (y [ν] ) = f (y + ε[ν] ) = f (y) +
=y
[ν]
(14)
− y then we have
∂f [ν] ε + O(||ε[ν] ||2 ). ∂y
Substituting (15) into (14) and ignoring the term O(||ε[ν] ||2 ), we have ε[ν+1] = J˜ν ε[ν+1] + Jν − J˜ν ε[ν] ,
(15)
(16)
and taking the inner product we have
d [ν+1] [ν+1] ε ,ε = J˜ν ε[ν+1] , ε[ν+1] + (Jν − J˜ν )ε[ν] , ε[ν+1] dx ≤ µ1 ||ε[ν+1] ||2 + µ2 ||ε[ν+1] || ||ε[ν] ||,
(17)
Waveform Relaxation for Second Order Differential Equation
783
where the norm || · || is defined by ||u||2 :=< u, u >, and µ1 and µ2 are the subordinate matrix norms of J˜ν and Jν − J˜ν , respectively. The left-hand side of (17) can be written as d [ν+1] [ν+1] d 1 d [ν+1] 2 ε ||ε ,ε || = ||ε[ν+1] || ||ε[ν+1] ||, = dx 2 dx dx so that, under the assumption that ||ε[ν+1] || = 0, we have d [ν+1] ||ε || ≤ µ1 ||ε[ν+1] || + µ2 ||ε[ν] ||. dx
(18)
Assuming ||ε[0] (x)|| ≤ K(x − x0 ) as before and using ||ε[0] (x0 )|| = 0, we have x e−µ1 (s−x0 ) ||ε[ν] (s)|| ds ||ε[ν+1] (x)|| ≤ µ2 eµ1 (x−x0 ) x0 x ≤ µ2 eµ1 (x−x0 ) ||ε[ν] (s)|| ds, (19) x0
which leads to
(µ2 eµ1 (x−x0 ) )ν (x − x0 )ν+1 , (20) (ν + 1)! showing that the rate of convergence of iteration (14) is not quadratic but superlinear. By the way, the second order nonlinear equation ||ε[ν] (x)|| ≤ K
y (x) = f (x, y),
y(x0 ) = y0 ,
y (x0 ) = y0 ,
y ∈ Rm
(21)
can be rewritten as the first order system z (x) = (y (x), f (x, y))T ≡ F (z(x)),
(22)
where z(x) = (y(x), y (x))T . The waveform Newton method for (22) is given by
z [ν+1] (x) − Jν z [ν+1] (x) = F (z [ν] (x)) − Jν z [ν] (x). In this case the Jacobian matrix Jν is given by ∂y ∂y 0 I ∂F ∂y ∂y = ∂f = Jν = , ∂f ∂f ∂z ∂y 0 ∂y ∂y
(23)
(24)
where I is an identity matrix. This result shows that for second order equation (12), if we define the waveform Newton method analogously by ∂f [ν+1] ∂f [ν] d2 [ν+1] y y (x), y (x) − (x) = f (y [ν] (x)) − 2 dx ∂y ∂y
(25)
then the method also converges quadratically. Moreover, from the same discussion as for the first order equation, we can conclude that if an approximate Jacobian J˜ν is applied, then the rate of convergence of the iteration d2 [ν+1] y (x) − J˜ν y [ν+1] (x) = f (y [ν] (x)) − J˜ν y [ν] (x) dx2 is superlinear.
(26)
784
4
Kazufumi Ozawa and Susumu Yamada
Numerical Experiments
4.1
Linear Equations
In order to examine the efficiency of the waveform relaxations, we solve first the large linear system of second order differential equations. Consider the wave equation ∂2u ∂ 2u = , 0 ≤ x ≤ 1, 2 ∂t ∂x2 (27) u(x, 0) = sin(πx), u(0, t) = u(1, t) = 0, where the exact solution is u(x, t) = cos(πt) sin(πx). The semi-discretization by 3-point spatial difference yields the system of linear equations y (t) = Qy(t), where
T
y(t) = (u1 , . . . , um ) ,
(28)
−2 1 1 −2 1 1 1 .. .. .. . Q= ∈ Rm×m , ∆x = . . . (∆x)2 m+1 1 −2 1 1 −2
0
0
In the splitting method given here we take M as the block diagonal matrix given by M = diag(M1 , M2 , . . . , Mµ ), and each block Ml is the tridiagonal matrix given by 2 −1 −1 2 −1 1 .. .. .. Ml = ∈ Rd×d , l = 1, 2, . . . , µ, µ = m/d, . . . (∆x)2 −1 2 −1 −1 2
0
0
where we have assumed that m is divisible by d and, as a result, µ = m/d is an integer. In the implementation, the system is decoupled into µ subsystems and each of the subsystems is integrated concurrently on different processors, if the number of processors available are greater than that of the number of the subsystems. If this is not the case each processor integrates several subsystems sequentially. In any way our code is designed so as to have a uniform work load across the processors. The basic method to integrate the system is the 2-stage Runge-Kutta-Nystr¨ om method called the indirect collocation, which has an excellent stability property[3]. In our experiment we set m = 256 and integrate the system from x = 0 to 1 with the stepsize h = 0.1 under the stopping criteria ||y [ν] −y [ν−1]|| ≤ 10−7 . The result on the parallel computer KSR1, which is a parallel computer with shared address space and distributed physical memory(see e.g. [6]), is shown in Table 1.
Waveform Relaxation for Second Order Differential Equation
785
Table 1. Result for the linear problem on KSR1 CPU time(sec) d iterations serial parallel Sp Sp E E P µ = m/d 1 11579 566.02 61.92 9.14 2.58 0.57 0.16 16 256 2 6012 586.73 53.79 10.91 2.97 0.68 0.19 16 128 4 3052 362.94 31.44 11.54 5.08 0.72 0.32 16 64 8 1462 240.81 19.56 12.31 8.17 0.76 0.51 16 32 16 688 181.74 13.94 13.04 11.5 0.81 0.72 16 16 32 427 201.56 27.11 7.43 5.89 0.93 0.73 8 8 64 403 369.86 97.27 3.80 1.64 0.95 0.41 4 4 128 403 791.10 407.41 1.94 0.39 0.97 0.20 2 2 256 1 159.76 – – 1.00 – – 1 1 CPU time(serial) CPU time(serial, d = 256) Sp = , Sp = , CPU time(parallel) CPU time(parallel) Sp Sp , E = , P =number of processors E= P P
4.2
Nonlinear Equations
Next we consider the nonlinear wave equation given by vi = exp(vi−1 (x)) − 2 exp(vi (x)) + exp(vi+1 (x)),
i = 0, ±1, ±2, . . . , . (29)
This equation describes the behaviour of the well-known Toda lattice and has the soliton solution given by vi (x) = log(1 + sinh2 τ sech2 (iτ − w(x + q))),
i = 0, ±1, ±2, . . . ,
w = sinh τ, where q is an arbitrary constant. The equation to be solved numerically is not (29) but its m-dimensional approximation given by yi = exp(yi−1 ) − 2 exp(yi ) + exp(yi+1 ), i = 2, . . . , m − 1 y = 1 − 2 exp(y1 ) + exp(y2 ), (30) 1 ym = exp(ym−1 ) − 2 exp(ym ) + 1 initial conditions :
yi (0) = vi (0), yi (0) = vi (0),
where we have to choose a sufficiently large m to lessen the effect due to the finiteness of the boundary. As the parameters we take q = 40, τ = 0.5 and w = sinh 0.5. The approximate Jacobian J˜ν used here is the block diagonal matrix given by Jν (i, j), d(l − 1) + 1 ≤ i, j ≤ dl, l = 1, . . . , m/d J˜ν (i, j) = 0, otherwise,
786
Kazufumi Ozawa and Susumu Yamada
where we have assumed m is divisible by d as before. In our numerical experiment we use the KSR1 parallel computer, and integrate the 1024-dimensional system from x = 0 to 5 with stepsize h = 0.05 under the stopping criteria ||y [ν] − y [ν−1] || ≤ 10−10 . The result is shown in Table 2. In our algorithms, if the value of d becomes small then the degree of paral-
Table 2. Results for the nonlinear problem on KSR1
d 1 2 4 8 16 32 64 128 256 512 1024 Sp =
CPU time(sec) serial parallel Sp Sp E E P µ = m/d 17 53.98 6.10 8.85 3.49 0.55 0.22 16 1024 15 61.44 6.43 9.55 3.31 0.60 0.21 16 512 14 58.14 6.24 9.32 3.41 0.58 0.21 16 256 14 60.19 6.37 9.46 3.34 0.59 0.21 16 128 13 54.33 5.87 9.25 3.63 0.58 0.23 16 64 12 50.81 5.71 8.91 3.73 0.56 0.23 16 32 6 25.74 3.05 8.43 6.98 0.53 0.44 16 16 6 26.79 4.78 5.60 4.46 0.70 0.56 8 8 4 20.33 6.43 3.16 3.31 0.79 0.83 4 4 4 21.03 11.84 1.78 1.71 0.89 0.86 2 2 4 21.30 — — 1.00 – – 1 1 CPU time(serial) CPU time(serial, d = 1024) , Sp = CPU time(parallel) CPU time(parallel) P =number of processors Sp Sp E= , E = P P
iterations
lelism becomes large, since the number of the subsystems µ(= m/d) becomes large and, moreover, the cost of the LU decomposition that takes place in the inner iteration in each of the Runge-Kutta-Nystr¨ om schemes, which is proportional to d3 , becomes small. However, as is shown in the tables, it is in general true that the smaller the value of d, the larger the number of the WR iterations, which means the increase of the overheads such as synchronisation and communications. Therefore, small d implementation is not necessary advantageous. In our experiments we can achieve the best speedups when the number of subsystems is equal to that of processors.
References 1. Burrage, K., Parallel and Sequential Methods for Ordinary Differential Equations, Oxford University Press, Oxford, 1995. 782
Waveform Relaxation for Second Order Differential Equation
787
2. Hairer, E., Nørsett S. P., and Wanner, G., Solving Ordinary Differential Equations I (Nonstiff Problems, 2nd Ed.), Springer-Verlag, Berlin, 1993. 3. van der Houwen, P. J., Sommeijer, B.P. Nguyen Huu Cong, Stability of collocationbased Runge-Kutta-Nystr¨ om methods, BIT 31(1991), 469-481. 784 4. Miekkala, U. and Nevanlinna, O., Convergence of dynamic iteration methods for initial value problems, SIAM J. Sci. Comput. 8(1987), pp.459-482. 5. White, J. and Sangiovanni-vincentelli, A.L., Relaxation Techniques for the Simulation of VLSI Circuits, Kluwer Academic Publishers, Boston, 1987. 782 6. Papadimitriou P., The KSR1 — A Numerical analyst’s perspective, Numerical Analysis Report No.242, University of Mancheter/UMIST. 784
The Parallelization of the Incomplete LU Factorization on AP1000 Takashi Nodera and Naoto Tsuno Department of Mathematics Keio University 3-14-1 Hiyoshi Kohoku Yokohama 223, Japan
Abstract. Using a finite difference method to discretize a two dimensional elliptic boundary value problem, we obtain systems of linear equations Ax = b, where the coefficient matrix A is a large, sparse, and nonsingular. These systems are often solved by preconditioned iterative methods. This paper presents a data distribution and a communication scheme for the parallelization of the preconditioner based on the incomplete LU factorization. At last, parallel performance tests of the preconditioner, using BiCGStab() and GMRES(m) method, are carried out on a distributed memory parallel machine AP1000. The numerical results show that the preconditioner based on the incomplete LU factorization can be used even for MIMD parallel machines.
1
Introduction
We are concerned with the solution of linear systems of equations Ax = b
(1)
which arises from the discretization of partial differential equations. For example, a model problem is the two dimensional elliptic partial differential equation: − uxx − uyy + σ(x, y)ux + γ(x, y)uy = f (x, y)
(2)
which defined on the unit square Ω with Dirichlet boundary conditions u(x, y) = g(x, y) on ∂Ω. We require σ(x, y) and γ(x, y) to be a bounded and sufficiently smooth function taking on strictly positive values. We shall use a finite difference approximation to discretize the equation (2) on a uniform grid points (MESH × MESH) and allow a five point finite difference molecule A. The idea of ˜U ˜ − R, where incomplete LU factorization [1,4,7] is to use A = L
A=
AW i,j
AN i,j C E Ai,j Ai,j ASi,j
˜= L
LW i,j
0 LC 0 , i,j S Li,j
N Ui,j C E ˜ U = 0 Ui,j Ui,j 0
(3)
˜ and U ˜ are lower and upper triangular matrices, respectively. The inwhere L complete LU factorization preconditioner which is efficient and effective for the D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 788–792, 1998. c Springer-Verlag Berlin Heidelberg 1998
The Parallelization of the Incomplete LU Factorization on AP1000
Ω E Ui−1,j
✲LC i,j
✻
Ω
PU(N )
.. .
.. .
PU(N ) : PU(1) PU(N ) : PU(1)
PU(2)
N Ui,j−1
PU(1)
Fig. 1. The data dependency among neighborhoods.
Fig. 2. The data distribution for N processors.
789
Fig. 3. The cyclic data distribution for N processors.
single CPU machine, may not be appropriate for the distributed memory parallel machine just like AP1000. In this paper, we describe some recent work on the efficient implementation of preconditioning for a MIMD parallel machine. In section 2, we discuss the parallelization of our algorithm, and then we describe the implementation of incomplete LU factorization on the MIMD parallel machine. In section 3, we present some numerical experiments in which BiCGStab() algorithm and GMRES(m) algorithm are used to solve a two dimensional self-adjoint elliptic boundary value problem.
2
Parallelization
In section 1, as we presented the incomplete LU factorization, it involves in two ˜ and U ˜ . The another is the inversion parts. One is the decomposition of A into L ˜ ˜ of L and U by forward and backward substitution. First of all, we consider the decomposition process. The data dependency of this factorization is given in C C Fig. 1. Namely, the entry of LC 1,j can be calculated as soon as Li−1,j and Li,j−1 C C have been calculated on the grid points. The L1,j only depends on L1,j−1 on the C west boundary and LC i,1 only depends on Li−1,1 on the south boundary. Next, we ˜ −1 v and wi,j = U ˜ −1 v consider the inversion process. The calculation of wi,j = L is given by forward and backward substitution as follows. C wi,j = (vi,j − LSi,j wi,j−1 − LW i,j wi−1,j )/Li,j
wi,j = vi,j −
N Ui,j wi,j+1
−
E Ui,j wi+1,j
(4) (5)
We now consider brief introduction to the method of parallelization by Bastian and Horton [4]. They proposed the algorithm of dividing the domain Ω, as shown in Figure 2, when we do the parallel processing by N processors such as PU(1), . . . , PU(N ). Namely, PU(1) starts with its portion of the first low of grid points. After end, a signal is sent to PU(2), and then it starts to process its
790
Takashi Nodera and Naoto Tsuno
Ω
Ω
(a) CYCLIC=1
(b) CYCLIC=2
Low
High
Fig. 4. The efficiency of parallel computation in the grid points on Ω. section of the first row. At the time, PU(1) has started with the calculation of its component of the second row. All processors are running when PU(N ) begins with the calculation of its component of the first row. 2.1
Generalization of the Algorithm by Bastian and Horton
As we can show in Fig. 3, we divide the domain Ω by N processors cyclically. Here, we denote CYCLIC which denotes the number of cyclic division of the domain. The parallelization of Fig. 2 is equal to CYCLIC=1. If it divides as shown in the Fig. 3, the computation of the k-th processor PU(k) can be begun more quickly than the case of Bastian and Horton. Therefore, in this case, the efficiency of parallel processing will be higher than the ordinary one. In Fig. 4, we show the efficiency of parallel computation in the square grid points for CYCLIC= 1, 2, respectively. In these figures, the efficiency of parallel processing is shown a deeper color of the grids is low. The efficiency E of the parallel processing by this division is calculated as following equation 6. The detailed derivation of this equation is given in the forthcoming our paper [10,11]. E=
CYCLIC × MESH CYCLIC × MESH + N − 1
(6)
On the other hand, for distributed memory parallel machine, the number of CYCLIC is increased, the communication between processors is also increased. Moreover, concealment of the communication time by calculation time also becomes difficult. Therefore, we will consider the trade-off between the efficiency of parallel processing and the overhead of communication.
3
Numerical Experiences
In this section we demonstrate the effectiveness of our proposed implementation of ILU factorization on a distribute memory machine Fujitsu AP1000 for some
The Parallelization of the Incomplete LU Factorization on AP1000
791
Table 1. AP1000 specification Architecture Distributed Memory, MIMD Number of processors 64 Inter processor networks Broadcast network(50MB/s) Two-dimensional torus network (25MB/s/port) Synclronization network
Table 2. The computational time (sec.) for the multiplica˜ U) ˜ −1 and vector v on example 1. tion of matrix A(L Mesh Size CYCLIC 1 128×128 2 1 256×256 2 4 1 2 512×512 4 8
˜U ˜ )−1 v (L 2.52 × 10−2 5.86 × 10−2 5.44 × 10−2 1.41 × 10−1 2.12 × 10−1 1.56 × 10−1 3.31 × 10−1 5.05 × 10−1 8.26 × 10−1
Av 2.88 × 10−3 6.08 × 10−3 9.18 × 10−3 1.41 × 10−2 2.68 × 10−2 4.16 × 10−2 5.05 × 10−2 6.70 × 10−2 1.21 × 10−1
Total 2.81 × 10−2 6.47 × 10−2 6.36 × 10−2 1.55 × 10−1 2.38 × 10−1 1.98 × 10−1 3.81 × 10−1 5.72 × 10−1 9.47 × 10−1
large sparse matrices problems. The Specification of AP1000 is given in Table 1. Each cell of AP1000 employs RISC-type SPARC or SuperSPARC processor chip. ˜ U) ˜ −1 v for various CYCLIC, As Table 2 shows, the computational time for A(L which obtained by the following example 1. From these results, it is considered suitable by the AP1000 to decrease data volume of the communication between processors, using CYCLIC= 1. So, we use 1 as the number of CYCLIC in the following numerical experiments. [Example 1.] This example involves discretizations of boundary value problem for the partial differential equation (Joubert [5]). −uxx − uyy + σux = f (x, y) u(x, y)|∂Ω = 1 + xy where the right-hand side is taken such that the true solution is u(x, y) = 1 + xy on Ω. We discretize above equation 7 on a 256 × 256 mesh of equally spaced discretization points. By varying the constant σ, the amount of nonsymmetricity of the matrix may be changed. We utilize the initial approximation vector of x0 = 0 and use simple stopping criterion rk /r0 ≤ 10−12 . We also used double precision real arithmetic to perform the runs presented here. In Table 3, we show the computational times required to reduce the residual norm by a factor of 10−12 for an average over 3 trials of preconditioned or un-
792
Takashi Nodera and Naoto Tsuno
Table 3. The computational time (sec.) for example 1.
0 2−3 GMRES(5) ∗ ∗ ∗ 48.21 GMRES(5)+ILU ∗ 117.04 GMRES(10) GMRES(10)+ILU 296.18 36.25 ∗ 109.68 GMRES(20) GMRES(20)+ILU 197.47 60.54 BiCGStab(1) 30.11 23.83 BiCGStab(1)+ILU 30.34 21.28 35.16 26.78 BiCGStab(2) BiCGStab(2)+ILU 31.86 22.87 47.28 33.40 BiCGStab(4) BiCGStab(4)+ILU 34.38 24.56 ∗ It does not converge after 3000 Algorithm
2−2 2−1 63.25 31.74 25.74 27.29 50.52 47.75 39.58 40.95 89.45 88.79 76.01 70.29 20.34 20.66 17.82 17.25 24.14 24.67 19.67 21.27 30.93 32.86 22.47 23.17 iterations.
σh 20 21 33.51 32.15 22.91 18.10 50.41 50.89 37.02 25.45 94.16 92.41 54.82 36.51 23.18 23.04 15.09 10.91 26.38 29.65 16.47 10.37 39.23 39.51 18.99 12.00
22 32.98 12.24 50.75 16.08 92.00 29.32 48.42 8.46 30.07 7.47 36.47 8.49
23 32.82 9.23 47.82 11.94 90.39 15.55 101.04 6.73 25.64 6.51 36.75 7.79
24 36.68 7.07 44.42 8.31 84.65 9.83 ∗ 4.86 24.69 5.23 32.33 5.69
25 43.91 6.13 43.54 6.63 79.01 5.89 ∗ 3.13 24.90 3.63 30.96 4.30
preconditioned BiCGStab() and GMRES(m) algorithm. For the preconditioner of incomplete LU factorization, in most cases both methods worked quite well.
References 1. Meijerink, J. A. and van der Vorst, H. A.: An iterative solution method for linear systems of which the coefficient matrix is symmetric M -matrix, Math. Comp., Vol. 31, pp. 148-162 (1977). 788 2. Gustafsson, I.: A class of first order factorizations, BIT Vol. 18, pp. 142-156 (1978). 3. Saad, Y. and Schultz, M. H.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., Vol. 7, No. 3, pp. 856–869 (1986). 4. Bastian, P. and Horton, G.: Parallelization of robust multigrid methods: ILU factorization and frequency decomposition method, SIAM J. Sci. Stat. Comput., Vol. 12, No. 6, pp. 1457–1470 (1991). 788, 789 5. Joubert, W.: Lanczos methods for the solution of nonsymmetric systems of linear equations, SIAM J. Matrix Anal. Appl., Vol. 13, No. 3, pp. 926–943 (1992). 791 6. Sleijpen, G. L. G. and Fokkema, D. R.: BiCGSTAB() for linear equations involving unsymmetric matrices with complex spectrum, ETNA, Vol. 1, pp. 11–32 (1993). 7. Bruaset, A. M.: A survey of preconditioned iterative methods, Pitman Research Notes in Math. Series 328, Longman Scientific & Technical (1995). 788 8. Nodera, T. and Noguchi, Y.: Effectiveness of BiCGStab() method on AP1000, Trans. of IPSJ (in Japanese), Vol. 38, No. 11, pp. 2089-2101 (1997). 9. Nodera, T. and Noguchi, Y.: A note on BiCGStab() Method on AP1000, IMACS Lecture Notes on Computer Science, to appear (1998). 10. Tsuno, N.: The automatic restarted GMRES method and the parallelization of the incomplete LU factorization, Master Thesis on Graduate School of Science and Technology, Keio University (1998). 790 11. Tsuno, N. and Nodera, T.: The parallelization and performance of the incomplete LU factorization on AP1000, Trans. of IPSJ (in Japanese), submitted. 790
An Efficient Parallel Triangular Inversion by Gauss Elimination with Sweeping Ay¸se Kiper Department of Computer Engineering, Middle East Technical University, 06531 Ankara Turkey [email protected]
Abstract. A parallel computation model to invert a lower triangular matrix using Gauss elimination with sweeping technique is presented. Performance characteristics that we obtain are O(n) time and O(n2 ) processors leading to an efficiency of O(1/n). A comparative performance study with the available fastest parallel matrix inversion algorithms is given. We believe that the method presented here is superior over the existing methods in efficiency measure and in processor complexity.
1
Introduction
Matrix operations have naturally been the focus of researches in numerical parallel computing. One fundamental operation is the inversion of matrices and that is mostly accompanied with the solution of linear system of equations in the literature of parallel algorithms. But it has equally well significance individually and in matrix eigenproblems. Various parallel matrix inversion methods have been known and some gave special attention to triangular types [1,2,3,5]. The time complexity of these algorithms is O(log2 n) using O(n3 ) processors on the parallel random accesss machine (PRAM). No faster algorithm is known so far although the best lower bound proved for parallel triangular matrix inversion is Ω(log n). In this paper we describe a parallel computation model to invert a dense triangular matrix using Gauss elimination (although it is known [2] that Gauss elimination is not the fastest process) with sweeping strategy which allows maximum possible parallelisation. We believe that the considerable reduction in the processors used results in an algorithm that is more efficient than the available methods.
2
Parallel Gauss Elimination with Sweeping
A lower triangular matrix L of size n and its inverse L−1 = B satisfy the equation LB = I. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 793–797, 1998. c Springer-Verlag Berlin Heidelberg 1998
(1)
794
Ay¸se Kiper
The parallel algorithm that we will introduce is based on the application of the standard Gauss elimination on the augmented matrix LI . (2) Elements of the inverse matrix B are determined by starting from the main diagonal and progressing in the subdiagonals of decreasing size. The Gauss elimination by sweeping (hereafter referred to as GES) has two main type of sweep stages: (i) Diagonal Sweeps (DS) : evaluates all elements of each subdiagonal in parallel in one division step. (ii) Triangular Sweeps (TS) : updates all elements of each subtriangular part in parallel in two steps (one multiplication and one subtraction). We can describe GES as follows: Stage 1. Initial Diagonal Sweep (IDS) : the main diagonal elements of B are all evaluated in terms of lij (are the elements of the matrix L ) in parallel by αii =
1 , lii
i = 1, 2, . . . , n.
(3)
Stage 2. Initial Triangular Sweep (ITS) : updates the elements of the lower triangular part below the main diagonal in columnwise fashion αij =
−lij , ljj
i = 2, 3, . . . , n; j = 1, 2, . . . , n − 1
(4)
in one parallel step being different than the other regular triangular sweep steps. The IDS and ITS lead to the updated matix α11 α21 α22 0 . . BIS = . (5) . . αnn αn1 Then the regular DS and TS are applied alternately (denoting their each application by a prime over the elements of the current matrix) until B is obtained as α11 α21 α22 0 α31 α32 α33 . B= (6) . α . . 42 . . . (n−1) . . αn,n−2 αn,n−1 αnn αn1 If k denotes the application number of the sweeps in DSk and TSk , the stage computations can be generalised as:
An Efficient Parallel Triangular Inversion by Gauss Elimination
795
Stage (2k + 1). DSk : (k + 1)th subdiagonal elements of B are computed using (k−1)
(k)
αk+i,i =
αk+i,i
lk+i,k+i
,
i = 1, 2, . . . , n − k
leading to the intermediate matrix α11 α21 α22 . α32 . . . . (k) α . . BDSk = k+1,1 (k−1) αk+2,1 . . . . . . . . (k−1) (k−1) (k) αn,1 . . αn,n−k−1 αn,n−k
(7)
0
. .. . . . . αn,n−1 αnn
.
(8)
Stage (2k + 2). TSk : The constants are evaluated in the first step by (k)
(k)
Ck+p,i = αi+1,i lk+p,i+1 ,
p = i + 1, i + 2, . . . , n − k; i = 1, 2, . . . , n − (k + 1) (9) After the constant evaluations are completed, new updated values of elements below the kth subdiagonal are computed columnwise using (k)
(k−1)
(k)
αk+p,i = αk+p,i − Ck+p,i , p = i + 1, i + 2, . . . , n − k; i = 1, 2, . . . , n − (k + 1) (10) in parallel giving matrix α11 α21 . α . 0 32 . . . (k) α . . BT Sk = k+1,1 (11) . (k) αk+2,1 . . . . . . .. . . . . . (k) (k) (k) αn,1 . . αn,n−k−1 αn,n−k . . αn,n−1 αnn GES algorithm totally requires (n − 2) diagonal, (n − 2) triangular sweeps (k = 1, 2, ..., n−2, in (7) and (10)) in addition to initial diagonal, initial triangular and final diagonal sweeps (FDS) (k = n − 1, in (7)) .
3
Performance Characteristics and Discussion
The total number of parallel arithmetic operation steps Tp is obtained as Tp = 3(n − 1)
(12)
796
Ay¸se Kiper
with the maximun number of processors used pmax =
n(n − 1) . 2
(13)
The speedup Sp = Ts /Tp with respect to the the best sequential algorithm [4] for which Ts = n2 time is obtained as Sp =
n2 3(n − 1)
(14)
and the associated efficiency Ep = Sp /pmax is Ep =
2n . 3(n − 1)2
(15)
All efficient parallel algorithms [1,2,3] of triangular matrix inversion run in O(log2 n) time using O(n3 ) processors and are designed for PRAM system. So we find it appropriate to compare the theoretical performance of GES algorithm with the complexity bounds stated by Sameh and Brent [3] which are the known model reference expressions. We select the same computational model PRAM in the discussions to keep the common comparison base. The speedup and efficiency expressions (denoted by a superscript ∗ to differentiate from those of GES algorithm) for the algorithm of Sameh and Brent [3] over the best sequential time are 2n2 (16) Sp∗ = (log2 n + 3 log n + 6) and Ep∗ =
256 . (log n + 3 log n + 6)(21n + 60) 2
(17)
The best comparative study can be done by observing the variation of speedup RS = Sp /Sp∗ and efficiency RE = Ep /Ep∗ ratios with respect to the problem size. These ratios expilicitely are log 2 n + 3 log n + 6 6(n − 1)2
(18)
n(21n + 60)(log2 n + 3 log n + 6) . 384(n − 1)2
(19)
RS = and RE =
The efficiency ratio curve (Figure 1) demonstrates a superior behaviour of GES algorithm compared to that of Sameh and Brent for n > 8. Even for the worst case (occurs when n ∼ = 8 ) efficiency of GES exceeds the value of that for Sameh and Brent more than twice. This is the natural result of considerable smaller number of processor needed in GES. As the speedup ratio concerned the GES algrithm is slower than that given by Sameh and Brent and this property is dominant particularly for small size problems.
An Efficient Parallel Triangular Inversion by Gauss Elimination
797
6
Efficiency Ratio
5 4 3 2 1 0 2
4
8
16 32 log n
64
128
256
Fig. 1. Efficiency Ratio
The implementation studies of GES on different parallel architectures is open for future work. The structure of the algorithm suggests an easy application on array processors with low (modest) communication cost that meets the PRAM complexity bounds of the problem with a small constant.
References 1. L. Csansky, On the parallel complexity of some computational problems, Ph.D. Dissertation, Univ. of California at Berkeley, 1974. 793, 796 2. L. Csansky, Fast parallel matrix inversion algorithms, SIAM J.Comput., 5 (1976) 618-623. 793, 796 3. A. H. Sameh and R. P. Brent, Solving triangular systems on parallel computer, SIAM J. Numer. Anal., 14 (1977) 1101-1113. 793, 796 4. U. Schendel, Introduction to Numerical Methods for Parallel Computers (Ellis Harwood, 1984). 796 5. A. Schikarski and D. Wagner, Efficient parallel matrix inversion on interconnection networks, J. Parallel and Distributed Computing, 34 (1996) 196-201. 793
Fault Tolerant QR-Decomposition Algorithm and its Parallel Implementation Oleg Maslennikow1, Juri Kaniewski1 , and Roman Wyrzykowski2 1
2
Dept. of Electronics, Technical University of Koszalin, Partyzantow 17, 75-411 Koszalin, Poland Dept. of Math. & Comp. Sci, Czestochowa Technical University, Dabrowskiego 73, 42-200 Czestochowa, Poland
Abstract. A fault tolerant algorithm based on Givens rotations and a modified weighted checksum method is proposed for the QR-decomposition of matrices. The algorithm enables us to correct a single error in each row or column of an input M × N matrix A occurred at any among N steps of the algorithm. This effect is obtained at the cost of 2.5N 2 +O(N ) multiply-add operations (M = N ). A parallel version of the proposed algorithm is designed, dedicated for a fixed-size linear processor array with fully local communications and low I/O requirements.
1
Introduction
The high complexity of most of matrix problems [1] implies the necessity of solving them on high performance computers and, in particular, on VLSI processor arrays [3]. Application areas of these computers demand a large degree of reliability of results. while a single failure may render computations useless. Hence fault tolerance should be provided on hardware or/and software levels [4]. The algorithm-based fault tolerance (ABFT) methods [5,6,7,8,9] are very suitable for such systems. In this case, input data are encoded using error detecting or/and correcting codes. An original algorithm is modified to operate on encoded data and produce encoded outputs, from which useful information can be recovered easily. The modified algorithm will take more time in comparison with the original one. This time overhead should not be excessive. An ABFT method called the weighted checksum (WCS) one, especially tailored for matrix algorithms and processor arrays, was proposed in Ref. [5]. However, the original WCS method is little suitable for such algorithms as Gaussian elimination, Choleski, and Faddeev algorithms, etc., since a single transient fault in a module may cause multiple output errors, which can not be located. In Refs. [8,9], we proposed improved ABFT versions of these algorithms. For such important matrix problems as least squares problems, singular value and eigenvalue decompositions, more complicated algorithms based on the QRdecomposition should be applied [2]. In this paper, we design a fault tolerant version of the QR-decomposition based on Givens rotations and a modified W CS method. The derived algorithm enables correcting a single error in each row or D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 798–803, 1998. c Springer-Verlag Berlin Heidelberg 1998
Fault Tolerant QR-Decomposition Algorithm
799
column of an input M × N matrix A occurred at any among N steps of the algorithm. This effect is obtained at the cost of 2.5N 2 + O(N ) multiply-add operations. A parallel version of the algorithm is designed, dedicated for a fixedsize linear processor array with local communications and low I/O requirements.
2
Fault Model and Weighted Checksum Method
Module-level faults are assumed [6] in algorithm-based fault tolerance. A module is allowed to produce arbitrary logical errors under physical failure mechanism. This assumption is quite general since it does not assume any technology-dependent fault model. Without loss of generality, a single module error is assumed in this paper. Communication links are supposed to be fault-free. In the W CS method [6], redundancy is encoded at the matrix level by augmenting the original matrix with weighted checksums. Since the checksum property is preserved for various matrix operations, these checksums are able to detect and correct errors in the resultant matrix. The complexity of correction process is much smaller than that of the original computation. For example, a W CS encoded data vector a with the Hamming distance equal to three (which can correct a single error) is expressed as aT = [a1 P CS = pT [a1
a2
...
a 2 . . . aN aN ] ,
P CS
QCS]
QCS = qT [a1
(1) a2
...
aN ]
Possible choices for encoder vectors p, q are, for example, [10] : qT = [1 2 . . . N ] pT = 20 21 . . . 2N −1 ,
(2)
(3)
For the floating-point implementation, numerical properties of single-error correction codes based on different encoder vectors were considered in Refs. [7,11]. Based on an encoder vector, a matrix A can be encoded as either a row encoded matrix AR , a column encoded matrix AC , or a full encoded matrix ARC [11]. For example, for the matrix multiplication A B = D, the column encoded matrix AC is exploited [6]. Then choosing the linear weighted vector (3), the equation AC ∗ B = DC is computed. To have the possibility of verifying computations and correcting a single error, syndromes S1 and S2 for the j-th column of D-matrix should be calculated, where S1 =
M i=1
3
cij − P CSj ,
S2 =
M
i ∗ cij − QCSj
(4)
i=1
Design of the ABFT QR-Decomposition Algorithm
The complexity of the Givens algorithm [2] is determined by 4N 3 /3 multiplications and 2N 3 /3 additions, for a real N × N matrix A. Based on equivalent matrix transformations, this algorithm preserves the Euclidean norm for columns of A during computations. This property is important for error detection and enable us to save computations.
800
Oleg Maslennikow et al.
In the course of the Givens algorithm, an M × N input matrix A = A1 = {aij } is recursively modified in K steps to obtain the upper triangular matrix R = AK+1 , where K = M −1 for M ≤ N , and K = N for M > N . The i-th step consists in eliminating elements aiji in the i-th column of Ai by multiplications on rotation matrices Pji , j = i + 1, . . . , M , which correspond to rotation coefficients j−1 2 i i )2 , i 2 2 / (a ) + (a s = a / (aj−1 cji = aj−1 ji ji ji ii ii ii ) + (aji ) Each step of the algorithm includes two phases. The first phase consists in recomputing M − i times the first element aii of the pivot (i.e. i-th) row, and computing the rotation coefficients. The second phase includes computation of ai+1 jk , and resulting elements rik in the i-th row of R. This phase includes also recomputing M − i times the rest of elements in the pivot row. Consequently, if during the i-th step, i = 1, . . . , K, an element ajii is wrongly calculated, then errors firstly appear in coefficients cji and sji , j = i + 1, . . . , M , and then in all the elements of Ai+1 . Moreover, if at the i-th step, any coefficient cji or sji is wrongly calculated, then errors firstly appear in all the elements of the pivot row, and then in all the elements of Ai+1 . All these errors can not be located and corrected by the original W CS method. To remove these drawbacks, the following lemmas are proved. It is assumed that a single transient error may appear at each row or column of Ai at any step of the algorithm. Lemma 1. If at the i-th step, an element ai+1 jk (i < j, i < k) is wrongly calculated, then errors will not appear among other elements of Ai+1 . However, if aijk is erroneous, then error appears while computing either the element aj+1 jk in the pivot row at the j-th step of the algorithm, for j ≤ k, or values of ajkk , cjk and sjk (j = i + 1, . . . , M ) at the k-th step, for j > k. Hence in these cases, we should check and possibly correct elements of the i-th and j-th rows of Ai , each time after their recomputing. Lemma 2. Let an element ajik of the pivot row (j = i + 1, . . . , M ; k = 1, . . . , N ) or an element ai+1 jk of a non-pivot row (k = i + 1, . . . , N ) was wrongly calculated when executing phase 2 of the i-th step of the Givens algorithm. Then it is possible to correct its value while executing this phase, using the W CS method for the row encoded matrix AR , where AR = [A Ap Aq] = [A PCS QCS] +
i+1 i+1 i i = ai+1 j,i+1 + aj,i+2 + . . . aj,N = (−sji ai,i+1 + cji aj,i+1 ) + . . . (−sji aii,N + cji aij,N ) = −sji (aii,i+1 + aii,N ) + cji (aij,i+1 + aij,N )
(5)
P CSji+1
(6)
So before executing phase 2 of the i-th step we should be certain that cji , sji and ajii were calculated correctly at phase 1 of this step (we assume that the remaining elements were checked and corrected at the previous step). For this aim, the following properties of the Givens algorithm may be used: – (cji )2 +(sji )2 = 1 (7) – preserving the Euclidean norm for column of A during computation M (aii,i )2 + (aii+1,i )2 + . . . + (aiM,i )2 = aM i,i (where aii = rii )
(8)
Fault Tolerant QR-Decomposition Algorithm
801
The triple time redundancy (TTR) method [4] may also be used for its modest time overhead. In this case, values of cji , sji or/and ajii are calculated three times. Hence, for cji and sji , the procedure of error detection and correction consists in computing the left part of expression (7), and its recomputing if equality (7) is not fulfilled, taking into account a given tolerance τ [6]. For elements ajii , i = 1, . . . , K, this procedure consists in computing the both parts of expression (8), and their recomputing if they are not equal. The correctness of this procedure is based on the assumption that only one transient error may appear at each row or column of Ai at any step. Moreover, instead of using the correction procedure based on formulae (4), we recompute all elements in the row with an erroneous element detected. The resulting ABFT Givens algorithm is as follows: 1. The original matrix A = {ajk } is represented as the row encoded matrix AR = {a1jk } with a1jk = ajk , a1j,N +1 = P CSj1 , for j = 1, . . . , M ; k = 1, . . . , N . 2. For i = 1, 2, . . . , K, stages 3-9 are repeated.
i 2 2 3. The values of ajii = (ai−1 ii ) + (aji ) , are calculated, j = i + 1, . . . , M . 4. The norm | ai | for the i-th column of A is calculated. This stage needs approximately M −i multiply-add operations. The value of | ai | is compared M with the value of aM ii . If aii =| ai |, then stages 3,4 are repeated. 5. The coefficients cji and sji are computed, and correctness of equation (7) is checked, for j = i + 1, . . . , M . In case of non-equality, stage 5 is repeated. 6. For j = i + 1, . . . , M , stages 7-10 are repeated. 7. The elements ajik of the i-th row of Ai+1 are computed, k = 1, . . . , N + 1. 8. The value of P CSij is calculated according to (6). This stage needs approximately N −i additions. The obtained value is compared with that of aji,N +1 . In case of the negative answer, stages 7,8 are performed again. i+1 are calculated, k = i+1, . . . ,N+1. 9. The elements ai+1 jk in the j-th row of A i+1 10. The value of P CSj is computed according to expression (6). This stage also needs approximately N − i additions. The computed value is compared with that of ai+1 j,N +1 . In case of the negative case, stages 9,10 are repeated.
The procedures of error detection and correction increase the complexity of the Givens algorithm on N 2 /2 + O(N ) multiply-add operations and N 2 + O(N ) additions, for M = N . Due to increased sizes of the input matrix, the additional overhead of the proposed algorithm is 2N 2 + O(N ) multiplications and N 2 + O(N ) additions (M = N ). As a result, the complexity of the whole algorithm is increased approximately on 2.5N 2 + O(N ) multiply-add operations and 2N 2 + O(N ) additions. At this cost, the proposed algorithm enables us to correct one single transient error occured in each row or column of A at any among K steps of computations. Consequently, for M = N , it is possible to correct up to N 2 single errors when solving the whole problem.
4
Parallel Implementation
The dependence graph G1 of the proposed algorithm is shown in Fig.1 for M = 4, N = 3. Nodes of G1 are located in vertices of the integer lattice
802
Oleg Maslennikow et al.
Fig. 1. Graph of the algorithm
Q = {K = (i, j, k) : 1 ≤ i ≤ K; i + 1 ≤ j ≤ M ; i ≤ k ≤ N }. There are two kind of nodes in G1 . Note that non-local arcs marked with broken lines, and given by vectors d5 = (0, i + 1 − M, 0), d6 = (0, 0, i − N ) are result of introducing ABFT properties into the original algorithm. To run the algorithm in parallel on a processor array with local links, all the vectors d5 are excluded using the TTR technique for computing values of ajii , cji and sji. This needs to execute additionally 6N 2 + O(N ) multiplications and additions. Then all the non-local vectors d6 are eliminated by projecting G1 along k-axis. As a result, a 2-D graph G2 is derived (see Fig.2). To run G2 on a linear array with a fixed number n of processors, G2 should be decomposed into a set of s =]N/n[ subgraphs with the ”same” topology and without bidirectional data dependencies. Such a decomposition is done by cutting G2 by a set of straight lines parallel to j-axis. These subgraphs are then mapped into an array with n processors by projecting each subgraph onto iaxis [12]. The resulting architecture, which is provided with an external RAM module, features a simple scheme of local communications and a small number of I/O channels. The proposed ABFT Givens algorithm is executed on this array s in [( N + 3 − n(i − 1)) + (M − n(i − 1) )] T = i=1
time steps. For M = N , we have T = N 3 /n − (N − n)N 2 /2 + (2N − n)N 2 /(6n) The processor utilization is En = W/(T ∗ n), where W is the computational complexity of the proposed algorithm. Using these formulae, for example, in case of s = 10, we obtain En = 0.86 Note that with increasing in parameter s the value of En also increases.
Fault Tolerant QR-Decomposition Algorithm
803
Fig. 2. Fixed-sized linear array
References 1. G.H. Golub, C.F. Van Loan, Matrix computations, John Hopkins Univ. Press, Baltimore, Maryland, 1996. 798 2. ˚ A. Bj¨ orck, Numerical methods for least squares problems, SIAM, Philadelphia, 1996. 798, 799 3. S.Y. Kung, VLSI array processors, Prentice-Hall, Englewood Cliffs, N.J., 1988. 798 4. S.E. Butner, Triple time redundancy, fault-masking in byte-sliced systems. Tech.Rep. CSL TR-211, Dept. of Elec.Eng., Stanford Univ., CA, 1981. 798, 801 5. K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., C-35 (1984) 518-528. 798 6. V.S. Nair, J.A. Abraham, Real-number codes for fault-tolerant matrix operations on processor arrays. IEEE Trans. Comp., C-39 (1990) 426-435. 798, 799, 801 7. J.-Y. Han, D.C. Krishna, Linear arithmetic code and its application in fault tolerant systolic arrays, in Proc. IEEE Southeastcon, 1989, 1015-1020. 798, 799 8. J.S. Kaniewski, O.V. Maslennikow, R. Wyrzykowski, Algorithm-based fault tolerant matrix triangularization on VLSI processor arrays, in Proc. Int. Workshop ”Parallel Numerics’95”, Sorrento, Italy, 1995, 281-295. 798 9. J.S. Kaniewski, O.V. Maslennikow, R. Wyrzykowski, Algorithm-based fault tolerant solution of linear systems on processor arrays, in Proc. 7-th Int. Workshop ”PARCELLA ’96”, Berlin, Germany, 1996, 165 - 173. 798 10. D.E. Schimmel, F.T. Luk, A practical real-time SVD machine with multi-level fault tolerance. Proc. SPIE, 698 (1986) 142-148. 799 11. F.T. Luk, H. Park, An analysis of algorithm-based fault tolerance techniques. Proc. SPIE, 696 (1986) 222-227. 799 12. R. Wyrzykowski, J. Kanevski, O. Maslennikov, Mapping recursive algorithms into processor arrays, in Proc. Int. Workshop Parallel Numerics’94, M. Vajtersic and P. Zinterhof eds., Bratislava, 1994, 169-191. 802
Parallel Sparse Matrix Computations Using the PINEAPL Library: A Performance Study Arnold R. Krommer The Numerical Algorithms Group Ltd, Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, UK [email protected]
Abstract. The Numerical Algorithms Group Ltd is currently participating in the European HPCN Fourth Framework project on P arallel I ndustrial N umE rical Applications and P ortable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the existing NAG Parallel Library for dealing with computationally intensive industrial applications by appropriately extending the range of library routines. Additionally, several industrial applications are being ported onto parallel computers within the PINEAPL project by replacing sequential code sections with calls to appropriate parallel library routines. A substantial part of the library material being developed is concerned with the solution of PDE problems using parallel sparse linear algebra modules. This talk provides a number of performance results which demonstrate the efficiency and scalability of core computational routines – in particular, the iterative solver, the preconditioner and the matrixvector multiplication routines. Most of the software described in this talk has been incorporated into the recently launched Release 1 of the PINEAPL Library.
1
Introduction
The NAG Parallel Library enables users to take advantage of the increased computing power and memory capacity offered by multiple processors. It provides parallel subroutines in some of the areas covered by traditional numerical libraries (in particular, the NAG Fortran 77, Fortran 90 and C libraries), such as dense and sparse linear algebra, optimization, quadrature and random number generation. Additionally, the NAG Parallel Library supplies support routines for data distribution, input/output and process management purposes. These support routines shield users from having to deal explicitly with the message-passing system – which may be MPI or PVM – on which the library is based. Targeted primarily at distributed memory computers and networks of workstations, the NAG Parallel Library also performs well on shared memory computers whenever efficient implementations of MPI or PVM are available. The fundamental design principles of the existing NAG Parallel Library are described in [7], and information about contents and availability can be found in the library Web page at http://www.nag.co.uk/numeric/FD.html. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 804–811, 1998. c Springer-Verlag Berlin Heidelberg 1998
Parallel Sparse Matrix Computations
805
NAG is currently participating in the HPCN Fourth Framework project on P arallel I ndustrial N umE rical Applications and P ortable Libraries (PINEAPL). One of the main goals of the project is to increase the suitability of the NAG Parallel Library for dealing with a wide range of computationally intensive industrial applications by appropriately extending the range of library routines. In order to demonstrate the power and the efficiency of the resulting Parallel Library, several application codes from the industrial partners in the PINEAPL project are being ported onto parallel and distributed computer systems by replacing sequential code sections with calls to appropriate parallel library routines. Further details on the project and its progress can be found in [1] and in the project Web page at http://www.nag.co.uk/projects/PINEAPL.html. This paper focuses on the parallel sparse linear algebra modules being developed by NAG.1 These modules are used within the PINEAPL project in connection with the solution of PDE problems based on finite difference and finite element discretization techniques. Most of the software described in this paper has been incorporated into the recently launched Release 1 of the PINEAPL Library. The scope of support for parallel PDE solvers provided by the parallel sparse linear algebra modules is outlined in Section 2. Section 3 provides a study of the performance of core computational routines – in particular, the iterative solver, the preconditioner and the matrix-vector multiplication routines. Section 4 summarizes the results and gives an outlook on future work.
2
Scope of Library Routines
A number of routines have been and are being developed and incorporated into the PINEAPL and the NAG Parallel Libraries in order to assist users in performing some of the computational steps (as described in [7]) required to solve PDE problems on parallel computers. The extent of support provided in the PINEAPL Library is – among other things – determined by whether or not there is demand for parallelization of a particular step in at least one of the PINEAPL industrial applications. The routines in the PINEAPL Library belong to one of the following classes: Mesh Partitioning Routines which decompose a given computational mesh into a number of sub-meshes in such a way that certain objective functions (measuring, for instance, the number of edge cuts) are optimized. They are based on heuristic, multi-level algorithms similar to those described in [4] and [2]. Sparse Linear Algebra Routines which are required for solving systems of linear equations and eigenvalue problems resulting from discretizing PDE problems. These routines can be classified as follows: 1
Other library material being developed by PINEAPL partners includes optimization, Fast Fourier Transform, and Fast Poisson Solver routines.
806
Arnold R. Krommer
Iterative Schemes which are the preferred method for solving large-scale linear problems. PINEAPL Library routines are based on Krylov subspace methods, including the CG method for symmetric positive-definite problems, the SYMMLQ method for symmetric indefinite problems, as well as the RGMRES, CGS , BICGSTAB() and (TF)QMR methods for unsymmetric problems. Preconditioners which are used to accelerate the convergence of the basic iterative schemes. PINEAPL Library routines employ a range of preconditioners suitable for parallel execution. These include domain decomposition-based, specifically additive and multiplicative Schwarz preconditioners.2 Additionally, preconditioners based on classical matrix splittings, specifically Jacobi and Gauss-Seidel splittings, are provided, and multicolor orderings of the unknowns are utilized in the latter case to achieve a satisfactory degree of parallelism [3]. Furthermore, the inclusion of parallel incomplete factorization preconditioners based on the approach taken in [5] is planned. Basic Linear Algebra Routines which, among other things, calculate sparse matrix-vector products. Black-Box Routines which provide easy-to-use interfaces at the price of reduced flexibility. Preprocessing Routines which perform matrix transformations and generate auxiliary information required to perform the foregoing parallel sparse matrix operations efficiently. In-Place Generation Routines which generate the non-zero entries of sparse matrices concurrently: Each processor computes those entries which – according to the given data distribution – have to be stored on it. In-place generation routines are also provided for vectors which are aligned with sparse matrices. Distribution/Assembly Routines which (re)distribute the non-zero entries of sparse matrices or the elements of vectors, stored on a given processor, to other processors according to given distribution schemes. They also assemble distributed parts of dense vectors on a given processor, etc. Additionally, all routines are available for both real and complex data.
3
Performance of Library Routines
For performance evaluation purposes, a simple PDE solver module, which uses a number of sparse matrix routines available in Release 1 of the PINEAPL Library, has been developed. See Krommer [6] for a description of the solver module and the numerical problem it solves. Performance data generated by this module is used in this section to demonstrate the efficiency and scalability of core computational routines, such as the iterative solver, the matrix-vector multiplication and the preconditioner routines. 2
The subsystems of equations arising in these preconditioners are solved approximately on each processor, usually based on incomplete LU or Cholesky factorizations.
Parallel Sparse Matrix Computations
3.1
807
Test Settings
All experiments were carried out for a small-size problem configuration, n1 = n2 = n3 = 30, resulting in a system of n = 27 000 equations, and for a medium-size problem configuration, n1 = n2 = n3 = 60, resulting in a system of n = 216 000 equations. Most equations, except those corresponding to boundary grid points, contained seven non-zeros coefficients. The low computation cost of the small problem helps to reveal potential inefficiencies – in terms of their communication requirements – of the library routines tested; whereas a comparison of the results for the two different problem sizes makes it possible to assess the scalability of the routines tested for increasing problem size. All performance tests were carried out on a Fujitsu AP3000 at the Fujitsu European Centre for Information Technology (FECIT). The Fujitsu AP3000 is a distributed-memory parallel system consisting of UltraSPARC processor nodes and a proprietary two-dimensional torus interconnection network. Each of the twelve computing node at FECIT used in the experiments was equipped with a 167 MHz UltraSPARC II processor and 512 Mb of memory. 3.2
Performance of Unpreconditioned Iterative Solvers
Figure 1 shows the speed-ups of unpreconditioned iterative solvers for the two test problems. The iterative schemes tested – BICGSTAB(1), BICGSTAB(2) and CGS – have been found previously to be the most effective in the unsymmetric solver suite. BICGSTAB(2) and CGS can be seen to scale (roughly) equally well, whereas BICGSTAB(1) scales considerably worse (in particular, for the smaller problem). The results are generally better for the larger problem than for the smaller, especially when the maximum number of processors is used. In this case, the improvement in speed-up is between 10 and 20 %, depending on the specific iterative scheme. 3.3
Performance of Matrix-Vector Multiplication
When solving linear systems using iterative methods, a considerable fraction of time is spent performing matrix-vector multiplications. Apart from its use in iterative solvers, matrix-vector multiplication is also a crucial computational kernel in time-stepping methods for PDEs. It is therefore interesting to study the scalability of this algorithmic component separately. Figure 2 shows the speed-ups of the matrix-vector multiplication routine in the PINEAPL Library. The results demonstrate a degree of scalability of this routine in terms of both increasing number of processors and increasing problem size. In particular, near-linear speed-up is attained for all numbers of processors for the larger problem. 3.4
Performance of Multicolor SSOR Preconditioners
Multicolor SSOR preconditioners are very similar to matrix-vector multiplication in their computational structure and their communication requirements. In par-
808
Arnold R. Krommer
12 11 10 9 8 7 Speed-Up 6 5 4 3 2 1
small, BICGSTAB(1) ✸ small, BICGSTAB(2) ✸ small, CGS ✸ medium, BICGSTAB(1) ❝ medium, BICGSTAB(2) ❝ medium, CGS ❝
❝
❝
✸❝ ✸ ✸
8
12
✸ ✸❝ ✸
❝ ✸ ✸❝
1
✸❝
2
4
Number of Processors
Fig. 1. Speed-Up of Unpreconditioned Iterative Solvers
12 11 10 9 8 7 Speed-Up 6 5 4 3 2 1
small ✸ medium ❝
❝
❝
✸
✸
❝ ✸ ✸❝
1
✸❝
2
4
8 Number of Processors
Fig. 2. Speed-Up of Matrix-Vector Multiplication
12
Parallel Sparse Matrix Computations
809
ticular, the two operations are equal in their computation-to-communication ratios. However, the number of messages transferred in multicolor SSOR preconditioners is larger than in matrix-vector multiplication (i. e. the messages are shorter). As a result, one would expect multicolor SSOR preconditioners to exhibit similar, but slightly worse scalability than matrix-vector multiplication. Figure 3 confirms this expectation for configuration with up to eight processors. Between eight and twelve processors, however, the speed-up curves flatten out significantly. The precise reasons for this uncharacteristic behavior will have to be investigated in future work.
12 11 10 9 8 7 Speed-Up 6 5 4 3 2 1
small ✸ medium ❝
❝
❝
✸
✸
❝
✸ ✸❝
1
✸❝
2
4
8
12
Number of Processors
Fig. 3. Speed-Up of Multicolor SSOR Preconditioners
3.5
Performance of Additive Schwarz Preconditioners
The multicolor orderings used in the PINEAPL SSOR preconditioners (see Section 3.4) may vary depending on the number of processors and/or data distribution used. However, experiments have shown that the quality of multicolor SSOR preconditioners is fairly independent of the particular multicolor orderings they are based on: The number of iterations required to achieve convergence does not vary significantly when different multicolor SSOR preconditioners are used in an iterative solver. With additive Schwarz preconditioners, the situation is very different: The quality of additive Schwarz preconditioners usually decreases when the number of processors – and hence, the number of subdomains – is increased, i. e. the
810
Arnold R. Krommer
number of iterations required to achieve converge with a corresponding preconditioned iterative solver increases. Therefore, the performance of additive Schwarz preconditioners cannot be evaluated separately, but only in conjunction with their application in iterative solvers. Figure 4 shows the speed-ups of the BICGSTAB(1) iterative solver in conjunction with several additive Schwarz preconditioners.3 The additive Schwarz preconditioners tested differ in the size of the overlap regions between subdomains. In particular, the choice “overlap=0” corresponds to a traditional block Jacobi preconditioner. The scalability results are clearly worse than those shown in the previous sections. In particular, the transition from one to two processors actually results in a performance decrease for the block Jacobi preconditioner. Generally speaking, the overlapping additive Schwarz preconditioners perform significantly better than the block Jacobi preconditioner for small numbers of processors; for larger numbers of processors, the performance gain from using overlap regions decreases or – for the smaller problem – is completely lost. In Krommer [6] the hardware performance 4 speed-ups corresponding to the temporal performance 5 speed-ups represented in Figure 4 are investigated. It turns out that the hardware performance scales significantly better than temporal performance – indeed, the results for the hardware performance are quite comparable to the results obtained for the unpreconditioned solvers in Figure 1. This fact demonstrates that (i) the implementation of the additive Schwarz preconditioners is efficient in terms of communication requirements and that (ii) the main reason for the relatively poor temporal scalability lies in the unsatisfactory quality of additive Schwarz preconditioners.
4
Conclusion and Outlook
The performance study in Section 3 has established the efficiency and scalability of a number of core computational routines in the sparse linear algebra modules of Release 1 of the PINEAPL Library. The only main deficiency detected is the poor quality of additive Schwarz preconditioners for two or more processors when compared to the additive Schwarz preconditioner for a single processor (=incomplete LU factorization). It is planned to overcome this problem by providing parallel incomplete factorization preconditioners in Release 2 of the PINEAPL Library.
3 4 5
The subsystems associated with different domains were approximately solved using local modified ILU(0) factorizations. The hardware performance of a program is defined as the ratio between the number of floating-point operations carried out and the run-time of the program. The temporal performance of a program is defined as reciprocal run-time of the program.
Parallel Sparse Matrix Computations
12 11 10 9 8 7 Speed-Up 6 5 4 3 2 1
small, small, small, medium, medium, medium,
811
overlap=0 ✸ overlap=1 ✸ overlap=2 ✸ overlap=0 ❝ overlap=1 ❝ overlap=2 ❝
❝
❝
✸❝
✸ ✸❝ ✸❝
1
2
✸❝ ✸❝
4
❝
✸ ✸❝ ✸
8
12
✸ ✸ ✸❝
Number of Processors
Fig. 4. Speed-Up of BICGSTAB(1) with Additive Schwarz Preconditioners
References 1. M. Derakhshan, S. Hammarling, A. Krommer, PINEAPL: A European Project on Parallel Industrial Numerical Applications and Portable Libraries, in “Recent Advances in Parallel Virtual Machine and Message Passing Interface” (M. Bubak et al., Eds.), Vol. 1332 of LNCS, Springer-Verlag, Berlin, 1997, pp. 337–342. 805 2. A. Gupta, Fast and Effective Algorithms for Graph Partitioning and Sparse Matrix Ordering, IBM Research Report RC 20496, IBM T.J. Research Center, Yorktown Heights, New York, 1996. 805 3. M. T. Jones, P. E. Plassman, Scalable Iterative Solution of Sparse Linear Systems, Parallel Computing 20 (1994), pp. 753–773. 806 4. G. Karypis, V. Kumar, Multilevel k-Way Partitioning Scheme for Irregular Graphs, Technical Report TR-95-064, Department of Computer Science, University of Minnesota, 1995. 805 5. G. Karypis, V. Kumar, Parallel Treshold-Based ILU Factorization, Technical Report TR-96-061, Department of Computer Science, University of Minnesota, 1996. 806 6. A. Krommer, Parallel Sparse Matrix Computations Using the PINEAPL Library: A Performance Study, Technical Report TR1/98, The Numerical Algorithms Group Ltd, Oxford, UK, 1998. 806, 810 7. A. Krommer, M. Derakhshan, S. Hammarling, Solving PDE Problems on Parallel and Distributed Computer Systems Using the NAG Parallel Library, in “HighPerformance Computing and Networking” (B. Hertzberger, P. Sloot, Eds.), Vol. 1225 of LNCS, Springer-Verlag, Berlin, 1997, pp. 440–451. 804, 805
Using a General-Purpose Numerical Library to Parallelize an Industrial Application: Design of High-Performance Lasers Ida de Bono1 , Daniela di Serafino1,2 , and Eric Ducloux3 1
Center for Research on Parallel Computing and Supercomputers, CPS-CNR, Naples, Italy 2 The Second University of Naples, Caserta, Italy 3 Thomson-CSF/LCR, Orsay Cedex, France
Abstract. We describe the development of a parallel version of an industrial application code, used to simulate the propagation of optical beams inside diodes pumped rod lasers. The parallel version makes an intensive use of a 2D FFT routine from a general-purpose parallel numerical library, the PINEAPL Library, developed within an ESPRIT project. We discuss some issues related to the integration of the library routine into the application code and report first performance results.
1
Introduction
The development of high performance lasers is presently a very active field. An industrial challenge is the production of 100W-1KW diodes pumped rod lasers with high compactness, high beam quality and high energy conversion. The numerical simulation of the propagation of optical beams inside these lasers plays a central role in the optimization of the whole design process and leads to a reduction of the development time. On the other hand, the industrial design of rod lasers requires an intensive simulation and an effective use of high-performance computers allows to reduce the so-called “time-to-market”. It also gives the possibility of increasing the level of detail in the simulation without increasing the computation time, which is an important requirement in an efficient design process. In this work we present the main issues concerning the development of a parallel version of an industrial code, used at Thomson-CSF LCR to model the propagation of optical beams in diodes pumped rod lasers. This version, developed for MIMD distributed-memory machines, has been obtained exploiting software from a general-purpose parallel numerical library. Numerical software libraries play a significant role in the development of application codes, since they provide building blocks for the solution of computational kernels arising in the modeling of scientific problems. The availability of reliable, accurate and efficient numerical software allows users to develop their application codes without dealing with
This work has been supported by the European Commission under the ESPRIT contract 20018 (PINEAPL Project).
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 812–820, 1998. c Springer-Verlag Berlin Heidelberg 1998
Using a General-Purpose Numerical Library
813
details related to the implementation of numerical algorithms, improving at the same time the quality of the computed solutions [2]. In particular, the use of numerical software for High-Performance Computing environments can simplify the porting of application codes to such environments, allowing also the introduction of different mathematical models and numerical techniques, that can better exploit the powerful resources of advanced machines. This work has been carried out as a part of the PINEAPL (Parallel Industrial NumErical Applications and Portable Libraries) Project, an ESPRIT Project in the area of High-Performance Computing and Networking, with a duration of three years (1996-1998) [3]. The main goal of the project is to produce a general-purpose library of parallel numerical software suitable for a wide range of computationally intensive industrial applications. This project is a collaborative effort among academic/research partners (CERFACS (F), CPS-CNR (I), IBM Italia (I), Manchester University (UK) and Math-Tech (DK)) and industrial partners (British Aerospace (UK), Danish Hydraulic Institute (DK), Piaggio Veicoli Europei (I) and Thomson-CSF (F)), under the coordination of NAG (UK). The industrial partners are the end-users of the project and have provided several application codes, that are ported on parallel machines using routines from the PINEAPL Library. A first release of the library is already available and includes numerical routines in the areas of Fast Fourier Transforms (FFTs), Nonlinear Optimization and Sparse Linear Algebra.1 The parallel version of the rod laser application code, described in this paper, integrates the 2D FFT routine of the PINEAPL Library.
2
Numerical Simulation of Beam Propagation Inside Rod Lasers
The numerical simulation is mainly devoted to describe the evolution of optical beams, which propagate inside the rod laser following a “zig-zag” trajectory (Figure 1). The final goal is to estimate the deformation and the amplification of laser spots, and to find an optimal coupling between laser beams and pumping devices. The complete simulation is performed using three application codes,
LASER DIODES
LASER DIODES
ROD LASER
LASER DIODES
Fig. 1. Schematic picture of a rod laser. 1
For more details see the URL http://www.nag.co.uk/Projects/PINEAPL.html.
814
Ida de Bono et al.
modeling the heating of the rod, the propagation of the beams and the transformation due to the reflection on mirrors. The work described here is devoted to the development of a parallel version of the propagation code. The mathematical model of the propagation of an optical beam is based on the Helmholtz equation: ∆Φ + k02 n2 Φ = 0 , where Φ = Φ(x, y, z) is the (polarized) electromagnetic field, n is the index of the medium, k0 = 2π/λ is the wave number corresponding to the wavelength λ, and Φ and n have complex values. Let us assume for simplicity that the direction of propagation is z. The fast variation of the field along z is separated from the “modulation”, Ψ (x, y, z), leading to an equation of the form: ∂Ψ + k02 (n2 − n2z )Ψ = 0, (1) ∂z √ where kz is a suitable wave number, nz = kz /n0 and i = −1. The solution of equation (1) is performed using a method in the class of Beam Propagation Methods (BPMs) [6], that are widely used in integrated optics. The BPM used here is based on a splitting of equation (1) into two equations, modelling the propagation and the “lens” diffraction, respectively: ∆Ψ + 2i kz
∂Ψ ∂2Ψ = 0, + 2i kz ∂z 2 ∂z
(2)
∂Ψ + k02 (n2 − n2z )Ψ = 0, ∂z
(3)
∆x,y Ψ + 2i kz
where ∆x,y is the Laplace operator in the (x, y)-plane. The solution is obtained solving in turn equations (2) and (3) along the z-direction, that is repeating the following steps: – propagation from z to z + dz/2 (eq. (2)); – lens diffraction from z to z + dz (eq. (3)); – propagation from z + dz/2 to z + dz (eq. (2)). Each propagation step is carried out performing a transformation from the (x, y)-plane to the 2D frequency space, where the solution is known analytically, and coming back to the (x, y)-plane. The transformations are obtained as direct and inverse Discrete Fourier Transforms (DFTs) on complex data arising from the discretization of equation (2) on a uniform grid. The DFTs are computed using FFT algorithms. We note that the computational (x, y)-domain depends on the angle of incidence of the beam at the entry of the rod laser. For large angles, the beam can cross the boundaries of the rod; in this case the domain is unfolded, i.e. it is reflected with respect to the crossed boundaries, in x- or y-direction or both, and the computation is performed in a domain that is twice or four times the original domain.
Using a General-Purpose Numerical Library
3
815
Integration of the Parallel 2D FFT Routine
Until now, most of the work has been carried out on a simplified version of the propagation code, named PROBLA, which neglects the coupling with thermal effects and some physical details that increase the complexity of the model. This version is considered as a testbed, to evaluate efforts and improvements concerning the integration of the parallel 2D FFT of the PINEAPL Library. The experience gained in working on the simplified code will be exploited to develop a parallel version of the complete industrial code. PROBLA is written in Fortran 77, apart from some routines for the dynamical allocation of memory, that are written in C. Its structure is sketched in Figure 2. The bulk of computation is in step 5.1, as it is shown by the plots
1. 2. 3. 4. 5.
read input parameters; initialize the electro-magnetic field; compute the mean index of the medium; compute the energy of the beam; for i = 1, visual steps 5.1 simulate the beam propagation (BPM); 5.2 rearrange the data to be visualized; endfor 6. reconstruct the real field from the computational field; 7. compute the energy of the beam; 8. output the electro-magnetic field to be visualized.
Fig. 2. Structure of the PROBLA code.
of the execution times reported in Section 4. This step implements the BPM method and hence makes an intensive use of 2D FFTs. The BPM is applied between two successive visualization positions, that is positions along the direction of propagation where selected values of the electro-magnetic field are required for a future visualization of the evolution of the beams along that direction. Taking into account the above considerations, a speedup can be achieved parallelizing the execution of each FFT. Hence, our work has been devoted to the integration of the parallel 2D FFT routine of the the PINEAPL Library into the application code. This routine performs a complex-to-complex FFT on a 2D array of data, distributed in row-block form over a 2D logical grid of processors, i.e. assigning approximately the same number of consecutive rows of the array to each processor, moving row by row on the processor grid. In the rod laser application code, such data distribution corresponds to partitioning the (x, y) computational grid along the x-direction, that is into vertical strips. The algorithm implemented in the parallel FFT routine is the so-called fourstep algorithm, that is: 1D FFTs along the rows of the 2D array, transpose,
816
Ida de Bono et al.
1D FFTs along the rows, and transpose. All the communication is performed in the transpose steps, and the second transpose can be avoided when a direct transform followed by an inverse transform must be executed, as it is in the PROBLA code. All the communication is performed using the Basic Linear Algebra Communication Subprograms (BLACS) [4], as in most of the PINEAPL Library routines. More details are given in [1]. An efficient integration of the parallel FFT routine requires the introduction of the row-block distribution in the whole application code, to avoid unnecessary data movements among different stages of the simulation. Therefore, the different steps of PROBLA have been analyzed and modified to use this data distribution. In the BPM step (5.1), the solution in the frequency space and the computation of the lens diffraction are local processes, and hence are performed without any communication among processors, doing only minor changes in the corresponding parts of the application software. A few changes have been introduced to rearrange the data to be visualized (step 5.2), although this step does not require any communication. The initialization of the electro-magnetic field (step 2) is local too, and has been only slightly modified to be run in parallel. Global communications are required in some initial and final steps, such as the computation of the mean index of the medium (step 3) and of the energy of the beam (steps 4 and 7). These steps have been reduced to global sum operations, that are executed using ad hoc routines from the BLACS. Modifications have been introduced to reconstruct the real field from the computational field (step 6), when the unfolding along the x-direction is performed. In this case, processors holding rows that are symmetric with respect to the central row of the computational domain, exchange data to reconstruct the field in the effective domain. The input/output has not been parallelized and is performed by only one processor, which distribute to the others the input parameters and gathers the data to be printed out. This allows to have the same input/output files as in the sequential code; moreover, it has been observed that generally the input/output does not account for a large percentage of the total execution time (see Section 4).
4
Numerical Experiments
First experiments have been carried out to evaluate the efficiency of the parallel version of the rod laser application code, using two sets of benchmarks of interest for Thomson-CSF. In the following, they are referred to as Benchmark 1 and Benchmark 2. In both benchmark sets, the number of grid points in the ydirection, ny , is kept fixed, while the number of grid points in the x-direction, nx , varies. In Benchmark 1 nx = 64, 128, 256, 512, 1024 and ny = 64; in Benchmark 2 nx = 64, 128, 256, 512, 1024 and ny = 128. Moreover, the increment along z is dz = 10 µm, the number of visualization steps, i.e. the number of times the BPM step is repeated, is 100, and the domain is unfolded along x, thus the number of grid points in this direction is doubled.
Using a General-Purpose Numerical Library
817
The experiments have been performed on an IBM SP2, at CPS-CNR. This machine has 12 Power2 Super Chip thin nodes (160 MHz), each with 512 MBytes of memory and a 128 Kbytes L1 data cache; the nodes are connected via an SP switch, with a peak bi-directional bandwidth of 110 MBytes/sec. The SP2 has the AIX 4.2.1.operating system; the BLACS 1.1 are implemented on the top of the IBM proprietary version 2.3 of the MPI message-passing library [5]. The Fortran part of the parallel code has been compiled with the XL Fortran compiler, vers. 4.1, and the C part with the XL C compiler, vers. 3.1.4. Figure 3 shows the execution times, in seconds, of a single BPM step and of the whole PROBLA (including input/output), on 2, 4, 8, 12 processors, for Benchmark 1. Remembering that a BPM step is performed 100 times, we see BPM - Benchmark 1
50
NX=64 NY=64 NX=128 NY=64 NX=256 NY=64 NX=512 NY=64 NX=1024 NY=64
5000 2000 execution time (sec.)
execution time (sec.)
20
PROBLA - Benchmark 1
10000 NX=64 NY=64 NX=128 NY=64 NX=256 NY=64 NX=512 NY=64 NX=1024 NY=64
10
5
2
1000 500 200 100
1
2
4
8 number of processors
12
10
2
4
8 number of processors
12
Fig. 3. Execution times of a BPM step and of PROBLA for Benchmark 1. that it accounts for a very large percentage of the total execution time. This percentage decreases as the number of processors increases; for example, it is more than 98% on 2 processors and reaches the lowest value of 95% on 12 processors. One reason for such reduction is the load inbalance generated by unfolding along the x-direction. In this case, the rearrangement of the data to be visualized (Figure 2, step 5.2) and the reconstruction of the real field from the computational field (Figure 2, step 6) are mainly performed by half the processors, that is by the processors that hold the effective domain, and hence the distribution of the computational work is not balanced. One more reason is the sequential input/output; its time is at most 1.5% of the total execution time on 2 processors, while it reaches 3.3% on 12 processors. Figure 4 shows the speedup of a BPM step and of PROBLA (including input/output) with respect to the original sequential code, for Benchmark 1.2 The parallel code has been compared with the sequential code to see the effective 2
The speedup is here defined as Sp = T1 /Tp , where T1 is the execution time of the original sequential code on one processor and Tp is the execution time of the parallel code on p processors, for the same problem.
818
Ida de Bono et al.
gain obtained with the parallelization. The speedup lines of a BPM step and of BPM - Benchmark 1
12 11
NX=64 NY=64 NX=128 NY=64 NX=256 NY=64 NX=512 NY=64 NX=1024 NY=64 ideal speedup
11 10
9
9
8
8 speedup
speedup
10
PROBLA - Benchmark 1
12 NX=64 NY=64 NX=128 NY=64 NX=256 NY=64 NX=512 NY=64 NX=1024 NY=64 ideal speedup
7 6
7 6
5
5
4
4
3
3
2
2
1
1 2
4
8 number of processors
12
2
4
8 number of processors
12
Fig. 4. Speedup of a BPM step and of PROBLA for Benchmark 1. PROBLA have similar behaviours; the values concerning PROBLA are generally slightly lower, according to the previous considerations on the execution times. The speedup plots show that for each number of processors there is a “best problem size” where the parallel code gets the largest speedup. For example, the largest speedup of the BMP step on 4 processors is obtained for nx = 64 and is 3.25, while on 8 processor it is obtained for nx = 256 and is 5.35 (the value of ny is not considered because it is constant). On the other hand, we can see that there is a “best number of processors” for each problem size, where the efficiency reaches the largest value.3 For example, for the BPM step with nx = 64 such best number is 4, with an efficiency of 0.78, while for nx = 256 it is 8 with an efficiency of 0.67. Using more processors than the optimal number does not increase the performance, as it is evident for nx = 64 and nx = 128. Experiments on more than 12 processors should be performed to confirm this on problems with larger sizes. The behaviour discussed above is due to the use of the cache memory and to the communications performed in the parallel global transpose operation inside the FFT routine. Finally, we note that there is no gain in terms of execution time where only 2 processors are used, except for nx = 64. This is because the sequential FFT used in the original code is generally faster than the parallel FFT on one processor and is comparable with it on 2 processors. This is probably due to the fact that the sequential FFT routine does not perform transpositions of data as the parallel FFT routine; moreover, the sequential FFT has been implemented specifically for radix-2 transforms, that are the transforms considered in PROBLA, while the parallel FFT has been implemented to perform more general mixed-radix transforms. 3
The efficiency is defined as Sp /p, i.e. as the speedup divided by the number of processors.
Using a General-Purpose Numerical Library
819
The previous considerations apply also to the execution times and the speedup of BPM and PROBLA for Benchmark 2, shown in Figures 5 and 6. Since the value of ny has been doubled with respect to Benchmark 1, the best
BPM - Benchmark 2 100 50
NX=64 NY=128 NX=128 NY=128 NX=256 NY=128 NX=512 NY=128 NX=1024 NY=128
5000
30
2000
20
execution time (sec.)
execution time (sec.)
PROBLA - Benchmark 2
11000 NX=64 NY=128 NX=128 NY=128 NX=256 NY=128 NX=512 NY=128 NX=1024 NY=128
10 5
1000 500 200 100
2 1
2
4
8 number of processors
10
12
2
4
8 number of processors
12
Fig. 5. Execution times of a BPM step and of PROBLA for Benchmark 2.
BPM - Benchmark 2
12 11
NX=64 NY=128 NX=128 NY=128 NX=256 NY=128 NX=512 NY=128 NX=1024 NY=128 ideal speedup
11 10
9
9
8
8 speedup
speedup
10
PROBLA - Benchmark 2
12 NX=64 NY=128 NX=128 NY=128 NX=256 NY=128 NX=512 NY=128 NX=1024 NY=128 ideal speedup
7 6
7 6
5
5
4
4
3
3
2
2
1
1 2
4
8 number of processors
12
2
4
8 number of processors
12
Fig. 6. Speedup of a BPM step and of PROBLA for Benchmark 2.
number of processors after which the performance degrades is generally larger, and is visible only for the cases nx = 64 and nx = 128. Moreover, the speedup values are generally higher than in Benchmark 1; on 12 processors, the BPM step reaches a speedup of 9 and the whole PROBLA code a speedup of 8.
820
5
Ida de Bono et al.
Conclusions
A parallel version of an industrial application code has been developed, which integrates the 2D FFT routine of the parallel numerical library developed in the ESPRIT Project PINEAPL. First experiments have been performed to evaluate the performance of the parallel application code, obtaining satisfactory speedup values. Work is presently devoted to evaluate the scalability of the parallel code. Preliminary tests have been performed on a single BPM step, obtaining encouraging results; values of scaled efficiency4 between 0.75 and 0.84 have been measured on 4 and 8 processors, using problems of size nx × ny , with nx /p = 256, 512, 1024 and ny = 128.
References 1. Carracciuolo, L., de Bono, I., De Cesare, M.L., di Serafino, D., Perla, F.: Development of a Parallel Two-dimensional Mixed-radix FFT Routine. Tech. Rep. TR-978, Center for Research on Parallel Computing and Supercomputers (CPS) - CNR, Naples, Italy (1997) 816 2. di Serafino, D., Maddalena, L., Messina, P., Murli, A.,: Some Perspectives on HighPerformance mathematical Software. In A. Murli and G. Toraldo eds., Computational Issues in High Performance Software for Nonlinear Optimization, Kluwer Academic Pub. (to appear) 813 3. di Serafino, D., Maddalena, L., Murli, A.,: PINEAPL: a European Project to Develop a Parallel Numerical Library for Industrial Applications. In Euro-Par’97 Parallel Processing, C. Lengauer, M. Griebl and S. Gorlatch eds., Lecture Notes in Computer Science 1300 (1997), Springer, 1333–1339. 813 4. Dongarra, J.J., Whaley, R.C.: A Users’ Guide to the BLACS v 1.0. LAPACK Working Note No. 94, Technical Report CS-95-281, Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN (1995) 816 5. Snir, M., Otto, S.W., Huss-Lederman, S., Walker, D.W., Dongarra, J.J.: MPI: The Complete Reference. MIT Press, Cambridge, MA (1996) 817 6. Van Roey, J., Van der Donk, J., Lagasse, P.E.: Beam-propagation method: analysis and assessment. J. Opt. Soc. Am. 71 (1981) 814
4
The scaled efficiency is here defined as SEp = T1 (nx , ny )/Tp (p × nx , ny ), where Ti (r, s) is the execution time of the parallel code on i processors, for a problem of size r × s.
Fast Parallel Hermite Normal Form Computation of Matrices over F[x] Clemens Wagner Rechenzentrum der Universit¨ at Karlsruhe, Germany [email protected]
Abstract. We present an algorithm for computing the Hermite normal form of a polynomial matrix and an unimodular transformation matrix on a distributed computer network. We provide an algorithm for reducing the off-diagonal entries which is a combination of the standard algorithm and the reduce off-diagonal algorithm given by Chou and Collins. This algorithm is parametrised by an integer variable. We provide a technique for producing small multiplier matrices if the input matrix is not full-rank, and give an upper bound for the degrees of the entries in the multiplier matrix.
1
Introduction
We study the problem of computing the Hermite normal form and an unimodular transforming matrix of a large, rectangular input matrix over F[x]. Our motivation comes from the area of convolutional encoders where we need the transforming matrix to construct the decoder of a given minimal, basic encoder (cf. [6]). Since the degrees of the entries in the transforming matrix determine the size of the decoder’s circuit, we are interested in a transforming matrix whose entries have small degrees. Definition 1. A matrix H ∈ Matm×n (F[x]) with rank r ∈ N0 is in Hermite normal form (HNF) if 1. the first r columns of H are nonzero, and the last n − r columns are zero. 2. for 1 ≤ j ≤ r let ij be the index of the first nonzero entry in the jth column. Then 1 ≤ i1 < i2 < . . . < ir−1 < ir ≤ m. 3. the pseudo diagonal entries Hij ,j for 1 ≤ j ≤ r are monic. 4. for 1 ≤ j ≤ r and 1 ≤ k < j the degree of the off-diagonal entries Hij ,k is less than deg(Hij ,j ). We call 1 ≤ i1 < i2 < . . . < ir−1 < ir ≤ m the row indices of the pseudo diagonal entries of H. If only the first two conditions are fulfilled we say H is in echelon form. Two matrices A, B ∈ Matm×n (F[x]) are right equivalent if there exists an unimodular Matrix V ∈ GLn (F[x]) such that AV = B. We call V a transforming matrix or multiplier. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 821–830, 1998. c Springer-Verlag Berlin Heidelberg 1998
822
Clemens Wagner
Theorem 1. For each matrix A ∈ Matm×n (F[x]) exists an unique, right equivalent matrix H ∈ Matm×n (F[x]) in Hermite normal form. There are many well-known algorithms for computing the Hermite normal form and a multiplier of an integer matrix. Gaussian elimination based methods, as in Sims [13, p. 323], eliminate above diagonal entries of a whole row by extended greatest common divisor computations. There are many different specialisations of this general algorithm (e. g. [2], [4], [8]) which can be generalised to arbitrary Euclidean domains [14]. Gaussian elimination has worst-case exponential time and space complexity for matrices over F2k [x] and Q[x] [9]. But it was shown by Wagner [14], [15] that some variants of Gaussian elimination have a better practical performance than algorithms guaranteeing a polynomial time and space complexity.
2
Parallel Implementation
A parallel computer P := {π0 , . . . , πN −1 } consists of N processors with distinct memory and a communication network. Let A ∈ Matm×n (F[x]), and let l 1 if k < (l mod N ) (l, k) := + N 0 otherwise for each 0 ≤ k ≤ N − 1. Each processor πk on the parallel computer stores a ((m, k)+1)×n matrix A(k) and a (n, k)×n matrix V (k) which are submatrices of A and the multiplier V ∈ GLn (F[x]), respectively. The additional row (m, k)+1 of A(k) is used for computations. We will refer to this row as the computation row.
Broadcast
Distribute
1
1 2 3 4 1
a b c d
2
1
a b c d
2 a b c d
a b c d
2 3 4
3
3 a b c d
4
4
(b.1)
(b.2)
a b c d
(a)
Fig. 1. Distribution of a matrix on a parallel computer (a & b.1), Broadcasting a part of a row (b.1 & b2) We distribute the input matrix A by storing the ith row of A in row i of matrix A(k) on processor πk where k := (i − 1) mod N and i := (i − 1)/N + 1
Fast Parallel Hermite Normal Form Computation of Matrices
823
for 1 ≤ i ≤ m. Note: The computation row of every A(k) will not be set here. We construct the identity matrix In in the matrices V (k) in a corresponding way. Figure 1 shows how a matrix with eight rows (a) is distributed on four processors of a parallel computer (b.1). In contrast to Gaussian elimination over fields Gaussian elimination over Euclidean domains needs several steps for eliminating the entries in a row. Over Euclidean domains we have just division with remainder instead of exact division. This is the reason why we distribute rows instead of columns on the parallel computer. The main HNF algorithm is given by pseudocode in Algorithm 1. Parallel-HNF(A(k) , V (k) , b) 1 input k processor index A(k) πk -part of a distributed m × n matrix A V (k) πk -part of the distributed n × n identity matrix V b non-negative integer 2 r←0 3 for i ← 1 to m do 4 g ← (i − 1) mod N 5 if g = k then 6 t ← (i − 1)/N + 1 (k) (k) 7 broadcast (At,r+1, . . . , At,n ) to all other 8 else 9 t ← (m, k) + 1 Index of computation row (k) (k) 10 receive (At,r+1 , . . . , At,n ) from πg (k) 11 A ← Compute-GCD(A(k) , V (k) , t, r + 1, n) (k) = 0 then 12 if At,r+1 13 r ←r+1 14 ir ← i 15 (A(k) , V (k) ) ← Parallel-ROD(A(k) , V (k) , r, b, i1 , . . . , ir ) 16 return (A(k) , V (k) )
Algorithm 1. Parallel HNF computation The matrix A(k) is the πk -part of A which means the rows of A (and the computation row) lying on processor πk ∈ P. This algorithm is executed on each processor in P with the appropriate A(k) , V (k) and a non-negative integer b as input. We explain the meaning of b below. In the description of the algorithm we refer to A and V , respectively, if we want to consider the whole matrix consisting of the πk -parts on each processor. In Line 4 we compute the index g of the processor which stores row Ai,∗ . Processor πg computes in Line 6 the index t of the row of A(g) which corresponds to row Ai,∗ , and broadcasts the entries Ai,r+1 , . . . , Ai,n to each other processor. Broadcasting a row to the computation rows of the other processors is illustrated in Figure 1 (b.1) and (b.2). The function Compute-GCD takes two matrices B and C with n columns, an integer i ≥ 1 and 1 ≤ j1 ≤ j2 ≤ n as input, where i is a valid row index of B. It produces right equivalent matrices
824
Clemens Wagner
B and C as output where B i,j1 = gcd(Bi,j1 , . . . , Bi,j2 ) and B i,j = 0 for j1 < j ≤ j2 . This algorithm computes B from B by unimodular column operations (swapping of two columns, multiplying a column with an unit, and adding a multiple of one column to another column) and does the same transformations on C for computing C . There are several different methods to obtain B from B (e. g. [1], [3], [7], [14]). The variable r corresponds to the number of linearly independent rows of A with index less than i. After execution of the for-loop r is the rank of A, i1 , . . . , ir are the indices of the pseudo diagonal entries of A, and A is in echelon form. The technique of computation rows make it easy to reuse already implemented sequential algorithms for the gcd computation. If we use a gcd algorithm whose execution depends only from the entries in row Ai,∗ we need no communication for function Compute-GCD. The algorithms of Blankinship [1], Bradley [3] or Majewski-Havas [12] and the Multiple-Remainder-GCD as in Wagner [14] are examples for that. The Norm-Driven-Sorting-GCD due to Havas-Majewski-Matthews [8] needs communication for computing column norms of A, and it was shown by examples in Wagner [14], that this algorithm is not useful for parallel HNF computation.
3
Reducing the Off-Diagonal Entries
After leaving the for-loop in Algorithm 1 the current matrix A is in echelon form, and the current V is a multiplier matrix. There are two important algorithms to obtain the Hermite normal form from a matrix in echelon form. There is a row based (standard), as in [11], [13] and a column based method, due to Chou-Collins [5]. Algorithm 2 is a hybrid technique which is controlled by parameter b ∈ N0 . If we choose b equal to zero the algorithm does not change the input matrices. For b = 1 this algorithm is a parallel variant of the reduction method of Chou-Collins, and if b + 1 ≥ r = rank(A) we have a parallel variant of the standard reduction algorithm. The function Monic-Multiplier takes f ∈ F[x] as input and gives 1 if f = 0 and a is the leading coefficient of f a 1 if f = 0 as output. The calls of this function are needed to guarantee point 3. in the definition of the Hermite normal form. For f, g ∈ F[x] \ {0} the expression f /g is equal to (the unique) q ∈ F[x] given by f = g · q + r with deg(r) < deg(g) where deg(0) := −∞. Algorithm 2 parts the input matrix into stripes of width b, and it reduces all off-diagonal entries in each stripe. The macro reduction order of the stripes is from right to left, and the micro reduction order reduces the entries in the stripes
Fast Parallel Hermite Normal Form Computation of Matrices
825
Parallel-ROD(A(k) , V (k) , r, b, i1 , . . . , ir ) 1 input k processor index A(k) πk -part of a distributed m × n matrix A V (k) πk -part of a distributed n × n matrix V r, b, i1 , . . . , ir non-negative integers 2 if r > 0 and b > 0 then 3 f ←r 4 repeat 5 l←f −1 6 f ← max{1, f − b} 7 for j ← f + 1 to r do 8 g ← (ij − 1) mod N 9 h ← min{j − 1, l} 10 if g = k then 11 t ← (ij − 1)/N + 1 (k) (k) (k) 12 broadcast (At,f , . . . , At,h , At,j ) to all other 13 else 14 t ← (m, k) + 1 Index of computation row (k) (k) (k) 15 receive (At,f , . . . , At,h , At,j ) from πg 16 for s ← f to h do (k) 17 q ← At,s /At(k),j (k)
(k)
(k)
(k)
(k)
(k)
A∗,s ← A∗,s − q · A∗,j
18 19 20 21
V∗,s ← V∗,s − q · V∗,j if j ≤ l + 1 then (k) (k) (k) A∗,j ← Monic-Multiplier(At,j )A∗,j
22 23 24 25 26 27 28 29 30 31
V∗,j ← Monic-Multiplier(At,j )V∗,j until f = 1 µ ← (i1 − 1) mod N if µ = πk then t ← (m, k) + 1 (k) broadcast At,1 to all other else t ← (i − 1)/N + 1 (k) receive At,1 from µ (k) (k) (k) A∗,1 ← Monic-Multiplier(At,1 )A∗,1
32 33
(k)
(k)
(k)
(k)
(k)
(k)
V∗,1 ← Monic-Multiplier(At,1 )V∗,1 return (A(k) , V (k) )
Algorithm 2. Parallel reduction of off-diagonal entries
from top to bottom. The macro reduction order is similar to the order of ChouCollins, while the micro reduction order is similar to the standard reduction order. Figure 2 shows the reduction order of the standard method (a), hybrid with b = 2 (b) and for Chou-Collins (c), where the consecutive numbers in the squares describe the order of reduction.
826
Clemens Wagner
The reduction order of Chou-Collins needs smaller degrees than the standard reduction order, since the method of Chou-Collins uses only reduced column vectors of the input matrix for reducing other columns. On the other hand the standard order needs less communication operations (broadcasts) which we will show in the next section.
* 1 2 4 7
* 3 * 5 6 * 8 9 10 * b (a)
* 4 5 7 9
* 6 * 8 1 * 10 2 3 * b
b (b)
* 7 8 9 10
* 4 * 5 2 * 6 3 1 *
b b b b (c)
Fig. 2. Reducing of diagonal entries: (a) Standard, (b) hybrid, (c) Chou-Collins
4
Analysis and Performance Examples
Lemma 1. Given a matrix A ∈ Matm×n (F[x]) with r := rank(A) and 1 ≤ b ≤ r − 1. Let q := r − 2/b . Computing the Hermite normal form with Algorithm 1 needs at most q(q + 1) b+r m+ 2 broadcasts if the implementation of Compute-GCD needs no communication. Proof. Algorithm 1 needs m broadcasts at most. The repeat loop in Algorithm 2 is executed q + 1 times. In the last run of this loop f is equal to 1 and it does r − 1 broadcasts. In the ith run for 1 ≤ i ≤ q f is equal to r − ib and it does r − f = ib broadcasts. There is one further broadcast after the loop, and we get q q(q + 1) m+r−1+( b+r ib) + 1 = m + 2 i=1
broadcasts.
For a m × n matrix with rank r Algorithm 1 needs m + r broadcasts if b ≥ r − 1. We could reduce the number of broadcasts for b ≥ r − 1. Algorithm 3 We modify Algorithm 1 in the following way. (k)
– Replace Line 7 by broadcast At,∗ to all other (k)
– Replace Line 10 by receive At,∗ from πg
Fast Parallel Hermite Normal Form Computation of Matrices
827
– Insert the following lines after Line 13: (k) (k) (k) A∗,r ← Monic-Multiplier(At,r )A∗,r (k)
(k)
(k)
V∗,r ← Monic-Multiplier(At,r )V∗,r for j ← 1 to r − 1 do (k) (k) q ← At,j /At,r (k)
(k)
(k)
(k)
(k)
(k)
A∗,j ← A∗,j − qA∗,r V∗,j ← V∗,j − qV∗,r – Remove Line 15.
Lemma 2. Algorithm 3 needs at most m broadcasts for computing the Hermite normal form and a transforming matrix of a m × n input matrix. It seems that Algorithm 3 is faster than Algorithm 1 with b < r−1 in practice where r is the rank of the input matrix. We give some running time examples showing that Algorithm 3 needs larger degrees than Algorithm 1 with b = 1 for the HNF computation. We have implemented the algorithms in C/C++ on the IBM SP2 at the GMD in Sankt Augustin. We have used the xlC compiler and the message passing library MPL (both IBM products). We have used the Sorting-GCD algorithm, due to Majewski-Havas [12], for the implementation of the Compute-GCD function, where we used a heap for determining the polynomial with the largest degree in a subvector. The input matrix is a random 80 × 80 matrix over F2 [x], and the degree of the entries are less than or equal to 80. The rank of this matrix is r = 80. Table 1 shows the results of our measurements. We have used 8, 16 and 24 nodes of the SP2. The rows in Table 1 have the following meaning: b: the value of the parameter b for Algorithm 1. For b = 80 we have taken Algorithm 3. max. degree: The largest degree of all polynomials which occurred during the computation. This value is independent of the number of processors. N = x: Running-time in minutes:seconds.hundredth seconds on the SP2 using x processors. We have used the high performance switch and us-protocol for the communication. t1 /tb : The fraction of the running time for the b in this column and the running time for b = 1. For b = 1 Algorithm 1 runs faster than for every other b > 1. The relative running times tb /t1 become better if the number of processors become larger (with exceptions for b ∈ {2, 4}), since the time for a broadcast depends on the number of processors.
5
Producing Small Multipliers
For every nonzero A ∈ Matm×n (F[x]) we denote the largest degree of the entries of A as A := max{deg(Ai,j ) | 1 ≤ i ≤ m ∧ 1 ≤ j ≤ n} .
828
Clemens Wagner
Table 1. Run-times for Algorithm 1 for a 80 × 80 matrix over F2 [x] depending on parameter b and the number of processors. b max. degree N =8 tb /t1 N = 16 tb /t1 N = 24 tb /t1
1 9965 2:11.77 1.00 1:36.01 1.00 1:25.75 1.00
2 12571 2:36.02 1.18 1:46.99 1.11 1:37.37 1.14
4 13129 3:19.45 1.51 2:09.78 1.35 1:57.04 1.36
8 19306 4:29.39 2.04 2:46.17 1.73 2:27.46 1.72
17 27227 6:06.97 2.78 3:40.84 2.30 3:12.58 2.25
20 29009 6:52.61 3.13 4:05.44 2.56 3:34.04 2.50
40 36802 8:47.29 4.00 5:08.17 3.21 4:27.68 3.12
60 40989 9:36.34 4.37 5:44.60 3.59 4:56.81 3.46
80 43486 11:40.47 5.32 7:04.60 4.42 5:58.80 4.18
Lemma 3. Given matrices A, H ∈ Matm×n (F[x])\ {0} where H := HNF(A). A multiplier V ∈ GLn (F[x]) satisfying AV = H is unique if and only if rank(A) = n. In this case V ≤ (n − 1)A. Proof. Let 1 ≤ i1 < i2 < . . . < in ≤ m be the row indices of the pseudo diagonal entries of A, and let A , H ∈ Matn (F[x]) be defined as A j,∗ := A ij ,∗ and H j,∗ := H ij ,∗ for 1 ≤ j ≤ n, respectively. Note that H is in Hermite normal form and is right equivalent to A , so H = HNF(A ). Then d := det(A ) = 0, and it is easy to prove that adj(A ) ≤ (n − 1)A ≤ (n − 1)A where A˜ ∈ Matn (F[x]) is the adjoint matrix of A . Let V ∈ GLn (F[x]) a transforming matrix with A V = H . Since adj(A )H dIn = A adj(A ) ⇐⇒ dIn H = A adj(A )H ⇐⇒ H = A d and H as well as adj(A ) are unique V must be unique, and thus adj(A )H ≤ (n − 1)A ≤ (n − 1)A V ≤ d AV = H because the uniqueness of H. Suppose there exists another V ∈ GLn (F[x]) then H = A V but this is a contradiction to the uniqueness of the multiplier V for A V = H . In the case rank(A) < n the largest degree V of a multiplier matrix V can be arbitrary large. We will modify Gaussian elimination for not full-rank input matrices producing small multipliers. Kaltofen, Krishnamoorthy and Saunders [10] extend the input matrix by the n × n identity matrix to obtain a full-rank m + n × n matrix. We will apply this technique to our algorithms. Algorithm 4 Input: A ∈ Matm×n (F[x]), Output: H := HNF(A) and a multiplier V ∈ GLn (F[x]).
Fast Parallel Hermite Normal Form Computation of Matrices
829
1. Set B := IAn 2. Compute G := HNF(B) without computing a transforming matrix using either Algorithm 1 or 3 3. Set H to the first m rows of G and set V to the last n rows of G. Algorithm 4 computes the HNF of a full-rank matrix even if the input matrix is not full-rank. Matrix B := IAn ∈ Matm+n×n (F[x]) has an unique multiplier V ∈ GLn (F[x]) satisfying V ≤ (n−1)B = (n−1)A for nonzero A which is equal to the returned V . A more accurate analysis leads to the following Lemma. Lemma 4. Algorithm 4 computes the Hermite normal form of a matrix A ∈ Matm×n (F[x]) with rank r and a multiplier V ∈ GLn (F[x]) satisfying V ≤ min{r, n − 1}A. Proof. If rank(A) = n we can apply Lemma 3. Otherwise, without loss of generality let A = 0, and set B ∈ Matm+n×n (F[x]) to B := IAn , and proceed with B as with A in the proof of Lemma 3. The full-rank submatrix B ∈ Matn (F[x]) of B must contain at least n − r identity vectors, and thus we get adj(B ) ≤ rB ≤ rA which completes the proof. In Table 2 we give some running time examples for Algorithm 1 and 4.
Table 2. Running times for Algorithm 1 and 4 for b = 1. Algorithm 1 1 1 1 1 4 1 4
6
m×n A 500×500 10 500×500 10 512×512 8 1000×1000 2
rank 500 500 512 1000
N 16 32 16 32
128×256
32
128
8
250×260
32
250
16
Time max. degree V 29:35.26 7763 4987 19:00.96 7763 4987 11:47.61 161 55 34:50.45 3117 1997 4:34.64 9200 9200 8:39.56 6395 4092 31:39.07 49400 49400 15:32.74 12457 8000
Conclusions and Acknowledgements
We have described parallel implementations for Hermite normal form computations and we have studied a technique for computing small multiplier matrices when the input matrix is not full-rank. The running time examples in Table 1 demonstrate that we can obtain better performance if we allow higher communication costs. We have observed this effect for other examples, too. Algorithm 4 has a good upper bound for the degrees
830
Clemens Wagner
of the computed transforming matrix, and we have seen that this algorithm provides very good transforming matrices in practice. We have used earlier implemented sequential gcd algorithms for our parallel implementations. This reuse of sequential code where really well supported by the usage of computation rows. The author is very grateful to the GMD, Sankt Augustin, Germany for offering CPU time on their IBM SP2 installation. Special thanks go to George Havas for his corrections and suggestions.
References 1. W. A. Blankinship. A new version of the Euclidian algorithm. Amer. Math. Monthly, 70(9):742–745, 1963. 824 2. W. A. Blankinship. Matrix triangulation with integer arithmetic. Comm. ACM, 9(7):513, 1966. 822 3. G. H. Bradley. Algorithms and bound for the greatest common divisor of n integers and multipliers. Comm. ACM, 13:447, 1970. 824 4. G. H. Bradley. Algorithms for Hermite and Smith normal matrices and linear Diophantine equations. Math. Comp., 25(116):897–907, 1971. 822 5. T. W. J. Chou and G. E. Collins. Algorithms for the solution of systems of linear Diophantine equations. SIAM J. Comput., 11(4):687–708, 1982. 824 6. G. D. Forney. Convolutional codes I: Algebraic structure. IEEE Trans. Inform. Theory, IT-16(6):720–738, 1970. 821 7. G. Havas and B. S. Majewski. Extended gcd calculation. Congr. Numer., 111:104– 114, 1995. 824 8. G. Havas, B. S. Majewski, and K. R. Matthews. Extended gcd algorithms. Technical Report TR0302, The University of Queensland, Brisbane, 1995. 822, 824 9. G. Havas and C. Wagner. Matrix reduction algorithms for euclidean rings. In Proc. of the 3rd Asian Symposium Computer Mathematics, to appear. 822 10. E. Kaltofen, M. S. Krishnamoorthy, and B. D. Saunders. Parallel algorithms for matrix normal forms. Linear Algebra Appl., 136:189–208, 1990. 828 11. R. Kannan and A. Bachem. Polynomial algorithms for computing the Smith and Hermite normal forms of an integer matrix. SIAM J. Comput., 8(4):499–507, 1979. 824 12. B. S. Majewski and G. Havas. A solution to the extended gcd problem. In ISSAC’95 (Proc. 1995 Internat. Sympos. Symbolic Algebraic Comput.), pages 248–253. ACM Press, 1995. 824, 827 13. C. C. Sims. Computation with Finitely Presented Groups. Cambridge University Press, 1994. 822, 824 14. C. Wagner. Normalformberechnung von Matrizen u ¨ber euklidischen Ringen. PhD thesis, Institut f¨ ur Experimentelle Mathematik, Universit¨ at/GH Essen, 1997. Published by Shaker-Verlag, 52013 Aachen/Germany, 1998. 822, 824 15. C. Wagner. Hermite normal form computation over Euclidean rings. Interner Bericht 71, Rechenzentrum der Universit¨ at Karlsruhe, 1998. submitted for publication. 822
Optimising Parallel Logic Programming Systems for Scalable Machines V´ıtor Santos Costa1 and Ricardo Bianchini2 1
2
LIACC and DCC-FCUP, 4150 Porto, Portugal COPPE Systems Engineering, Federal University of Rio de Janeiro, Brazil
Abstract. Parallel logic programming (PLP) systems have obtained good performance on traditional bus-based shared-memory architectures. However, the scalable multiprocessors being developed today pose new challenges. Our experience with a sophisticated PLP system, Andorra-I, demonstrates that indeed performance suffers greatly on modern architectures. In order to improve performance, we perform a detailed analysis of the cache behaviour of all Andorra-I data structures via executiondriven simulation of a DASH-like multiprocessor. Based on this analysis we optimise the Andorra-I code using 5 different techniques. Our results show that the techniques provide significant performance improvements, leading to the conclusion that PLP systems can and should perform well on modern scalable multiprocessors.
1
Introduction
Parallel computers can improve performance of both numerical and symbolic applications. Logic programs are good examples of symbolic applications that often exhibit large amounts of implicit parallelism and that can greatly benefit from parallel computers. Several PLP systems have been developed so far and have obtained good performance for traditional bus-based shared-memory architectures. However, the scalable multiprocessors being developed today pose new challenges, such as the high latency of memory accesses and the demand for scalability. The complexity of PLP systems and the large amount of data they process raise the issue of whether PLP systems can obtain good performance on these new parallel architectures. In order to address this issue, we experiment with a sophisticated PLP system, Andorra-I [8], that exploits both dependent and-parallelism and or-parallelism. Andorra-I is a particularly interesting example of PLP system, since most of its data structures are similar to the ones of several other PLP systems. Andorra-I has obtained good performance on the Sequent Symmetry, but our experience with it running on modern multiprocessors demonstrates indeed that scalability suffers greatly on these architectures [7]. This paper addresses the question of whether the poor scalability of AndorraI is inherent to the complex structure of PLP systems or can be improved through careful analysis and tuning. In order to answer this question, we analyse the cache behaviour of all Andorra-I data areas when applied to several different D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 831–841, 1998. c Springer-Verlag Berlin Heidelberg 1998
832
V´ıtor Santos Costa and Ricardo Bianchini
logic programs. The analysis pinpoints the areas that are responsible for most misses and the main sources of the misses. Based on this analysis we remove the main performance limiting factors in Andorra-I through a small set of optimisations that did not require a redesign of the system. More specifically, we optimise Andorra-I using 5 different techniques: trimming of shared variables, data layout modification, privatisation of shared data structures, lock distribution, and elimination of locking in scheduling. We present the isolated and combined performance improvements provided by the optimisations on a simulated DASH-like multiprocessor with up to 24 processors. In isolation, shared variable trimming and the modification of the data layout produced the greatest improvements. The improvements achieved when all optimisation techniques are combined are substantial. A few of our programs approach linear speedups as a consequence of our modifications. In fact, for one program the speedup of the modified Andorra-I is a factor of 3 higher than that of the original version of the system on 24 processors. Our main conclusion is then that, even though PLP systems are indeed complex and sometimes irregular, these systems can and should scale well on modern scalable multiprocessors.
2
Methodology
In this section we detail the methodology used in our experiments. The experiments consisted of the simulation of the parallel execution of Andorra-I [8,10]. The Andorra-I parallel logic programming system employs a very interesting method for exploiting and-parallelism, namely to execute determinate goals first and concurrently, where determinate goals are goals that match at most one clause in a program. The Andorra-I system also exploits or-parallelism that arises from the non-determinate goals. Andorra-I requires access to shared memory for both execution and scheduling of work. In order to study the Andorra-I execution more fully, we divided its shared memory into ten different areas. Andorra-I implements the standard Prolog work areas. The Code Space includes the compiled code for every procedure and is read-only during execution of our benchmarks. The Heap Space stores structured terms and variables. The Goal Frame Space keeps the goals yet to be executed. The Choicepoint Stack maintains alternatives for open goals. The Trail Stack records any conditional bindings of variables. Andorra-I also requires several new areas for and/or parallelism. The OrScheduler Data Structures are used to manage or-parallelism. The data structures for and-parallelism are in the Worker area. The Binding Arrays are used to implement the SRI model [5] for or-parallelism, by storing conditional bindings. A Lock Array was needed in our port to establish a mapping between a shared memory position (such as a variable in the heap) and a lock. Finally, the Miscellaneous Shared Variables include the remaining data structures. To simulate Andorra-I we ported the system to a detailed on-line, executiondriven simulator. The simulator uses the MINT front-end [9], and a back-end
Optimising Parallel Logic Programming Systems
idealised dash
20 Speedup
Speedup
20 15 10
833
idealised dash
15 10 5
5
5 10 15 20 Number of Processors
5 10 15 20 Number of Processors
Fig. 1. Speedups for bt-cluster
Fig. 2. Speedups for tsp
that simulates the memory and interconnection systems. We simulate a 24-node, DASH-like [4], directly-connected multiprocessor. Each node of the simulated machine contains a single scalar processor, a write buffer, a 128-KB directmapped data cache with 64-byte cache blocks, local memory, a full-map directory, and a network interface. We use the DASH write-invalidate protocol with release consistency [3] in order to keep caches coherent. We classify cache misses under this protocol using the algorithm presented in [1].
3
Workload and Original Performance
The benchmarks we used in this work are applications representing predominantly and-parallelism, predominantly or-parallelism, and both and- and orparallelism. We next discuss application performance for the original Andorra-I (more detailed information on the benchmarks can be found in the extended version of this paper [6] and in Dutra’s thesis [2]). Note that our results correspond to the first run of an application; results would be somewhat better for other runs. We use two example And-parallel applications, the clustering algorithm for network management from British Telecom, bt-cluster, and a program to calculate approximate solutions to the traveling salesman problem, tsp. To obtain best performance, we rewrote the original applications to make them determinate-only computations. Figure 1 shows the bt-cluster speedups for the simulated architecture as compared to an idealised shared-memory machine, where data items can always be found in cache. The idealised curve shows that the application has excellent and-parallelism and can achieve almost linear speedups up to twenty four processors. Unfortunately, performance for the DASH-like machine is barely acceptable. Figure 2 shows that the tsp application achieves worse speedups than bt-cluster on a modern multiprocessor. The maximum speedup actually decreases for 24 processors, whereas the ideal machine would achieve a speedup of 20 for 24 processors. Figure 3 illustrates the number and sources of cache misses
834
V´ıtor Santos Costa and Ricardo Bianchini BT−CLUSTER
CHAT80
Number of read misses
Number of cache misses
240000
15000 Cold Eviction True Sharing False sharing
200000
Cold Eviction True Sharing False sharing
12000
160000
9000 120000
80000
6000
40000
3000
0
O
rS
ch
ke
W
or
r Lo
ck
s C
od
e H
ea
p G
oa
ls C
ho
ic
eP
l
ai
Tr
BA
M
is
c
Fig. 3. Misses by data area for bt-cluster
0 O
rS
r
ch
ke
W
or
s
ck
Lo
od
ls
p
e
C
H
ea
G
oa
eP
ic
C
ho
l
ai
Tr
BA
c
M
is
Fig. 4. Misses by data area for chat80
per data area in the bt cluster application running on 16 processors as a representative example. The Figure shows that the overall miss rate of bt-cluster is dominated by true and false sharing misses from the Worker and Misc areas. This suggests that the system could be much improved by reducing false sharing and studying activity in the Worker and Misc areas. We use two Or-parallel applications. Our first application, chat80, is an example from the well-known natural language question-answering system chat80, written at the University of Edinburgh by Pereira and Warren. The second application, fp, is an example query for a knowledge-based system for the automatic generation of floor plans. This application should at least in theory have significant or-parallelism. Figure 5 shows the speedups for the chat80 application from 1 to 24 processors. These speedups are very similar to those obtained by Andorra-I on the Sequent Symmetry architecture. In contrast, the DASH curve reaches a maximum speedup of 4.2 for 16 processors. Figure 6 shows the speedups for the fp application. The theoretical speedup is very good, in fact quite close to linear, in sharp contrast to the actual speedup for the DASH-like machine.
idealised dash
20 Speedup
Speedup
20 15 10
idealised dash
15 10 5
5
5 10 15 20 Number of Processors
Fig. 5. Speedups for chat80
5 10 15 20 Number of Processors
Fig. 6. Speedups for fp
Optimising Parallel Logic Programming Systems
835
PAN2 Number of cache misses 140000
20
idealised dash
Cold Eviction True Sharing False sharing
120000
Speedup
100000
15 80000
10
60000 40000
5
20000
5 10 15 20 Number of Processors
Fig. 7. Speedups for pan2
0
ch
rS O
r
ke
or W
s
ck
Lo
e
od C
p
ea H
ls
oa G
eP
ic
ho C
l
ai
Tr
BA
c
is M
Fig. 8. Misses by data area for pan2 Figure 4 shows the number and source of misses for chat80 running on 16 processors, again as an example of this type of application. Note that chat80 does not have enough parallelism to feed 16 processors, suggesting that most sharing misses should result from or-parallel scheduling areas, OrSch and ChoiceP. Indeed, the figure shows that these areas are responsible for a large number of sharing misses, but the areas causing the most misses are Worker and Misc as in the and-parallel applications, indicating again that these two areas should be optimised. As an example of And/Or application we used a program to generate naval flight allocations, based on a system developed by Software Sciences and the University of Leeds for the Royal Navy. Figure 7 shows the speedups for pan2. The idealised curve shows that the application has less parallelism than all other applications; the ideal speedup does not even reach 12 on 24 processors. When run on the DASH simulator, pan2 exhibits unacceptable speedups for all numbers of processors; speedup starts out at about 1.8 for 2 processors and slowly improves to a maximum of only 4.8 for 24 processors. Figure 8 shows the distribution of cache misses by the different Andorra-I data areas for 16 processors. In this case, the Worker area clearly dominates, since the contribution from the Misc area is not as significant as in the and-parallel benchmarks. Note that there is more true than false sharing activity in Worker. The true sharing probably results from idle processors looking for work.
4
Optimisation Techniques and Performance
The previous analysis suggests that relatively high miss rates may be causing the poor scalability of Andorra-I. It is interesting to note that most misses come from fixed layout areas, such as Worker and Misc, and not from the execution stacks, as one would assume. We next discuss how several optimisations can be applied to the system, particularly in order to improve the utilisation of the Worker and Misc areas.
836
V´ıtor Santos Costa and Ricardo Bianchini
The first two optimisations were prompted by our simulation-based analysis of caching behaviour, and they are the ones that give the best improvement. The other three were based on our original intuitions regarding the system, and were the ones we would have performed first without simulation data. We studied performance for three applications, bt-cluster, chat80 and pan2. A detailed discussion of our experiments and results can be found in [6]. In the remainder of this section, we simply summarise the impact of each of the techniques studied when applied in isolation. Variable Trimming. In this technique we investigate the areas that have unexepected sharing, and try to eliminate this sharing if possible. For two of our applications, the Misc area gave a surprisingly significant contribution to the number of misses. The area is mostly written at the beginning of the execution to set up execution parameters. During execution it is used by the reconfigurer and to keep reduction and failure counters. By investigating each component in the area, we detected that the counters were the major source of misses. As they are only used for research and debugging purposes, we were able to eliminate them from the Andorra-I code. The results in [6] show that chat80 benefits the most from this optimisation. This is because the failure counter is never updated by and-parallel applications, but often updated by this or-parallel application. The optimisation does not impact the and-parallel benchmarks as much, leading to less than 10% speedup improvements. Data Layout Modification. All benchmarks but pan2 exhibit a high false sharing rate, showing a need for this technique. On 16 processors, 15% of the misses in pan2 are false sharing misses, whereas in the other applications false sharing causes between 40% (chat80) and 51% (bt-cluster) of all misses. These results suggest that improving false sharing is of paramount importance. According to our detailed analysis of caching behaviour, false sharing misses are concentrated in the Worker, OrSch, ChoiceP and BA areas, besides the Misc area optimised by the previous technique. The Worker and OrSch data areas are allocated statically. This indicates that we can effectively reduce false sharing. We applied two common techniques to tackle false sharing, padding between fields that belonged to different workers or that were logically independent, and field reordering to separate fields that were physically close but logically distinct. Although these are well-known techniques, padding required careful analysis, as it increases eviction misses significantly. The field reordering technique was not easily applied either, as the relationships between fields are quite complex. Padding may lead to serious performance degradation for the dynamic data areas, such as ChoiceP and BA. This restricted our options for layout modification to just field reordering for these areas. The BA area was the target of one final data layout modification, since the analysis of cache behaviour surprised us with a high number of false sharing misses in this area for chat80. Further investigation showed that this was a memory allocation problem. The engines’
Optimising Parallel Logic Programming Systems
837
top of stacks were being shmalloc’ed separately and appeared as part of the BA area in the analysis. This increased sharing misses in the area and was especially bad for the or-parallel applications, as different processor’s top of stacks would end up in the same cache line. We addressed the problem by moving these pointers into the Worker area, where they logically belong. Our results show that the bt-cluster and chat80 applications benefit the most from this optimisation; speedup improvements can be as significant as 60%. In contrast, the pan2 application achieves improvements of less than 10% from this optimisation. Privatisation of Shared Variables. This technique reduces the number of shared memory accesses by making local copies in each node of the machine. In the best case, the shared variables are read-only and hence local copies can actually be allocated in private memory. The high number of references to Worker suggested that privatisation could be applied there. In fact, Andorra-I did already use private copies of the variables in Worker and there was little room for improvement. The Locks and Code data areas are the major candidates to privatisation in Andorra-I. The Locks area only includes pointers to the actual locks, is thus read-only during execution, and can be easily privatised. Another area that is also read-only during parallel execution of our benchmarks is Code. Unfortunately, logic programs in general can change the database and, therefore, update Code, making privatisation complex. Our results show that privatisation improves speedups by up to 10% at most and that the impact of this optimisation decreases as the number of processors increases. Lock Distribution. This technique was considered to reduce contention on accesses to logical variables, and-scheduling, or-scheduling, and stack management. The original implementation used a single array of locks to implement these operations. In the worst case, several workers would contend for the same lock causing contention. To improve scalability, we implemented different lock data structures for different purposes. We expected best results for or-parallel applications, as the optimisation prevents different teams from contending on accesses to logical variables. The cost of this optimisation is that, if the arrays of locks are shared, there will be more expensive remote cache misses. Our results show that the bt-cluster and chat80 applications benefit somewhat from this optimisation, but that the pan2 application already exhibited a significant number of misses in the Locks area and suffers a slowdown. Elimination of Locking in Scheduling. This technique improves performance in benchmarks with significant and-parallelism by testing whether there is available work, before actually locking the work queue. This modification is equivalent to replacing a test and set lock with a test and test and set lock. This optimisation provides a small speedup improvement for pan2, as it avoids locking when there is no and-work. For bt-cluster the technique does not improve speedups as this application exhibits enough and-work to keep processors busy.
838
V´ıtor Santos Costa and Ricardo Bianchini
idealised dash (optimised) dash (initial)
idealised dash (optimised) dash (initial)
20 Speedup
Speedup
20 15 10
15 10 5
5
5 10 15 20 Number of Processors
5 10 15 20 Number of Processors
Fig. 10. Speedups for tsp
Fig. 9. Speedups for bt-cluster BT−CLUSTER
CHAT80
Number of Cache Misses
Number of cache misses 15000
240000 Cold Eviction True Sharing False sharing
200000
Cold Eviction True Sharing False sharing
12000
160000
9000 120000
6000 80000
3000
40000
0
O
rS
ch
ke
W
or
r Lo
ck
s C
od
e H
ea
p G
oa
ls C
ho
ic
eP
l
ai
Tr
BA
M
is
c
Fig. 11. Misses by data area for bt-cluster
5
0 O
rS
r
ch
ke
W
or
s
ck
Lo
od
ls
p
e
C
H
ea
G
oa
eP
ic
C
ho
l
ai
Tr
BA
c
M
is
Fig. 12. Misses by data area for chat80
Combined Performance of Optimisation Techniques
We next discuss the overall system performance with all optimisations combined. We compare speedups against the idealised and original results. The idealised speedups were recalculated for the new version of Andorra-I, but, as it is shown in the figures, the optimisations did not have any significant impact for the idealised machine. Figures 9 and 10 show the speedups for the two and-parallel applications running on top of the modified Andorra-I system. The maximum speedup for bt-cluster jumped from 12 to 20, whereas the maximum speedup for tsp jumped from 6.3 to 19. This indicates that the realistic machine is now able to exploit the available parallelism more fully. The explanation for the better speedups is a significant decrease in miss rates. For bt-cluster, the new version of Andorra-I exhibits a miss rate of only 0.6% for 16 processors, versus the 1.6% of the previous version. In the case of tsp, the optimisations decreased the miss rate from 3% to 1.2% again on 16 processors. Figure 11 shows the number and source of misses for bt-cluster on 16 processors. Note that the figure keeps the same Y-axis as in Figure 3 to simplify
Optimising Parallel Logic Programming Systems
839
comparisons against the cache behaviour of the original version of Andorra-I. The figure shows that the number of misses in the Worker area was reduced by a factor of 4, while the number of misses in the Misc area was reduced by an order of magnitude. The figure also shows that there is still significant true sharing in Worker, but false sharing is much less significant. The number of misses from Misc is now almost irrelevant. The or-parallel benchmarks also show remarkable improvements due to the combination of the optimisation techniques we applied. Figure 13 shows the speedups for chat80 and Figure 14 shows the speedups for fp. The maximum speedup for chat80 almost doubles from one version of the system to the other. Note that speedups for the optimised system still flatten out on 16 processors, but at a much better efficiency. The other benchmark, fp, displays our most impressive result. The speedup for 24 processors jumps from 6.2 with the original Andorra-I system to 20 when all our optimisations are applied. This result represents more than a three-fold improvement. Figure 12 shows the distribution of misses for chat80 with 16 processors. The figure demonstrates that the number of misses in the Worker and Misc areas was reduced by an order of magnitude. The large number of eviction and cold start misses in the Code area remains however. Sharing misses are now concentrated in the OrSch and ChoiceP areas, as they should. Figure 15 shows the speedups of the new version of Andorra-I for the pan2 benchmark. In this case, the improvement resulting from our optimisations was quite small. Figure 16 shows the cache miss distribution for the optimised Andorra-I. The main source of misses was true sharing originating in the Worker region. A more detailed analysis proved that these misses originate from lack of work. Workers are searching each other’s queues and generating misses. We are investigating more sophisticated scheduling strategies to address this problem.
idealised dash (optimised) dash (initial)
20 Speedup
Speedup
20 15 10
idealised dash (optimised) dash (initial)
15 10 5
5
5 10 15 20 Number of Processors
Fig. 13. Speedups for chat80
5 10 15 20 Number of Processors
Fig. 14. Speedups for fp
840
V´ıtor Santos Costa and Ricardo Bianchini PAN2 Number of cache misses 140000
20
idealised dash (optimised) dash (initial)
Cold Eviction True Sharing False sharing
120000
Speedup
100000
15 80000 60000
10
40000
5 20000 0
5 10 15 20 Number of Processors
O
Fig. 15. Speedups for pan2
6
rS
ch
ke
W
or
r Lo
ck
s C
od
e H
ea
p G
oa
ls C
ho
ic
eP
l
ai
Tr
BA
M
is
c
Fig. 16. Misses by data area for pan2
Conclusions and Future Work
Andorra-I is an example of an and/or-parallel system originally designed for traditional bus-based shared-memory architectures. We have demonstrated that the system can also achieve good performance on scalable shared-memory systems. The key to these results was the extensive data available from detailed simulations of Andorra-I. This information showed that there was no need to restructure the system or its schedulers. Instead, performance could be dramatically improved by focusing on accesses to shared data. We believe there is potential for improving the performance of PLP systems even further. To prove so will require more radical changes to data structures within Andorra-I itself, as the system was simply not designed for such large numbers of processors. Last, but not least, we are interested in studying performance of other parallel logic programming systems, such as the systems that exploit independent and-parallelism.
Acknowledgements The authors would like to thank Leonidas Kontothanassis and Jack Veenstra for their help with the simulation infrastructure, and Rong Yang, Tony Beaumont, D. H. D. Warren for their work in Andorra-I. This paper results from collaboration work with Inˆes Dutra, and has benefited from M´ arcio da Silva’s studies. V´ıtor Santos Costa would like to thank support from the Praxis PROLOPPE and FCT MELODIA projects. Ricardo Bianchini would like to thank the support of the Brazilian CNPq.
References 1. R. Bianchini and L. I. Kontothanassis. Algorithms for categorizing multiprocessor communication under invalidate and update-based coherence protocols. In Proceedings of the 28th Annual Simulation Symposium, April 1995. 833
Optimising Parallel Logic Programming Systems
841
2. Inˆes Dutra. Distributing And- and Or-Work in the Andorra-I Parallel Logic Programming System. PhD thesis, University of Bristol, Department of Computer Science, February 1995. 833 3. D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directorybased cache coherence protocol for the DASH multiprocessor. Proceedings of the 17th International Symposium on Computer Architecture, pages 148–159, May 1990. 833 4. D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The dash prototype: Logic overhead and performance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41–61, Jan 1993. 833 5. Ewing Lusk, David H. D. Warren, Seif Haridi, et al. The Aurora Or-parallel Prolog System. New Generation Computing, 7(2,3):243–271, 1990. 832 6. V. Santos Costa and R. Bianchini. Optimising Parallel Logic Programming Systems for Scalable Machines. Technical Report DCC-97-7, DCC - FC & LIACC, UP, October 1997. 833, 836 7. V. Santos Costa, R. Bianchini, and I. C. Dutra. Evaluating the impact of coherence protocols on parallel logic programming systems. In Proceedings of the 5th EUROMICRO Workshop on Parallel and Distributed Processing, pages 376–381, 1997. 831 8. V. Santos Costa, D. H. D. Warren, and R. Yang. Andorra-I: A Parallel Prolog System that Transparently Exploits both And- and Or-Parallelism. In Third ACM SIGPLAN PPOPP, pages 83–93. ACM press, April 1991. 831, 832 9. J. E. Veenstra and R. J. Fowler. Mint: A front end for efficient simulation of shared-memory multiprocessors. In Proceedings of MASCOTS ’94), 1994. 832 10. Rong Yang, Tony Beaumont, Inˆes Dutra, V´ıtor Santos Costa, and David H. D. Warren. Performance of the Compiler-Based Andorra-I System. In Proceedings of the Tenth International Conference on Logic Programming, pages 150–166. MIT Press, June 1993. 832
Experiments with Binding Schemes in LOGFLOW Zsolt N´emeth and P´eter Kacsuk MTA Computer and Automation Research Institute H-1518 Budapest, P.O.Box 63. Hungary {zsnemeth, kacsuk}@sztaki.hu http://www.lpds.sztaki.hu
Abstract. The handling of variables is a crucial issue in designing a parallel Prolog system. In the framework of the LOGFLOW project some experiments were made with binding methods. The work is an analysis of possible schemes, a modification of an existing one and a plan for future implementations of LOGFLOW on new architectures. The conditions, principles and results are presented in this paper.
1
Introduction
Logic programming languages offer a high degree of inherent parallelism [4]. The central question in implementing an OR-parallel Prolog system is how the conditional bindings can be handled consistently and effectively. There are a number of techniques for treating conditional bindings. They can be categorized as stack sharing [1][3][5] [6], stack copying [11] and recomputation methods. All these techniques have a common feature: references (either physical or logical) are allowed to point from an environment to any other environments. However, in a distributed memory system these environments may reside on different processing elements, and dereferencing a variable through several processing elements is a tremendously expensive operation. LOGFLOW is a distributed-memory implementation of Prolog [8]. Its abstract execution model is the logicflow model based on dataflow principles. Prolog programs are transformed into the so called Dataflow Search Graph (DSG). Nodes in this graph represent Prolog procedures. The execution is governed by token streams. Tokens represent a given state of the machine where the computation is continued. LOGFLOW is based entirely on the closed binding environment scheme. Thus, tokens holding an environment are allowed to be distributed over the processor space freely, since there are no external references from the token. The implementation shows some of the drawbacks of the closed binding scheme. There is a long research work in combining the benefits of different binding schemes (e.g. [10][12][14]). The overheads of the closed binding scheme in LOGFLOW were analysed and a hybrid binding method was elaborated where the idea basically relies on the Quasi Closed Environment (QCE [12]) but in fact, the new scheme is an amalgamation of much more binding techniques. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 842–846, 1998. c Springer-Verlag Berlin Heidelberg 1998
Experiments with Binding Schemes in LOGFLOW
2
843
The Closed Binding Environment
A closed environment in Conery’s definition is a set of frames E so that no pointers or links originating from E dereference to slots in frames that are not members of E [2]. There are two special procedures to maintain the closed state of an environment frame: the (2-stage) unification and the back-unification [7]. The environment closing is a costly operation both in time and in memory consumption. The most important contributors to this overhead are the followings. Compound terms must be scanned each time when a unification or back- unification takes place. Scanning a compound term means checking each argument of the term whether it is a reference to a variable or not and performing the necessary actions. Since compound terms (structures and lists) are the natural data structures in logic programming languages, they are used extensively. Structures must not be shared by different environments because their arguments may be changed by bounding variables or by modifying a reference item during the environment closing. If a token is duplicated, all the structures in its environment must be copied for each new environment. This results in an enormous memory consumption and slows down the computation.
3
The Proposed Solutions
The previously listed overheads may occur because of the very special handling of addresses in the closed binding environment scheme. The proposed solutions try to combine the local addressing scheme of the closed environment with the global addressing scheme of non-closed environments. In [10] the molecules are introduced to avoid copying when all the arguments of a structure became ground. In [14] the so called tagged variable scheme is used which is a variation of hash tables for dataflow machines. In [12] variables are divided into two groups whether they can occur in a structure or not. Variables occurring in a structure are stored as global ones whereas others are stored still according to the local addressing scheme. This distinction does not affect the location of the variables but the way how they can be accessed. In such a way, the most expensive part of the closing procedure, the structure scanning and copying, can be discarded until the task is really migrated between the processors.
4
Combining the Closed and Non-Closed Binding Schemes in LOGFLOW
The new hybrid binding scheme alloys many ideas from different binding methods. The basis is still a variation of Conery’s closed environment. By default, variables are handled in the usual way according to the closed binding environment principles. Variables are made global and thus, are handled differently when they are in the arguments of a structure. The fundamental elements of the new binding scheme are the followings: 1. The special storage of global variables. The variable access and the duplication of environments should be done efficiently. Any well-known global method is a good candidate (e.g. binding arrays, directory trees, etc.); finally
844
Zsolt N´emeth and P´eter Kacsuk
a hashing method was chosen. It is a hash technique on variables rather than on environments as Borgwardt’s hash windows [1]. There are comparisons of the two hashing methods in [5]. 2. The way how the global variables are detected, created and named according to a global naming scheme. The naming scheme is very similar to that in binding array technique [3]. 3. Binding the global variables to a value. Basic considerations were: how to represent unbound variables, where to put newly bound ones. 4. Handling different OR-branches (the question of conditional global variables.) As a result of the decisions made at 3, the duplication of environments at OR-nodes is easy and efficient - there is no need to take extra care of unbound variables. 5. The new binding scheme manipulates within a single PE. Some special procedures should be done when the computation migrates between PEs. The problem shows many common features with the time stamping method. None of the above mentioned binding techniques can solve all the problems arising in LOGFLOW but their combination could be a promising candidate. Details of the design and implementation principles can be found in [13].
5
Evaluation of the Hybrid Binding Method
The hybrid binding scheme was implemented and tested on programs of different types and sizes. Test results can be summarized as follows: – the cost of closing procedure within a single processing element (intra- processor overhead) was significantly reduced – the cost of task migration (inter-processor overhead) became slightly higher due to the procedures involved Tests showed some further properties of the binding scheme, e.g.: – the bigger the structures are, the more time can be saved by the new binding method – when other kinds of optimization are present (e.g. ground term table) the new scheme can add a little to the performance
6
Conclusions and Further Work
The goal of the revision of the binding method was to improve the performance of a distributed Prolog system. While efforts were made to achieve this goal, test results pointed out an interesting conclusion which can set the direction of the future research work. The test results suggested that this kind of hybrid binding scheme is an excellent candidate for binding environment on multithreaded architectures like Datarol-II and KUMP/D, where the intra-processor overhead associated with closing procedures is reduced by the binding method whereas, the inter-processor copying overhead is eliminated by the facilities of the architecture. This idea is presented in [9].
Experiments with Binding Schemes in LOGFLOW
845
There are ongoing research projects on implementing LOGFLOW on a kind of hybrid von Neumann-dataflow multithreaded architecture. It has been proven that the runtime model of the Datarol-II family and that of the LOGFLOW is very similar. Furthermore, the thread scheduling policies and granularities of the two models are quite close to each other. The efficient implementation of LOGFLOW on Datarol-II can be greatly supported by a binding method like the one presented in this paper.
Acknowledgments The work reported in this paper was partially supported by the National Research Grant ’Massively Parallel Implementation of Prolog on Multithreaded Architectures, Particulary on Datarol-II’ registered under No. T-022106.
References 1. P. Borgwardt: Parallel Prolog Using Stack Segments on Shared-Memory Multiprocessors. Proceedings of the 1984 Symposium on Logic Programming. 842, 844 2. J.S. Conery: Binding Environments for Parallel Logic Programs in Non-Shared Memory Multiprocessors. Proceedings of the 1987 Symp. on Logic Programming, 1987. 843 3. M. Carlsson: Design and Implementation of an OR-Parallel Prolog Engine. SICS Dissertation Series 02. 842, 844 4. J.Chassin de Kergommeaux, P.Codognet: Parallel Logic Programming Systems. ACM Computing Surveys Vol 26. No 3. Sept 1994 842 5. A. Ciepielewski, B. Hausman: Performance Evaluation of a Storage Model for ORParallel Execution of Logic Programs. SICS 86003 Research Report. 842, 844 6. B. Hausman, A. Ciepielewski, S. Haridi: OR-Parallel Prolog Made Efficient on Shared Memory Multiprocessors. SICS R87006 Research Report. 842 7. P. Kacsuk: Distributed Data Driven Prolog Abstract Machine. In: P. Kacsuk, M.J.Wise: Implementations of Distributed Prolog. Wiley, 1992. 843 8. P. Kacsuk, Zs. N´emeth and Zs. Pusk´ as: Tools for mapping, Load Balancing and Monitoring in the LOGFLOW Parallel Prolog Project. Parallel Computing Journal, Elsevier, Vol. 22, No. 13, Feb. 1997, pp. 1853-1882 842 9. P. Kacsuk, M. Amamiya: A Multithreaded Implementation Concept of Prolog for Datarol-II. Proceedings of ISHPC97. 844 10. L.V. Kal´e, B. Ramkumar: The Reduce-Or Process Model for Parallel Logic Programming on Non-Shared Memory Machines. In: P. Kacsuk, M.J.Wise: Implementations of Distributed Prolog. Wiley, 1992. 842, 843 11. R.Karlsson: A High Performance OR-Parallel Prolog System. SICS Dissertation Series 07, March 1992. 842 12. H. Kim, J-L. Gaudiot: A Binding Environment for Processing Logic Programs on Large-Scale Parallel Architectures. Technical Report, University of Southern California 1994.
846
Zsolt N´emeth and P´eter Kacsuk
13. Zs. N´emeth, P. Kacsuk: Analysis and Improvement of the Variable Binding Scheme in LOGFLOW. Workshop on Parallelism and Implementation Technologies for (Constraint) Logic Programming Languages, Port Jefferson, 1997. 842, 843 844 14. A.V.S. Sastry, L.M. Patnaik: OR-Parallel Evaluation of Logic Programs on a MultiRing Dataflow Machine. New Generation Computing, 10 (1991), pp. 23-53. 842, 843
Experimental Implementation of Parallel TRAM on Massively Parallel Computer Kazuhiro Ogata, Hiromichi Hirata, Shigenori Ioroi, and Kokichi Futatsugi JAIST, JAPAN {ogata,h-hirata,ioroi,kokichi}@jaist.ac.jp
Abstract. MPTRAM is an abstract machine for parallel rewriting that is intended to be reasonably implemented on massively parallel computers. MPTRAM has been implemented on Cray T3E carrying 128 PEs with MPI. It rewrites terms efficiently in parallel (about 65 times faster than TRAM) if message traffic is little. But we have to reduce the overhead of message passing somehow, e.g. by using the shared globally addressable memory subsystem of Cray T3E, so that MPTRAM can be used to implement algebraic specification languages.
1
Introduction
Algebraic specification languages such as OBJ3 and CafeOBJ allow users to use computers to verify some properties of software systems and observe their dynamic behavior at the early stage of software development. However there are still some problems that have to be solved about algebraic specification languages. One of them is the improvement of their execution (rewriting) speed. That is why we have designed some rewriting abstract machines. TRAM [5] is an abstract machine for rewriting [4] that may be used to efficiently implement algebraic specification languages. Its rewriting strategy is the user definable E-strategy [2]. Parallel TRAM [7] (PTRAM) is a parallel variant of TRAM that is intended to be implemented on shared-memory multiprocessors. It allows users to control parallel rewriting using the parallel E-strategy that is an extension of the E-strategy and is also a subset of the concurrent E-strategy [3]. Furthermore, we have altered PTRAM so as to reasonably implement it on massively parallel computers. The altered version is called Massively Parallel TRAM (MPTRAM). We have experimentally implemented MPTRAM on Cray T3E [1] carrying 128 processing elements with MPI [8]. In this paper, we describe briefly parallel rewriting with the parallel E-strategy, PTRAM and MPTRAM, and reports some experiments carried out on Cray T3E.
2
Parallel TRAM
Outline of Parallel Rewriting. Parallel TRAM (PTRAM) uses as its rewriting strategy the parallel E-strategy [7], which is an extension of the E-strategy [2] and is also a subset of the concurrent E-strategy [3], so as to control parallel D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 846–852, 1998. c Springer-Verlag Berlin Heidelberg 1998
Experimental Implementation of Parallel TRAM
847
rewriting. The strategy allows each operation have its own parallel local strategy. Parallel local strategies are specified for an n-ary operation by using lists that are basically ones of natural numbers and sets of positive numbers, or to be more exact, that are defined by the following extended BNF notation: Definition 1 (Syntax of parallel local strategies). ParallelLocalStrategy ::= ( ) | ( SerialElem∗ 0 ) SerialElem ::= 0 | ArgNum | ParallelArgs ArgNum ::= 1 | 2 | . . . | n ParallelArgs ::= { ArgNum+ } A positive number x (1 ≤ x ≤ n) in parallel local strategies denotes the xth argument of a term whose top operation (if the term is f (t1 , . . . , tn ), f is the top operation) is that n-ary operation, and zero stands for the term itself. The rewriting of terms with the parallel E-strategy is briefly described, which is followed by the operational semantics. A term is rewritten according to the parallel local strategy of its top operation. If the first element of the strategy is some positive number x, the xth argument is rewritten and the rewriting of the original term continues according to the remainder of the strategy. If the first one is zero, the term is first replaced with a new term and then the new one is rewritten according to the parallel local strategy of its top operation if there exists a rewrite rule whose left-hand side matches with the term, or otherwise the rewriting of the original term continues according to the remainder of the strategy. If the first one is a set of positive numbers, all the subterms corresponding to the positive numbers in the set are rewritten in parallel and then the rewriting of the original term continues according to the remainder of the strategy. The following is the operational semantics written in rewrite rules: Definition 2 (Operational semantics of the parallel E-strategy). eval(t) → if(eval?(t), t, reduce(t, strategy(t))) reduce(t, nil) → mark(t) reduce(t, c(0, l)) → if(redex?(t), eval(contractum(t)), reduce(t, l)) reduce(t, c(x, l)) → reduce(replace(t, x, eval(arg(t, x))), l) reduce(t, c(m, l)) → reduce(fork(t, m), l) fork(t, empty) → t fork(t, p(x, m)) → exit(eval(arg(t, x)), x, fork(t, m)) exit(s, x, t) → replace(t, x, s) eval rewrites a term according to the parallel local strategy of its top operation by calling reduce if the term has not been rewritten yet. t and s are variables for terms, x for positive numbers (that is why reduce(t, c(x, l)) does not overlap with reduce(t, c(0, l))), l for ParallelLocalStrategy and m for ParallelArgs, and the other symbols are operations. c and nil, and p and empty are constructors for representing ParallelLocalStrategy and ParallelArgs respectively. exit and if are given ({1 3} 0) and (1 0) as their strategies, and the other operations eager local strategies such as (1 2. . . 0).
848
Kazuhiro Ogata et al.
stable regions
DNET
CODE
CR
L0: match_sym idx L1: L3 L2: L6 L3: match_sym idx L4: L5 L5: match_sym idx L6: match_sym idx L7: L8 L8: match_sym idx
rule compiler interpreter
term compiler interpreter
mutable regions
processor STACK
SL
VAR <X3, T3> <X2, T2> <X1, T1>
mutable regions
processor STACK
SL
VAR <X3, T3> <X2, T2> <X1, T1>
interpreter
mutable regions
processor STACK
SL
VAR <X3, T3> <X2, T2> <X1, T1>
Fig. 1. Rough sketch of PTRAM architecture
PTRAM Architecture. PTRAM is intended to be implemented on sharedmemory multiprocessors with a small number of processors. Fig. 1 shows a rough sketch of PTRAM architecture. There are three kinds of processing units (an interpreter, a rule compiler and a term compiler), and six kinds of regions. DNET and CR are stable regions and are shared with all processors. The three mutable regions SL, STACK and VAR are given to each processor, but CODE, on which terms are allocated, is shared with all processors so that terms can be shared with all processors. One of the processors, called the main processor, has also a rule compiler and a term compiler. Given a set of rewrite rules, with the rule compiler the main processor encodes the left-hand sides into a discrimination net that is kind of a decision tree for pattern matching, and translates the righthand sides into pairs of matching program templates and strategy list templates. The discrimination net is allocated on DNET, and the pairs on CR. After that, given a term, with the rule compiler the main processor translates it into a pair of a matching program and a strategy list. The matching program is allocated on CODE, and the strategy list on the main processor’s SL. And then, in order to rewrite the original term, the main processor interprets the matching program according to the strategy list with its interpreter, STACK and VAR, and DNET. If a processor meets a situation that some arguments may be evaluated in parallel and there are some available (idle) processors, it asks them to rewrite the arguments independent of it. If there is no idle processor at that time, the processor itself rewrites the arguments. The strategy list [6] of a term is made from the matching program compiled from the term and the parallel local strategies of the subterms’ top operations. It indicates the order in which the term is rewritten. The elements of the strategy list are basically addresses at which the matching programs of the subtemrs and some instructions are stored. There are three instructions for parallel rewriting: fork, join and exit. fork creates a new process for parallel rewriting, join delays its caller process until all of its child processes terminate, and exit terminates its caller process after reporting the termination to the parent process. exit does not have to explicitly return the term that its caller process has rewritten to the parent process because terms are allocated on the shared region CODE. Strategy lists may be regarded as process queues. A chunk of consecutive elements of a strategy list represents a process. Each process has two possible states: active and pending. Each process queue contains zero or more pending
Experimental Implementation of Parallel TRAM
IdleStack WaitList ForkTable
Master processor
Worker2
Worker1 processor
interpreter
DNET term compiler
CR
CODE L0: match_sym idx L1: L3 L2: L6 L3: match_sym idx L4: L5 L5: match_sym idx L6: match_sym idx L7: L8 L8: match_sym idx
rule compiler STACK
SL VAR
<X3, T3> <X2, T2> <X1, T1>
Workern
processor
interpreter
DNET term compiler
849
CODE L0: match_sym idx L1: L3 L2: L6 L3: match_sym idx L4: L5 L5: match_sym idx L6: match_sym idx L7: L8 L8: match_sym idx
rule compiler STACK
processor
CR
SL
VAR
<X3, T3> <X2, T2> <X1, T1>
interpreter
DNET term compiler
CODE L0: match_sym idx L1: L3 L2: L6 L3: match_sym idx L4: L5 L5: match_sym idx L6: match_sym idx L7: L8 L8: match_sym idx
rule compiler STACK
CR
SL
VAR
<X3, T3> <X2, T2> <X1, T1>
Fig. 2. Rough sketch of MPTRAM architecture processes, and zero or one active process that is on the top of the queue if exists. A process whose process queue contains an active process is busy, and otherwise idle. There is no special processor that manages processors, but a table for recording processors’ state that is shared with all processors. Each processor itself keeps a record of its own state in the table, and looks at the table so as to find idle processors.
3
Massively Parallel TRAM
We have altered PTRAM to reasonably implement it on massively parallel computers. The altered version is called Massively Parallel TRAM (MPTRAM). Fig. 2 shows a rough sketch of MPTRAM architecture. MPTRAM uses the message passing model as its parallel computational model, and the simple master/worker model as the basic parallel algorithm. The master manages all workers by using the two containers (IdleStack and WaitList) and the one table (ForkTable). Each worker has three possible states: idle, waiting and busy. Idle workers are pushed on the stack IdleStack, and waiting workers that are waiting for results from their child processes are added to the linked list WaitList because waiting workers may become busy at some time. Workers that are in neither IdleStack nor WaitList are busy. At first all the workers are idle. Each worker has the three processing units and the six regions so that the amount of messages passed between processors should not explode. That is why each worker can collect garbages in its own CODE independent of the others. ForkTable will be described later. Given a set of rewrite rules, the master receives and forwards it to all the workers. Each worker compiles it with its rule compiler in the same way as PTRAM. Given a term afterward, the master receives and forwards it to some worker, e.g. Worker1, and then the worker becomes busy and compiles it with its term compiler like PTRAM. Subsequently, the worker interprets the matching program according to the strategy list with its interpreter so as to rewrite the original term. If a worker meets the fork instruction, it asks the master if there is some available worker. The master searches IdleStack and then WaitList for an available worker, and changes the available worker to busy and sends its ID to the client worker if exists, or informs the client of no available worker otherwise.
Kazuhiro Ogata et al.
Normalized speed based on TRAM
850
70 threshold: 24 threshold: 25 threshold: 26 threshold: 27
60 50 40 30 20 10 0 1
20
40
60 80 Number of workers
100
120 127
Fig. 3. Experimental results of computing the 34th Fibonacci number
The client worker asks the child worker with the ID to rewrite some argument if it receives the ID from the master, or otherwise rewrites the argument by itself. If a worker meets the join instruction, it basically asks the master whether it can go on, or it must wait for their child workers, in which case it becomes waiting. The master judges this by looking at ForkTable in which it keeps a record of how many child workers a worker is waiting for at each join point that is a place where the join instruction is referred in the worker’s strategy list. If a worker meets the exit instruction, it sends the result term to the parent worker, which is either waiting or busy, and notifies the master that it has finished a request rewriting. After the notification, the master decreases the number of how many child workers the parent worker is waiting for at the corresponding join point in ForkTable, and makes the parent worker busy and has the parent worker resume if the number becomes zero and there is no other processes above the process corresponding to the join point in the parent worker’s strategy list. Each worker must have some table in which it keeps a record of references to terms that it has asked other workers to rewrite. Since those references become necessary so that a worker can go on rewriting after executing the join instruction, the necessary references are put just after each point at which the join instruction is referred in the strategy list.
4
Implementation of MPTRAM on T3E with MPI
Implementation. We have experimentally implemented MPTRAM on Cray T3E [1] with MPI [8]. The Cray T3E used carries 128 user processing elements (PEs), each of which has the DECchip 21164 64-bit super-scalar RISC (300MHz DEC Alpha EV5) and 64MB local memory. The PEs are connected by a bidirectional three-dimensional torus system interconnect network. The MPTRAM system uses two intracommunicators called Master and Workers, and one intercommunicator called Section between Master and Workers. An idle worker waits for a message that informs it that some busy worker is going to ask it to rewrite some term from the master with the intercommunicator Section. After it receives such a message, it waits for a message that includes a term from a busy worker with the intracommunicator Workers. When a waiting worker waits for
Experimental Implementation of Parallel TRAM
851
some messages from the master or workers that the waiting worker has asked to evaluate some subterms, it is necessary to use the nonblocking receive operation because the waiting worker cannot generally predict the order in which messages will arrive. Preliminary Experiment. A set of rewrite rules that compute Fibonacci numbers in parallel was used to assess the MPTRAM system on Cray T3E. The rewrite rules used primitive integers (DEC Alpha EV5’s integers). They also used a threshold to prevent excess number of processes. If the nth Fibonacci number, where n is less than the threshold, is computed, the rewrite rules compute it sequentially, not in parallel. We had the MPTRAM system compute (rewrite) the 34th Fibonacci number with the rewrite rules in varying the number of workers used from 1 to 127 and also the threshold from 24 to 27. Fig. 3 shows the normalized speed of the MPTRAM system based on TRAM. The result corresponding to each threshold except for 24 has a similar tendency. The rewriting speed drastically increases at some point: 90, 60 and 40 workers if the thresholds are 25, 26 and 27 respectively. This seems to relate to how many processes are created because the fork instruction is executed 88, 54 and 33 times if the thresholds are 25, 26 and 27 respectively. The rewriting speed drastically increases after the number of workers is over the number of forks executed in each case. If the threshold is 24, the fork instruction is executed 143 times that is more than actual PEs. That is why the result has the different tendency from the others. We used another set of rewrite rules, in which natural numbers were represented ´a la Peano, to compute the 25th Fibonacci number so as to investigate how the amount of messages passed affected the increase in rewriting speed. The threshold was 14. The speedup was not over twice even with 90 workers while the speedup was approximately nine times if primitive integers were used. Hence we have to reduce the overhead of message passing somehow, e.g. by using the shared globally addressable memory subsystem of Cray T3E, so that MPTRAM can be used to implement efficiently algebraic specification languages.
References 1. Cray T3E: http://www.cray.com/products/systems/crayt3e 846, 850 2. Futatsugi, K., Goguen, J. A., Jouannaud, J. P. and Meseguer, J.: Principles of OBJ2. Proc. of POPL’ 85. (1985) 52–66 846 3. Goguen, J., Kirchner, C. and Meseguer, J.: Concurrent Term Rewriting as a Model of Computation. LNCS 279 (Graph Reduction’ 86) Springer-Verlag. (1986) 53–93 846 4. Klop, J. W.: Term Rewriting Systems. In Handbook of Logic in Computer Science. 2. Oxford University Press. (1992) 1–116 846 5. Ogata, K., Ohhara, K. and Futatsugi, K.: TRAM: An Abstract Machine for OrderSorted Conditional Term Rewriting Systems. LNCS 1232 (RTA’ 97) Springer-Verlag. (1997) 335–338 846 6. Ogata, K. and Futatsugi, K.: Implementation of Term Rewritings with the Evaluation Strategy. LNCS 1292 (PLILP’ 97) Springer-Verlag. (1997) 225–239
852
Kazuhiro Ogata et al.
7. Ogata, K., Kondo, M., Ioroi, S. and Futatsugi, K.: Design and Implementation of Parallel TRAM. LNCS 1300 (Euro-Par’ 97) Springer-Verlag. (1997) 1209–1216 848 846 8. Snir, M., Otto, S. W., Lederman, S. H., Walker, D. W. and Dongarra, J.: MPI: The Complete Reference. The MIT Press. 1996 846, 850
Parallel Temporal Tableaux R. I. Scott1 , M. D. Fisher2 , and J. A. Keane1 1
2
Department of Computation, UMIST, Manchester, UK Department of Computing & Mathematics, Manchester Metropolitan University, UK
Abstract. Temporal logic is useful for reasoning about time-varying properties. To achieve this it is important to validate temporal logic specifications. The decision problem for propostional temporal logic is PSPACE complete and thus automated tableau-based theorem provers are both computationally and resource intensive. Here we analyse a standard temporal tableau method using parallelism in an attempt to both reduce the amount of storage needed and improve efficiency. Two shared memory parallel systems are presented which are based on this decomposition of the algorithm but differ in the configuration of parallel processes.
1
Introduction
Temporal logic has been shown [4] to be a powerful tool for reasoning about properties that vary over time. It is important that we are able to test specifications expressed in temporal logic for validity. A decision procedure for determining the validity of a temporal logic formula is given in [6]. As the decision problem for propositional temporal logic (PTL) is PSPACE-complete, and because temporal specifications are typically large, an automated theorem prover based on this algorithm would be computationally intensive and would potentially require a large amount of memory. The aim of this work is to address ways in which we can minimise the amount of memory required and reduce the computational intensity by utilising parallel systems within temporal tableaux. Wolper’s algorithm [6] has two phases in which a graph (representing possible models of the temporal formula) is constructed and, when construction is complete, systematically reduced. Parallelising Wolper’s algorithm has two potential benefits: (1) the reduction of the amount of memory consumed by deleting nodes from the graph during the construction phase; and (2) improving overall efficiency by decreasing the overall time taken. The paper is structured as follows: in §2 we define PTL; in §3 we describe Wolper’s decision procedure for PTL; in §4 we analyse the algorithm in order to identify those operations which can be performed on the graph in parallel; from this, in §5, we outline a shared memory parallel program design and show
This work was partially supported by EPSRC grant GR/J48979
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 852–861, 1998. c Springer-Verlag Berlin Heidelberg 1998
Parallel Temporal Tableaux
853
how, by grouping together different operations, the granularity of processes can be refined; in §6 we present results of the parallel temporal tableaux algorithm on a range of parallel systems; conclusions are presented in §7.
2
Propositional Temporal Logic (PTL)
Here we consider a discrete linear future-time temporal logic [1]. The alphabet of this logic is that of classical propositional logic augmented with the unary operators ✷ (”always”), ♦ (”sometime”), (”next-time”) and the binary operators U (”until”) and W (”unless”). Informally, ✷f means that f is true in all future states, ♦f means that f is true in some future state, f means that f is true in the next state and f U g means that f is true in all future states until g becomes true. f W g is a weaker version of f U g where g is not guaranteed to occur. An interpretation for PTL is a pair σ, s0 where σ provides an interpretation for the propositions of the language in each state and s0 is some initial state. As we are dealing with a linear model of time, each state has a unique successor state. Further discussion of temporal logics can be found in [1]
3
Wolper’s Decision Procedure For PTL
Wolper’s method for determining the validity of a temporal logic formula is an extension of the semantic tableau decision procedure for propositional logic [1]. As a refutation procedure, it attempts to construct a model for the negation of the formula to be tested. If a model cannot be found, then the negation is unsatisfiable and hence the initial formula is valid. The algorithm works on the observation that a PTL formula can be decomposed into subformulae that are true in the current state or the next state in time. Thus we can attempt to construct a model state by state and test for satisfiability. The algorithm consists of two phases: Graph Construction, and Graph Reduction. In the construction phase the logical structure of the formula is abstracted into the form of a graph, whose nodes are labelled with sets of formulae. Models are represented by infinite paths in the graph. The graph is built by applying certain rules to the formulae depending on whether they are disjunctive or conjunctive. Once construction is complete, nodes are deleted according to some reduction rules. If this reduction phase deletes the entire graph then the original formula is valid. We can classify temporal logic formulae into four types: α formulae (conjunctions); β formulae (disjunctions); next-time formulae (those whose leading connective is ) and literals (atomic propositions and their negations). Elementary formulae are defined to be either literals or next-time formulae. During construction, tableau expansion rules are applied to α and β formulae, these are:
854
R. I. Scott et al.
– α → {{α1 , α2 }}, and – β → {{β1 }, {β2 }}. The α and β sub-formulae are given in Table 1.
Table 1. α and β formulae
α α1 α2 A1 ∧ A2 A1 A2 ✷A A ✷A
β B1 ∨ B2 ♦B B1 U B2 B1 W B2
β1 β2 B1 B2 B ♦B B2 B1 ∧ (B1 U B2 ) B2 B1 ∧ (B1 W B2 )
We now semi-formally outline the graph construction and graph reduction stages of the algorithm (for a more formal description, see [6]).
3.1
Graph Construction
The digraph G(N, E, L), where N is the set of nodes (which may be marked or unmarked) in the graph, E is the set of edges and L is a function which maps a node n to a set of formulae Ln , is constructed as follows: 1. Start with N := {n1 }, where n1 is the initial node and is labelled with the set of formulae Ln1 := {f } where f , the initial formula, is the negation of the formula to be tested. E := ∅. 2. While there exists an unmarked node n ∈ N , labelled with the set of formulae Ln , repeatedly apply the following rules: (a) if f ∈ Ln where f is an unmarked non-elementary formula then for all Si in the tableau expansion rule (f → {Si }), create a new node n labelled with Ln := Ln − {f } ∪ {Si} ∪ {f ∗} where f ∗ is f marked. N := N ∪ {n } and E := E ∪ {(n, n )} (b) if Ln contains only elementary and/or marked formulae, create a new node n where Ln := {f | f ∈ Ln } N := N ∪ {n } and E := E ∪ {(n, n )} N.B. when we write N := N ∪ {n }, we mean that n is added to N if ∀ni ∈ N.Lni = Ln , otherwise an edge to the existing node is created. Nodes added by rule 2. (b) are called prestates and the parent of a prestate is called a state.
Parallel Temporal Tableaux
3.2
855
Graph Reduction
The digraph G is reduced by repeatedly applying the following reduction rules: 1. Delete nodes containing complementary formulae. ∀n ∈ N. ∀f ∈ Ln , (¬f ∈ Ln ⇒ (N = N − {n})) 2. Delete nodes with no successors. ∀n ∈ N, ∀ni ∈ N. (n, ni ) ∈ / E ⇒ (N = N − {n}) 3. Delete prestates containing unsatisfiable eventualities. (a) ∀n ∈ N. prestate(n). ∃♦f ∈ Ln ∧ ∀ni ∈ N. reachablef rom(ni , n) ∧ f ∈ / Lni ⇒ (N = N − {n}) / (b) ∀n ∈ N. prestate(n). ∃f U g ∈ Ln ∧ ∀ni ∈ N. reachablef rom(ni , n) ∧ g ∈ Lni ⇒ (N = N − {n}) N.B. We say that the formulae ♦g and f U g have associated eventuality g because both formulas imply that eventually there is a state in which g is satisfied. The predicate reachablef rom(ni , nj ) is true iff there exists a directed path in the graph from nj to ni .
4
Opportunities for Parallelism
To identify possibilities for parallel activity, we divide the sequential algorithm into separate sub-problems. As described above, expansion rules are repeatedly applied to the sets of formulae labelling pre-states. A new node is then added to the graph for each of the new formula sets derived (if no node exists with an identical label) and then a prestate is derived from these states. These prestates can be expanded in parallel as can the two formula sets created upon the application of a β expansion rule. When a set of formulae is fully expanded, a new node is added to the graph only if there is no other node with the same label. Therefore, the graph has to be searched for duplicate labels. Here, two approaches were considered: (1) nodes are added to the graph and a marker process can be used to detect duplicate labels; or (2) the graph is searched for a label before any node is added to the graph. The advantage of the first approach is that it does not restrict the amount of parallel activity, but the graph may become unnecessarily large and the marker process could become overloaded. The second approach means that the graph size is kept to a minimum as no speculative work is performed by processes applying tableau expansion rules and adding nodes to the graph. Searching the graph before adding a node can be done in parallel, but synchronisation is required between a process searching the graph and a process adding a new node to the graph. The label of any node added to the graph must be visible to a process concurrently searching the graph for a label to be certain that no two nodes have the same label. This constraint may result in a process having to lock the entire graph while it searches for a particular label. The graph is reduced by deleting nodes with labels which contain contradictory formulae, with no successors, or whose labels contain unsatisfied eventualities.
856
R. I. Scott et al.
The first of these rules can be applied during construction by simply discarding the contradictory formula set, as no global checking of the graph is required. The other two reduction rules need more consideration. If a process is to delete a node during construction because it has no successors, it must ensure that the node will never have any children, and similarly, to delete a node because it contains an unsatisfied eventuality a process must be certain that the eventuality will never be satisfied. Coordination is again needed between a process deleting nodes from the graph and a process simultaneously traversing the graph, for instance during search to prevent the actions of the latter from being corrupted. Therefore, it may be more efficient to concurrently mark nodes for deletion, yet actually delete them sequentially at some future point. As a result we can divide the algorithm into the following operations which may be performed in parallel (these operations are similar to those presented in [3]): OP1 OP2 OP3 OP4 OP5 OP6 OP7
5
-
applying tableau expansion rules to sets of formulae; searching the graph for duplicate node labels; adding a new node to the graph; identifying nodes which contain contradictory formulae; identifying nodes with no successors; identifying prestates which contain unsatisfied eventualities; deleting nodes identified by OP4, OP5 or OP6.
Program Design
The operations described above can be implemented as individual fine-grained processes or grouped together to form processes with a coarser granularity. Below, we describe a medium-grained shared memory program model. This design was then analysed and the configuration of the processes was modified in an attempt to improve efficiency. 5.1
Heterogeneous Processes - Model 1
This model consists of four processes: two constructors; a searcher; and a checker. Each process has read and write access to a shared structure - the graph. The constructors obtain and expand unmarked labels from the graph (OP1). Upon the application of a β rule, two sets of formulae are created, a constructor retains one set and pushes the other onto a shared queue (to which both constructors have access). Sets which contain complementary formulae are simply discarded by the constructor (OP4). Once a formula set has been obtained from the graph (or the shared queue), the constructor does not require further access to the graph and thus the amount of coordination required between constructors is small. Therefore, we allow multiple instances of the constructor process in our model. Fully expanded formula sets are passed to the searcher via queue 1.
Parallel Temporal Tableaux
eventuality look-up table
GRAPH
✛
1, 4 constructor
✚
✯ ❍ ✟✟ ❆❑ ❆❑ ❍ ✁✕ ❍❍ ✟✟ ✟ ❍❍ ✟ ❆ ❆ ✁ ✟ ❍ ✟ ❍ ✟ ❍ ✟ ❍ ❆ ☛ ✁ ✛ ✛ ✛ ✘ ✘ ✘ ✟ ❍ ✙ ✟ ❥ 5,❆ 6 ❍ 1, 4 2, 3 constructor
searcher
✶ ✏ ✙ ✚ ✙ ✏✏✚ ❆ ❆ ✁ ✏✏✏ ❆ ❆ ☛✏ ✁ queue 1
857
checker
✙ ✚ ✁✕ ✁
✘ ✙
queue 2
Fig. 1. Program Design with 4 Heterogeneous Processes
The searcher performs OP2 and OP3 and instructs the checker that the graph has been updated via queue 2. The checker is responsible for the identification of nodes which can be deleted while the graph is being constructed. The process of identifying nodes that can be deleted is similar to the problem of garbage collection in functional programming languages [5]. The two methods described below for OP5 and OP6 are similar to reference count garbage collection method [5]. To perform OP5 a process has to be sure that a node with no successor nodes can never have any more before marking it for deletion. This is achieved by maintaining a reference count for each node. When a node is added to the graph, its reference count is set to 1. The node’s count is incremented when a β rule is applied during its expansion, and is decremented when a formula set is discarded due to OP4 or when a node or edge is added from it. Thus, if a node’s reference count is equal to zero and it has no children, a process can be sure that no more successors will be added to the node. In addition to building up a picture of which eventualities are satisfied at which nodes, a process performing OP6 must also identify when an eventuality can never be satisfied, i.e. when a node cannot reach any more nodes as a result of a node or edge being added to the graph. When a node is added to the graph, this new label has to be checked to see whether it satisfies any of the eventualities that can reach it and when an edge (n1 , n2 ) is added to the graph, all the nodes that n2 can reach must be checked against all the eventualities that can reach n1 . Therefore a process has to identify the subgraph that can be reached by a particular node. The identification of subgraphs is complicated by the existence of loops in the graph and some measure has to be taken to detect this. As the process traverses the graph identifying nodes in a subgraph, it tags each node through which it passes, thus ensuring that it passes through a node once only. The system must record where eventualities are satisfied and so an Eventuality Look-Up Table is maintained which contains information on to which node
858
R. I. Scott et al.
an eventuality is associated and which nodes satisfy it. In order to delete a node because it is labelled with an unsatisfied eventuality a process must be certain that the node can have no more descendants. This is achieved by maintaining a set, reachsetn , of node identifiers for each node n. This set contains the identifiers of the nodes in the subgraph reachable from n which could yet have more successors, i.e. their reference count is not zero. These reachsets can be maintained during the identification of subgraphs. As both OP5 and OP6 involve the identification of subgraphs, we combine them into one checker process. The checker is notified of any additions to the graph by the searcher. If a node, n, is added it obtains any eventualities associated with the subgraph that can reach n, updating the relevant reachsets as it does so, and checks them against the label of n. If an edge (n1 , n2 ) is added it identifies the subgraph, SG1 , that can reach n1 and the subgraph, SG2 , reachable from n2 , again updating the relevant reachsets and checks the labels in SG2 against any eventualities associated with SG1 . The actions of a process deleting nodes from the graph (OP7) and those of a process traversing the graph have to be synchronised to ensure that the former process does not corrupt the latter. Locking each node upon access could remove this possibility but could prove unnecessarily restrictive, especially in an example where no nodes can be deleted concurrently. Here the approach taken for OP7 is to suspend the parallel processes and delete nodes sequentially when a predefined number of nodes can be deleted. 5.2
Homogeneous Processes - Model 2
Model 1 was designed to minimise inter-process synchronisation by grouping similar operations into the same process. By statically determining the configuration of the processes in this way it is unlikely that the load across the four processes will be balanced. Neither is model 1 particularly scalable: we can allow more instances of the constructor process but we are restricted to single instances of the seacher and checker processes because of the need for sequential update. An alternative model has four homogeneous coarse-grained processes which sequentially perform OP1 to OP6. Here OP2, OP3, OP5 and OP6 are only performed once a formula set has been fully expanded and so no process is idle as long as there are enough sets of formulae needing expansion. We now have more than one process able to update the graph and so some synchronisation is required. Before OP3 is performed we must ensure that the graph contains no node with label similar to the new label. Thus, we have to deal with the possibility that two processes may attempt to add nodes with identical labels at the same time. To facilitate this only one insertion point for new nodes is permitted. When a process has completed OP2, it attempts to lock the last node of a linked list (the point of insertion). If, when the lock is obtained, the locked node is no longer the last node of the linked list, the process knows that more nodes have been added by another process and OP2 is not complete and so the process again searches to the end of the linked-list and attempts to lock the last node. If the locked node is still the last node in the list, then the process can be sure that no
Parallel Temporal Tableaux
859
more nodes will be added until it releases the lock and thus the graph can never contain two nodes with identical labels. Similarly, OP5 and OP6 need access to the entire graph and so some synchronisation is required - a node is locked when its reference count or its reachset is updated. OP7 is again performed at some fixed points during construction.
6
Results
The two models were implemented on a Silicon Graphics 4-processor shared memory machine. A sequential version was also implemented for comparison purposes. These programs were tested on four reasonably large formulae which have previously been used to test sequential temporal tableau systems [2]. The sizes of the four formulae are summarised in Table 2.
Table 2. Summary of Test Formulae Formula Length/chars Number of Subformulae 1 876 213 2 422 142 3 116 99 4 418 130
Each program - sequential, model 1 and model 2 - require some preprocessing, this involves converting the formula to post-fix form and building a formula table so that a formula can subsequently be referred to simply by an identifier. These preprocessing times are not included in Table 3, containing the timings of the three programs on the four formulae. As an indicator of the nature of the computation involved, it is useful to compare the maximum size of the graph produced in the sequential case, where deletions are carried out once the graph is fully constructed, with that of the parallel case, when nodes are deleted during construction. These are also shown in Table 3.
Table 3. Timings for Each Model & Maximum Graph Size for Sequential and Parallel Models TIME MAX GRAPH SIZE Formula Sequential Model 1 Model 2 Sequential Parallel 1 0m41.08s 0m32.97 0m16.57s 1 1 2 28m31.33s 19m27.28s 13m47.13s 197 84 3 0m31.29s 0m49.24s 0m19.87s 257 257 4 3m58.62s 4m51.68s 2m19.32s 134 130
860
R. I. Scott et al.
As can be seen in Table 3, Model 2 exhibits some speed-up on all four formulas whereas Model 1 performs better than the sequential implementation on only two occasions. This may be explained by the dependencies between operations, for example OP2 and OP3 rely on OP1, and by the fact that it is difficult to predict the nature of a computation for a given example. Model 1 assigns operations to processors statically, whereas Model 2 is more dynamic, operations 2, 3, 5 and 6 are performed on demand. Model 2 does not exhibit ideal speed-up because of the amount of inter-process synchronisation required when accessing the graph and because of the extra work needed to delete nodes concurrently.
7
Conclusions and Further Work
Automated reasoning methods for temporal logics consume a large amount of memory and can be slow. This restricts the size of temporal theorem that can be validated. Our aim, in parallelising the decision procedure for these logics, was two-fold: to reduce the time taken to prove a theorem; and to reduce the size of graph produced in proving a theorem. The algorithm was divided into seven concurrent operations. Two parallel models were designed and implemented along with a sequential algorithm. The heterogeneous model proved to be slower than the sequential algorithm for two of the four formulas tested, whereas the second homogeneous model exhibited satisfactory speed-ups for all four formulas. This can be attributed to the more dynamic nature of the second model. We have also compared the maximum size of the graph produced for the sequential and parallel cases. The sequential algorithm deletes nodes from the graph only when construction is complete whereas the parallel models attempt to delete as many nodes as possible during construction. A considerable amount of deletion was possible for formula 2 (over 50%), but very little parallel deletion was achieved for the other formulas. If the amount of concurrent reduction possible could be predicted by some pre-analysis, efficiency may be improved by deciding not to test for concurrent reduction if it were deemed unlikely. Unfortunately, formulas 2 and 4 are similar in length and structure and yet formula 2 allows much more reduction. The parallel temporal logic theorem prover outlined above achieves reasonable speed-up on all the temporal formulas tested. We have also shown that significant parallel graph reduction is possible for some but not all formulas. Thus, for some cases, a parallel theorem prover may not be able to handle larger formulae by concurrently reducing the graph.
References 1. M. Ben-Ari. Mathematical Logic for Computer Science. Prentice Hall, 1993. 853 2. G. D. Gough. Decision Procedures for Temporal Logic. Technical Report UMCS89-10-1, Department of Comp. Sci., Univ. of Manchester, 1984. 859
Parallel Temporal Tableaux
861
3. R. Johnson. A BlackBoard Approach to Parallel Temporal Tableaux. In Proc. of Artificial Intelligence, Methodologies, Systems, and Applications, World Scientific, 1994. 856 4. Z. Manna and A. Pneuli. Verification of Concurrent Programs: The Temporal Framework. In R.S. Boyer and J.S. Moore (Eds.), The Correctness Problem in Computer Science, Academic Press, 1982. 852 5. R. Plasmeijer and M. van Eekelen. Functional Programming and Parallel Graph Rewriting. Addison-Wesley, 1993. 857 6. P. Wolper. Temporal Logic Can Be More Expressive. Information and Control, 56, p72-99, 1983. 852, 854
Workshop 10+17+21+22 Theory and Algorithms for Parallel Computation Bill McColl and David Walker Co-chairmen
Theory and Algorithms for Parallel Computation Bridging models such as BSP and LogP are playing an increasingly important role in scalable parallel computing. They provide a convenient architectureindependent framework for the design, analysis, implementation and comparison of parallel algorithms and data structures. The Theory and Algorithms sessions at EuroPar ’98 will cover a number of important issues concerning models of parallel computation. The topics covered will include: The BSP model and its relationship to LogP, BSP algorithms for PC clusters, distributed shared memory models, computational complexity of parallel algorithms, BSP algorithms for discrete event simulation, cost modelling of shared data types, systolic algorithms, cache optimisation, optimal scheduling of task graphs.
Parallel I/O In recent years it has become increasingly apparent that I/O presents significant challenges to the goal of making parallel applications truly portable, while atthe same time maintaining good performance. The need for a parallel I/O standard has become particularly acute as more and more large-scale scientific applications have migrated to parallel systems. Many of these applications produce hundreds or gigabytes of output which would result in a serious bottleneck if handled sequentially. Parallel I/O is also of central importance in very largescale problems requiring out-of-core solution, and is closely related to the subject of distributed file systems. Hardware vendors have developed customized parallel I/O subsystems for their particular platforms, but at most this only addresses portability within a vendor’s family of systems. Recently, the MPI Forum and published a standard specification for parallel I/O called MPI-IO. This has not yet been widely implemented, and much research remains to be done before it can be adequately assessed. The Scalable I/O Initiative has also addressed related issues. Nevertheless, parallel I/O remains a wide open field, and a fertile research area. The parallel I/O session of EuroPar’98 contains three papers that address different aspects of parallel I/O. The first, by Jensch, L¨ uling, and Sensen, is of interest to anyone who has ever been frustrated by the slow response time of a popular web site. They consider the problem of how the average response D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 863–864, 1998. c Springer-Verlag Berlin Heidelberg 1998
864
Bill McColl and David Walker
time can be speeded up by mapping the site’s files over multiple disks in such a way that the overall throughout from the disks to the outside world is maximised. This is done by assigning files to disks in such a way that collisions, i.e., quasi-simultaneous accesses to files on the same disk, are minimised. A graph theoretic approach is used to solve the optimisation problem, and found to be effective when compared with a random assignment strategy. The second paper, by Schikuta, Fuerle, and Wanek, describes ViPIOS, the Vienna Parallel I/O System. ViPIOS uses a client-server approach to parallel I/O which seeks to achieve maximum I/O bandwidth by reacting dynamically to the behavior of the application. Applications are the clients whose I/O requests are serviced by one or more ViPIOS servers. ViPIOS can be used either as an independent sub-system, or as a runtime library. The last paper by Dickens and Thakur describes different approaches to two-phase I/O, and presents performance results for the Intel Paragon. The two-phase approach is a technique often used for performing parallel I/O in which processors coordinate their I/O requests so that a few requests for large contiguous blocks of data are made, rather than a larger number of requests for fragmented data. This reduces latency and improves I/O performance. One phase combines I/O requests, and the other phase actually performs the I/O. The three papers presented in this session give some idea of the diversity of ideas currently being pursued in the search for a solution to the parallel I/O problem. For those interested in finding out more about the subject the parallel I/O archive at Dartmouth University is an excellent place to begin: http://www.cs.dartmouth.edu/pario/.
BSP, LogP, and Oblivious Programs J¨ orn Eisenbiegler, Welf L¨ owe, and Wolf Zimmermann Institut f¨ ur Programmstrukturen und Datenorganisation Universit¨ at Karlsruhe 76128 Karlsruhe, Germany {eisen,loewe,zimmer}@ipd.info.uni-karlsruhe.de Abstract. We compare the BSP and the LogP model from a practical point of view. Using compilation instead of interpretation improves the (best known) simulations of BSP programs on LogP machines by a factor of O(log P ) for oblivious programs. We show that the runtime decreases for classes of oblivious BSP programs if they are compiled into LogP programs instead of executed directly using a BSP runtime library. Measurements support the statements above.
1
Introduction
Parallel programming suffers from the lack of a uniform and commonly accepted machine model providing abstract programming of parallel machines and describing their costs adequately. Two candidates, the BSP model (Valiant [9]) and the LogP model (Culler et al. [4]), have been considered in an increasing number of papers. The comparison of the two models in [2] determines the delays for a simulation of LogP programs on the BSP machine and vice versa. For our observations we make two additional assumptions: we only consider oblivious programs and message passing architectures. A program is oblivious if source and destination processors of communications are statically determined1 . The target machines are processor-memory-nodes connected by a communication network. We explicitly exclude shared memory machines. Virtual shared memory architectures are covered since they implicitly require communication via the interconnection network for remote memory operations. We only mention send and receive communications and deliberately ignore remote store and load operations. The first part of our paper shows that oblivious BSP programs can be compiled to the LogP machine. We further show, that compilation reduces the delay of the simulation of oblivious BSP programs on the LogP machine by a factor of O(log(P )) compared to the result in [2], i.e., there is asymptotically no delay for the compiled LogP program compared to the BSP program. Even better, it turns out that the compiled LogP program could outperform a direct execution of the BSP program on the same architecture. To sharpen this observation, we consider three classes of oblivious programs: first we study Multiple-Program-MultipleData (MPMD) solutions. Those programs are in general hard to partition into 1
Many algorithms, especially those in scientific computing are oblivious, e.g. Matrix Multiplication, Discrete Simulation, Fast Fourier Transform, etc.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 865–874, 1998. c Springer-Verlag Berlin Heidelberg 1998
866
J¨ orn Eisenbiegler et al.
supersteps, i.e. in global phases of receive-, compute-, and send-operations. As an example, we discuss the optimal broadcast problem. The second and third classes of problems allow Single-Program-Multiple-Data (SPMD) solutions and there is a natural partition into supersteps. They differ in the data dependencies between the phases: in the second class there are sparse dependencies between the phases in third class these dependencies are dense. We call a data dependency sparse iff the BSP communication of each superstep last longer than the corresponding communication in the compiled LogP program. As representatives of the second class of problems, we discuss a numeric wave simulation. The fast Fourier transform is a representative of the third class. Measurement of the parameters and run time results for an optimal broadcast, the simulation and the FFT support our theoretical results.
2
The Machine Models
This section describes the two machine models a little more in detail. In order to distinguish between the parameters of both models, we use capital letters for the parameters of the LogP model and small letters for the parameters of the BSP model. 2.1
The LogP Model
The LogP model assumes a finite number P of processors with local memory, which are connected by a data network. It abstracts from the network topology, presuming that the position of the processor in the network has no effect on communication costs. Each processor has its own clock, synchronization and communication is done via message passing. All send and receive operations are initiated by the processor which sends or receives, respectively. From the programmers point of view, the network has no direct connection to the local memory. All communication is done via the processor. In the LogP model, communication costs are determined by the parameters L, O, and G. Sending a message costs time O (overhead) on the processor. The time the network connection of this processor is busy with sending the message into the network is bound by G (gap). A processor can not send or receive two messages within time G, but if a processor returns from a send or receive routine, the difference between gap and overhead can be used for computation. The time between the end of sending a message and the start of receiving this message is defined as latency L. There are most L/G messages in transit from any or to any processor at any time, otherwise, the communication stalls. We only consider programs satisfying this capacity constraint. If the sending processor is still busy with sending the last bytes of a message while the receiving processor is already busy with receiving, the send and the receive overhead for this message overlap. In this case the latency is negative. This happens on many systems especially for long messages or if the communication protocol is too complicated. L, O, and G have been determined for quite a number of machines; all works confirmed
BSP, LogP, and Oblivious Programs
867
runtime predictions based on the parameters by measurements. In contrast to [1] and [5], we assume the LogP parameters to be constants (as proposed in the early LogP works [4]). This assumption is admissible if the message size does not vary in a single program. 2.2
The BSP Model
The BSP (bulk synchronous parallel) machine was defined by Valiant [9]. We refer to its modification by McColl in [8,6]. Like the LogP model, the BSP model assumes a finite number of processors P with local memory, local clock, and a network connection to an arbitrary network. It also abstracts from the network topology. In contrast to the LogP model, the BSP machine can explicitly (barrier) synchronize all processors. The synchronization barriers subdivides the calculation into supersteps. All send operations in a superstep i are guaranteed to be completed before superstep i + 1. In the BSP model as invented by Valiant, processor communicate via remote memory access. For oblivious programs we may focus on a message based communication: Consider the remote memory accesses at one superstep. If processor πi reads from processor πj , then πi should send a request to πj and πj sends its answer for the general case. However, for oblivious algorithms, it is already known that πi reads from the memory of πj . Thus, the request can be saved in the case of oblivious algorithms: it is sufficient to send the result of the read request from πj to πi . A write of processor πi to the memory of processor πj , is equivalent to sending a message from processor πi to πj containing the memory address and the value to be written. The cost model uses two parameters: the time for the barrier synchronization l, and the reciprocal of the network bandwidth g. With theses parameters, the time for one superstep is bounded by l + 2h · g + w, where h is the maximal number of messages sent or received by one processors and w is the maximal computation time needed by one processor in this superstep. A BSP machine is able to route a l/g-relation in a superstep which is a capacity constraint analogous to the LogP model. The total computation time is the sum of the time for all supersteps. Like for the LogP model, we assume the parameters to be constant.
3
BSP vs. LogP for Oblivious Algorithms
In this section, we discuss the compilation of oblivious BSP programs to the LogP machine. First, we discuss how the communication between subsequent supersteps can be mapped onto the LogP machine without exceeding the LogP capacity constraints. Second, we define the actual compilation and prove execution time bounds for the compiled LogP programs. Third, we compare the direct execution of a BSP program on a target machine with the execution of the (compiled) LogP program. Therefore, we conclude this section by determining lower bounds for the BSP and LogP parameters, respectively, for the same target machine.
868
3.1
J¨ orn Eisenbiegler et al.
Communication of a Superstep on the LogP Machine
For simplicity, we assume that a h-relation is implemented, i.e. each processor sends and receives exactly h messages. It is well known that a “pipelined” communication can be computed using edge coloring on a bipartite graph (U, V, E) where U and V are the set of processors and (u, v) ∈ E, iff u communicates with v. Each color, represented by an integer j ∈ {0, . . . , k − 1} where k is the number of required colors, defines set of non-conflicting communications that can be started simultaneously. A send(v) on processor u is scheduled at time j · max(O, G) and recv (u) on processor v at time L + O + j · max(O, G). Since we consider oblivious BSP algorithms, the edge coloring of the communication graph for each superstep (and therefore the communication phase itself) can be computed prior to execution of the BSP algorithm. Thus, the time for edge coloring (O(|E| log(|V | + |U |) due to [3]) can be ignored when considering the execution time of the BSP algorithm. It is easy to see that the schedule obtained from the above algorithm does not violate the capacity LogP constraints if L ≥ (h − 1) · max(O, G). If follows Lemma 1. If L ≥ (h − 1) max(O, G), then every fixed h-relation can be implemented on the LogP machine such that its execution time is L + 2O+ (h − 1) max(O, G). If L < (h − 1) max(O, G) the scheduling algorithm must be modified to avoid stalling. First we discuss the simplified model where each channel is allowed to contain an arbitrary number of messages. In this case, the first receive operation is performed after the last send operation. The same approach as above then yields execution time 2O+2(h−1) max(O, G) since the first receive operation can be performed at time O+(h−1) max(O, G) instead of O+L < (h−1) max(O, G). If the number of messages is bounded, the message is received greedily, i.e. as soon as possible after a send operation on the processor is finished at the time when the message arrives. This does not increase the overall execution time of a communication phase since no new gaps are introduced. Thus, the following lemma holds: Lemma 2. If L < (h − 1) max(O, G), then every fixed h-relation can be implemented on the LogP machine such that its execution time is 2O + 2(h − 1) max(O, G). 3.2
Execution Time Bounds for the Compiled LogP Programs
The actual compilation of an oblivious BSP algorithms to the LogP machine is now straightforward: Each BSP processor corresponds one to one to a LogP processor. Beginning with the first superstep we map the tasks of BSP processor to a corresponding LogP processor in the same order. Communication is mapped as described in the previous subsection. Since a processor can proceed its execution when it received all its messages of the preceding superstep, there is no need for a barrier synchronization. Together with Lemmas 1 and 2, this observation leads to the
BSP, LogP, and Oblivious Programs
869
Theorem 1 (Simulation of BSP on LogP). Every superstep of an oblivious BSP algorithm with work w and h remote memory accesses can be implemented on the LogP Machine in time w + 2O + (h − 1) max(O, G) + max(L, (h − 1) max(O, G)). If we choose g = max(O, G) and l = max(L, (h − 1) max(O, G)) + 2O − max(G − O, 0) then, the execution time of a BSP algorithm in the BSP model and the compiled BSP algorithms in the LogP model is the same. Especially, the bound for the simulation in [2] is improved by a factor of log P for oblivious BSP programs. 3.3
Interpretation vs. Compilation
For the comparision of a direct execution of a BSP program with the compiled LogP program, the parameters for the BSP machine and LogP machine, respectively, cannot be chosen arbitrarily. They are determined by the target architecture, for comparable runtime predictions we must choose the smallest admissible values for the respective model. A superstep implementing a h-relation costs in the BSP-model w + l + h · g. According to Theorem 1, it can be executed in time w +2O +(h−1) max(O, G)+ max(L, (h − 1) max(O, G)) on the LogP machine. The speedup is the ratio of these two numbers. Easy calculations prove the following Corollary 1 (Interpreting vs. Compiling). Let M be a parallel computer with BSP parameters l, g, and P and LogP parameters L, O, G, and P . Let A an oblivious BSP algorithm where each superstep executes at most h remote memory accesses. If l ≥ O + max(L, (h − 1) max(O, G)) and g ≥ max(O, G) then the execution time of the compiled BSP algorithm on the LogP model is not slower than the execution time of the BSP algorithm on the BSP model. If one of these inequalities is strict, then the execution time of the compiled BSP algorithm is faster. 3.4
Lower Bounds for the Parameters
Let g ∗ and l∗ (resp. max∗ (O, G) and L∗ ) be the smallest values of the BSP parameters (resp. LogP parameters) admissible on a certain architecture. Our simulation implies that g ∗ = O(max∗ (O, G)) and l∗ = O(L∗ ). Together with the simulation result of [2] that proves a constant delay in simulation of LogP algorithms on the BSP machine, we obtain Theorem 2 (BSP and LogP parameters). For the smallest values of the BSP parameters g ∗ and l∗ (resp. LogP parameters max∗ (O, G) and L∗ ) achievable on any given topology, it holds that g ∗ = Θ(max(O∗ , G∗ )) and l∗ = Θ(L∗ ), provided that both machines run oblivious programs.
870
J¨ orn Eisenbiegler et al.
So far we only considered the worst case behavior of the two models. They are equivalent for oblivious programs in the sense that bi-simulations are possible with constant delay. We now consider the time for implementing a barrier synchronization and a packed router on topology networks. Theorem 3. Let d be the diameter of a point-to-point network with P processors. Let dg be the degree of the processors of the network. Each processor may route up to dg messages in a single time step but it receives and sends only one message at each time step2 . On any topology, the time l∗ of its BSP model is l∗ = Ω(max(d(P ), log P )). Proof. A lower bound for synchronization is broadcasting. Hence, d(P ) is obviously a lower bound for l∗ . Assume an optimal broadcast tree [7] could be embedded optimally on the given topology. Its depth for P processors is Θ(log P ), a message could be broadcasted in time Θ(log P ). Hence, synchronization takes at least time Ω(log P ). Remark 1. Actually Ω(P 1/3 ) is a physical lower bound for l∗ and L∗ under the assumption that the processors must be layouted (in 3-dimensions) and signals have a run duration. Then the minimal average distance is Ω(P 1/3 ). Due to the small constant factors of this bound, we may abstract from the layout and model signal delay by discrete hops from one processor to its neighbors. For many packet routing problems (specific h-relations) and topologies, the latency is considerable smaller than l∗ which could lead to a speed up of the LogP programs compared to oblivious BSP programs. This hypothesis is addressed by the next section.
4
Subclasses of Oblivious Programs
In general, there is a higher effort in designing LogP programs instead of BSP programs. E.g., the proof that the BSP capacity constraints are guaranteed reduces to simply counting the number of messages sent in a superstep. Due to the asynchronous execution model, the same guarantee is not easy to prove for the LogP machine. BSP programs cannot deadlock, LogP programs can. Hence, for all considered classes, we will have to evaluate whether: – the additional effort in programming directly a LogP program pays out in efficiency, or – compilation of BSP programs to LogP programs speeds up the program’s execution, or – such a compilation cannot guarantee a speedup. For our practical runtime comparisons, we determine the parameters of the two machine models for the IBM RS6000/SP with 16 processors3. 2 3
This behavior lead to the LogP model where only receiving and sending a message takes processor time while routing is done by the communication network. The IBM RS6000/SP is a IBM SP, but uses other processors. Therefore, we compiled the BSP tools without processor specific optimizations.
BSP, LogP, and Oblivious Programs
871
Table 1. LogP vs. BSP parameters (IBM RS6000/SP), message size < 16 Bytes. BSP l = 502µs — g = 30.1µs
4.1
LogP L = 17.1µs O = 9.0µs G = 9.8µs
The Parameters
We use the native message passing library for the LogP model and the Oxford BSP tools4 for the BSP model. The results are shown in Table 1. For all measurements in this and the following sections we used compiler option -O3 -qstrict and for the BSP library option -flibrary-level 2. 4.2
MPMD-Programs
If an efficient parallel solution for a problem is hard to partition in supersteps, the BSP model is not appropriate. For those programs the LogP model seems preferable. Due to its asynchronous execution model, it avoids unnecessary dependencies of processors by synchronization. However, if the problem is low level, as our example broadcast is, the solution may be hidden in a library. It is not hard to extend the BSP model by a set of such library functions. Their implementation could be tuned using the LogP model. Optimal Broadcast: A basic algorithm on distributed memory machines is the optimal broadcast of data from one processor to all others, which is used in many applications. For the LogP model, an optimal solution is given by Karp at al. in [7]. Each processor that received the item immediately initiates a repeated sending to processors which have not received the item until there is no such processors. We achieve the optimal BSP broadcast for our machine if the first processor sends messages to all others in one superstep. After synchronization the other processors receive their message. This is optimal for our target machine, since all other implementations would need at least two synchronization steps, which alone cost more than the algorithm given above. The measured runtime of broadcast for LogP and BSP on our machine can be seen in Table 2. Remark 2. In the measurements for the BSP parameters, we used taged messages where the tag identifies the sender. For an optimal broadcast, this identification of the sender is not necessary. This explains the difference between estimation and measurements. In contrast to the LogP model, it is not possible on the IBM RS6000/SP to receive a message and send a it immediately without a gap. The LogP estimation ignore this property and are therefore too small. However, even if we assume max(O, G) = g, L = l then the optimal LogP broadcast is a lower bound for the optimal BSP broadcast. The latter requires 4
see http://www.BSP–worldwide.org
872
J¨ orn Eisenbiegler et al.
Table 2. Predictions vs. runtimes of broadcast, wave simulation, and FFT. BSP LogP measurement estimation measurement estimation broadcast 822 µs 980 µs 101 µs 99.6 µs simulation (n=1,000) 6.84 s 6.35 s 1.50 s 1.56 s simulation (n=100,000) 20.6 s 19.4 s 14.9 s 14.24 s FFT (n=1024) 3.16 ms 3.51 ms 2.65 ms 2.67 ms FFT (n=16384) 34.8 ms 35.6 ms 39.3 ms 35.2 ms
global synchronization barriers between subsequent send and receive operations. These synchronizations lead to delays on other processors if the LogP broadcast tree is not balanced by chance. For each send and receive pair all processors have to be synchronized instead of two. 4.3
SPMD-Programs with Sparse Dependencies
Assume that a BSP process sends at most h messages to the subsequent supersteps. The dependencies in the BSP program are sparse for M if the for LogP and BSP parameters for M it holds that l + 2h · g > 2O + (h − 1) max(O, G) + max(L, (h − 1) max(O, G))
(1)
If a BSP programs communicates at most an h-relation from one superstep to the next, these communication costs are bounded by the right hand side of inequation (1) in the compiled LogP program (due to Theorem 1). Then we expect a speed up if the BSP program is compiled instead of directly executed. However, if the problem size n is large compared to P , computation costs could dominate communication costs. Hence, if the problem scales arbitrarily, the speed-up gained by compilation could approach zero for increasing n. Wave Simulation: For simulating a one-dimensional wave, a new value for every simulated point is recalculated in every time step according to the current value of this point(y0 ), its two neighbors (y−1 ,y+1 ), and the value of this point one time step before (y0 ). This update is performed by the function Φ(yo , y−1 , y+1 , yo ) = 2 · y0 − yo + ∆2t /∆y · 2 · (y+1 − 2 ∗ y0 + y−1 ). Since recalculation of one node needs the values of direct neighbors only, it is optimal to distribute the data balanced and block by block on the processors. Figure 1 sketches two successive computational steps and the required communication. The two programs for the LogP and BSP model only differ in the communication phase of each computation step. In both models, the processes communicate with their neighbors, i.e, the communication phase must route a 2relation. The BSP additionally performs a synchronization of all processors. For our target machine, the data dependencies are sparse i.e., each communications phase is faster on the LogP machine than on the BSP, cf Table 2.
BSP, LogP, and Oblivious Programs
P0
873
P1 P2 P3 P4 P5 P6 P7
Fig. 1. A communication phase of the Wave Simulation.
4.4
SPMD-Programs with Dense Dependencies
We call data dependencies of a BSP program dense for a target machine if they are not sparse. Obviously for those programs compilation does not pay as the following example show. Fast Fourier Transform: We consider an one-dimensional parallel FFT and assume the input vector v to be of size n = 2k , k ∈ N. Furthermore, let n ≥ 2×P . Then the following algorithm requires only one communication phase: v is initially distribute block-by-block and we may perform the first log (n/P ) operations locally. Then v is redistributed in a cyclic way, which requires an all-to-all communication. The remaining operations can be executed also locally. The computation is balanced and a barrier synchronization is done implicitly on the LogP machine with the communication. Hence, we expect no noteworthy differences in the runtimes. The measurements in Table 2 confirm this hypothesis. Remark 3. For the redistribution of data from block-wise to cyclic, we used a LogP–library routine, which gathers data by memcopy calls with variable data length. For the BSP, such a routine does not exist and was therefore implemented by hand. This implementation allowed the compiler to use more efficient copying routines and explains the difference of the LogP runtime to its estimation and to the BSP runtime.
5
Conclusions
We show how to compile of oblivious BSP algorithms to LogP machines. This approach improves the best known simulation delay of BSP programs on the LogP machine [2] by a factor of O(log(P )). It turns out, that the both models are asymptotically equivalent for oblivious programs. We identified a subclass of oblivious programs that are potentially more efficient if directly designed for the LogP machine. Due to a more comfortable programming, other programs are preferably designed for the BSP machine. However, among those we could identify another subclass that are more efficient if compiled to the LogP machine
874
J¨ orn Eisenbiegler et al.
instead of executed directly. Others are not. Our measurements determine the parameters for a LogP and a BSP abstraction of a IBM RS6000/SP with 16 processors. They compare broadcast, wave simulation, and FFT programs as representative of the described subclasses of oblivious programs. Predictions and measurements of the programs in both models confirm our observations. Further work could combine the best parts of both worlds. We could use the same classification to identify parts of BSP programs that are executed directly, namely the non-oblivious parts and those which do not profit from compilation, and compile the other parts. Low level programs from the first class, like the broadcast, could be designed and tuned for the LogP machine and inserted to BSP programs as black boxes.
References 1. Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. LogGP: Incorporating long messages into the LogP model. In 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95–105, 1995. 867 2. Gianfranco Bilardi, Kieran T. Herley, Andrea Pietracaprina, Geppino Pucci, and Paul Spirakis. Bsp vs logp. In SPAA’96: 8th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 25–32. ACM, acm press, June 1996. 865, 869, 873 3. R. Cole and J. Hopcroft. On edge coloring bipartite graphs. SIAM Journal on Computing, 11(3):540–546, 1982. 868 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP 93), pages 235–261, 1993. 865, 867 5. J¨ orn Eisenbiegler, Welf L¨ owe, and Andreas Wehrenpfennig. On the optimization by redundancy using an extended LogP model. In International Conference on Advances in Parallel and Distributed Computing (APDC’97), pages 149–155. IEEE Computer Society Press, March 1997. 867 6. Jonathan M. D. Hill, Paul I. Crumpton, and David A. Burgess. Theory, practice, and a tool for BSP performance prediction. In Luc Boug´e, Pierre Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par’96 Parallel Processing, number 1123 in Lecture Notes in Computer Science, pages 697–705. Springer, August 1996. 867 7. R.M. Karp, A. Sahay, E.E. Santos, and K.E. Schauser. Optimal broadcast and summation in the logp model. ACM-Symposium on Parallel Algorithms and Architectures, 1993. 870, 871 8. W. F. McColl. Scalable computing. In Jan van Leeuwen, editor, Computer Science Today, number 1000 in Lecture Notes in Computer Science, pages 46–61. Springer, 1995. 867 9. Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), August 1990. 865, 867
Parallel Computation on Interval Graphs Using PC Clusters: Algorithms and Experiments A. Ferreira1, I. Gu´erin Lassous2, K. Marcus3 , and A. Rau-Chaplin4 1
CNRS, INRIA, Projet SLOOP, BP 93, 06902 Sophia Antipolis, France. [email protected]. 2 LIAFA – Universit´e Paris 7, Case 7014, 2, place Jussieu, F-75251 Paris Cedex 05. [email protected]. 3 Eurecom, 2229, route des Cretes - BP 193, 06901 Sophia Antipolis cedex - France. [email protected]. 4 Faculty of Computer Science, Dalhousie University, P.O. Box 1000, Halifax, NS, Canada B3J 2X4. [email protected]. Abstract. The use of PC clusters interconnected by high performance local networks is one of the major current trends in parallel/distributed computing. We give coarse-grained, BSP-like, parallel algorithms to solve many problems arising in the context of interval graphs, namely connected components, maximum weighted clique, BFS and DFS trees, minimum interval covering, maximum independent set and minimum dominating set. All of the described p-processor parallel algorithms require only constant or O(log p) number of communication rounds and are efficient in practice, as demonstrated by our experimental results obtained on a Fast Ethernet based PC cluster.
1
Introduction
The use of PC clusters interconnected by high performance local networks with raw throughput close to 1Gb/s and latency smaller than 10µs is one of the major current trends in parallel/distributed computing. The local networks are either realized with off-the-shelf hardware (e.g. Myrinet and Fast Ethernet), or application-driven devices, in which case additional functionalities are built-in, mainly at the memory access level. Such cluster-based machines (called henceforth PCC’s) typically utilize some flavour of Unix and any number of widely available software packages that support multi-threading, collective communication, automatic load-balance, and others. Note that such packages typically simplify the programmers task by both providing new functionality and by promoting a view of the cluster as a single virtual machine. Clusters based on off-the-shelf hardware can yield effective parallel systems for a fraction of the price of machines using special purpose hardware. This kind of progress may thus be the key to a much wider acceptance of parallel computing, that has been postponed so far, perhaps primarily due to issues of cost and complexity. Although a great deal of effort has been undertaken on system-level and programming environment issues as described above, little attention has been paid to methodologies for the design of algorithms for this kind of parallel systems. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 875–886, 1998. c Springer-Verlag Berlin Heidelberg 1998
876
A. Ferreira et al.
Despite the availability of a large number of built-in and/or highly optimized procedures, algorithms are still designed at the machine level and claims to portability lay only on the fact that they are implemented using communication libraries such as PVM or MPI. In this paper we show that theoretical (BSP-like) coarse-grained models are well adapted to PCC’s. In particular, algorithms designed for such models are portable and their theoretical and practical performance are closely related. Furthermore, they allow a reduction on the costs associated with software development since the main design paradigm is the use of existing sequential algorithms and communication sub-routines, usually provided with the systems. Our approach will be to study a class of problems from the start of the algorithm design task until the implementation of the algorithms on a PCC. The class of problems will be those arising on a family of intervals on the real line which can model a number of applications in scheduling, circuit design, traffic control, genetics, and others [16]. Previous Work This class of problems has been studied extensively in the parallel setting and many work-optimal fine-grained PRAM algorithms have been described in the literature [3,14,15,16]. Their sequential complexity is Θ(n log n) in all cases. Whereas fine-grained PRAM algorithms are likely to be efficient on finegrained shared memory architectures, it is common knowledge that they tend to be impractical on PCC’s due to their failure to exploit locality. Therefore, there has been a recent growth of interest in coarse-grained computational models [4,5,18] and the design of coarse-grained algorithms [5,6,8,10,13]. The BSP model, described by Valiant [18], uses slackness in the number of processors and memory mapping via hash functions to hide communication latency and provide for the efficient execution of fine grained PRAM algorithms on coarse-grained hardware. Culler et. al. introduced the LogP model which, using Valiant’s BSP model as a starting point, focuses on the technological trend from fine grained parallel machines towards coarse-grained systems and advocates portable parallel algorithm design [4]. Other coarse grained models focus more on utilizing local computation and minimizing global operations. These include the Coarse-Grained Multicomputer (CGM(n, p)) model used in this paper [5], where n is the input size and p the number of processing elements. In this mixed sequential/parallel setting, there are three important measures of any coarse-grained algorithm, namely, the amount of local computation required, the number and type of global communication phases required and the scalability of the algorithm, that is, the range of values for the ratio np for which the algorithm is efficient and applicable. We refer to [5,9,10] for more details on this model. Recently, C´ aceres et al. [2] showed that many problems in general graphs, such as list ranking, connected components and others, can be solved in O(log p) communication rounds in BSP and CGM. However, unlike general graphs, interval graphs can be more easily partitioned and treated in the distributed memory setting. Since each interval is given by its two extreme points, they can be sorted
Parallel Computation on Interval Graphs Using PC Clusters
877
by left and/or right endpoints and distributed according to this ordering. This partitioning allows us to design less complex parallel algorithms; moreover, the derived algorithms are easier to implement and faster both in theory and in practice. The following result will be used in the remaining to achieve a constant number of communication rounds in the solution of many problems. Theorem 1. [13] Given a set S of n items stored O(n/p) per processor on a CGM(n, p), n/p ≥ p, sorting S takes a constant number of communication rounds. The algorithms proposed for the CGM are independent of the communication network. Moreover, it was proved that the main collective communication operations can be implemented by a constant number of calls to global sort ([5]). Hence, by Theorem 1, these operations take a constant number of communication rounds. However, in practice these operations will be implemented through built-in, optimized system-level routines. In the remainder, let TS (n, p) denote the time complexity of a global sort in the CGM. Our Work We describe constant communication round coarse-grained parallel algorithms to solve a set of the standard problems arising in the context of interval graphs [16], namely connected components [3], maximum weighted clique [15] and breadthfirst-search (BFS) and depth-first-search (DFS) trees [14]. We also propose O(log p) communication round algorithms for optimization problems as minimum interval covering, maximum independent set [15] and minimum dominating set [17]. In order to demonstrate the practicability of our approach, we implemented three of the above algorithms on a PCC interconnected by a Fast Ethernet backbone. Because of the paradigms used, the programs were easy to develop and are quite portable. The results presented in this paper show that high performance can be achieved with off-the-shelf PCC’s along with the right model for algorithm design. Interestingly, super-linear speedups were observed in some cases due to memory swapping effects. Using multiple processors allows us to effectively utilize more RAM and therefore allows computation on data sets that are simply too large to be effectively processed on single processor machines. In Section 2 the required basic operations are described. Then, in Section 3, chosen problems in interval family model are presented, and solutions are proposed using the basic operations from Section 2. In Section 4, we describe experiments on a Fast Ethernet based PCC. We close the paper with some conclusions and directions for further research.
2
Basic Operations
In the CGM model, any parallel prefix (suffix) associative function to be performed in an array of elements can be done in O(1) communication steps, since
878
A. Ferreira et al.
each processor can compute locally the function, and then with a total exchange all the processors get to know the partial result of all the other processors and can compute the final result for each element in the array. We will also use the pointer-jump operation, to identify the elements in a linked list. This operation can be easily done in O(log p) communication steps, at each step each processor keeps track of the pointers of its elements. 2.1
Interval Operations
In the following algorithms two functions will be widely used, the Ominright and Omaxright [1]. Given an interval I, Omaxright(I) (Ominright(I)) denotes, among all intervals that intersect I, the one whose right endpoint is the furthest right (left). The formal definition is the following. Ij , if bj = max{bk |ak ≤ bi < bk } Omaxright(Ii ) = nil, otherwise. The function Omaxright can be computed with time complexity O(TS (n, p)), as follows. 1. Sort the left endpoints of the interval in ascending order as a1 , a2 , . . . , an . 2. Compute the prefix maxima of the corresponding sequence b1 , b2 , . . . , bn of right endpoints and let the result be b1 , b2 , . . . , bn . (bk = max1≤i≤k {bi }.) 3. For every i (1 ≤ i ≤ n) compute the rank r(i) of bi with respect to a1 , a2 , . . . , an 4. For every i (1 ≤ i ≤ n), set Omaxright(Ii ) = Ij , such that bj = br(i) and = br(i) ; otherwise set Omaxright(Ii ) = nil. bi
We define also the parameter First(I) as the segment I which “ends first”, that is, whose right endpoint is the furthest left: First(I) = Ij , with bj = min{bi |1 ≤ i ≤ n}. To compute it, we need only to compute the minimum of the sequence of right endpoints of intervals in the family I. Finally, we will use the function next(I) : I → I defined as Ij , if bj = min{bk |bi < ak }, next(Ii ) = nil, otherwise. That is, next(Ii ) is the interval that ends farthest to the left among all the intervals beginning after the end of Ii . To compute next(Ii ), 1 ≤ i ≤ n, we use the same algorithm used for Omaxright(Ii ), with a new step 2. 1. Sort the left endpoints of the interval in ascending order as a1 , a2 , . . . , an .
Parallel Computation on Interval Graphs Using PC Clusters
879
2. Compute the suffix minima of the corresponding sequence b1 , b2 , . . . , bn of right endpoints and let the result be b1 , b2 , . . . , bn . (bk = mink≤i≤n {bi }.) 3. For every i (1 ≤ i ≤ n) compute the rank r(i) of bi with respect to a1 , a2 , . . . , an 4. For every i (1 ≤ i ≤ n), set Next(Ii ) = Ij , such that bj = br(i) and = br(i) ; otherwise set Next(Ii ) = nil. bi
It is easy to see that the above procedure implements the definition of next(Ii ), with the same complexity as for computing Omaxright(Ii ).
3
Interval Graph Problems and Algorithms
Formally, given a set n of intervals I = {I1 , I2 , . . . , In } on a line, the corresponding interval graph G = (V, E) has the set of nodes V = {v1 , . . . , vn }, and there is an edge in E between nodes vi , vj if and only if Ii ∩ Ij = ∅. In this section, solutions for some important problems in interval graphs are proposed for the CGM model. Some of these algorithms use techniques derived from their corresponding PRAM algorithms while others require different methods, e.g. to compute the connected components, as shown below. 3.1
Maximum Weighted Clique
A clique is a set of nodes that are mutually adjacent. In the maximum weighted clique problem for an interval graph, we want to know the maximum weight of such a set, given weights p(Ii ) ≥ 0 on the intervals, and identify a maximum weighted clique by marking its nodes. The CGM algorithm is as follows: 1. Sort the endpoints of the segments such that each processor 2n/p endpoints. 2. Assign to each endpoint ci a weight wi defined by
wi =
p(Ij ), −p(Ij ),
receives
if ci = aj , for some 1 ≤ j ≤ n, if ci = bj , for some 1 ≤ j ≤ n,
3. Compute the prefix sum of the resulting weighted sequence - the maximum obtained is the cardinality of a maximum clique; let d1 , . . . , d2n denote the resulting sequence. 4. Consider the sequence e1 , . . . , e2n obtained by replacing every dj corresponding to a right endpoint of an interval with -1 and compute the rightmost maximum of the resulting sequence; this occurs at ak . 5. Broadcast ak . Every interval Iu such that au ≤ ak < bu is marked to be in the final maximum weighted clique.
Due to space limitations, the correctness and the complexity of the algorithm can be found in [7]. Theorem 2. The maximum weighted clique problem in an interval graph of size n can be solved on a CGM(n, p) in O(TS (n, p) + n/p) time, with a constant number of communication rounds.
880
3.2
A. Ferreira et al.
Connected Components
The connected components of a graph G are the maximal connected subgraphs of G. The connected components problem consists of assigning to each node the label of the connected component that contains it. For the CGM(n, p) we have the following algorithm: 1. Sort the intervals by left endpoints distributing n/p elements to each processor. 2. Each processor Pi computes the connected components for the subgraph corresponding to its n/p intervals, giving labels to the components and associating the labels to the nodes. 3. Each processor detects the farthest right segment amongst its n/p intervals - tail ti - and broadcasts it (with its label) to all other processors. 4. Each processor checks if any of the tails intersects its components, and updates its local labels using in each case the smallest such new label. 5. Each processor Pi records the pair (ti , new label) and sends it to processor P0 . 6. Processor P0 performs a connected components algorithm on the tails and updates the tail labels using the smallest such new labels and sends the tails and their new labels to all processors. 7. Each processor updates the labeling accordingly.
Due to space limitations, the correctness and the complexity of the algorithm can be found in [7]. Theorem 3. The connected components problem in interval graphs can be solved on a CGM(n, p) in O(TS (n, p) + n/p) time, with a constant number of communication rounds.
3.3
BFS and DFS Tree
The problem of finding a Breadth First Search Tree in an interval graph reduces to the problem of computing the function Omaxright described earlier. The tree given by the edges (Ii ,Omaxright(Ii )) is a BFS tree [14]. And the tree formed by the edges (Ii ,Ominright(Ii )) is a DFS tree [14]. The algorithm is the following: 1. Compute Omaxright(Ii ), for 1 ≤ i ≤ n. 2. Let father(Ii ) = Omaxright(Ii ). 3. The edges (Ii , father(Ii )) form a BFS tree.
With the appropriate modifications, this algorithm may be used to find a DFS tree. The obtained BFS and DFS trees have their roots in the segments ending farthest to the right in each connected component. With respect to its complexity, the algorithm takes a constant number of communication steps and requires a total running time of O(TS (n, p) + n/p).
Parallel Computation on Interval Graphs Using PC Clusters
881
Theorem 4. Given an interval graph G, BFS and DFS trees can be found using a CGM(n, p) in O(TS (n, p) + n/p) time, with a constant number of communication rounds.
3.4
Minimum Interval Covering
Given a family I of intervals and a special interval J = (Ja , Jb ), the problem of the minimum interval covering is to find a subset J ⊆ I such that J ⊆ ∪(J ), and |J | is minimum; i.e., to find the minimum number of intervals in I needed to cover J. To solve this problem we may only consider the intervals Ii = (ai , bi ) ∈ (I) such that bi ≥ Ja and ai ≤ Jb . Let IJ be the family of the intervals in I satisfying this condition. An algorithm to solve this problem is as follows: 1. Compute Omaxright(I), I ∈ IJ . 2. Find the interval Iinit such that binit = max{bk |ak ≤ Ja }. 3. Mark Iinit and all the intervals in the path given by Omaxright pointers beginning at Iinit .
Due to space limitations, the correctness and the complexity of the algorithm can be found in [7]. Theorem 5. The minimum interval covering problem in interval graphs can be solved using a CGM(n, p) in O(TS (n, p) + log p) time, with O(log p) communication rounds.
3.5
Maximum Independent Set and Minimum Dominating Set
The first problem consists of finding a largest set of mutually non-overlapping intervals in the family I, called the maximum independent set. The second problem consists of finding a minimum dominating set, i.e., a minimum set of intervals which are adjacent to all remaining intervals in the family I. To solve these problems, we simply show a coarse-grained implementation of the algorithms proposed in [17]. In fact, it can be shown that both problems can be solved by building a linked list from First(I): Maximum independent set: 1. 2. 3. 4.
Compute First(I) Compute next(Ii ), for i, 1 ≤ i ≤ n Let father(Ii ) = next(Ii ), for i, 1 ≤ i ≤ n Using the pointer-jump operation, mark all the intervals in the linked list given by father and beginning at First(I)
Minimum dominating set:
882 1. 2. 3. 4. 5.
A. Ferreira et al. Compute First(I) Compute Omaxright(Ii ), for 1 ≤ i ≤ n Compute next(Ii ), for 1 ≤ i ≤ n Let father(Ii ) = Omaxright(next(Ii )), for 1 ≤ i ≤ n Using the pointer-jump operation, mark all the intervals in the linked list given by father and beginning at Omaxright(First(I))
The number of communication rounds in each of the algorithms is O(log p), giving us a total time complexity of O(TS (n, p) + log p) and O(log p) communication rounds. Their correctness stems from the arguments in [17]. Theorem 6. The maximum independent set and the minimum dominating set problems in interval graphs can be solved using a CGM(n, p) in O(TS (n, p)+log p) time, with O(log p) communication rounds.
4
Experimental Results
This section describes the implementations of three of the algorithms presented previously. Our aim here is to demonstrate that these algorithms are not only theoretically efficient but that they lead to simple fast codes in practice. They were implemented on a Fast Ethernet-PCC platform which consists of a set of 12 Pentium Pro 200 Mhz processors each with 64M of RAM that are linked by a 100Mb/s Fast Ethernet network. The processors run the Linux Operating System and the programs are written in C utilizing the PVM communication library [11] for all interprocessor communications. Since many of the algorithms rely on sorting, the choice of the sorting method was critical. In the following we first present the implemented sort and its performance before describing the implementation and performance of our algorithms. 4.1
Global Sort
The sorting algorithm implemented is described in [12]. The algorithm requires a constant number of communication steps and its single drawback is that data may not be equally distributed at the end of the sort. Nevertheless, a partial sum procedure and a routing can be used to redistribute the data with a constant number of communication rounds so that each processor stores np data in its memory. Figure 1 shows the execution time for the global sort on an array of integers with the data redistributed in comparison to the sequential performance of quicksort (from the standard C library). The results shown are the average of ten execution times over ten different inputs generated randomly. The abscissa represents n the size of the input array and the ordinate the execution time in seconds. For less than 7,000,000 integers, the achieved speedup is about 2.5 for four processors and 6 for twelve processors. Beyond the size of 7,000,000 integers, the
Parallel Computation on Interval Graphs Using PC Clusters
883
900 sequential quicksort CGM parallel sort with 4 PCs CGM parallel sort with 12 PCs
800 700 600 500 400 300 200 100 0 0
5e+06
1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07
Fig. 1. Sorting on the Fast Ethernet-PCC. memory swapping effects increase significantly the execution time on a single processor and super-linear speedup is obtained, with 10 for four processors and 35 for twelve processors. Sorting 40,000,000 integers takes less than one minute with twelve processors. 4.2
Maximum Weighted Clique
The algorithm requires a constant number of communication rounds and O( np log np ) local operations. Note that only the local computations involved in the sort require O( np log np ) operations, whereas all the other steps require only O( np ) operations. 700 sequential MWC CGM MWC with 4 PCs CGM MWC with 12 PCs
600
500
400
300
200
100
0 0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
Fig. 2. Maximum Weighted Clique on the Fast Ethernet-PCC. Figure 2 presents the execution time when the number of intervals increases. With a graph having less than 1,000,000 intervals, the speedup is 2.5 with four
884
A. Ferreira et al.
processors, whereas it is equal to 7 for twelve processors. Beyond 1,000,000 intervals, the speedup is 10 for four processors and 35 for twelve processors, these superlinear timings being due to memory swapping effects. Again note that with this algorithm larger data sets than in the sequential case can be handled in a reasonable time. 4.3
Connected Components
As in the maximum weighted clique algorithm above, here also only the sort requires O( np log np ) local operations, all the other steps being linear in np .
500 sequential CC CGM CC with 4 PCs CGM CC with 12 PCs
450 400 350 300 250 200 150 100 50 0 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
Fig. 3. Connected Components on the Fast Ethernet-PCC. Figure 3 shows the execution time in seconds as the size of the input increases. The achieved speedup is approximately 2.5 for four processors and 7 for twelve processors for a graph having at most 2,000,000 intervals. With more intervals, the speedup is 8 for four processors and 30 for twelve processors. Also observe that with one processor, at most 2 million of data can be processed whereas twelve processors can process 12 million of data reasonably. Beyond 12 million data items the execution time increases more steeply due to memory swapping effects, even with twelve processors. 4.4
BFS Tree
The achieved speedup is 2 for four processors and 6 for twelve processors with at most 2 million of intervals. Beyond this size, the speedup becomes 7 for four processors and 20 for twelve processors. The measured times are slower than those obtained for the previous algorithms, because two steps of the function Omaxright require O( np log np ) operations, whereas for the previous problems only one step required this number of local operations.
Parallel Computation on Interval Graphs Using PC Clusters
885
400 sequential BFS CGM BFS with 4 PCs CGM BFS with 12 PCs
350 300 250 200 150 100 50 0 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
Fig. 4. BFS Tree on the Fast Ethernet-PCC.
5
Conclusion
In this paper we have shown how to solve many important problems on interval graphs using a coarse-grained parallel computer such as a cluster of PC’s. The proposed algorithms were shown to be theoretically efficient, easy to implement and fast in practice. We believe this can largely be attributed to the use of the CGM model which accounts for distributed memory effects, mixes sequential and parallel coding, and encourages the use of a constant or very small number of communication rounds. Note that the use of the CGM model, which was primarily developed for algorithm design in the context of interconnection networks, has led to efficient implementations even in the context of a bus-based network like Ethernet. We speculate that this is due to several factors including: 1) the model focuses on sending a small number of large messages rather than a large number of small ones 2) it relies on standard, and typically well optimized, communications operations and 3) it focuses on reducing the number of communication rounds and therefore the number on interdependencies between rounds. Of course at some point such bus-based networks always become saturated and more attention must be paid to bandwidth and broadcast conflict concerns, particularly as one scales up. We are currently exploring how such concerns can best be dealt with within the context of a CGM-like model.
References 1. A.A. Bertossi and M.A. Bonuccelli. Some Parallel Algorithms on Interval Graphs. Discrete Applied Mathematics, 16:101–111, 1987. 878 2. E. Caceres, F. Dehne, A. Ferreira, P. Flocchini, I. Rieping, A. Roncato, N. Santoro, and S. Song. Efficient parallel graph algorithms for coarse grained multicomputers and BSP. In Proc. of ICALP’97, pages 131–143. Lecture Notes in Computer Science. Springer-Verlag, 1997. 876
886
A. Ferreira et al.
3. R. Cole and U. Vishkin. The accelerated centroid decomposition technique for optimal tree evaluation in logarithmic time. Algorithmica, 3:329–346, 1988. 876, 877 4. D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Shacuser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Proc. 4th ACM SIGPLAN Symp. on Princ. and Practice of Parallel Programming, pages 1–12, 1993. 876 5. F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In Proc. 9th ACM Symp. on Computational Geometry, pages 298–307, 1993. 876, 877 6. M. Diallo, A. Ferreira, and A. Rau-Chaplin. Communication-efficient deterministic parallel algorithms for planar point location and 2d Voronoi diagram. In Proceedings of the 15th Symposium on Theoretical Aspects of Computer Science – STACS’98, Lecture Notes in Computer Science, Paris, France, February 1998. Springer Verlag. 876 7. A. Ferreira, I. Guerin-Lassous, K. Marcus, and A. Rau-Chaplin. Parallel computation of interval graphs on PC clusters: Algorithms and experiments. RR LIAFA 97-30, University of Paris 7, http://www.liafa.jussieu.fr/˜guerin/biblio.html, 1997. 879, 880, 881 8. A. Ferreira, C. Kenyon, A. Rau-Chaplin, and S. Ub´eda. d-dimensional range search on multicomputers. Algorithmica, in press. Special Issue on Coarse Grained Algorithms. 876 9. A. Ferreira and M. Morvan. Models for parallel algorithm design: An introduction. In A. Migdalas, P. Pardalos, and S. Storoy, editors, Parallel Computing in Optimization, pages 1–26. Kluwer Academic Publisher, Boston (USA), 1997. 876 10. A. Ferreira, A. Rau-Chaplin, and S. Ub´eda. Scalable 2d convex hull and triangulation for coarse grained multicomputers. In Proc. of the 6th IEEE Symposium on Parallel and Distributed Processing, San Antonio, USA, pages 561–569. IEEE Press, October 1995. 876 11. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R Manchek, and V. Sunderman. PVM: Parallel Virtual Machine - A Users’ Guide and Tutorial for Networked Parallel Computing, 1994. 882 12. A.V Gerbessiotis and L.G Valiant. Direct bulk-synchronous parallel algorithms. Journal of Parallel and Distributed Computing, pages 251–267, 1994. 882 13. M.T. Goodrich. Communication-efficient parallel sorting. In Proc. of 28th Symp. on Theory of Computing, 1996. 876, 877 14. S.K. Kim. Optimal Parallel Algorithms on Sorted Intervals. In Proc. 27th Annual Allerton Conference Communication, Control and Computing, volume 1, pages 766–775, 1990. 876, 877, 880 15. A. Moitra and R. Johnson. PT-Optimal Algorithms for Interval Graphs. In Proc. 26th Annual Allerton Conference Communication, Control and Computing, volume 1, pages 274–282, 1988. 876, 877 16. S. Olariu. Parallel graph algorithms. In A. Zomaya, editor, Handbook of Parallel and Distributed Computing, pages 355–403. McGraw-Hill, 1996. 876, 877 17. S. Olariu, J.L. Schwing, and J. Zhang. Optimal Parallel Algorithms for Problems Modelled by a Family of Intervals. IEEE Transactions on Parallel and Distributed Systems, 3(3):364–374, 1992. 877, 881, 882 18. L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33:103–111, 1990. 876
Adaptable Distributed Shared Memory: A Formal Definition Jordi Bataller and Jos´e M. Bernab´eu-Aub´an Dept. de Sistemes Informatics i Computaci´ o Universitat Polit`ecnica de Val`encia
Abstract. In this paper we introduce a new model of consistency for distributed shared memory called Mume. It offers only the essentials to be considered as a shared memory system. This allows an efficient implementation on a message passing system, and due to this, it can be used to emulate other memory models.
1
Introduction
Distributed shared memory (DSM) is a paradigm for programming parallel and distributed systems. It offers the agents of the system a shared space address that they use to communicate each other. The main problem of a DSM implementation is performance, specially when it is implemented on top of a message passing system. This, of course, depends on the consistency the DSM system offers. Many systems have been proposed, each one supporting a different level of consistency. Unfortunately, experience has shown that none is well suited for the whole range of problems. This is also true for different implementations of the same memory model, since no single one is able to efficiently support all of the possible data access patterns. In this paper, we introduce a novel model called Mume. Following the direction of the main techniques to improve shared memory systems, Mume tries to give the programmer means to tell the system how to carry out the work more efficiently. Mume is a low level layer close to the level of the message passing interface. Mume interface only offers the strictly necessary requirements to be considered as shared memory thus allowing an efficient implementation. The interface also includes three types of synchronization primitives, namely, total ordering, causal ordering and mutual exclusion. Because the Mume interface is low enough, it can be used either to develop final applications or to build new primitives and use them in turn to develop final programs. This latter possibility allows, for example, to emulate the main memory models proposed to date. This paper describes Mume from a formal point of view.
This work has been supported in part by research grants TIC93-0304 and TIC960729
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 887–891, 1998. c Springer-Verlag Berlin Heidelberg 1998
888
2
Jordi Bataller and Jos´e M. Bernab´eu-Aub´ an
Formal Definition of Mume
The formalism we use was presented in [3]. It is a simple extension of the I/O automata model, [5]. We define a memory model by giving the set of allowable executions for it. An execution is an ordered sequence of actions. The basic orders we consider for an execution α are Process Order (<α P O ), Writes-to (→) and Causal Relation (<α CR ). Process Order is the order in which a processor issues actions. Writes-to is the order defined between a write and the read that gets the value left by the write. Causal Relation is the transitive closure of the sum of PO and →. We also consider the Mutual Exclusion (<α ME ) order for executions produced by models that include a synchronization primitive. Obviously, the order is defined only on the synchronized actions and they are ordered as they appear in the execution. An execution is admissible for a model if it can be rearranged in a way that the new sequence respects a given ordering and it is serial. An execution is serial if every read gets the value written by the immediately preceding write on the same variable. We express this concept by saying that a given execution is <α ? -serializable, this is, serializable respecting the ordering <α ?. In the case of Mume, the actions or primitives which compose executions are classified into ordinary ones (read and write) and synchronizations (sync). Mume considers that there exists a set of independent memory deposits (caches). Caches apply ordinary actions as they arrive, if they are not synchronized. Ordinary actions must be labelled to indicate to which cache they should be applied. Of course, read actions can only operate at one cache. Formally, the ordinary actions are: write(i, x, v)[A] and read(i, x, v)[a], where i ∈ P (set of processes), x ∈ VD (set of data variables), v is the value being read or written and A ⊆ C, a ∈ C (set of caches). Because Mume ordinary actions are independent of each other by default, it is necessary to synchronize them to set the order in which they are applied to caches. The synchronizations are the following ones: T Osync(i, sT O , t), CRsync(i, sCR , t) are M Esync(i, sME , t), where i ∈ P, sT O ∈ VST O , sCR ∈ VSCR , sME ∈ VSME (sets of synchronization variables) and t ∈ S = {acq, rel} (synchronization type : acquire or release). The actions issued by a process between an acquire and its following release are the ones that are synchronized. We say that they are X-synchronized being X the type of synchronization: TO, CR or ME. Actions of different processors are related if they are synchronized by synchronizations acting on the same variable. We say that they are X-related, being X TO, CR or ME. The meaning of each type of synchronization is the following. T Osync ensures that actions TO-related are performed in the same total order to any cache. It also ensures that if two actions of the same process are TO-related, then, Process Order is respected. CRsync maintains causal consistency: it ensures that actions CR-related are applied to any cache in an order consistent with the causal dependences among the CR-related actions. M Esync is the strictest synchronization. It behaves like a TO-synchronization and, in addition, it guarantees mutual exclusion. Furthermore, when a process obtains
Adaptable Distributed Shared Memory: A Formal Definition
889
the exclusion it is sure that it will see all the writes made by all processes which held the exclusion in the past. Formally, the Mume memory model is defined by requiring its executions to satisfy the properties TOsync, CRsync, and MEsync. Those properties are defined as follows: – TOsync: an execution α respects T Osync iff ∀c ∈ C α is (<α T Osync |c)serializable. – CRsync: an execution α respects CRsync iff ∀c ∈ C α is (<α CRsync |c)serializable. – MEsync: an execution α respects M Esync iff ∀c ∈ C, α is (<α MEsync |(c, sync))-serializable. In this definition, <α Xsync is the ordering that is defined considering only the X-related actions of an execution. |c denotes the restriction of to c.
3
Emulation of Known DSM Models
The primitives of Mume can be used to emulate other memory models including Sequential consistency, PRAM, Causal consistency, Processor consistency, Cache consistency, Slow consistency, Local consistency as well as synchronized models such as Release consistency. The references for these models can be found in [3]. We want to point out that each model can be emulated in different ways. For instance, any model can be trivially emulated by using a single cache for all the processes. But as many systems are implemented this way, we have chosen to give emulations in which one memory deposit is used per process, writes act on every cache, and reads act on the “local” cache. Now, as an example, we explain how Mume can emulate some of the cited models. The proofs can be found in [2]. Sequential. The executions of this model satisfy that they are
890
Jordi Bataller and Jos´e M. Bernab´eu-Aub´ an
Causal. The executions of this model satisfy that, for all process i,they are
4
Related Work and Concluding Remarks
In this paper we have defined a new memory model called Mume. Mume ordinary operations are close to the message passing interface since the programmer has to declare to which caches they have to be applied. In spite of this, ordinary operations still possess memory semantics because a write can be hidden by a subsequent one, this is, not all written values have to be read. Apart of ordinary operations, Mume also have three types of synchronizations in order to relate the ordinary ones. One of the choices followed to improve the performance of shared memory systems has been to give the programmer means to tell the system useful information so the system can know which is the best way to do things or when
Adaptable Distributed Shared Memory: A Formal Definition
891
actions should be performed. An example of this are synchronized models such as Release consistency. These models use synchronizations not only to achieve mutual exclusion but also to tell the system when to update the data. Another case, [4], is to allow the programmer to classify the variables of his programs according to the expected access pattern of each one. This permits the system to apply the best strategy to maintain the consistency of each variable. Compared to other synchronized models, Mume has two synchronizations more. They help to relate ordinary actions without the need of using mutual exclusion. Furthermore, because the ordinary actions of Mume are marked with the target cache, the programmer has more means to influence the performance than in other systems. For example, some systems have to use complex protocols (write-shared protocols) in order to eliminate the problem of false sharing. The overhead introduced is significant and may reduce the gains dramatically. In Mume, the programmer can eliminate false sharing by fine tuning concrete and particular points of his program, carefully choosing the caches to where ordinary actions are applied. But it is not mandatory to use Mume at the lowest level. Just because it is low level, libraries can be developed in order to emulate different memory models or to support different access patterns to variables. If libraries are used, the programmer still have the choice of fine tuning a program. The “high level” program can be translated to the level of the native primitives where such tuning can be done. Because we are concerned with formal issues, we have presented here the formal definition of Mume. You can find a more elaborate version of this work along with the correctness proofs of the emulations of other memory models in [2]. You can also see [1] to find a wider discussion on the practical issues of Mume. We are currently developing an implementation of Mume for a loosely coupled system in order to compare it with other memory systems and to evaluate its expected benefits.
References 1. J. Bataller and J. Bernabeu. Constructive and adaptable distributed shared memory. In Proc. of the 3rd Int’l Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 15–22, March 1998. 891 2. J. Bataller and J. Bernabeu-Auban. An adaptable dsm model: a formal description. Technical Report II-DSIC-17/97, Dep. de Sistemes Informatics i Computacio (DSIC), Universitat Politecnica de Valencia (Spain), 1997. 889, 891 3. J. Bataller and J. Bernabeu-Auban. Synchronized DSM models. In Europar’97. Third Intl. Conf. on Parallel Processing, Lecture notes on computer science, pages 468–475. Springer-Verlag, August 1997. 888, 889 4. J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Techniques for reducing consistency-related communication in distributed shared memory systems. ACM Transactions on Computer Systems, 13(3):205–243, August 1995. 891 5. N. Lynch. I/O Automata: A model for discrete event system. Technical Report MIT/LCS/TM-351, Laboratory for Computer Science, Massachusetts Institute of Tecnology, March 1988. 888
Parameterized Parallel Complexity Marco Cesati1 and Miriam Di Ianni2 1
Dip. di Informatica, Sistemi e Produzione, Univ. “Tor Vergata” via di Tor Vergata, I-00133 Roma, Italy. [email protected] 2 Istituto di Elettronica, Universit` a di Perugia, via G. Duranti 1/A, I-06123 Perugia, Italy. [email protected]
Abstract. We introduce a framework to study the parallel complexity of parameterized problems, and we propose some analogs of NC.
1
Introduction
The theory of NP-completeness [8] is a theoretical framework to explain the apparent asymptotical intractability of many problems. Yet, while many natural problems are intractable in the limit, the way by which they arrive at the intractable behaviour can vary considerably. For instance, deciding if the nodes of a graph can be properly colored by k colors is NP-complete even for a constant k ≥ 3 [8], while there exists an algorithm deciding if the edges of a n nodes graph can be covered by k nodes in time O(n) for any fixed value of k. The parameterized complexity setting [6,7] has been introduced in order to overcome the intrinsic inability of the standard NP-completeness model to give insight into this variety of behaviours. In this paper we investigate the issue of which problems do admit efficient fixed parameter parallel algorithms. A first attempt to formalize the concept of efficiently fixed parameter parallelizable problems has been pursued by Bodlaender, Downey and Fellows: in a one-page abstract [3] they suggested the introduction of the class PNC as the parameterized analogue of NC. However, neither theoretical results nor applications to concrete natural problems were presented. We now want to give a deeper insight to such concept. According to the degree of efficiency we are interested in, several kinds of efficient parallelization for parameterized problems can be considered. In section 2, after reviewing some basic concepts of the parameterized complexity theory, we define the two classes of efficiently parallelizable parameterized problems, PNC and FPP, and we study their relationship with the class of sequentially tractable parameterized problems (FPT). We also present a non trivial tool for proving FPP-membership of parameterized graph problems based on the concept of treewidth and on the results in [1,4,2]. In section 3 we study the relationship between NC, PNC, and FPP. In section 4 we give two alternative characterizations of both FPP and D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 892–896, 1998. c Springer-Verlag Berlin Heidelberg 1998
Parameterized Parallel Complexity
893
PNC, and we use them to prove the PNC-completeness of two parameterized structural problems.
2
The Classes PNC and FPP
Let Σ be a finite alphabet. A parameterized problem is a set L ⊆ Σ ∗ × Σ ∗ . Typically, the second component represents a parameter k ∈ IN. The kth slice of the problem is defined as Lk = { x ∈ Σ ∗ : x, k ∈ L }. The class FPT of fixed parameter tractable problems contains all parameterized problems that have a α solving algorithm with running time bounded by f (k) |x| , where x, k is the instance of the problem, k is the parameter, f is an arbitrary function and α is a constant independent of x and k. From now on, with the term “parallel algorithm” we always refer to a PRAM algorithm. The first class of efficiently parallelizable parameterized problems, PNC, has been defined in [3]. PNC (parameterized analog of NC) contains all parameterized problems which have a parallel solving algorithm with at β most g(k) |x| processors and running time bounded by f (k)(log |x|)h(k) , where x, k is the instance of the problem, k is the parameter, f , g and h are arbitrary functions, and β is a constant independent of x and k. The definition of PNC is coherent with the definition of FPT; indeed we can prove that PNC is a subset of FPT. A drawback with the definition of PNC is the exponent in the logarithm which bounds the running time. We thus introduce the class of fixed-parameter parallelizable problems FPP that contains all parameterized problems having a β parallel solving algorithm with at most g(k) |x| processors and running time bounded by f (k)(log |x|)α , where x, k is the instance of the problem, k is the parameter, f and g are arbitrary functions, and α and β are constants independent of x and k. Observe that, by definition, FPP ⊆ PNC. Several parameterized parallel problems can be proved to be included in FPP by directly showing a PRAM algorithm for them. Among them we recall parameterized Vertex Cover and Max Leaf Spanning Tree. However, we can now present a different way for proving FPP-membership, based on the concept of treewidth. Treewidth was introduced by Robertson and Seymour [9], and it has proved to be a useful tool in the design of graph algorithms. Bodlaender and Hagerup [4] showed that there exists an optimal parallel algorithm on a EREW PRAM using O((log n)2 ) time and O(n) space which is able to construct a minimum-width tree decomposition of G or correctly decide that tw (G) > k. Therefore parameterized Treewidth belongs to FPP. The concept of treewidth turns out to be an useful tool for proving FPPmembership. In [1] it is shown that many problems that are (likely) intractable for general graphs are in NC when restricted to graphs of bounded treewidth. In particular, the authors prove that some graph properties (MS and EMS properties) are verifiable in NC for graphs of bounded treewidth. A careful study of the proofs in [1] reveals that such properties are also verifiable in FPP under the same hypothesis, since any optimal tree decomposition of width k can be transformed in O(log n) time and O(n) operations on a EREW PRAM into
894
Marco Cesati and Miriam Di Ianni
a binary tree decomposition of depth O(log n) and width at most 3k + 2 [4]. Therefore: all problems involving MS or EMS properties are in FPP when restricted to graphs of parameterized treewidth. Thus, for any (E)MS property P , the following Bounded Treewidth Graphs Verifying P problem belongs to FPP: given a graph G, is G satisfying P and such that tw (G) ≤ k (k being the parameter)? Lists of graph problems defined over (E)MS properties can be found in [2]. In a few cases it is possible to prove FPP-membership even without restrictions on the treewidth since some properties directly imply a bound on the treewidth. Therefore several natural problems (like Feedback Vertex Set) are in FPP, because (i) yes-instances have bounded treewidth, and (ii) the corresponding property is (E)MS.
3
Relationship Between FPP, PNC, and NC
Trivially, the slices of every parameterized problem belonging to FPP or PNC are included in NC. It is now worthwhile to verify whether the converse is true, that is, whether the parameterized versions of all problems whose slices belong to NC are included in FPP or PNC. Consider the Clique problem. Each slice Cliquek is trivially in NC. Indeed, Cliquek can be easily decided by a parallel algorithm that uses O(nk ) processors and requires constant time. Since Clique is likely not in FPT [7], it is also likely not in PNC. Thus, slice-membership to NC is not a sufficient condition for membership to PNC. Therefore, the classes FPP and PNC are somewhat “orthogonal” to NC, as much as the definition of FPT is orthogonal to the one of P. Nevertheless, there is a strong relation between P-completeness and FPT-completeness, where the completeness for FPT is defined with respect to parameterized reductions preserving membership to FPP or PNC. Indeed, consider the Weighted Circuit Value problem WCV: its instance consists of the description of a logical decision circuit C and of an input word w; the parameter k is the Hamming weight of w, and the question is whether C accepts w. We have proved the following. Lemma 1. WCV is in FPT, while each slice WCVk is P-complete. Previous lemma has an important consequence. Let L, L be parameterized problems. We say that L FPP-reduces (PNC-reduces) to L if there are an FPP-algorithm (PNC-algorithm) Φ and a function f such that Φ(x, k) computes x , k = f (k) and x, k ∈ L if and only if x , k ∈ L . It is not hard to prove the following. Corollary 1. Let ≤Φ be either the FPP-reduction or the PNC-reduction. If there exists an FPT-hard problem L with respect to ≤Φ such that all its slices Lk are included in NC, then P=NC. Corollary 1 implies that every FPT-hard parameterized problem must have at least one slice which is not in NC unless P = NC. Thus, the notion of FPTcompleteness does not add anything to the classical notion of P-completeness
Parameterized Parallel Complexity
895
to characterize hardly parallelizable parameterized problems. In particular, to distinguish between parallel intractability in the two contexts, we must find an hardly parallelizable parameterized problem with slices in NC. This is the aim of next section. However, we can hope to characterize “FPP’s intractability” by proving the PNC-completeness of some problem (in the hypothesis FPP = PNC), but we have no way to characterize “PNC’s intractability”.
4
Alternative Characterizations of FPP and PNC
NC is primarly defined as the class of problems that can be decided by uniform families of polynomial size, polylogarithmic depth circuits. Successively, it has been proved that uniform families of circuits and PRAM algorithms are polynomially related. In this paper, we have taken the inverse approach, by initially defining the classes FPP and PNC in terms of PRAM algorithms. In this section, we extend the relation between PRAM algorithms and uniform families of circuits to the parameterized setting. In this case, uniform means that the circuit Cnk , able to decide instances of size n when the value of the parameter is k, can be derived in space f (k)(log n)α , for some function f and constant α. It is possible to prove the following theorem by essentially using the same proof technique in [10]. Theorem 1. A uniform circuit family of size s(n, k) and depth d(n, k) can be simulated by a PRAM-PRIORITY algorithm using O(s(n, k)) active processors and running in time O(d(n, k)). Conversely, there are a constant c and a polynomial p such that, for any FPP (PNC) algorithm Φ operating in time T (n, k) with processor bound P (n, k), there exist a constant dΦ and, for any pair n, k, a circuit Cnk of size at most dΦ p(T (n, k), P (n, k), n) and depth c T (n, k) realizing the same input-output behaviour of Φ on inputs of size n. Furthermore, the family {Cnk } is uniform. The previous theorem allows us to show the first PNC-complete problem. It is denoted as Bounded Size-Bounded Depth-Circuit Value Problem (in short, BS-BD-CVP) and is defined as follows: given a constant α, three functions f , g and h, a boolean circuit C with n input lines, an input vector x and an integral parameter k, decide if the circuit C (having size g(k) nα and depth h(k)(log n)f (k) ) accepts x. We have proved the following. Corollary 2. BS-BD-CVP is PNC-complete with respect to FPP-reductions. A second characterization of the classes of parameterized parallel complexity is based on random access alternating Turing machines. There is a strong relation between the time required by a random access alternating Turing machine and the depth of a simulating circuit, and between the space required by a random access alternating Turing machine and the size of a simulating circuit [5]. Such relation also holds in the parameterized setting. For the sake of brevity, we do not state the corresponding theorem. We only want to remark that such a characterization allows us to define another PNC-complete problem,
896
Marco Cesati and Miriam Di Ianni
Random Access Alternating Turing Machine Computation (in short, RA-ATMC): given a random access alternating Turing machine AT , an input word x and an integral parameter k, does AT (x) halts within O(f (k)(log |x|)k ) steps with SPACE*(n) ∈ O(g(k) log n)? Corollary 3. RA-ATMC is PNC-complete with respect to FPP-reductions.
References 1. Stefan Arnborg, Jens Lagergren, and Detlef Seese. Easy problems for treedecomposable graphs. Journal of Algorithms, 12, 308–340, 1991. 892, 893 2. Hans L. Bodlaender. A partial k-arboretum of graphs with bounded treewidth. Technical Report, Department of Computer Science, Utrecht University, 1995. 892, 894 3. Hans L. Bodlaender, Rodney G. Downey, and Michael R. Fellows. Applications of parameterized complexity to problems of parallel and distributed computation, 1994. Unpublished extended abstract. 892, 893 4. Hans L. Bodlaender and Torben Hagerup. Parallel algorithms with optimal speedup for bounded treewidth. Technical Report UU-CS-1995-25, Department of Computer Science, Utrecht University, 1995. 892, 893, 894 5. A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer. Alternation. J. of the ACM, 28, 114–133, 1981. 895 6. Rodney G. Downey and Michael R. Fellows. Parameterized computational feasibility. Feasible Mathematics II, 219–244. Birkh¨ auser, Boston, 1994. 892 7. Rodney G. Downey and Michael R. Fellows. Fixed-parameter tractability and completeness II: On completeness for W [1]. TCS, 141, 109–131, 1995. 892, 894 8. Michael R. Garey and Davis S. Johnson. Computers and Intractability — A Guide to the Theory of N P -Completeness. W. H. Freeman and Co., New York, 1979. 892 9. Neil Robertson and Paul D. Seymour. Graph Minors. II. Algorithmic aspects of tree-width. Journal of Algorithms, 7, 309–322, 1986. 893 10. Larry Stockmeyer and Uzi Vishkin. Simulation of parallel random access machines by circuits. SIAM J. Comput., 13(2), 409–422, 1984. 895
Asynchronous (Time-Warp) Versus Synchronous (Event-Horizon) Simulation Time Advance in BSP Mauricio Mar´ın Programming Research Group, Computing Laboratory University of Oxford [email protected] Abstract. This paper compares the very fundamental concepts behind two approaches to optimistic parallel discrete-event simulation on BSP computers [10]. We refer to (i) asynchronous simulation time advance as it is realised in the BSP implementation of Time Warp [3], and (ii) synchronous time advance as it is realised in the BSP implementation of Breathing Time Buckets [8]. Our results suggest that asynchronous time advance can potentially lead to more efficient and scalable simulations in BSP.
1
Introduction
As BSP is a (bulk) synchronous model of parallel computing [10], the obvious method of parallel discrete-event simulation [2,7] on BSP computers seems to be synchronisation protocols based on synchronous time advance, such as Breathing Time Buckets [8]. In this paper we show that using synchronous time advance on a BSP computer might not be a good idea for two reasons. Firstly, synchronous simulation methods force the execution of periodical parallel minreductions during which no interesting advance in simulation time can be made. These operations take some number of BSP supersteps to complete [10], and the simulation is stopped during their execution. The cumulative cost of these operations is relevant since the total number of min-reductions is equal to the total number of supersteps used to process the simulation events. Secondly and more importantly, synchronous methods impose barriers in simulation time which tend to reduce the rate of time advance per superstep of the simulation. This increases the total number of supersteps dedicated to parallel event-processing. Equivalently, time barriers tend to decrease the number of events that are processed in each superstep. In BSP, supersteps are delimited by the barrier synchronisation of all the processors. In turn this operation takes some amount of real time to complete and it can become comparatively expensive in simulation models where the amount of computation involved in the processing of events is small (e.g., queuing networks). As many real life simulations can easily demand the execution of hundreds of thousands of event-processing supersteps, a significant reduction of the total number of supersteps can also cause a significant reduction of the total D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 897–905, 1998. c Springer-Verlag Berlin Heidelberg 1998
898
Mauricio Mar´ın
running time of the simulation. In addition, the cost of the remaining barrier synchronisation of processors can be amortised by the processing of a comparatively larger number of events per superstep. On the other hand, approaches based on asynchronous time advance do not stop the simulation to perform periodical min-reductions, and simulation time barriers are not required (cf., [6]). For symmetric work-loads and excluding minreduction supersteps, we have observed that as the size of the simulation model scales up, the rate of simulation time advance per superstep decreases comparatively faster using synchronous time√advance. In particular, synchronous time advance requires on the average O( P /ln ln P ) times more event-processing supersteps than asynchronous time advance, with P being the total number of logical processes (LPs). Moreover, for any simulation model, we show that the number of supersteps for the asynchronous approach is at most the supersteps required by the synchronous approach, and in the end the total number of supersteps required by the asynchronous approach is optimal for any model. In addition, for the same symmetric work-loads, we have observed that load balance tends to be better under asynchronous time advance for moderate level of aggregation of LPs onto processors (here load balance refers to balance in computation and communication as it is understood in BSP), whereas load balance is optimal in both approaches when the aggregation of LPs is large enough (domain of current practical simulations). Obviously the nature of simulation work-loads is completely irregular and it is hard (perhaps impossible) to draw general conclusions about the overall running time achieved by protocols based on the two approaches analysed in this paper. We contribute with the analysis of an important performance indicator, namely total number of supersteps. In our view, the results presented in this paper suggest that the domain of simulation models and current BSP machines in which protocols based on synchronous time advance can be efficient should be expected to be limited. As to the development of general purpose BSP simulation environments, we conjecture that protocols based on asynchronous time advance offer better opportunities to achieve scalable performance.
2
Two Approaches to Time Advance
Approaches based on the event-horizon concept [9] like Breathing Time Buckets [8], advance in simulation time in a synchronous manner by consuming supersteps as in figure 1.a. Note that here we do not consider the additional supersteps required for min-reductions which compute a new global event-horizon on each cycle. That is, each cycle of the pseudo-code shown in this figure should be followed by a min-reduction. Approaches like Time Warp [3], on the other hand, advance in simulation time in an asynchronous manner [6] by consuming supersteps as in figure 1.b. Note that only silent min-reductions are needed in these approaches since values such as global virtual time (GVT) [3] are calculated without stopping the simulation. That is, these min-reductions are carried out using the same supersteps used to simulate events. We call these two styles of superstep advance as SYNC and ASYNC respectively.
Asynchronous Versus Synchronous Simulation Time Advance
899
Lemma 1. The total number of supersteps required by ASYNC is optimal. Proof: Dependencies among events form trees. If the occurrence of event e generates event e1 , then e1 is a child of e. If e1 is scheduled to occur in a different processor and e takes place at superstep s, then e1 must be processed at least in the superstep s + 1. But the simulation of e1 may be delayed up to some superstep s > s + 1 if earlier events take place in e1 ’s processor at superstep s . In this case, each descendant of e1 may only take place at superstep ≥ s and if an e1 ’s decendant takes place in another processor, it may occur at least in superstep s + 1. In this sense we say that every event has an initial “energy” which indicates the minimum superstep at which it can take place. Figure 1.b helps us to realise that given any set of events to be processed during the whole simulation, the superstep counters are only updated with the initial energy of chronological events that cannot be processed in earlier supersteps. This rule is inductively applied in each processor from the first to the last superstep. In this way the simulation, as mapped onto the processors, is completed in the minimum number of supersteps. ✷ Lemma 2. The total number of supersteps required by ASYNC is at most the supersteps of SYNC. Proof: A subset EH of all simulation events are the events that mark the eventhorizon times. These events are not necessarily causally related and they may be generated in different processors. Also note that these events are the least timestamped messages buffered during the respective SYNC supersteps. The size |EH | of this subset is the total number of supersteps required by SYNC since by construction these events cannot occur in the same superstep. Consider the simulation with ASYNC. The first chronological event in EH is necessarily the first event processed by one of the processors in its second superstep. However, the remaining processors may advance farther in time during their first superstep since they are not barrier synchronised by event-horizon times. This implies that more than one horizon event may be processed in the second superstep of ASYNC. Similar argument applies to the following supersteps so that, in general, SYNC is not optimal in supersteps. Both approaches require identical number of supersteps when, for example, each event in EH is causally related so that they must be simulated sequentially. ✷ Lemma 3. Under assumption of unlimited memory, Time Warp as realised in BSP can approximate the supersteps of ASYNC within log P supersteps. Proof: The lemma follows trivially by letting Time Warp process every available event (with time within the simulation period), correct erroneous computations accordingly, and start a new GVT calculation in each superstep. ✷ Note that processing every available event per superstep may lead to large roll-back overheads. However, near-optimal supersteps can be achieved at low overheads by limiting the number of events processed in each superstep [6].
900
Mauricio Mar´ın
Generate N initial pending events; TZ := ∞; [ event horizon time ] SZ ← Φ; [ buffer ] loop if TimeNextEvent() > TZ then SStep := SStep + 1; Schedule(SZ ); TZ := ∞; SZ ← Φ; endif e := NextEvent(); e.t := e.t + TimeIncrement(); p := e.p ; [ e occurs in processor p ] e.p := SelectProcessor(); if e.p = p then SZ ← SZ ∪ {e}; TZ := MinTime(SZ ); else Schedule(e); endif endloop
(a) SYNC
Generate N initial pending events; [e.s indicates the minimal superstep at which the event e may take place in processor e.p.] loop e := NextEvent(); p := e.p ; [ e occurs in processor p ] if e.s > SStep[p] then SStep[p] := e.s; endif e.t := e.t + TimeIncrement(); e.p := SelectProcessor(); if p = e.p then e.s := SStep[p]; else e.s := SStep[p] + 1; endif Schedule(e); endloop The total number of supersteps is the maximum of the P values in array SStep. (b) ASYNC
Fig. 1. Sequential programs describing the rate of superstep advance in two approaches to parallel simulation. This program, called the hold-model [11], simulates work-loads of systems such as queuing networks. Schedule() stores events in the set of pending events. NextEvent() retrieves from this set the event with the least time. TimeIncrement() returns random time values. SelectProcessor() returns a number between 0 and P − 1 selected uniformly at random. The variable/array SStep maintains the current number of supersteps of the simulated BSP machine.
Asynchronous Versus Synchronous Simulation Time Advance
3
901
Average Case Analysis
It is not difficult to see that the number of supersteps per unit simulation time, denoted by Sp , required by SYNC and ASYNC is the optimal, Sp = 1, for fully connected communication topology when the TimeIncrement function of figures 1 returns 1. However, when this function returns random values, say exponentially distributed, the Sp values increase noticeably as shown in the following analysis (details in [5]). We assume one LP per processor. Suppose that the initial event-list has N pending events e with timestamps e.t. Let X be a continuous random variable with p.d.f. f (x) and c.d.f. F (x). A hold operation consists of (i) retrieving the event e∗ with the least timestamp from the event-list, (ii) creating an event e with timestamp e.t = e∗ .t + X, and (iii) storing e in the event-list. e∗ is discarded so that N remains constant throughout the whole sequence of hold operations. It is known [11] that after executing a long sequence of hold operations the probability distribution of the times e.t approaches an steady state distribution with p.d.f. g(y) = (1 − F (y))/µ where µ = E[X]. We represent the e.t values with the random variable Y . The values of Y are measured relative to the timestamp of the last event e∗ retrieved by the last hold operation. Let G(y) be the c.d.f. of Y . SYNC: We use the above defined f (x) and g(x) probability density functions to calculate SYNC’s Sp , say Sps , as follows. Let A be the minimum of the N random variables Y . Since Prob[A > x] = G(x)N the c.d.f. of A is MA (x) = 1 − G(x)N . Let B be the minimum of the N random variables resulting from the sum of X and Y . The variable B represents the time of the event-horizon. The x c.d.f. of X + Y , say F2 (x), can be calculated using F2 (x) = 0 F (x − t)g(t)dt x or F2 (x) = 0 G(x − t)f (t)dt. The c.d.f. of B is then MB (x) = 1 − F2 (x)N . Note that E[B] − E[A] is the average time advance per superstep. Thus if the simulation ends at time T 1, then it will require an average of T /(E[B]−E[A]) supersteps to complete. Therefore Sps = 1/(E[B] − E[A]). Let us consider the negative exponential distribution with mean µ = 1 for time increments X. In this case f (x) = g(x) = e−x , therefore E[A] is given by ∞ ∞ 1 , M A (t) dt = e−N t dt = E[A] = N 0 0 and E[B] is E[B] = 0
∞
(e−t +t e−t )N dt =
N N i=0
i
∞
ti e−N t dt =
0
i N 1 1 N . i! N i=0 i N
Using the approximation given in [4] (pp. 112-117) we obtain E[B] ≈
5 1 3 1 √ + , 4 N 4 N
so that Sps ≈
4√ N. 5
902
Mauricio Mar´ın
Let us now consider SYNC’s event-efficiency Ef , say Efs , which is defined as the ratio of the optimal number of events to the average maximum number of events processed in each processor per superstep. This is a measure of load balance in computation and communication for the studied symmetric workload. On average, a total of m events take place in each superstep. Given a sensible distribution of LPs onto the P processors, by Valiant’s theorem [10] we learn that the average maximum number of events per superstep should tend to the optimal m/P as m scales up. Thus for m√large enough the efficiency is close to 1. For exponential distribution, m ≈ 54 N [5,9]. Define D = N/P . Regression analysis on simulation data from the program in figure 1.a (validated with numerical evaluation of the expected maximum of P binomial random variables) produces (asymptotically) for exponential distribution [5], Efs =
P 1/4
D1/4 . ln P + D1/4
ASYNC: In this case there is no global synchronisation in simulation time. In a given superstep each LP pi simulates events with time less than the minimum event time of all new events (messages) arriving to pi from other LPs by the next superstep. For large number of LPs (which justifies parallel simulation itself), it is reasonable to assume that a significant amount of LPs will work with its own statistically independent local event-horizon. The average instance of this local event-horizon being the average minimum of D independent random variables Z = X + Y . Thus, from the previous results √ for SYNC, these comments lead to a first view of ASYNC’s Sp , say Spa ≈ O( D) for fixed number of LPs P . That is, we conjecture that for P 1 and D 1, Spa asymptotically tends to the Sp value of a SYNC simulation with just D events. In the following we provide evidence supporting this claim. The LPs advance the simulation in a generally different amount of time in each superstep. Note that the average time advance per superstep is by definition 1/Spa . Let us consider a lower bound Tp for this time advance. Consider all the order statistics associated with the set of N random variables Z = X + Y . Thus Tp = Average among the first P values E[Zi ], where the lower bound Tp comes from the assumption that the first P order statistics of Z are all located in a different LP. However, the actual average cannot be much larger than Tp since any difference Zi − Zk increases very slowly with N . For example, for exponential distribution, the maximum ZN increases only logarithmically with N [1]. In this √ case we conjecture that the true average time advance per superstep is ≈ 1.25/ D. Let us then compare Tp with this quantity. Unfortunately calculating Tp is mathematically intractable. Thus we performed the following experiment. For large N , say N = 104 , we generated I = 103 different instances of a set of N random values X + Y , with X and Y being exponentiallydistributed. For each instance i we calculated the partial P sums Sum[i, P ] = P1 k=1 Zk with 1 ≤ P ≤ N , so that Tp for a particular D I 1 is given by Tp = I Sum[i, N D ]. The results show that in the range D ≥ 10, √i=1 Tp behaves like O(1/ D). Assuming Tp = β (P/N )α least squares regression on
Asynchronous Versus Synchronous Simulation Time Advance
903
the points (log Tp , log P ) produced α = 0.47738 and β = 0.01194 which should be compared √ with the conjectured α = 0.5 and β = 0.0125 resulting from the curve 1.25/ D. We now consider the effect of P in Spa when D is fixed. Let χk be the sum of k ≥ 1 random variables with p.d.f. f (x) so that they represent the time increments of events forming a thread of causally related events. Then, for any positive difference between LP time advances, say ∆ = Tj − Ti , there is some non-zero probability that Tj < Ti +'i +χk < Tj +'j for some 'i , 'j ≥ 0, such that Ti + 'i + χk is the time of the next event ej in LP pj after superstep s. This event ej is processed in some superstep s with s + 1 < s ≤ s + k + 1. An increase in P implies an increase in the diversity of Ti values and thereby an increase in the probability of events of type “ej ”. Thus, for fixed D, Spa may increase in about k supersteps with P . For exponential distribution the k values are Poisson. The average k for the maximum time interval ∆ = ZN − Z1 is bounded from above by ZN = O(ln N ). So a conservative upper bound for Spa is O(ln P ). However, similar experiment to the above described tells us ZP < 1 ln ln P in the range D ≥ 10. This leads to √ Spa ≈ ln ln P D . This expression was validated with simulation results obtained with the program in figure 1.b. Regression analysis on data from the same program produces (asymptotically) for exponential distribution [5], Efa
D1/4 = (ln ln P )1/2 ln P + D1/4
1 ln P ln D
.
√ Comments: Note that the ratio Sps /Spa behaves in practice as O( P ) for large systems. For fixed P and large D values, SYNC’s efficiency is better than ASYNC’s efficiency. For large P ’s, P 1/4 goes to ∞ faster than (ln ln P )1/2 ln P . Thus for large P and small D (practical simulation models) the efficiency of ASYNC is better than SYNC efficiency. However, the expression for Efa was obtained considering just one LP per processor. As more LPs are put on the processors, Efa improves noticeably. This is so because the variance of the cumulative sum of co-resident LP time advances is reduced as the number of LPs per processor increases. Consequently, the variance of the number of events simulated in each processor decreases [5]. Conversely, Efs is insensitive to the relation Dlp D = m/P with Dlp being the number of LPs per processor. Another important point here is that we can always move ASYNC towards SYNC by imposing upper limits to the number of events processed in each processor and superstep (this filters peaks). If these limits are sufficiently small, ASYNC degenerates to SYNC [5]. To compare the two approaches under more practical grounds we performed experiments with programs similar to those shown in figure 1. First note that the work-load generated by these programs is equivalent to a fully connected queuing network where each node (LP) contains infinite servers. The service times are given by the time increments of events. In our experiments we considered
904
Mauricio Mar´ın
a more realistic queuing network: one non-preemptive server per node, exponential service times, fully connected topology, and unlimited queue capacities with moderate number of jobs flowing throughout the network (closed system). Figure 2 shows the results. The comparison is made in terms of RS = Sps /Spa and RE = Efs /Efa . The plotted data clearly show that ASYNC outperforms SYNC in a wide range of parameters (the minimal value of RS in figure 2.a is √ 1.9). The data indicate that RS increases as P and RE decreases when N and P scale up simultaneously. In other words, the results show that ASYNC has better scalability than SYNC. In addition, the level of aggregation of LPs onto processors contribute to further increase RS and decrease RE . In particular, for fixed P , the ratio RS increases as Dlp in figure 2.a. RS 100 80 60 40 20
Dlp =2 8 32 128
×× ✷ ✷+ ✸✸ 0 + 1 2 (a)
× ✷ + ✸ 3
✸ + ✷ × × ✷ + ✸
× × × ✷ + ✸
✷ + ✸
✷ + ✸
4 5 6√ 7 8 9 10 11 12 P
RE 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4
× ✷ + ✸ ✸ +✸ ✷ + ✸ × ✷ + × ✷ ×
Dlp =2 8 32 128 ✸ + × ✷
✸ + × ✷
✸ + ✷ × ✸ + × ✷
1 2 3 4 5 6√ 7 8 9 10 11 12 P (b)
Fig. 2. Fully-connected non-preemptive single-server queuing network with exponential service times. Figure a: RS = supersteps-SYNC/supersteps-ASYNC. Figure b: RE = AvgMaxEv-SYNC/AvgMaxEv-ASYNC where AvgMaxEv is the average maximum number of events per superstep. Dlp = number of LPs per processor, and P = number of processors. Initially D = 8 “jobs” are scheduled in each LP (server).
References 1. R. Felderman and L. Kleinrock. “An upper bound on the improvement of asynchronous versus synchronous distributed processing”. In SCS Multiconference on Distributed Simulation V.22, pages 131–136, Jan. 1990. 902 2. R.M. Fujimoto. “Parallel discrete event simulation”. Comm. ACM, 33(10):30–53, Oct. 1990. 897 3. D.R. Jefferson. “Virtual Time”. ACM Trans. Prog. Lang. and Syst., 7(3):404–425, July 1985. 897, 898 4. D. E. Knuth. “The Art of Computer Programming, Vol. 1, Fundamental algorithms”. Addison-Wesley, Reading, Mass., 1973. 901 5. M. Mar´ın. “Simulation time advance in BSP”. Technical Report Oxford University, May 1998. http://www.comlab.ox.ac.uk/oucl/groups/bsp/. 901, 902, 903
Asynchronous Versus Synchronous Simulation Time Advance
905
6. M. Mar´ın. “Time Warp On BSP Computers”. Technical Report Oxford University, Feb. 1998. To appear in 12th European Simulation Multiconference. 898, 899 7. D.M. Nicol and R. Fujimoto. “Parallel simulation today”. Annals of Operations Research, 53:249–285, 1994. 897 8. J.S. Steinman. “SPEEDES: A multiple-synchronization environment for parallel discrete event simulation”. International Journal in Computer Simulation, 2(3):251–286, 1992. 897, 898 9. J.S. Steinman. “Discrete-event simulation and the event-horizon”. In 8th Workshop on Parallel and Distributed Simulation (PADS’94), pages 39–49, 1994. 898, 902 10. L.G. Valiant. “A bridging model for parallel computation”. Comm. ACM, 33:103– 111, Aug. 1990. 897, 902 11. J.G. Vaucher. “On the distribution of event times for the notices in a simulation event list”. INFOR, 15(2):171–182, May 1977. 900, 901
Scalable Sharing Methods Can Support a Simple Performance Model Jonathan Nash Scalable Systems and Algorithms Group, School of Computer Studies The University of Leeds Leeds LS2 9JT, West Yorkshire, UK Abstract. The Bulk Synchronous Parallelism (BSP) model provides a simple and elegant cost model, as a result of using supersteps to develop parallel software. This paper demonstrates how the cost model can be preserved when developing software for irregular problems, which typically require dynamic load balancing and introduce runtime task dependencies. The solution introduces shared data types within a superstep, which support weakened forms of shared data consistency for scalable performance. An example of a priority queue to support a solution of the travelling salesman problem is given, with predicted and observed performance results provided for 256 processors of a Cray T3D MPP.
1
Introduction
The Bulk Synchronous Parallelism (BSP) model [7,12] provides a simple and elegant cost model, as a result of using supersteps to develop parallel software. A key feature is the independent execution of processors in generating remote accesses, allowing a superstep cost to be characterised by an h − relation [7]. Specifically, given a machine with network performance g, barrier cost L and computational performance s, a superstep can be costed as gh + sw + L, where h and w are the maximum usage of each resource by any processor. The BSP model seems less suited to the efficient support of irregular problems, which require dynamic load balancing and introduce runtime task dependencies [9]. This paper describes work on supporting irregular problems with scalable high performance, while preserving the BSP-style cost model (based on earlier experiences with the WPRAM model [10]). The key idea has been to develop scalable data structures which support dynamic sharing patterns [8,9,3], using highly concurrent implementations provided by the use of weakened forms of shared data consistency [4,1]. This highly concurrent behaviour approximates the superstep operation, allowing the BSP-style cost model to be preserved. Section 2 describes the use of abstractions for sharing in parallel systems. Section 3 uses the example of a priority queue to describe the techniques to support scalable high performance. Section 4 describes the extension of the BSP cost model to characterise the priority queue performance, with Section 5 providing predicted and observed performance, both for the queue and its use in the travelling salesman problem, for 256 processors of a Cray T3D. Section 6 provides a summary and points to some current work. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 906–915, 1998. c Springer-Verlag Berlin Heidelberg 1998
Scalable Sharing Methods Can Support a Simple Performance Model
External Interface
Data Access Functions
Global Consistency Protocols
Local Data
Support Libraries
Enqueue * Enqueue * Dequeue
Locally ordered dataset
Dequeue dynamic data migration
* serial heap access * migration strategy
Enqueue * Enqueue * Dequeue
Locally ordered dataset
907
Dequeue
not used
* serial heap access * global array access
Fig. 1. (a) Generic SADT; (b) PriQueue#1 SADT; (c) PriQueue#2 SADT
2
Abstractions for Sharing in Parallel Systems
The programming framework is based on the idea of shared abstract data types (SADTs) [5]. An abstract data type (ADT) encapsulates local data, and provides methods to change the state of the data (and return some result). An SADT augments an ADT to include concurrent method invocation, and must specify when changes in state are visible to the methods. SADTs support the key sharing patterns in a parallel system [1,2,6], allowing the encapsulation of possibly complex concurrency issues. This well defined layer of abstraction allows for optimised implementations on specific platforms, code portability and re-use. The diagrams in Figure 1 point to the four main features of an SADT, both in generic terms, and related specifically to the support of two alternative priority queue (PriQueue) SADT implementation strategies. A more detailed description can be found in [3]. Local data: Each processor maintains a local state. This data might be replicated or partitioned across the processors, depending on the access patterns which the SADT generates. The PriQueue maintains a partitioned segment of the data set on each processor. Data access functions: In order to support the external interface methods, the internal SADT data will have to be accessed. The first PriQueue implementation adopts a bulk synchronous approach in which the Enqueue and Dequeue operations will access local data items, with dynamic data migration occurring within regular superstep periods [3]. The second implementation supports a shared data structure across the processors, and so Enqueue and Dequeue operations may result in the access of remote data. Global consistency: This is typically only explicitly defined if the data access functions, defined above, make only local data accesses. In the case of the first PriQueue, the consistency is specified by the load balancing superstep frequency and the number of data items migrated. The second implementation does not contain any explicit consistency routines, since a shared data structure is used to directly store and retrieve the data items. Support libraries: Both the data access functions and the global consistency protocols will typically need to make use of serial and parallel libraries.
908
Jonathan Nash
The first PriQueue requires libraries to support a serial heap for local data access, and various migration strategies are supported for the load balancing superstep. The second implementation also makes use of the heap library, and also routines to support the access of a shared array data structure, so that data items on other processors may be accessed [3] (see Section 3.3). SADTs are utilised within a bulk synchronous superstep, to support dynamic data sharing patterns in a portable and scalable manner: Enqueue sequence #1; Barrier Enqueue sequence #2; Barrier Dequeue operations If a FIFO Queue SADT is being accessed [8], with the FIFO ordering defined at the barrier points (and an arbitrary ordering between barriers) then the Dequeue operations would initially retrieve items from the first Enqueue sequence. If the SADT was the PriQueue defined earlier, the first barrier could be removed, since this will not effect the ordering of the data items (an explicit priority ordering is maintained). The second barrier guarantees that all Enqueue operations have completed, and the following Dequeues will begin to access the globally highest priority items. This barrier could also be removed if this strict guarantee is not required, allowing processors to access the currently highest priority tasks (or if the creation of the data items is depicted by runtime task dependencies [3]). While a serialised implementation of the PriQueue might be practical for a small-scale parallel machine, highly parallel computers consisting of tens or hundreds of processors require high degrees of concurrency, if performance is to scale. The next section describes how this has been achieved for the PriQueue, based around the second PriQueue implementation strategy described above.
3
Techniques for Supporting Scalable Performance
The example of the travelling salesman problem (TSP) is given here, to illustrate the use of the PriQueue SADT (full details of the solution are presented in [3]). As shown in Figure 2(a), the solution of the TSP makes use of the PriQueue to store and retrieve new tasks, representing tours, which are dynamically created within a tree-based computation. The priority is related to the length of the (partially generated) tour. An Accumulator SADT is also used to note the current best tour, to which new tours can be compared (for reasons of brevity, further details of this SADT are excluded). A lower level lightweight Lock SADT is also used to support the operation of the PriQueue and Accumulator [3]. The next sections describe the use of weak data consistency, bulk synchronous parallelism 1 and fine-grain parallelism to support the PriQueue with scalable performance. Specific implementation details are provided for the Cray T3D. 1
The algorithm which solves the TSP in fact only uses a single superstep
Scalable Sharing Methods Can Support a Simple Performance Model
Travelling Salesman Problem Accum
data items
PriQueue Lock
Fine-grain Parallelism
909
0
1
2
..... p-1
BSP
Parallel Platform (eg Cray T3D)
Locks
Flags
Fig. 2. (a) Structure of the TSP solution; (b) PriQueue implementation Table 1. The SHMEM operations Bulk synchronous operations Action Get (buf , x, Buf , i) buf [0..x − 1] = Bufi [0..x − 1]; Put (buf , x, Buf , i) Bufi [0..x − 1] = buf [0..x − 1]; Quiet () suspend upon completion of Put(s) Barrier () synchronise all processors Fine-grain operations Action w = Swap (c, W , i) w = Wi ; Wi = c; w = Inc (W , i) w = Wi ; Wi = Wi + 1; Wait (w, c) suspend until location w =c
3.1
Weak Data Consistency
A key characteristic for scalable performance was to use weakened data consistency [4,1]. Sequential consistency can provide a global priority ordering on the PriQueue elements, but typically through the use of a serialising lock. This becomes the bottleneck in large systems, limiting performance. The implementation here removes this lock by removing the strict ordering guarantee. Each processor 0..p − 1 holds a locally ordered data segment, as shown in Figure 2(b). These are accessed cyclically by the processors when storing and retrieving data, distributing the priorities approximately evenly. A processor is then only guaranteed to remove one of the p highest priorities (Section 5 will show that this does not have any substantial effect on the operation of the algorithm). This approach provides for a highly concurrent implementation (see below), allowing scalable performance characteristics as the number of processors grow. 3.2
Bulk Synchronous Parallelism
For the implementation study on the Cray T3D, the SHMEM library of operations has been used, as given in Table 1. In the table, Vi denotes a variable V which is accessed on processor i, in contrast to v, which is a local variable. SHMEM allows the direct access of remote memory locations through the specification of a processor-address pair, supporting high bandwidth and low latency operations. The BSP programming model can be supported through the use of
910
Jonathan Nash
P M g
P M
D
Get and Put
P M g
P M
D
Swap and Inc
g
Operation Cost Swap/Inc D + 2g Get D + 4gx Put 2gx Quiet D Wait − Barrier L
Cost g D L s t
Value 0.04-0.06 0.75-1.27 2.00 0.0533 0.0664
Fig. 3. (a) Modelling SHMEM; (b) SHMEM costing; (c) Cost values the Put and Get operations, which write to and read from remote memories respectively, and a Barrier operation. Put supports the pipelining of multiple requests, to amortize the network latency cost (and allow overlapping of communication with subsequent local computation), with Quiet being used to suspend until these complete. The Put and Get operations can be used to maintain a priority ordering on a given remote data segment of the PriQueue, through a simple binary search method. 3.3
Fine-Grain Parallelism
The PriQueue concurrency can be exploited by operations which support finegrain parallelism. Looking back to Figure 2(b), it was mentioned that the local data segments are accessed cyclically by the processors. The SHMEM library provides an atomic increment operation, Inc, which can be used to coordinate the global access of these segments, as shown in Table 1. The result in w can be taken modulus p to obtain the required PriQueue data segment. Figure 2(b) shows that each data segment also contains a number of locks and local flags, which are used to control the access to the data segments [3]. The implementation of these locks, through a lower level Lock SADT [3], can use the SHMEM atomic Swap operation, in which a word c is exchanged with the current contents of W on processor i. This operation allows for the construction of a scalable linked list [8]. In addition, Wait can be used to support fine-grain pointto-point synchronisation methods [10], which are typically used to coordinate the access of the shared linked lists [8]. The next section will show how the specific hardware support for these operations allows for the construction of high performance software.
4
A BSP-Style Performance Model
This section describes how a model of the performance of parallel applications can be constructed from a simple BSP-like cost model, as described in the introduction. The key to preserving this costing approach has been the scalable implementation of the PriQueue SADT, which preserves the independent superstep behaviour of the processors.
Scalable Sharing Methods Can Support a Simple Performance Model
911
Table 2. PriQueue SADT costing Operation Cost Enqueue (max,tot) Inc(max,tot) + Get(1,4*max,4*max)+Put(10,max,max) + 882*max*s + max*Add Heap(HEAP SIZE) + 2 { Acquire(max,tot/p) + Release(max,tot/p) } Dequeue (max,tot) Inc(max,tot) + Get(1,max,max)+Get(9,3*max,3*max)+Put(1,max,max) + 1107*max*s + max*Sub Heap(HEAP SIZE) + 2 { Acquire(max,tot/p) + Release(max,tot/p) }
4.1
Characterising Machine Performance
As described in the introduction, a superstep cost can be modelled as gh+sw+L, where g, s and L cost the machine operations (network access, local computation and barrier synchronisation respectively), with h and w specifying their maximum usage by any one processor within the superstep. In order to effectively model the operations for fine-grain parallelism, two additional machine costs, D and t, are required. The cost D measures the round-trip network latency, and is incurred when there exists a data dependency within a superstep (for example, when accessing the results of an atomic increment). The cost t represents the time to transfer words within the local memory, and is present due to the fact that irregular applications typically access complex local data structures. So the cost of a superstep can now be represented by gh + sw + tn + Dc + L, where n and c again reflect the maximum resource usage. 4.2
Characterising the Performance of the SHMEM Library
Figure 3(a) shows how the SHMEM operations which access the network are costed. For the Swap and Inc operations, performance is measured by the latency term D and the network access cost g. For the Put and Get operations, the time g is incurred at both the sending and receiving ends. Figure 3(b) provides a summary of all the costs. Put allows the pipelining of multiple remote accesses (totaling x words), with the processor suspending using Quiet, or Barrier, until they complete. Put and Get requests are split into packets of 4 words, with an acknowledgement packet returning for a Put, and the data returned for a Get. However, the effective network bandwidth is halved for the Get (hence the factor of 4 in the table), since each packet must return before the next packet can be sent. Figure 3(c) gives each of the cost parameters for the Cray T3D. The terms of g and D for each operation were derived from network performance results under randomised and deterministic traffic patterns. 4.3
Characterising the Performance of the PriQueue and TSP
As described in Section 4.1, the performance at the machine level can be modelled using the terms gh + sw + tn + Dc + L, where h, w, n and c represent the
912
Jonathan Nash
Table 3. Costs for maintaining an ordered heap segment Operation Cost Add Heap (n) Put(18,n,n) + 2*Get(9,n,n) Sub Heap (n) Put(18,n,n) + 3*Get(9,n,n)
Table 4. Lock SADT costing Operation Cost Acquire (max,tot) Swap(max,tot) + Swap(max,max) + 37*max*s Release (max,tot) Put(1,tot,tot) + 36*tot*s
maximum usage of each associated machine resource. When characterising the performance of parallel software, two terms are used to model the workload at a given level of abstraction. The term tot measures the total usage of a given resource across all processors, whereas max measures the maximum resource usage by any given processor. For example, the performance of the solution to the travelling salesman problem given in [5] can be described as: Enqueue (max,tot) + Dequeue (max,tot) + Accum Read (max,tot) + Accum Update (u) + Compute (max) In this case, tot and max refer to the number of generated tours. The costs Enqueue and Dequeue are for the PriQueue, and Accum Read and Accum U pdate the Accumulator (the u term has been found by experiment to be a small constant number). A Compute term measures the local work required for a given number of tours. In this example, SADT access occurs within a single superstep. Processors independently produce and consume tours through the access of the PriQueue. The only exception is when all items have been temporarily removed from the PriQueue and new items are being enqueued (causing any idle processors to momentarily suspend execution), and at the termination stage. The former case typically only occurs for a short time at the start of execution, when the first few PriQueue items are being generated. Thus, the independent superstep behaviour is preserved at the application level. The highly concurrent behaviour of the SADTs preserves this state of affairs (see below and Section 3.3), allowing the simple BSP cost model to be employed. Table 2 gives the costing of the Enqueue and Dequeue operations. These include access to the Inc, Put and Get SHMEM operations. The Inc cost represents a potential source of contention, including a term of tot, reflecting its use to coordinate the access of the local data segments which make up the PriQueue, as described in Section 3.3. The PriQueue costing also includes the need to maintain a local priority ordering within each local data segment after an Enqueue or Dequeue occurs, using Add Heap and Sub Heap respectively. These costs are parameterised by the size of the local segment (2HEAP SIZE items). Also included are costs to Acquire and Release a lock at a local segment. The term of tot/p in these costs reflects the scalable distributed implementation, in which the local data segments are accessed cyclically by the processors.
Scalable Sharing Methods Can Support a Simple Performance Model
913
Table 5. Parameterised SHMEM costs Operation Swap/Inc (max,tot) Get (x,max,tot) Put (x,max,tot)
Cost max ∗ D + tot ∗ 2 ∗ g max ∗ D + tot ∗ 4 ∗ g ∗ x max ∗ D + tot ∗ 2 ∗ g ∗ x
Performance of the PriQueue
Performance of the TSP
300
256 Enqueue Predicted Dequeue Predicted
260
128
20 cities predicted 21 cities predicted linear
64 Speedup
Time (microsecs)
280
240 220 200
32 16 8 4
180
2
160
1 1
2
4
8 16 32 Processors
64 128 256
1
2
4
8 16 32 64 128256 Processors
Fig. 4. (a) PriQueue performance (b) TSP performance
Tables 3 and 4 give the costs for maintaining the priority ordering of the data segments and the Lock SADT. The ordering uses a binary search to exchange data items within the segment, resulting in the HEAP SIZE term. The Lock SADT implementation is too involved to describe here, but further information can be found in [3]. Finally, Table 5 gives the cost of the SHMEM operations, when parameterised by max and tot. It can be seen, for example, that when a Swap operation is concurrently performed on a given shared word, the max term gives the usage of the network latency resource D (since max ∗ D gives the total time that any processor incurs the latency resource), and the tot term relates to the usage of the network resource g (since tot ∗ g provides the total contention at the memory module holding the word being accessed).
5
Predicted and Observed Performance
The performance of the PriQueue is given in Figure 4(a), in which all processors continuously generate Enqueue and Dequeue requests. It can be seen that the predicted and observed performance agree closely. Figure 4(b) shows the TSP performance for 20 and 21 city problems. Speedups of over 150 on 256 processors for a 21 city problem are typical, reducing the time from 53 seconds on one processor to around 13 second on 256 processors. It can be seen that the predicted times are again close, using the performance models derived for the SADTs. The values for tot and max, described in Section 4.3, were taken from actual runs of the TSP solution, on a given number of processors. Figure 5(a) shows the number of tasks generated for both problems. The total tasks, mean tasks (ie tot/p) and maximum tasks dealt with by any processor are
914
Jonathan Nash
Number of Tasks
SADT Performance 4096
16384
1024
4096
256
Time (milliseconds)
65536
Tasks
1024 256
Total (20) Mean Max Total (21) Mean Max
64 16
64
PriQueue 20 Accum 20 PriQueue 21 Accum 21
16 4 1
4 0.25 1 1
2
4
8 16 32 Processors
64 128 256
1
2
4
8 16 32 Processors
64 128 256
Fig. 5. (a) Number of generated tasks (b) SADT overheads given. The total number of tasks stays quite constant as the processors vary. Obviously, a larger sample set is required if specific conclusions are to be drawn on the change in task numbers. The maximum number of tasks is very close to the mean value, even for large numbers of processors, demonstrating the good dynamic load balancing properties of the PriQueue. Figure 5(b) presents the SADT overheads for both problem sizes. The 20 city problem overhead is chiefly due to the rising cost of Accumulator updates. The 21 city problem overheads increase slowly from 23% for 4 processors up to 37% for 256 processors. The PriQueue overheads essentially decreases linearly with the number of processors. This is due to the even load balance of tasks, together with the highly concurrent implementation of the PriQueue providing scalable high performance. The Accumulator overheads show an approximately linear increase, due to its serial implementation updating replicated copies on each processor in turn. This is not a serious limitation for the TSP, since the update frequency is very low, but for larger number of processors a tree-based solution may offer improved performance.
6
Summary and Current Work
This paper has described an approach to the development of scalable high performance parallel software for irregular problems, and how this can lead to the support of a simple analytical costing, derived from the BSP model. Further work is required to describe the assumptions under which the cost model can be used (in particular, the degree of concurrency within the sharing methods), and to expand the number of machines to demonstrate the generality of the approach. However, results for the Cray T3D have demonstrated both the scalable and high level of performance which can be achieved. Current work is centering around the support of an adaptive unstructured 3D mesh SADT, for solving computational fluid dynamic problems, from existing MPI codes [11]. Each processor maintains a local state which consists of both partitioned internal mesh elements and replicated shared halo elements, with global consistency points defining when these halos are updated.
Scalable Sharing Methods Can Support a Simple Performance Model
915
Acknowledgements Thanks to Professor Peter Dew and the members of the TallShiP and Pallas research projects for their helpful comments and the use of the serial code for the travelling salesman problem.
References 1. C. Clemencon, B. Mukherjee and K. Schwan, Distributed Shared Abstractions (DSA) on Multiprocessors, IEEE Transactions on Software Engineering, vol 22(2), pp 132-152, February 1996. 906, 907, 909 2. J. Darlington and H. W. To, Building Parallel Applications Without Programming, Abstract Machine Models for Highly Parallel Computers, (eds J.R. Davy and P.M. Dew) Oxford University Press, pp 140-154, 1995. 907 3. P. M. Dew and J. M. Nash, The High Performance Solution of Irregular Problems, MPPM’97: Massively Parallel Programming Models Workshop, Royal Society of Arts, London, November 1997 (to be published in IEEE Press). 906, 907, 908, 910, 913 4. K. Gharachorloo, S. V. Adve, A. Gupta and J. H. Hennessy, Programming for Different Memory Consistency Models, Journal of Parallel and Distributed Computing, vol 15, pp 399-407, 1992. 906, 909 5. D. M. Goodeve and S. A. Dobson, Programming with Shared Data Abstractions, Irregular’97, Paderborn, Germany, June 1997. 907, 912 6. L. V. Kale and A. B. Sinha, Information sharing mechanisms in parallel programs, Proceedings of the 8th International Parallel Processing Symposium, pp 461-468, April 1994. 907 7. W. F. McColl, An Architecture Independent Programming Model For Scalable Parallel Computing, Portability and Performance for Parallel Processing, J. Ferrante and A. J. G. Hey eds, John Wiley and Sons, 1993. 906 8. J. M. Nash, P. M. Dew and M. E. Dyer, A Scalable Concurrent Queue on a Message Passing Machine, The Computer Journal 39(6), pp 483-495, 1996. 906, 908, 910 9. J. M. Nash, P. M. Dew, J. R. Davy and M. E. Dyer, Scalable Dynamic Load Balancing using a Highly Concurrent Shared Data Type, The 2nd European School of Computer Science: Parallel Programming Environments for High Performance Computing, pp 123-128, April 1996. 906 10. J. M. Nash, P. M. Dew, J. R. Davy and M. E. Dyer, Implementation Issues Relating to the WPRAM Model for Scalable Computing, Euro-Par’96, Lyon, France, pp 319326, 1996. 906, 910 11. P. M. Selwood, M. Berzins and P. M. Dew, 3D Parallel Mesh Adaptivity: DataStructures and Algorithms, Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, 1997. 914 12. L. G. Valiant, A Bridging Model for Parallel Computation, Communications of the ACM 33, pp 103-111, 1990. 906
Long Operand Arithmetic on Instruction Systolic Computer Architectures and Its Application in RSA Cryptography Bertil Schmidt1 , Manfred Schimmler2 , and Heiko Schr¨ oder3 1
3
Lehrstuhl f¨ ur Informatik I, RWTH Aachen, 52056 Aachen, Germany, [email protected] 2 Inst. f. Datenverarbeitungsanlagen, TU Braunschweig, 38106 Braunschweig, Germany, [email protected] Department of Computer Studies, Loughborough University, Loughborough, LE11 3TU, England, [email protected]
Abstract. Instruction systolic arrays have been developed in order to combine the speed and simplicity of systolic arrays with the flexibility of MIMD parallel computer systems. Instruction systolic arrays are available as square arrays of small RISC processors capable of performing integer and floating point arithmetic. In this paper we show, that the systolic control flow can be used for an efficient implementation of arithmetic operations on long operands, e.g. 1024 bits. The demand for long operand arithmetic arises in the field of cryptography. It is shown how the new arithmetic leads to a high-speed implementation for RSA encryption and decryption.
1
Introduction
Instruction systolic arrays (ISAs) provide a programmable high performance hardware for specific computationally intensive applications [5]. Typically, such an array is connected to a sequential host, thus operating like a coprocessor which solves only the computationally intensive tasks within a global application. The ISA model is a mesh connected processor grid, which combines the advantages of special purpose systolic arrays with the flexible programmability of general purpose machines [3]. In this paper we illustrate how the capabilities of ISAs are exploited to derive efficient parallel algorithms of addition, subtraction, multiplication and division of long operands. The demand for long operand arithmetic arises in the field of cryptography, e.g. RSA encryption and decryption. Their implementations on Systola 1024 show that the concept of the ISA is very suitable for long operand arithmetic and results in significant run time savings. The ISA concept is explained in detail in Section 2. Section 3 gives an overview over the architecture of Systola 1024. It is documented how the ISA has been integrated on an low cost add-on board for commercial PCs. The new D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 916–922, 1998. c Springer-Verlag Berlin Heidelberg 1998
Long Operand Arithmetic on Instruction
917
ISA algorithms for long operand arithmetic are explained in Sections 4 to 6. The implementation of RSA based on these arithmetic routines is given in Section 7. Section 8 discusses its performance and concludes the paper.
2
Principle of the ISA
The basic architecture of the ISA is a quadratic n × n array of identical processors, each connected to its four direct neighbours by data wires. The array is synchronized by a global clock. The processors are controlled by instructions, row selectors and column selectors. The instructions are input in the upper left corner of the processor array, and from there they move step by step in horizontal and vertical direction through the array. This guarantees that within each diagonal of the array the same instruction is active during each clock cycle. In clock cycle k + 1 processor (i + 1, j) and (i, j + 1) execute the instruction that has been executed by processor (i, j) in clock cycle k. The selectors also move systolically through the array: row-selectors horizontally from left to right, column-selectors vertically from top to bottom (Fig. 1). The selectors mask the execution of the instructions within the processors, i.e. an instruction is executed if and only if both selector bits, currently in that processor, are equal to one. This construct leads to a very flexible structure which creates the possibility of very efficient solutions for a large variety of applications.
instructions * + 0 1 1 1 1 0 1 1
row selectors
1 1 1 1 1 1 0 1
column selectors
ISA
Fig. 1. Control flow in an ISA Every processor has read and write access to its own memory. Besides that, it has a designated communication register (C-register) that can also be read by the four neighbour processors. Within each clock phase reading access is always performed before writing access. Thus, two adjacent processors can exchange data within a single clock cycle in which both processors overwrite the contents of their own C-register with the contents of the C-register of their neighbour. This convention avoids read/write conflicts and also creates the possibility to broadcast information across a whole row or column with one single instruction: Row broadcast: Each processor reads the value from its left neighbour. Since the execution of this operation is pipelined along the row, the same value is propagated from one C-register to the next, until it finally arrives at the rightmost processor. Note that the row broadcast requires only a single instruction.
918
Bertil Schmidt et al.
The period of ISA programs is the number of their instructions, which is 2n − 2 clock cycles less than their execution time. The period describes the minimal time from the first input of an instruction of this program to the first input of an instruction of the next program. In the following the period of ISA programs is used to specify their time-complexity. This is appropriate because they will be used as subroutines of much larger ISA programs in Section 7.
3
Architecture of Systola 1024
The ISATEC Systola 1024 parallel computer is a low cost add-on board for standard PCs. The ISA on the board is a 4 × 4 array of processor chips. Each chip contains 64 processors, arranged as an 8 × 8 square. This provides 1024 processors on the board. In order to exploit the computation capabilities of this unit, it is necessary to provide data and control information at an extremely high speed. Therefore, a cascaded memory concept, consisting of interface processors and board RAM, is implemented on board that forms a fast input and output environment for the parallel processing unit (see Fig. 2). Interface ............
ISA-Prog.
RAM WEST
Controller
Interface ............
RAM NORTH
ISA
Host Computer Bus
Fig. 2. Data paths in the parallel computer Systola 1024
4
Addition and Subtraction of Long Operands
The difficulty of partitioning an addition in an array where only neighbours can talk to each other is the carry propagation. If it meets a sum value consisting of ones only, then an incoming carry produces a new carry in the most significant digit. In the worst case this can hold for all blocks of the partitioned addition, such that the carry-bit of the first block propagates through all other blocks. Therefore, we use an accumulation technique based on the generate and propagate signals of a carry-look-ahead-adder. If every processor knows, whether it has to propagate an incoming carry from the left to the right, it can set a flag (e.g. the zero-flag) to one if and only if it propagates an incoming carry. Furthermore every processor can store the carry generated by itself in its C-register. With the accumulation operation (*) if zeroflag then C:=C[WEST] all carry-bits travel to the processor one position left of their destinations. This is again because of the skewed instruction execution: Suppose, processors (i, j − 1), (i, j), and (i, j + 1) have set their zero-flag to one (in order to propagate a carry) and processor (i, j − 2) has generated a carry-bit one. The instruction (*)
Long Operand Arithmetic on Instruction
919
is executed in processor (i, j − 1) which replaces its carry-bit (0) with that of processor (i, j − 2). In the next clock cycle the same instruction is executed in processor (i, j), where again the own carry (0) is replaced by the value of the left neighbour (which has been set to one in the previous instruction cycle). In the following cycle, processor (i, j + 1) replaces its carry with the one from the left neighbour, etc. With the final assignment C:=C[WEST] all carries are at the place where they are needed. Note that the carry propagation requires only two instructions, although the carry-bits can travel over a distance of up to n − 1 processors. Of course, this technique only works if no processor propagating a carry has generated a carry-bit on its own. But this is obviously impossible. The operands to be added are distributed over the rows of the ISA. If the length of the operands is 512/1024 bits, every processor in a row of 32 processors gets 16/32 bits. The LSBs are stored in the leftmost and the MSBs in the rightmost processor of the row. Every processor has 16 bits of the operand A in register RA and 16 bits of B in register RB. The operations RSUM:=RA + RB; C:=carry; store the sum of RA and RB in RSUM and the carry-bit in C. Now every processor must find out whether it has to propagate an incoming carry. This is the case if and only if the value of RSUM is consisting of ones only. Thus, we can complete the long addition as pointed out above with if (RSUM + 1=0) then C:=C[WEST]; RSUM:=C[WEST] + RSUM; Table 1. Addition of two 24-bit numbers in a row with 6 processors Processor 1 2 3 4 5 6 A 0101 1100 1010 0010 1101 1000 Every pr. gets 4 bits of A and B.LSB/MSB B 0001 0011 0101 1010 1010 0100 in leftmost/rightmost bit in pr. 1/6. SUM 0100 1111 1111 1001 0000 1100 SUM:=A + B C 1 0 0 0 1 0 C-register stores the carry-bit Zeroflag 0 1 1 0 0 0 if (SUM+1=0) then zeroflag:=1 else :=0 C 1 1 1 0 1 0 C after prop.: if zeroflag then C:=C[WEST] SUM 0100 0000 0000 0101 0000 0010 SUM:=SUM + C[WEST]
The subtraction works in principle in the same way as the addition, using RDIFF:=RA-RB, the propagation condition if (RDIFF=0) and the carry-bit subtraction RDIFF:=RDIFF-C[WEST]. Due to the systolic control flow the n-bit addition/subtraction is possible using O(n) ISA processors with constant period. The complete 512/1024-bit addition/subtraction in one Systola 1024 processor-row requires only 5/8 instructions. However, having 32 processor-rows, 32 different additions/subtractions can be calculated in parallel in the same time.
920
5
Bertil Schmidt et al.
Multiplication of Long Operands
The idea behind the multiplication on the ISA is related to the school method for integer multiplication: One operand is cut into pieces and all pieces are multiplied with the other operand in parallel. The results are then shifted with respect to each other in an appropriate way and added to form the final result. The ISA-program for the m × n-bit (m = 16 · p, n = 16 · q) multiplication in each processor-row is explained below. Every processor stores 16 bits of the first operand in register RA, the least significant 16 bits in the leftmost and the most significant 16 bits in the rightmost processor. The complete second operand is stored in RB0 ,..,RBq−1 in each processor. At the end the most significant m bits of the result will be distributed over the rows in RPq . The least significant n bits will be stored in the first processor of each row in RP0 ,..,RPq−1. RP0 ..RPq := RA · RB0 ..RBq−1 // compute 16 × 16 · q-bit multiplication // load least significant result in C-register C:=RP0 for i:=1 to q do // add RP0 ..RPq along processor-row in q steps: begin add 16 bit more significant word in C[EAST] and RPi . Sum is stored in C and RPi RPi−1 :=C C:=C[EAST]+RPi+carry This addition is repeated up to the most end significant word // store most significant result RPq :=C C:=carry if RPq +1=0 then C:=C[WEST]// propagate the carry-bit of the last addition RPq :=C[WEST]+RPq as described in Section 4 The implementation on Systola 1024 requires 223/507 instructions for one single 512 × 512/1024 × 1024-bit multiplication in one row of 32 processors.
6
Division of Long Operands
Division can be efficiently reduced to multiplication and subtraction by using the Newton-Raphson-method. For a value B the reciprocal value is computed by an iteration that converges rapidly towards B1 . To achieve a fast convergence (2m bits precision in m iteration steps), 0.5 ≤ B < 1 must hold for B. The iteration is given by the equations x0 := 1
,
xi+1 := (2 − B · xi ) · xi A B
(1)
Obviously, any division Q = can be computed by multiplying A with B1 . For division on the ISA we assume B to be in the range between 0.5 and 1. If this is not the case, B is initially shifted in the ISA such that the most significant one is the MSB of the rightmost word of B. In order to understand the choice of intermediate operand lengths in our division method it is useful to consider the behavior of the convergence in the
Long Operand Arithmetic on Instruction
Newton-Raphson-algorithm. Let xi differ from the precise value of ε. Then the following holds for xi+1 : xi+1 = (2 − B · xi ) · xi = (2 − B · ( B1 − ε)) · ( B1 − ε) =
1 B
1 B
921
by some
− B · ε2
(2)
Since B < 1, this means that the precision of xi+1 (i. e. number of correct leading digits) is at least double the precision of xi . This means that we can restrict the operand lengths to 2i bits for B and xi−1 while computing the ith iteration xi . E.g. the first four iterations can be performed in one processor, since the required precision of the operands is ≤ 16. Only the last iteration step needs full precision of m bits. The subtractions and multiplications are performed efficiently on the ISA as explained above.
7
Parallel Implementation of RSA Encryption
For obtaining a high-speed implementation of RSA encryption a fast modular exponentiation (M e mod N ) is necessary. To achieve a sufficient degree of security the operand-lengths have to be relatively large (currently ranging from 512 up to 2048 bits). Modular exponentiation can be accomplished by iterating modular multiplications using the square-and-multiply algorithm [2]. The difficulty in implementing an efficient modular multiplication (A·B mod N ) is the modular reduction step. In the case of RSA encryption the modulus N is known in advance to each modular multiplication. Thus, the precomputation of N1 requires only a negligible amount of work. Now, the division x div N can be replaced by the more efficient multiplication x · N1 . Of course, we cannot produce 1 N precisely. We instead produce a number µ such that x · µ is close enough to x div N ; i. e. the result needs only a small correction by at most 4 subtractions with N . Thus, the modular multiplication for RSA can be implemented efficiently on the ISA by three multiplications and at most four subtractions. One single n-bit modular multiplication for n = 512/1024 requires 581/1709 instructions in one processor- row of Systola 1024.
8
Performance Evaluation and Conclusions
The performance of the parallel implementation is compared with an optimized sequential implementation on a PC (see Table 2). The current ISA-prototype Systola 1024 is a machine based on technology far from being state-of-the-art. Extrapolating to technology used for processors such as the Pentium II 266 MHz would lead to a speedup of the ISA by a factor of at least 80. Thus resulting in encryption/decryption speeds of more than 500 Kbit/s for 1024-bit RSA (without taking advantage of the Chinese Remainder Theorem). Such performance can be expected from the already announced next generation Systola board Systola 4096.
922
Bertil Schmidt et al.
Table 2. Performance of RSA encryption Systola 1024 Pentium II 266 Speedup 512-bit RSA with full length exponent 128 KBit/s 18 KBit/s 7 1024-bit RSA with full length exponent 56 KBit/s 6 KBit/s 9
We have presented ideas for fast implementations of addition, subtraction, multiplication and division on operands that are too long to be handled within one processor. We take advantage of the ability of the ISA to implement accumulation operations extremely efficiently. We have used the results for finding a high-speed instruction systolic implementation of RSA encryption and decryption. This leads to efficient software solutions with respect to performance and hardware cost. We also used the long operand arithmetic routines for the implementation of a prime number generator for RSA keys on Systola 1024[1].
References 1. Hahnel, T.: The Rabin-Miller Prime Number Test on Systola 1024 on the Background of Cryptography. Master Thesis, University of Karlsruhe (1998) 922 2. Knuth, D.E.: The Art of Computer Programming: Seminumerical Algorithms. Volume 2, Reading, Addison-Wesley, second edition (1981) 921 3. Kunde, M., et al.: The Instruction Systolic Array and its Relation to other Models of Parallel Computers. Parallel Computing 7 (1988) 25–39 916 4. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public key cryptosystems. Comm. of the ACM 21 (1978) 120–126 5. Schmidt, B., Schimmler, H., Schr¨ oder, H.: Morphological Hough Transform on the Instruction Systolic Array. Euro-Par’97, LNCS 1300, Springer (1997) 798–806 916
Hardware Cache Optimization for Parallel Multimedia Applications C. Kulkarni, F. Catthoor , and H. De Man IMEC, Kapeldreef 75, B-3001 Leuven, Belgium Abstract. In this paper, we present a methodology to improve hardware cache utilization by program transformations so as to achieve lower power requirements for real-time multimedia applications. Our methodology is targeted towards embedded parallel multimedia and DSP processors. This methodology takes into account many program parameters like the locality of data, size of data structures, access structures of large array variables, regularity of loop nests and the size and type of cache with the objective of improving cache performance for lower power. Experiments on real life demonstrators illustrate the fact that our methodology is able to achieve significant gain in power requirements while meeting all other system constraints. We also present some results about software controlled caches and give a comparison between both the types of caches and an insight about where the largest gains lie.
1
Introduction
Parallel machines were mainly, if not exclusively, being used in scientific communities until recently. Lately, the rapid growth of real-time multimedia processing (RMP) applications have brought new challenges in terms of the required processing (computing) power, power requirements and a host of other issues. For these type of applications, especially video, the processing power of traditional uniprocessors is no longer sufficient, which has lead to the introduction of smalland medium-scale parallelism in this field too, but then mostly oriented towards single chip systems for cost reasons. Today, many parallel video and multimedia processors are emerging ([20] and its references), increasing the importance of parallelization techniques. However, the cost functions to be used in these new application fields are no longer purely performance based. Power is also a crucial factor, and has to be optimized for a given throughput. RMP applications are usually memory intensive and most of the power consumption is due to the memory accesses for data transfers i.e. in the memory hierarchy [22]. Power management and reduction is becoming a major issue in such applications [3]. Many of the current algorithm standards (including the ones which are still under development like MPEG-x and object-oriented coders [2]) are based on a hierarchy of subsystems. The main data-dependencies and irregularities are
Professor at the Katholieke Universiteit Leuven Professor at the Katholieke Universiteit Leuven
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 923–932, 1998. c Springer-Verlag Berlin Heidelberg 1998
924
C. Kulkarni et al.
situated in the higher layers. However, from there several submodules are called, such as motion estimation/compensation, wavelet schemes, DCT etc. These lend themselves for extensive compile-time analysis enabling more efficient use of caches1 , as will be shown in the paper. This is in contrast with much of the previous work in a more “general” application context. The performance of these systems is related to the critical path in the algorithm (requires only local analysis, as shown with bold line in figure 1 where segments represent loop nests), whereas power requirements are related to the global data flow over conditional and/or parallel paths (as shown with the other lines in figure 1). Also this is a significant difference with the previous work (see below). These observations are the motivations for us to come up with an optimization strategy to utilize especially the cache hierarchy efficiently for low power.
I n p u t s
different Time/Power weights over exclusive or concurrent paths
Out Critical path
Fig. 1. Illustration of the difference between speed oriented critical path optimization which operates on the worst-case path in the global condition tree (with mutually exclusive branches) and power optimization which acts on the global flow. Our target architecture is based on (parallel) programmable video or multimedia processors (TMS320C6x, TriMedia, etc.) as this is a current trend to use in RMP applications [7] (see section 2). Most of the work related to efficient utilization of caches has been directed towards optimization of the throughput by means of (local) loop nest transformations to improve locality [1,15] and loop blocking transformation to improve cache utilization [14]. Some work [17] has been reported on the data organization for improved cache performance in embedded processors but also they do not take into account a power oriented model. Instead they try to reduce cache conflicts in order to increase the throughput. Similar work regarding reduction of cache conflicts and improvement of register allocation has been reported in [10,15]. Architectural techniques to reduce the overhead in cache related power are available in [23]. Recently a study has been presented in [8] which reports on the effect with respect to power requirements on the type of cache used in an architecture. But this work also does not address the basic issue of improving the cache utilization by means of program transformations. Program transformations for reducing memory related power and area overhead are discussed in [5,6]. These transformations are part of our ATOMIUM [16] global data transfer and storage oriented approach and have focussed upto now 1
in this paper unless specifically mentioned cache generally means a hardware controlled cache
Hardware Cache Optimization for Parallel Multimedia Applications
925
on uni-processors and customized memory organizations. Program transformations for improving cache usage for software controlled caches are discussed in [12]. Some other complementary aspects of our work on system level memory management for parallel processor mapping have been discussed in [4]. The paper is organised as follows. In section 2, the correlation between cache and power is discussed. Section 3 gives a formalized methodology to explore system level optimization for efficient cache usage. This is followed by discussion of two real life demonstrators and their results in section 4. Section 5 presents conclusions from the above study and proposes some future directions.
2
Power Models for Hardware Controlled Caches
The relation between the extent of data caching and power is explored in this section. Consider the simple but representative architecture model that we are using in this paper as shown in figure 2 (extensions to this model are feasible e.g. as in [12]). The power model used in our work comprises a power function which is dependent upon access frequency, size of memory, number of read/write ports and the number of bits that can be accessed in every (data) fetch operation.
PE’s can be programmed individually Dual Port, Fully Associative and Write Back
PE 1
C1
PE 2
C2
DISTRIBUTED
. . .
SHARED MEMORY ARCHITECTURE
PE n
Cn
Fig. 2. The target architecture used, where each processing element (PE) has its own cache hierarchy (Ci ) without prefetching or branch prediction. Let P be the total power, N be the total number of accesses to a memory (can be SRAM or DRAM or any other type of memory)2 and S be the size of the memory. The total power is then given by: P = N × F (S)
(1)
where F is a polynomial function with the different coefficients representing a vendor specific energy model per access. F is completely technology dependent. For an architecture with no cache at all i.e. only a CPU and (off chip) main memory, power can be approximated as below: hi = Nmain × F (Smain ) Ptotal = Pmain 2
(2)
Power consumption varies for different types of SRAM or DRAM depending on the individual power models (vendor and architecture specific)
926
C. Kulkarni et al.
Let us introduce a level one cache as in figure 2, then we observe that the total number of accesses is distributed into number of cache accesses (Ncache ) and number of main memory accesses (Nmain ). The total power is now given by: lo Pmain
Pcache
lo Ptotal = Pmain + Pcache
(3)
= (Nmc + Nmain ) × F (Smain ) = (Nmc + Ncache ) × F (Scache )
(4) (5)
Here the term Nmc represents the amount of traffic between the two levels of memory. It is a useful indicator of the power overhead due to various characteristics like block size, associativity, replacement policy etc. for an architecture with hierarchy. Let the factor of gain for an on-chip cache access compared to that of an off-chip main memory access be α. Here α is technology dependent. Then we have: F (Scache ) = α × F (Smain ) where α < 1. (6) The constants in this formula is instantiated for a proprietary vendor model. In the literature several more detailed cache power models have been reported recently (e.g. [11]) but we do not need these because we only require relative comparisons.
3
Methodology
The steps comprising our methodology in this work are shown in figure 3. We assume that all the parallelization related steps (i.e data and/or task level partitioning) are done before applying this methodology. We only focus on the cache related issues here since other issues in our work have been addressed in [4,5].
Original Source Code Data Flow and Locality Analysis And High Level In-Place Estimate Instruction and Cycle budget
Global Data And Control Flow Transformations Data Re-use Decisions Advanced In-Place Mapping To Native C Compiler
Fig. 3. Proposed methodology The first step, Data Flow and Locality Analysis, is necessary to identify the parts and data structures in the program where there is the most potential gain or where the performance bottlenecks are located. Next, a high-level estimation is applied of the potential in-place optimization for data [6]. In step 3, Loop
Hardware Cache Optimization for Parallel Multimedia Applications
927
and Control flow transformations are applied [4,5] but over a much more global scope than typically done in compilers [24,13], so as to improve the regularity and locality for power reduction. The constraints are the given cycle budget and the code size limitations and data dependencies. In the Data Re-use Decisions step, different data re-use trees [18] are explored and the best hierarchy mapping (from the power point of view) is obtained for a particular application. Finally, during Advanced In-place Mapping and Caching, both inter-signal and detailed intra-signal in-place mapping [6] are applied to obtain better cache utilization.
4
Experimental Results
In this section we illustrate the validity and effectiveness of the methodology on two real-life applications, namely a Quad-tree Structured DPCM video compression algorithm and a voice coder algorithm, using a DLX machine simulator and the Dinero III cache simulator [9]. It is important to emphasize that the results of the hardware controlled cache for both the test vehicles are based on relatively small test data3 due to the small address space available in the DLX simulator. But we also observe that relative gains for larger frame sizes will almost be similar for hardware controlled caches. Hence comparing relative gains for hardware and software controlled caches is justified. 4.1
QSDPCM Application
The Quadtree Structured Difference Pulse Code Modulation (QSDPCM) [21] technique is an interframe adaptive coding technique for video signals. The algorithm consists of two basic parts: the first part is motion estimation and the second is a wavelet-like quadtree mean decomposition of the motion compensated frame-to-frame difference signal. A global view of the QSDPCM coding algorithm is given in figure 4(a). It is worth noting that the algorithm spans 20 pages of complex C code. It should be stressed that the functional partitioning in the QSDPCM algorithm was not based on ad hoc algorithmic issues but based on our methodology proposed in [4] where the best possible one was selected for power. As motivated above, a small cache size of 64 words is chosen for all the functions of the QSDPCM algorithm throughout this paper (i.e. in figure 2, C1=C2=...=Cn= 64 words). The results of applying the above methodology for a hardware controlled cache are given in table 1. Here we give the power requirement of the most memory intensive parts of the QSDPCM algorithm for different levels of optimization. We observe that the initial specification consumes a relatively large amount of power. The gains due to the different steps of the methodology are quite clear 3
This means that e.g. for QSDPCM the frame size for software controlled cache was the real frame of 288 × 528, whereas for hardware controlled caching this was scaled down to 48 × 64
928
C. Kulkarni et al.
Input frame
SubSamp4 ComputeV4 SubSamp4
@1
SubSamp2 SubSamp2
ComputeV2 @1
V1Diff
Quad.Dec. Quad.Constr.
Output bitstream
Relative Number of Misses
Initial (loop blocked) Case Globally transformed Case 1.0
0.5
@1 Reconstructed previous frame
Ad.Up.Samp
Reconstruct
0.0 SS24
V4
V2
V1
QC
RECO UPSA
Fig. 4. (a) The QSDPCM algorithm (b) Gain in number of misses between the globally optimized version and the initial loop blocked version for QSDPCM algorithm from the columns 3,4 and 5. The functions V4 and V2 have almost similar gains due to the different steps in the methodology. This indicates that apart from gains due to improving locality by means of global transformations and data re-use (column 3), there are equal gains due to only applying the aggressive in-place mapping and caching step (column 4). The various sources of power consumption are described below. In general we observe that the gain in power for the above methodology is quite significant for hardware controlled caches.
Table 1. Normalized power requirements of the QSDPCM algorithm for hardware controlled cache Function
Initial (no transf)
Subsamp2/4 V4 V2 V1 QuadConstr.
0.57 0.89 1.00 1.51 0.28
Modified Initial,loop In-placed caching (globally blocked and on globally transformed) cached transformed 0.218 0.117 0.0282 0.398 0.344 0.0753 0.381 0.334 0.0721 0.597 0.452 0.1161 0.140 0.073 0.0230
Table 2 gives the power consumption for the various functions of QSDPCM algorithm for a software controlled cache. Note that initially the major power consumption is in functions V4,V2,V1 and QuadConstr. These are due to multiple reads from the frame memories as well as large number of intermediate signals. After the global transformations, the power consumption is reduced close to the absolute minimum for the given algorithm. All other sources of power consumption are reduced (eliminated) to a negligible amount. We observe a gain (on average) of a factor 9 globally. This illustrates the huge impact of global transformations combined with more aggressive cache exploitation at compile time, even on a fully predefined processor architecture.
Hardware Cache Optimization for Parallel Multimedia Applications
929
Table 2. Normalized power requirements of the QSDPCM algorithm for software controlled cache Function
Initial (no transf)
Subsamp2/4 V4 V2 V1 QuadConstr.
0.21 0.75 1.90 3.29 0.71
Modified Initial,In-placed In-placed caching (globally locally transformed on globally transformed) and cached transformed 0.092 0.095 0.0359 0.002 0.374 0.0010 0.005 0.492 0.0024 0.208 0.644 0.0836 0.18 0.149 0.074
Figure 4(b) shows the number of misses for each function in the QSDPCM algorithm for initial and optimized versions. The initial version has been optimized for caching with loop blocking, whereas the entire methodology presented in section 3 is applied for the optimized version. We observe a gain of (on average) 26%. This shows that our methodology is not only beneficial for power but also for performance (speed). For the software controlled case, an even larger gain of about 40% was measured however. It is quite clear that the gain due to software controlled cache is relatively large compared to that of the hardware controlled cache. The reasons for this lie in the two major parameters that govern a hardware controlled cache. First, if the updating policy of the cache is not taken into account in the code at compile time, a large amount of power is consumed in the write-backs (copy-backs). So a write through policy consumes more power than a write back one. The second issue is the block size. In general, increasing block size reduces the miss rate. But nevertheless it also increases cache pollution4 in case of algorithms with less than average spatial locality. It can be concluded that more software control over the cache has a very positive impact on power as well as for speed. This is an important consideration for the cache designers of (embedded) multi-media processors. 4.2 Voice Coder Application We have used a Linear Predictive Coding (LPC) vocoder which provides an analysis-synthesis approach for speech signals [19]. In our algorithm, speech characteristics (energy, transfer function and pitch) are extracted from a frame of 240 consecutive samples of speech. Consecutive frames share 60 overlapping samples. All information to reproduce the speech signal with an acceptable quality is coded in 54 bits. The general characteristics of the algorithm are presented in figure 5(a). It is worth noting that the steps LPC analysis and Pitch detection use autocorrelation to obtain the best matching predictor values and pitch (using a Hamming window). The autocorrelation function has irregular access structures due to algebraic manipulations, and in-place mapping performs better on irregular access structures than just loop blocking [12]. 4
Cache pollution is a phenomenon wherein data which are not accessed (or demanded) are brought in due to larger block sizes
930
C. Kulkarni et al. Speech 128 Kbps 8
Pre-emphasis and High Pass filter LPC Analysis Vector Quantization
6
Energy Index
Pitch Detection and Analysis
Original (loop blocked) Code In-placed Code
4
2
Coding 0
Coded Speech at 800bps
1
2
4
8
Block size (bytes)
Fig. 5. (a) The different parts of the voice coder algorithm and the associated global dependencies (b)Gain in terms of energy for a voice coder autocorrelation function for different block sizes on DLX simulator Figure 5(b) shows the power requirements of the initial version (with loop blocking) and the in-place mapped version. Here only the in-place mapping and caching steps of section 3 are used so as to demonstrate that for irregular access structures it acts much better from the point of view of cache utilization than just loop blocking. We have performed this experiments for varying block sizes and have observed that our approach indeed performs much better for all the block sizes. A gain of a factor 2 is observed.
5
Conclusion
The main contributions of this paper are: (1) a formalized methodology for power-oriented source-level transformations to improve cache utilization in a hardware controlled cache context (2) the effectiveness of this methodology is demonstrated on two real-life test vehicles: the QSDPCM video compression algorithm and a voice coder (3) insight is provided into the differences in power and speed gains between hardware and software controlled caches, with consequences for low-power cache design in embedded multi-media applications.
References 1. J.Anderson, S.Amarasinghe, M.Lam, “Data and computation transformations for multiprocessors”, in 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.39-50, August 1995. 924 2. V.Bhaskaran and K.Konstantinides, “Image and video compression standards: Algorithms and Architectures”, Kluwer Academic publishers, Boston, MA, 1995. 923 3. R.W.Broderson, “The network computer and its future”, Proc. IEEE Int. SolidState Circuits Conf.,San Francisco, CA, pp.32-36, Feb.1997. 923 4. K.Danckaert, F.Catthoor and H.De Man, “System-level memory management for weakly parallel image processing”, In proc. EUROPAR-96, Lecture notes in computer science series, vol. 1124, Lyon, Aug 1996. 925, 926, 927
Hardware Cache Optimization for Parallel Multimedia Applications
931
5. E.De Greef, F.Catthoor, H.De Man, “Program transformation strategies for reduced power and memory size in pseudo-regular multimedia applications”, accepted for publication in IEEE Trans. on Circuits and Systems for Video Technology, 1998. 924, 926, 927 6. E.De Greef, F.Catthoor, H.De Man, “Memory Size Reduction through Storage Order Optimization for Embedded Parallel Multimedia Applications”, Intnl. Parallel Proc. Symp.(IPPS) in Proc. Workshop on “Parallel Processing and Multimedia”, Geneva, Switzerland, pp.84-98, April 1997. 924, 926, 927 7. T.Halfhill and J.Montgomery, “Chip fashion: multi-media chips”, Byte Magazine, pp.171-178, Nov 1995. 924 8. P.Hicks, M.Walnock and R.M.Owens, “Analysis of power consumption in memory hierarchies”, In Proc. Int’l symposium on low power electronics and design, pp.239242, Monterey, CA, Aug 1997. 924 9. Marck Hill, “Dinero III cache simulator”, Online document available via http://www.cs.wisc.edu/ markhill, 1989. 927 10. M.Jimenez, J.M.Liaberia, A.Fernandez and E.Morancho, “A unified transformation technique for multilevel blocking”, In proc. EUROPAR-96, Lecture notes in computer science series, vol. 1124, Lyon, Aug 1996. 924 11. Milind Kamble and Kanad Ghose, “Analytical Energy Dissipation Models for Low Power Caches”, In Proc. Int’l symposium on low power electronics and design, pp.143-148, Monterey, Ca., Aug 1997. 926 12. C.Kulkarni,F.Catthoor and H.De Man, “Code transformations for low power caching in embedded multimedia processors”, Proc. of Intnl. Parallel Processing Symposium (IPPS), pp.292-297,Orlando,FL,April 1998. 925, 929 13. D.Kulkarni and M.Stumm, “Linear loop transformations in optimizing compilers for parallel machines”, The Australian computer journal, pp.41-50, May 1995. 927 14. M.Lam, E.Rothberg and M.Wolf, “ The cache performance and optimizations of blocked algorithms”, In Proc. ASPLOS-IV, pp.63-74, Santa Clara, CA, 1991. 924 15. N.Manjikian and T.Abdelrahman, “Array data layout for reduction of cache conflicts”, In Proc. of 8th Int’l conference on parallel and distributed computing systems, September, 1995. 924 16. L.Nachtergaele, F.Catthoor, F.Balasa, F.Franssen, E.De Greef, H.Samsom and H.De Man, “Optimisation of memory organisation and hierarchy for decreased size and power in video and image processing systems”, Proc. Intnl. Workshop on Memory Technology, Design and Testing, San Jose CA, pp.82-87, Aug. 1995. 924 17. P.R.Panda, N.D.Dutt and A.Nicolau, “ Memory data organization for improved cache performance in embedded processor applications”, In Proc. ISSS-96, pp.9095, La Jolla, CA, Nov 1996. 924 18. J.P. Diguet, S. Wuytack, F.Catthoor and H.De Man, “Formalized methodology for data reuse exploration in hierarchical memory mappings”, In Proc. Int’l symposium on low power electronics and design, pp.30-36, Monterey, CA, Aug 1997. 927 19. L.R.Rabiner and R.W.Schafer, “Digital signal processing of speech signals”, Prentice hall Int’l Inc., Englewood cliffs, NJ, 1988. 929 20. K.Ronner, J.Kneip and P.Pirsch, “A highly parallel single chip video processor”, Proc. of the 3rd International Workshop, Algorithms and Parallel VLSI architectures III, Elsevier, 1995. 923 21. P.Strobach, “QSDPCM – A New Technique in Scene Adaptive Coding,” Proc. 4th Eur. Signal Processing Conf., EUSIPCO-88, Grenoble, France, Elsevier Publ., Amsterdam, pp.1141–1144, Sep. 1988. 927
932
C. Kulkarni et al.
22. V.Tiwari, S.Malik and A.Wolfe, “Instruction level power analysis and optimization of software”, Journal of VLSI signal processing systems, vol. 13, pp.223-238, 1996. 923 23. Uming Ko, P.T.Balsara and Ashmini K.Nanda, “Energy optimization of multi-level processor cache architectures”, In Proc. Int’l symposium on low power electronics and design, pp.45-49, Dana Point, CA, 1995. 924 24. M.E.Wolf and M.Lam, “A loop transformation theory and an algorithm to maximize parallelism”, IEEE Trans. on parallel and distributed systems, pp.452-471, Oct 1991. 927
Parallel Solutions of Simple Indexed Recurrence Equations Yosi Ben-Asher1 and Gady Haber2 1
Dep. of Math. and CS., Haifa University, 31905 Haifa, Israel, [email protected] 2 IBM Science and Technology, Haifa, Israel, [email protected]
Abstract. We consider a type of recurrence equations called “Simple Indexed Recurrences” (SIR) wherein ordinary recurrences of the form X[i] = opi (X[i − 1], X[i])(i = 1 . . . n) are extended to X[g(i)] = opi (X[f (i)], X[g(i)]), such that opi is an associative binary operation, f, g : {1 . . . n} → {1 . . . m} and g is distinct.1 This extends our capabilties for parallelizing loops of the form: for i = 1 to n{X[i] = opi (X[i − 1], X[i])} to the form: for i = 1 to n{X[g(i)] = opi (X[f (i)], X[g(i)]) }. An efficient solution is presented for the special case where we know how to compute the inverse of opi operator. The algorithm requires O(log n) steps with O(n/ log n) processors. Furthermore, we present a practical and a more improved version of the non-optimal algorithm for SIR presented in [1] which uses repeated iterations of pointer jumping. A sequence of experiments was performed to test the effect of synchronous and asynchronous message-passing executions of the algorithm for p << n processors. This algorithm computes the final values of X[] in np · log p steps and n · log p work, with p processors. The experiments show that pointer jumping requires O(n) work in most practical cases of SIR loops, thus forming a more practical solution.
1
Introduction
Ordinary recurrence equations of the form Xi = opi (Xi−1 , Xi ) i = 1 . . . n can be generalized to what we refer to as Indexed Recurrence (IR) equations. In IR equations, general indexing functions of the form Xg(i) = opi (Xf (i) , Xg(i) ) replace the i, i−1 indexing of the ordinary recurrences. Efficient parallel solutions to such equations can be used to parallelize sequential loops of the form: for i = 1 to n { X[g(i)] = opi (X[f (i)], X[g(i)]) }
if the following conditions are met: 1. opi (x, y) is any segment of code that is equivalent to a binary associative operator and may be dependent on i. 2. the index functions f, g : {1..n} → {1..m} do not include references to elements of the X[] array. 1
This paper is a continuation of the work on IR equations presented in [1].
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 933–939, 1998. c Springer-Verlag Berlin Heidelberg 1998
934
Yosi Ben-Asher and Gady Haber
3. the index function g is distinct, i.e., for every indexes i, j if i = j then g(i) = g(j). In practice, such solutions can be used to parallelize sequential loops within given source code segments.
An Efficient SIR Algorithm for the Case Where the Inverse of opi is Known
2
In this section we describe an efficient parallel algorithm for SIR loops, providing we know how to compute the inverse operation op−1 i . The algorithm requires O(log n) steps with O(n/ log n) processors. The first step in the algorithm is to translate the SIR loop into its upside down tree representation T , in which every vertex i (1 ≤ i ≤ n) represents the g(i) index and every edge e = {i, j} represents the dependency opi (Xg(j) , Xg(i) where g(j) = f (i) and j < i. For example consider the following upside-down tree representation of a SIR loop: X[1..n] initialized by A[1] . . . A[n] X[1] := A[1] X[2] := X[1] · X[2] X[3] := X[1] · X[3]
X[4] := X[3] · X[4] X[5] := X[3] · X[5] X[6] := X[3] · X[6]
A SIR Loop for i = 1 . . . 6
1
2
3
4
5
6
An Upside-down Tree Representation of the loop
We then apply the Euler T our circuit on T in order to compute all the prefixes of the vertices in T . An Euler Tour in a graph is a path that traverses each edge exactly once and returns to its starting point. The Euler tour circuit which operates on a tree assumes that the tree is directed and that it is represented by an Adjacency List. In the directed tree version TD of T , each undirected edge {i, j} in T has two directed copies - one for (i, j) and one for (j, i). For example, consider the directed version of the tree for the above example, and its Adjacency List representation: 1
3
2
4
5
6
TD - The Directed Version of T
1 2 3 4 5 6
: : : : : :
(1, 3) (2, 1) (3, 1) (4, 3) (5, 3) (6, 3)
→ → → → → →
(1, 2) → N il N il (3, 4) → (3, 5) → (3, 6) → N il N il N il N il
Adjacency List Representation of TD
Our main concern when coming to construct the adjacency list of TD , is finding the list of all the children of each vertex i in TD efficiently; i.e., finding all vertices j < i which satisfy that g(j) = f (i). Sorting all the pairs < f (1), g(1) >, . . . , < f (n), g(n) > using f (i) as the sorting key, will automatically group together all vertices which have the same parent. The sorting process can be done efficiently using the Integer Sort algorithm [3] which operates on keys in the range of [1..n]. After the adjacency list of TD is formed, we can construct the Euler tour of TD in a single step according to the following rule: EtourLink(v, u) =
N ext(u, v) if N ext(u, v) = N il AdjList(u) Otherwise
Parallel Solutions of Simple Indexed Recurrence Equations
935
by allocating a processor to each pair of edges (v, u), (u, v) in TD . Once the Euler tour of TD is constructed, in the form of a linked list, we can compute the prefixes of all the vertices of TD , using the efficient parallel prefix algorithm on the linked list [2], after initializing all the edges of the tree with the following weights: – The weight of each forward edge < f (i), g(i) > in TD is initialized by Ag(i) , where A is the array of initial values of the X array of SIR. – The weight of each backward edge < g(i), f (i) > is initialized by the inverse value A−1 g(i) . 1
1
A -1 3 -1
A 2
2
A 3
3
A 4
A -1
4
A 1
* A 2 /
A 2
*
A 3 *
A 4 /
A 4
A 4
A 5
5
*
-1 6
-1
A 5
A 5 /
A 6
A 5 *
Euler Tour of TD with initial weights
3
X[1] X[2] X[3] X[4] X[5] X[6]
A -1 1
A
A 2
= = = = = =
A[1] A[1] A[1] A[1] A[1] A[1]
· · · · ·
A[2] A[2]/A[2] · A[2]/A[2] · A[2]/A[2] · A[2]/A[2] ·
A[3] A[3] · A[4] A[3] · A[4]/A[4] · A[5] A[3] · A[4]/A[4] · A[5]/A[5] · A[6]
6
A 6 /
A 6 /
A 3
/
A 1
The Prefixes for each vertex
Parallel Solution for SIR by Using Pointer Jumping
Consider for example the SIR problem A [2i] := A [i + 1] · A[2i]; (for i = 1, 2..n) where g(i) = 2i and f (i) = i+1. The final values of each element A [i] is a product of a varying number of items. We refer to this sequence of multiplications of every element i in A [] as the trace of A [i]. In general the trace of each element A [g(i)] satisfies that for all i = 1 . . . n: A [g(i)] = A[f (jk )] ⊕ . . . ⊕ A[f (j1 )] ⊕ A[g(i)] such that: – j1 = i. – for t = 2 . . . k the indices jt satisfy that jt < jt−1 and g(jt ) = f (jt−1 ). – jk is the last index for which g(jt ) = f (jt−1 ), i.e., there is no 1 ≤ jk+1 < jk such that g(jk+1 ) = f (jk ).
The above rule suggests a simple method for computing A [g(i)] in parallel. Let A−t [g(i)] denote the sub-trace with t + 1 rightmost elements in the trace of A [g(i)], i.e., A−t [g(i)] = A[f (jk−t )] ⊕ . . . ⊕ A[f (i)] ⊕ A[g(i)], and now consider the concatenation (or multiplication) of two “successive” sub-traces: A−(t1 +t2 ) [g(i)] = A−t1 [g(j)] ⊕ A−t2 [g(i)] where g(j) = f (jk−t2 ) and A[f (jk−t2 )] is the last element in A−t2 [g(i)]. The proposed algorithm (as shown in [1]) is a simple greedy algorithm that keeps iterating until all traces are completed. In each iteration all possible concatenations of successive sub-traces are computed in parallel where the concatenation operation of sub-traces A−t1 [g(j)], A−t2 [g(j)] can be implemented as follows: 1. The value of a sub-trace A−t [g(i)] is stored in its array element A[g(i)]. 2. A pointer N ext[g(i)] points to the sub-trace A−t1 [g(j)] to be concatenated to A−t2 [g(i)] (to form A−(t1 +t2 ) [g(i)]). Hence, A[N ext[g(i)]] contains the value of the sub-trace A−t1 [g(j)].
936
Yosi Ben-Asher and Gady Haber
Following is the code of an improved and a more practical version of the algorithm for which we ran our experiments with p << n processors. The new algorithm improves the above algorithm in the following points: – It works with compressed array of size n (the number of loop iterations) instead of the original array size, which is of size m > n, by mapping every A [g(i)] element to an auxiliary array X[i]. – The index function f (i) is transformed to a decreasing function which never refers to future indexes of X[]. This reduces the total number of iterations required by the algorithm from an order of log n to log p, improving the execution time to np log p. Suppose each processor computes np traces and uses a sequence of passes wherein it calculates the traces sequentially one after another. Since f () always points backwards to earlier elements in the input array, then after each step i we already have the final values of the first 2i elements, requiring a total of log p steps in order to complete the entire array of size n. The same argument however, does not hold for the case where f () can refer to future elements in the array. Input: Initialized Arrays A[1..m], g[1..m], f [1..m], ⊕.
(where ⊕ is a binary associative operator)
Output: Array X[1..n] where X[i] will hold the value of A[g[i]] at the end of the IR loop. Initialize Auxiliary Arrays X[1..n], G[1..m], N ext[1..n] to zero. forall i ∈ {1..n} do in parallel G[g[i]] := i; (The auxliary array G represents the inverse of g, i.e. G[i] ≡ g−1 (i)) end parallel-for forall i ∈ {1..n} do in parallel u := G[f [i]]; (The iteration where A [f (i)] was modified by the loop). if (u ≥ i or u = 0) { (The trace of A [g(i)] is of size two.) X[i] := A[f [i]] ⊕ A[g[i]]; (i.e., A [g(i)] = A[f (i)] ⊕ A[g(i)].) N ext[i] := 0; } else { if (f [u] ≥ f [i] or f [u] = 0) { (The trace of A [f (i)] is of size three.) X[i] := A[f [u]] ⊕ A[f [i]] ⊕ A[g[i]]; N ext[i] := 0; } else { (The trace of A [g(i)] is of size greater than three.) X[i] := A[f [i]] ⊕ A[g[i]]; N ext[i] := f [u]; } } end parallel-for
Initialization Stage
for t = 1 to log n do (The algorithm performs, at most, log n iterations) forall i ∈ {(2t−1 + 1)..n} do in parallel if (N ext[i] > 0) then { (Apply the concatenation operation followed by the pointer updating) X[i] := X[N ext[i]] ⊕ X[i]; N ext[i] := N ext[N ext[i]]; }
The code for the iteration phases
4
Experimental Results for Message Passing Sync and Async SIR
For the message-passing version of SIR we assume that the initial values A[1..m] are partitioned between the processors, with each processor holding m p elements.
Parallel Solutions of Simple Indexed Recurrence Equations
937
Thus the first step is for each processor to fetch the O( np ) elements of A[] that it will use during the initialization stage. Each processor will sort the requested elements according to their processor destination, and then fetch them using p “big-messages” (i.e., using message packaging). Two variants of this algorithm have been tested: Sync SIR: where all processors compute one iteration of the main loop synchronously. Async SIR(k): where each processor performs l iterations of the main-loop before passing the control to the next processor. The number l is chosen randomly (every time) from the range 1 . . . k, where k is the parameter of the algorithm. Thus, by choosing different values of k we can change the amount of asynchronous deviation of the execution. We simulated different numbers of processors, p = 4, 8, 16, 32, 64, 128, 256, 512, 1024 with n = 500, 000. The simulation of the algorithm was made on a sequential machine, since this allowed us an exact and simple measure of the total size of generated messages (communication), the work (in units of C instructions ) and the execution time (also in units of C instructions). All diagrams contain an artificial curve (e.g., f (p) = np · log n) for comparison. We first chose to test a SIR loop where g(i) = i and f (i) = i − 1, as this maximizes the length of each trace and consequently the expected work. For Sync SIR, we already know the results: execution time approximately np · log p, work n · log p, total of p · log p big messages and communication (total number of data items that have been sent) of p · log p. The results of the execution time in fig. 1 verify this computation and show that increasing k cost of Async SIR(k) can reduce the speedups significantly. Another observation regarding Async SIR(k) is that for a relatively small number of processors the impact of k is much larger than with a large number of processors. According to fig. 1,the difference between the various Async SIR(k) results, becomes constant (independent on p) for values greater than 32. This is because of the relatively small number of A[] s elements that are allocated to each processor ( np ≈ p). The same results are obtained for the work of the algorithms as described in fig. 1. The communication and number of big messages is exactly p · log p, as expected. These experiments were repeated for a random setting f (i) = random(1 . . . n). In general, we hoped to reach optimal performances of execution time np and work O(n). The execution time has improved, and sync SIR (see fig. 2) is now between np and np ·log p. In addition, the √ effect of k in Async SIR(k) on the execution time is reduced, and for k = 1, 5, p we get an execution time that is less than np · log n. In particular, we can see that the work (fig. 2) behaves like O(n) rather than n log p. The communication (fig. 2) is below n. Unlike the case f (i) = i, the effect of k in Async SIR(k) on √ the communication is negligible (all the results for k = 1, 5, p are the same). In addition, asynchronous execution seems to improve the communication compared to sync SIR. An asynchronous execution might, for example, cause the middle processor to advance its N ext[i] pointers before other processors started to work. This might reduce the communication that would have occured between
938
Yosi Ben-Asher and Gady Haber
2e+07
Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. N/P * log(N)............. N/P * log(P).............
1e+06
Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. N........................ N * log(N)............... N * log(P)...............
9e+06
500000
7e+06 250000 5e+06
Work
Time
4e+06 100000
3e+06
50000 2e+06 25000
1e+06
10000 5000
5e+05 4e+05 4 32 64
128
256
512 Processors
1024
4 32 64
128
256
512 Processors
1024
Fig. 1. Execution time and Work for f (i) = i − 1. these processors if the middle processor hadn’t advanced its pointers. The total number of messages was not affected by the random setting and (as is described in fig. 2) is around p log p. 1e+06 Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. N/P * log(N)............. N/P * log(P)............. N/P......................
500000 250000
Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. N........................ N * log(P)...............
100000
Work
Time
50000 25000
10000
2e+06
5000 1e+06
1000 5e+05 4e+05 3e+05 4 32 64
128
256
512 Processors
32 64
128
256
Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. N * log(P)............... N........................
5e+06
512 Processors
1024
Async IR with K = 1...... Async IR with K = 5...... Async IR with K = sqrt(P) Sync IR.................. P * log(P)............... P........................
Big Messages
3e+06
Communication
1024
1e+06
2046
510
126 32 500000 8
400000
4 300000 1 4 32 64
128
256
512 Processors
1024
4
32 64 Processors
128
256
512
Fig. 2. Execution time, Work, Comm. and big-messages for f (i) random(1..n).
1024
=
References 1. Y. Ben-Asher and G. Haber. Parallel solutions of indexed recurrence equations. In In Proceedings of the IPPS’97 conference, Geneva, 1997. 933, 935
Parallel Solutions of Simple Indexed Recurrence Equations
939
2. L. Rudolph Kruskal C. P. and M. Snir. The power of parallel prefix. IEEE Transactions on Computers, 34(C):965–968, 1985. 935 3. S. Rajasekaran, S. Sen. On parallel integer sorting. Technical Report To appear in ACTA INFORMATICA, Department of Computer Science, Duke University, 1987. 934
Scheduling Fork Graphs under LogP with an Unbounded Number of Processors Iskander Kort and Denis Trystram LMC-IMAG BP53 Domaine Universitaire 38041 Grenoble Cedex 9, France {kort,trystram}@imag.fr
Abstract. This paper deals with the problem of scheduling a specific precedence task graph , namely the Fork graph, under the LogP model. LogP is a computational model more sophisticated than the usual ones which was introduced to be closer to actual machines. We present a scheduling algorithm for this kind of graphs. Our algorithm is optimal under some assumptions especially when the messages have the same size and when the gap is equal to the overhead.
1
Motivation
The last decade was characterized by the huge development of many kinds of parallel computing systems. It is well-known today that a universal computational model can not unify all these varieties. PRAM is probably the most important theoretical computational model. It was introduced for shared-memory parallel computers. The main drawback of PRAM is that it does not allow to take into account the communications through an interconnection network in a distributed-memory machine. Practical PRAM implementations have often bad performances. Many attempts to define standard computational models have been proposed. More realistic models such as BSP and LogP [3] appeared recently. They incorporate some critical parameters related to communications. The LogP model is getting more and more popular. A LogP machine is described by four parameters L, o, g and P . Parameter L is the interconnection network latency. Parameter o is the overhead on processors due to local management of communications. Parameter g represents the minimum duration between two consecutive communication events of the same type. Parameter P corresponds to the number of processors. Moreover, the model assumes that at most Lg messages from (resp. to) a processor may be in transit at any time. We present in this paper an optimal scheduling algorithm under the LogP model for a specific task graph, namely the Fork graph. We assume that the number of processors is unbounded. In the remainder of this section we give some definitions and notations concerning Fork graphs, then we briefly describe some related work. In Sect. 2, the scheduling algorithm is presented. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 940–944, 1998. c Springer-Verlag Berlin Heidelberg 1998
Scheduling Fork Graphs under LogP
1.1
941
About the Fork Graph
A Fork graph is a tree of height one. It consists of a root task denoted by T0 preceeding n leaf tasks T1 , . . . , Tn . The root sends, after its completion, a message to each leaf task. In the remainder of this paper, symbol F will refer to a Fork graph with n leaves. In addition, we denote by wi the execution time of task Ti , i ∈ {0, . . . , n}. The processors are denoted by pi , i = 0, 1, . . .. We assume, without loss of generality, that task T0 is always scheduled on processor p0 . Let S be a schedule of F , then dfi (S) denotes the completion time of processor pi in S where i ∈ {0, 1, . . .}. Furthermore, df (S) denotes the makespan (length of S) and df ∗ the length of an optimal schedule of F . The contents of the messages sent by T0 must be considered when dealing with scheduling under LogP. Indeed, assume that T0 sends the same data to some tasks Ti and Tj . Then assigning these tasks to the same processor (say pi , i = 0) saves a communication. Two extreme situations are generally considered. In the first one, it is assumed that T0 sends the same data to all the leaves. This is called a common data semantics. In the second situation, the messages sent by T0 are assumed to be pairwise different. This is called an independent data semantics [4]. 1.2
Related Work
The problem of scheduling tree structures with communication delays and an unbounded number of processors has received much interest recently. The major part of the available results focused on the extended Rayward-Smith model. Chr´etienne has proposed a polynomial-time algorithm for scheduling Fork graphs with arbitrary communication and computation delays [1]. He has also showed that finding an optimal schedule for a tree with a height of at least two is NPHard [2]. More recently, some results about scheduling trees under LogP have been presented. In [6], Verriet has proved that finding optimal Fork schedules is NP-Hard when a common data semantics is considered. In [5], the authors have presented a polynomial-time algorithm that determines optimal linear schedules for inverse trees under some assumptions.
2
An Optimal Scheduling Algorithm
We present in this section an optimal scheduling algorithm of F when the number of processors is unbounded (P ≥ n). Furthermore , we assume that: – The messages sent by the root have the same size. – The gap is equal to the overhead: g = o. This is the case for systems where the communication software is a bottleneck. – An independent data semantics is considered. – wi ≥ wi+1 , ∀i ∈ {1, . . . , n − 1}. We start by presenting some dominant properties related to Fork schedules.
942
Iskander Kort and Denis Trystram
Lemma 1. Each of the following properties is dominant: π1 Each leaf task Ti such that wi ≤ o is assigned to p0 . π2 Processor p0 executes T0 at first, then it sends the messages to tasks Ti which are assigned to other processors. These messages are sent according to decreasing values of wi . Finally, p0 executes the tasks that were assigned to it in an arbitrary order. π3 Each processor pi (i = 0) executes at most one task, as early as possible (just after receiving its message). Every schedule that satisfies properties π1 −π3 will be called a dominant schedule. Such schedules are completely determined when the subset A∗ of the leaves that should be assigned to p0 is known. We propose in the sequel an algorithm (called CLUSTERFORK) that computes this subset. Initially, each task is assigned to a distinct processor. The algorithm manages two variables b and B which are a lower bound and an upper bound on df ∗ respectively. More specifically, at any time we have b ≤ df ∗ < B. These bounds are refined as the algorithm proceeds. CLUSTERFORK explores the tasks from T1 to Tn . For each task Ti , i ∈ {1, . . . , n}, one of the following situations may be encountered. If the completion time of Ti in the current schedule is not less than B then Ti is assigned to p0 . If dfi is not greater than b, then the algorithm does not assign this task to p0 . Finally, if dfi is in the range ]b, B[, then the algorithm checks whether there is a schedule S of F such that df (S) < dfi . If such a schedule exists, then Ti is assigned to p0 and B is set to dfi , otherwise Ti is not assigned to p0 and b is set to dfi . Theorem 1. Let A∗ be a subset produced by CLUSTERFORK, then any dominant schedule S ∗ associated with A∗ is optimal. Finally, it is easy to see that CLUSTERFORK has a computational complexity of O(n2 ).
3
Concluding Remarks
In this paper we presented an optimal polynomial time scheduling algorithm for Fork graphs under the LogP model and using an unbounded number of processors. This problem was solved under some assumptions which hold for a current parallel machine namely the IBM-SP. We remark that a slight modification in the problem parameters (for instance when the messages are the same) leads to an NP-hardness result. Indeed, in this latter case, scheduling at most one task = 0 is no more a dominant property and gathering some on each processor pi , i tasks on the same processor may lead to better schedules.
Scheduling Fork Graphs under LogP
943
Algorithm 1 CLUSTERFORK Begin {Initially: dfi = w0 + (i + 1) ∗ o + L + wi , ∀1 ≤ i ≤ n} b := 0;{lower bound on df ∗ } B := M AXV AL; {upper bound on df ∗ } for (i := 1 to n) do if (dfi ≥ B) then xi := 1; {Ti is assigned to p0 } U P DAT E DF (i); else if (dfi > b) then {Look for a schedule such that df < dfi } j := i; k := 0; v := df0 ; while (j ≤ n) do if (dfj − k ∗ o ≥ dfi ) then {The assignment of Tj to p0 is necessary} if (v + wj − o ≥ dfi ) then j := n + 1; else v := v + wj − o; k := k + 1; j := j + 1; if (j = n + 2) then {The looked for schedule does not exist} xi := 0; b := dfi ; else {df ∗ < dfi } xi := 1; B := dfi ; U P DAT E DF (i); {Update processors completion times} else {dfi ≤ b} xi := 0; End CLUSTERFORK
References 1. P. Chr´etienne. Task Scheduling over Distributed Memory Machines. In North Holland, editor, International Workshop on Parallel and Distributed Algorithms, 1989. 941 2. P. Chr´etienne. Complexity of Tree-scheduling with Interprocessor Communication Delays. Technical Report 90.5, MASI , Pierre and Marie Curie University, Paris, 1990. 941 3. D. E. Culler et al. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78–85, November 1996. 940 4. L. Finta and Z. Liu. Complexity of Task Graph Scheduling with Fixed Communication Capacity. International Journal of Foundations of Computer Science, 8(1):43– 66, 1997. 941 5. W. L¨ owe, M. Middendorf, and W. Zimmermann. Scheduling Inverse Trees under the Communication Model of the LogP-Machine. Theoretical Computer Science, 1997. to appear. 941
944
Iskander Kort and Denis Trystram
6. J. Verriet. Scheduling Tree-structured Programs in the Logp Model. Technical Report UU-CS-1997-18, Department of Computer Science, Utrecht University, 1997. 941
A Data Layout Strategy for Parallel Web Servers J¨ org Jensch, Reinhard L¨ uling, and Norbert Sensen Department of Mathematics and Computer Science University of Paderborn, Germany {jaglay,rl,sensen}@uni-paderborn.de
Abstract. In this paper a new mechanism for mapping data items onto the storage devices of a parallel web server is presented. The method is based on careful observation of the effects that limit the performance of parallel web servers, and by studying the access patterns for these servers. On the basis of these observations, a graph theoretic concept is developed, and partitioning algorithms are used to allocate the data items. The resulting strategy is investigated and compared to other methods using experiments based on typical access patterns from web servers that are in daily use.
1
Introduction
Because of the exponential growth in terms of number of users of the World Wide Web, the traffic on popular web sites increases dramatically. To resolve capacity requirements on the server systems hosting popular web sites, one way to strengthen the capability of a server system is to use parallel server systems that contain a number of disks attached to different processors which are connected by a single bus or a scalable communication network. The same solution can be applied for very large web sites which cannot be stored on one disk but use a number of disks connected to a single processor system. The problem that shows up if a number of disks are used to store the overall amount of data is to balance the number of requests issued to the disks as evenly as possible in order to achieve the largest overall throughput from the disks. The requests that have to be served by one disk are strictly dependent on the data layout of the server system, i.e. the mapping of the files stored on the server onto the different storage subsystems (disks). The investigation of a new concept for the data layout of parallel web servers is the focal point of this paper. The problem is of major relevance for the performance of web servers and can be assumed to become even more important if the current technological trends concerning the bandwidth offered by storage devices, processors and wide area
This work was partly supported by the MWF Project “Die Virtuelle Wissensfabrik”, the EU project SICMA, and the DFG Sonderforschungsbereich 1511 “Massive Parallelit¨ at: Algorithmen, Entwurfsmethoden, Anwendungen”.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 944–952, 1998. c Springer-Verlag Berlin Heidelberg 1998
A Data Layout Strategy for Parallel Web Servers
945
communication networks are regarded: Whereas the performance of processors and external communication networks (ATM, Gigabit Ethernet) has been dramatically increased over the last years, the read/write performance of storage devices shows only little increase over the last years. Thus, the performance of a server system can only be increased by the use of a larger number of disks that delivers data elements in parallel to the external communication network. Different approaches can be found in the literature in order to increase the performance of web servers: NCSA [7] and SWEB [4] have built a multiworkstation HTTP server based on round-robin domain name resolution(DNS) to assign requests to a set of workstations. In [5] this DNS-strategy is extended, the time-to-live (TTL) periods are used to distribute the load more evenly. In the past different researchers tried to identify the workload behavior with emphasis on the development of a caching strategy. NCSA developers analyzed user access patterns for system configurations to characterize the WWW traffic in terms of request count, request data volume, and requested resources [9]. Arlitt and Williamsion [3] studied workloads of different Internet Web servers. Their emphasis placed on finding universal invariants characterizing Web servers workloads. Bestavros et. al [2] tried to characterize the degree of temporal and spatial locality in typical Web server reference streams. A general survey on file allocation is given in [13]. In [9] the problem of declustering is solved by transforming the problem onto a MAX-CUT problem. The data layout method that is presented in this paper is based on the following idea: The goal of the layout is to minimize the number of items that resides on the same disk and are requested often simultaneously. So we examine the access pattern of the past and place those items on different disks. Since the items which are often simultaneously requested are roughly the same at following days, this strategy leads to a good data layout. The model of our parallel web server is presented in section 2. In section 3 we show that typical access patterns remain constant over some time, and collisions are typical patterns of a web server. In section 4 the data layout strategy is presented and the strategy is studied in detail, compared with other methods and limits of the strategy are shown. The paper finishes on some concluding remarks in section 5.
2
Model
This section presents the model of the parallel web server that is used to describe the data layout algorithm in the next sections. Thus, in our model we concentrate on the aspects that are important for the presentation in the rest of the paper. A parallel web server is built by the following entities: A number of processing modules that are connected by some kind of network or bus architecture, a number of communication devices that connect the processing modules to the external clients accessing the server and requesting information, and a number of storage devices (disks) that are connected to the processing modules.
946
J¨ org Jensch et al.
The parallel web server stores data items (files) on the disks and works as follows: – The server is able to accept one or more requests arriving via the communication devices from the external clients per time step. – The server forwards a request to the disk holding the requested item. We assume here, that each item is only stored once on the whole disk pool. – Every disk can accept only one request per time step. If more than one request is sent to a disk per time step, these requests queue up. – The processing time for each request on the disk is constant and takes one time unit, so one disk can process only one request in one time unit. – All disks are independent, so that the server can process a maximum of n requests per time step if n is the number of disks. In case that two requests are forwarded to the same disk in one time step, one request can be served and the other request has to wait for one time unit. This conflict resolution can be done arbitrarily, i.e. we do not assume a specific protocol here. In case that two (or more) requests access the same storage device (disk) we say, that these requests are colliding, i.e. these requests are in collision. We say that two requests collide if they arrive on the web server within a time interval of time δ. We have chosen δ to be one second. The aim of our work is now to develop a data layout strategy in order to map the data items in a way onto the disks, that the requests that arrive at the server lead to a minimal number of collisions and therefore to a minimal latency in answering the data requests issued by the external clients.
3
Monitoring the Access to Web Servers
In order to increase the overall performance of a parallel web server it is not important to balance the overall number of requests issued to the disks as evenly as possible, but to avoid that a larger number of requests are submitted to a single disk. This means that in each small time interval the load has to be distributed as evenly as possible minimizing the latency time for a request this way. Thus, the aim is to minimize the number of collisions on the disks. In order to get an impression of the collisions that occur we have taken the log files from the web server of the University of Paderborn (www.uni-paderborn.de) for November and December 1997. We built all tuples (i, j) of files i and j stored on the web server in Paderborn and measured the number of collisions, i.e. the number of hits that we made on files i and j in a time interval δ. Figure 1 shows the number of collisions for all pairs of files for two consecutive days sorted by the collisions of the first day. The x-axis lists all tuples (i, j) of files according to the collisions on the first day, and the y-axis lists the number of collisions for each pair of files. Now it is interesting to observe that the collisions are very similar on the two consecutive days. So if we develop an algorithm, that minimizes the collisions for a typical access pattern of the web server monitored at one day, these collisions will also
A Data Layout Strategy for Parallel Web Servers
947
11/26/97 11/27/97
1000 500
number
100 50
10 5
1 1
5
10
50
100
500 collision no.
1000
5000 10000
50000
Fig. 1. Distribution of collisions
be eliminated on the next day. Thus, we can take the similarity of access patterns into account for the construction of our data layout strategy. This is the basic observation and the foundation of our data layout principle.
4
Data Layout Strategies
As described above the data layout strategy aims at minimizing the number of collisions for a typical access pattern. The mapping that is computed in this way can be assumed to also avoid a large number of collisions in the future as the distribution of collisions is very similar on consecutive days. In the following we will firstly explain the algorithm used to compute the data layout, and secondly evaluate the performance of this algorithm in detail. 4.1
Algorithm
Throughout the rest of the paper we define F to be the set of data items (files) and {(t1 , f1 ), (t2 , f2 ), . . . , (tm , fm ), . . .} be an access pattern for a web server with ti being the time when the request for data item (file) fi ∈ F arrives on the server. Then c(i, j) =| {{(t, i), (t , j)} || t − t |≤ δ} | is defined as the number of collisions of files i and j for the given access pattern. Our strategy is to distribute the objects stored on the web server onto the given disks in such a way so that collisions are minimized. For a given access pattern this leads to the following algorithmic problem: given: A set of data items F , the access pattern, and a given number of storage devices n. question: Determine mapping π, π = minl:F →{1,...,n} i,j∈F,l(i)=l(j) c(i, j). It is easy to map this problem to the MAX-CUT-problem. The MAX-CUTproblem is defined as follows: given: A graph G = (V, E), weights w(e) ∈ IN and a number of partitions n. question: Determine a partition of V into n subsets, such that the sum of the weights for the edges having the endpoints in different subsets is maximal.
948
J¨ org Jensch et al.
The mapping is done in a way that the data items (files) are the nodes of the graph that has to be partitioned, and the number of collisions c(i, j) determines the weight of edge {i, j}. The MAX-CUT problem is known to be NP-hard [6]. However, there are good polynomial approximation algorithms which deliver good solutions. In our experiments we make use of the PARTY-Library [12] containing an efficient implementation of an extension from the partitioning algorithm described in [8]. 4.2
Collision Resolving if Access Pattern Are Known in Advance
In a first step we examine the gain of our algorithm if the access pattern and therefore the collisions are known in advance. Therefor we take the access statistics of one day and determine the number of collisions of each pair of data items. On the base of these statistics, we build the graph, partition the graph, and look how many collisions have remained and how many have been resolved. In the following we compare the results of our algorithm with the random mapping strategy for a number of access patterns. Each access patterns exactly represents all requests that were issued to the server during one day. We compare the number of collisions induced by the access pattern (ka ) with the number of remaining collisions that occur when applying the mapping algorithm described above (kr ). The factor f describes the ratio between the number of remaining collisions for the random mapping (which is kna ) and the mapping that is determined by the algorithm (kr ).
Table 1. Results for a number of access patterns, each representing one day
sun mon tue wed thu fri sat sun
day 11/23/97 11/24/97 11/25/97 11/26/97 11/27/97 11/28/97 11/29/97 11/30/97
nodes 5533 8877 7932 8206 7464 6976 5065 4798
edges 15148 48136 43720 41172 44656 30919 9059 8657
ka 75334 239365 228870 215825 231364 174120 37702 42544
results, n = 8 kr kkar [%] f 3152 4.18 2.99 12100 5.06 2.50 11367 4.97 2.52 10800 5.00 2.50 12021 5.20 2.40 8671 4.98 2.51 1305 3.46 3.61 1579 3.71 3.37
In Table 1 the statistics and results of the partition of one week are shown. The table shows that all collisions up to 4 - 5 percent can be resolved and that the algorithm has about 2 to 3 times the performance of the random mapping. 4.3
Realistic Optimization
In this section we examine how many collisions can be resolved if we use the access-statistics of the past. This approach can only be successful if the collisions
A Data Layout Strategy for Parallel Web Servers
949
of successive days have some similarity. We already examined this similarity in section 3. Table 2 presents the average results for a number of days where the mapping was determined by the access pattern of day i − 1 and this mapping was used for collision avoiding of day i. The table shows the factor fpd comparing the performance of the random mapping method with the algorithm presented above in respect to the number of disks n. It also shows the percentage of remaink ing collisions r,pd ka [%] that could not be resolved. It is shown that the number of remaining collisions with this mapping is clearly lower than with a random mapping. The advantage of our mapping increases with the number of available disks.
Table 2. Comparison of random placement and algorithm using access pattern of the previous day and algorithm using access pattern of the same day n
kr,pd [%] ka
fpd
kr,sd [%] ka
fsd
2 3 4 5 6 7 8
45.2 27.4 18.7 13.7 10.5 8.2 6.7
1.11 1.22 1.34 1.47 1.61 1.76 1.87
43.1 24.9 16.3 11.3 8.2 6.0 4.7
1.16 1.34 1.54 1.78 2.05 2.41 2.73
ka n ka n
−kr,pd −kr,sd
0.70 0.70 0.72 0.72 0.73 0.73 0.74
The table also shows the loss which occurs from the fact that the access pattern of successive days are not identical. To see this the according percentage k of remaining collisions r,sd ka [%] and the factor fsd are given which could be achieved if the access pattern of each day would be known in advance. The value of the term
ka n ka n
−kr,pd −kr,sd
shows the relative performance difference
of the random mapping and the data layout determined by the algorithm presented above, for the case that the algorithm knows the exact access pattern or only knows the access pattern of the day before. The results show that the performance of our method decreases by about 30 percent if the data layout is computed on the basis of the access pattern from day i − 1 instead of day i if the access pattern of day i is applied. This loss seems to be nearly independent from the number of disks but becomes smaller for larger number of disks. In general the results show, that the data layout that is based on the access pattern of a previous day leads to a large reduction of the collisions later on. 4.4
Using a Number of Access Patterns to Determine the Data Layout
Up to now we only used the access pattern of one day to compute the data layout. Here we will present results for the case that a number of access patterns
950
J¨ org Jensch et al.
Table 3. Results of using larger access patterns for determination of data layout days 1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 partitions avg max min 44.65 44.99 43.95 44.49 44.92 43.60 44.35 44.91 43.26 44.38 44.85 43.43 44.48 45.21 43.47 44.48 45.15 43.66 44.48 45.24 43.51 44.42 44.94 43.73 44.46 45.37 43.63 44.39 45.22 43.32 44.46 45.45 43.36 44.38 45.07 43.40 44.40 45.30 43.47 44.37 44.91 43.43
8 partitions avg max min 6.58 7.08 5.96 6.40 6.90 5.80 6.33 6.66 5.71 6.35 6.93 5.78 6.28 6.83 5.63 6.26 6.85 5.64 6.30 6.88 5.53 6.26 6.88 5.68 6.24 6.86 5.58 6.25 6.71 5.58 6.27 6.80 5.63 6.28 6.75 5.64 6.33 6.83 5.73 6.29 7.00 5.67
were used to determined this layout. All collisions from these patterns were taken into account to determine the data layout. The results in Table 3 were determined using m previous days of a fixed date i and determining the data layout using all the collisions that occur in these m previous days. The experiments were made for different start dates i, so average, maximum and minimum could be studied. The table presents the percentage of the collisions that could not resolved for the access pattern of date i using the different data layouts. It is shown, that the best results were achieved with a data layout that is determined using only a few days of access patterns. A larger number of access patterns does not increase the overall performance of the data layout method. 4.5
Update Frequency for Data Layout
Using the results of the previous sections we know that the data layout method presented here provides reasonable performance improvements to randomization strategies when it considers the access pattern. The results also show that a typical access pattern which is used to determine the data layout should contain the access information of only a few days. No better results could be made with more information. The question is now, if the data layout is determined on a day i taking the access pattern of day i − 1, i − 2, ... i − x into account (small x) how will the performance be on a day i + d, thus d days in the future. This will answer the question how often the data layout has to be updated to achieve the best results. The following experiment was done to answer this question: A data layout was determined taking the access pattern of data i into account. Then the access
A Data Layout Strategy for Parallel Web Servers
951
10 avg max min
rate of rest-collisions [%]
9.5 9 8.5 8 7.5 7 6.5 6 5.5 0
2
4
6
8 10 12 days after last partitioning
14
16
18
20
Fig. 2. Influence of time since last update of data layout on performance
patterns of the days i + d, was applied to this data layout and the number of collisions that had not been resolved was measured. For different values (x-axis) the results are shown in Figure 2. As the results were gained for different starting days i, average values, maximum and minimum could be determined for every d. The results show, that the number of unresolved collisions is steadily increasing, thus the data layout should be updated as often as possible (taking the last access pattern into account) to achieve the best results.
5
Conclusions
This paper presents a new strategy for allocating data items onto the storage devices of a parallel or sequential web server that contains a number of disks to store this data. The method is based on the observation that requests that often arise in parallel on the web server should be forwarded to different storage devices. Using this observation a graph-theoretic formulation is developed and the problem is reduced to a MAX-CUT problem that is solved using heuristics. The results show that the performance improvements of the strategy are considerable compared to randomization strategies that are usually used and that the performance improvement scales up with the number of disks used in the parallel server. The various investigations presented in the paper show that it makes no sense to observe the access to the parallel web server for a long time to determine a typical access pattern which then builds the basis for the determination of a data layout and that the data layout has to be updated from time to time as for a fixed data layout the number of resolved collisions decreases steadily. Future work will focus on the investigation of an extended method that determines the number of duplicates for data items considering different restrictions (in terms of disk capacity, or number of copies) into account.
952
J¨ org Jensch et al.
References 1. M. Adler, S. Chakrabarti, M. Mitzenmacher, L. Rasmussen: Parallel Randomized Load Balancing. Proc. 27th Annual ACM Symp. on Theory of Computing (STOC ’95), 1995. 2. Virgilio Almeida, Azer Bestavros, Mark Crovella and Adriana de Oliveira: Characterizing Reference Locality in the WWW. Proceedings of PDIS’96: The IEEE Conference on Parallel and Distributed Information Systems, December 1996. 945 3. Martin Arlitt and Carey Williamson: Web server workload characterization: The search for invariants. Proceedings of ACM SIGMETRICS’96,1996. 945 4. D. Andresen, T. Yang, V. Holmedahl and O. Ibarra: SWEB: Towards a Scalable WWW Server on MultiComputers, Proceedings of the 10th International Parallel Processing Symposium (IPPS’96), April 1996. 945 5. M. Colajanni and P.S. Yu: Adaptive TTL schemes for Load Balancing of Distributed Web Servers, ACM SIGMETRICS Perfermance Evaluation Review, 1997, volume 25,2 945 6. R. M. Karp: Reducibility among combinatorial problems, in R. E. Miller and J. W. Thatcher, Complexity of Computer Computations, pages 85–103, 1972 948 7. Eric Dean Katz, Michelle Butler and Robert McGrath: A scalable HTTP server: The NCSA prototype. Computer Networks and ISDN Systems, volume 27, pages 155–164, 1994. 945 8. B.W. Kernighan and S.Lin. An effective heristic procedure for partitioning graphs. The Bell Systems Technical Journal, pages 192–308, Feb 1970. 948 9. Kwan, T.T., R.E. McGrath and D.A. Reed: NCSA World Wide Web Server: Design and Performance. IEEE Computer, November 1995, 28(11), pages 68-74. 945 10. Y.H. Liu and P. Dantzig and C.E. Wu and J. Challenger and L.M. Ni, A distributed {Web} server and its performance analysis on multiple platforms. Proceedings of the 16th International Conference on Distributed Computing Systems, IEEE, 1996 11. D.R. Liu and S. Shekhar. Partitioning Similarity Graphs: A Framework for Declustering Problems, Information Systems, 1996, volume 21,6, pages 475–496 12. R. Preis and R. Diekmann. The PARTY Partitioning – Library User Guide. Technical Report tr-rsfb-96-024, University of Paderborn. 948 13. Benjamin W. Wah., File placement on distributed computer systems. Computer Magazine of the Computer Group News of the IEEE Computer Group Society, 17(1), January 1984. 945
ViPIOS: The Vienna Parallel Input/Output System Erich Schikuta, Thomas Fuerle, and Helmut Wanek Institute for Applied Computer Science and Information Systems Department of Data Engineering, University of Vienna, Rathausstr. 19/4, A-1010 Vienna, Austria [email protected]
Abstract. In this paper we present the Vienna Parallel Input Output System (ViPIOS), a novel approach to enhance the I/O performance of high performance applications. It is a client-server based tool combining capabilities found in parallel I/O runtime libraries and parallel file systems.
1
Introduction
In the last few years the applications in high performance computing (Grand Challenges [1]) shifted from being CPU-bound to be I/O-bound. Performance can not be scaled up by increasing the number of CPUs any more, but by increasing the bandwidth of the I/O subsystem. This situation is commonly known as the I/O bottleneck in high performance computing ([5]) In reaction all leading hardware vendors of multiprocessor systems provided powerful concurrent I/O subsystems. In accordance researchers focused on the design of appropriate programming tools to take advantage of the available hardware resources. 1.1
The ViPIOS Approach
Conventionally two different directions in developing programming support are distinguished: Runtime libraries for high-performance languages (e.g. Passion [6]) and parallel file systems, (e.g. IBM Vesta [4]). We see a solution to the parallel I/O problem in a combination of both approaches, which results in a dedicated, smart, concurrently executing runtime system, gathering all available information of the application process both during the compilation process and the runtime execution. Initially it can provide the optimal fitting data access profile for the application and may then react to the execution behavior dynamically, allowing to reach optimal performance by aiming for maximum I/O bandwidth.
This work was carried out as part of the research project ”Language, Compiler, and Advanced Data Structure Support for Parallel I/O Operations” supported by the Austrian Science Foundation (FWF Grant P11006-MAT)
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 953–958, 1998. c Springer-Verlag Berlin Heidelberg 1998
954
Erich Schikuta et al.
This approach led to the design and development of the Vienna Input Output System, ViPIOS ([2,3]). ViPIOS is an I/O runtime system, which provides efficient access to persistent files, by optimizing the data layout on the disks and allowing parallel read/write operations. ViPIOS is targeted as a supporting I/O module for high performance languages (e.g. HPF).
2
System Architecture
The basic idea to solve the I/O bottleneck in ViPIOS is de-coupling. The disk access operations are de-coupled from the application and performed by an independent I/O subsystem, ViPIOS. This leads to the situation that an application just sends general I/O requests to ViPIOS, which performs the actual disk accesses in turn. This idea is caught by figure 1.
application processes AP VI
AP VI
...
AP VI
requests
app.
app. data
requests
disk accesses
VS
VS
ViPIOS servers
data access
ViPIOS disk
accesses
disk sub-system
coupled I/O
de-coupled I/O
Fig. 1. Disk access de-coupling
ViPIOS
Fig. 2. ViPIOS system architecture
Thus ViPIOS’s system architecture is built upon a set of cooperating server processes, which accomplish the requests of the application client processes. Each application process AP is linked by the ViPIOS interface VI to the ViPIOS servers VS (see figure 2). The server processes run independently on all or a number of dedicated processing nodes on the underlying MPP. It is also possible that an application client and a server share the same processor. Generally each application process is assigned exactly one ViPIOS server (which is called the buddy server to the application), but one ViPIOS server can serve a number of application processes, i.e. there exists a one-to-many relationship between the application and the servers (see figure 3). The other ViPIOS servers are called foe server to the application.
ViPIOS: The Vienna Parallel Input/Output System
app.
A B
ViPIOS interface
BDL D1
...
HPF
955
MPI-IO
ViPIOS proprietory interface message manager
A is ’buddy’ to app.
B is ’foe’ to application
directory manager
fragmenter A
B
memory manager disk manager D1
D2
D3
D4
MPI-IO
Fig. 3. ”Buddy” and ”Foe” Servers 2.1
ADIO
...
hw spec.
Fig. 4. ViPIOS server architecture
ViPIOS Server
A ViPIOS server process consists of several functional units as depicted by figure 4. Basically we differentiate between 3 layers: – The Interface layer provides the connection to the ”outside world”(i.e. applications, programmers, compilers, etc.). Different interfaces are supported by interface modules to allow flexibility and extendibility. Until now we implemented an HPF interface module (aiming for the VFC, the HPF derivative of Vienna FORTRAN) a (basic) MPI-IO interface module, and the specific ViPIOS interface which is also the interface for the specialized modules. – The Kernel layer is responsible for all server specific tasks. – The Disk Manager layer provides the access to the available and supported disk sub-systems. This layer too is modularized to allow extensibility and to simplify the porting of the system. At the moment ADIO [7], MPI-IO, and Unix style file systems are supported. The ViPIOS kernel layer is built up of four cooperating functional units: – The Message manager is responsible for the external (to the applications) and internal (to other ViPIOS servers) communication. – The Fragmenter can be seen as ”ViPIOS’s brain”. It represents a smart data administration tool, which models different distribution strategies and makes decisions on the effective data layout, administration, and ViPIOS actions. – The Directory Manager stores the meta information of the data. We designed 3 different modes of operation, centralized (one dedicated ViPIOS directory server), replicated (all servers store the whole directory information), and localized (each server knows the directory information of the data it is storing only) management. Until now only localized management is implemented. – The Memory Manager is responsible for prefetching, caching and buffer management.
956
Erich Schikuta et al.
Requests are issued by an application via a call to one of the functions of the ViPIOS interface, which in turn translates this call into a request message which is sent to the buddy server. The local directory of the buddy server holds all the information necessary to map a client’s request to the physical files on the disks. The fragmenter uses this information to decompose (fragment) a request into sub-requests which can be resolved locally and sub-requests which have to be communicated to other ViPIOS-servers (foe servers). The I/O subsystem actually performs the necessary disk accesses and the transmission of data to/from the AP. 2.2
System Modes
ViPIOS can be used in 3 different system modes, as – runtime library, – dependent system, or – independent system. These modes are depicted by figure 5.
app.
app.
app.
ViPIOS
ViPIOS
ViPIOS
runtime library
dependent system
independent system
Fig. 5. ViPIOS system modes
Runtime Library. Application programs can be linked with a ViPIOS runtime module, which performs all disk I/O requests of the program. In this case ViPIOS is not running on independent servers, but as part of the application. The ViPIOS interface is therefore not only calling the requested data action, but also performing it itself. This mode provides only restricted functionality due to the missing independent I/O system. Parallelism can only be expressed by the application (i.e. the programmer). Dependent System. In this case ViPIOS is running as an independent module in parallel to the application, but is started together with the application. This
ViPIOS: The Vienna Parallel Input/Output System
application clients view pointer
persistent file global pointer
ViPIOS servers local pointer
000 111 000 111 000 111 000 111 000 111 000 111 000 111 1111 0000 11 00 00 11 11 00 11 00 00 11 00 11
11 00 11 00 00 00 11 00 11 11 11 00
111 000
957
Problem layer
File layer
Data layer
mapping functions
Fig. 6. ViPIOS data abstraction is inflicted by the MPI1 specific characteristic that cooperating processes have to be started together in the same communication world. Processes of different worlds can not communicate until now. This mode allows smart parallel data administration but objects a preceeding preparation phase. Independent System. This is the mode of choice to achieve highest possible I/O bandwidth by exploiting all available data administration possibilities. In this case ViPIOS is running similar to a parallel file system or a database server waiting for applications to connect via the ViPIOS interface. This connection is realized by a proprietary communication layer bypassing MPI. We implemented two different approaches, one by using PVM, the other by patching MPI. A third promising approach is just evaluated by employing PVMPI, a possibly uprising standard under development for coupling MPI worlds by PVM layers.
3
Data Abstraction in ViPIOS
ViPIOS provides a data independent view of the stored data to the application processes. Three independent layers in the ViPIOS architecture can be distinguished, which are represented by file pointer types in ViPIOS. – Problem layer. Defines the problem specific data distribution among the cooperating parallel processes (View file pointer). – File layer. Provides a composed view of the persistently stored data in the system (Global file pointer). – Data layer. Defines the physical data distribution among the available disks (Local file pointer). 1
The MPI standard is the underlying message passing tool of ViPIOS to ensure portability
958
Erich Schikuta et al.
Thus data independence in ViPIOS separates these layers conceptually from each other, providing mapping functions between these layers. This allows logical data independence between the problem and the file layer, and physical data independence between the file and data layer. This concept is depicted in figure 6 showing a cyclic data distribution.
4
Conclusions and Future Work
In this paper we presented the Vienna Parallel Input Output System (ViPIOS), a novel approach to parallel I/O based on a client server concept which combines the advantages of existing parallel file systems and parallel I/O libraries. We described the underlying design principles of our approach and gave an in-depth presentation of the developed system.
References 1. A Report by the Committee on Physical, Math., and Eng. Sciences Federal Coordinating Council for Science, Eng. and Technology. High-Performance Computing and Communications, Grand Challenges 1993 Report, pages 41 – 64. Committee on Physical, Math., and Eng. Sciences Federal Coordinating Council for Science, Eng. and Technology, Washington D.C., October 1993. 953 2. Peter Brezany, Thomas A. Mueck, and Erich Schikuta. Language, compiler and parallel database support for I/O intensive applications. In Proceedings of the International Conference on High Performance Computing and Networking, volume 919 of Lecture Notes in Computer Science, pages 14–20, Milan, Italy, May 1995. Springer-Verlag. also available as Technical Report of the Inst. f. Software Technology and Parallel Systems, University of Vienna, TR95-8, 1995. 954 3. Peter Brezany, Thomas A. Mueck, and Erich Schikuta. A software architecture for massively parallel input-output. In Third International Workshop PARA’96 (Applied Parallel Computing - Industrial Computation and Optimization), volume 1186 of Lecture Notes in Computer Science, pages 85–96, Lyngby, Denmark, August 1996. Springer-Verlag. Also available as Technical Report of the Inst. f. Angewandte Informatik u. Informationssysteme, University of Vienna, TR 96202. 954 4. Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225–264, August 1996. 953 5. Juan Miguel del Rosario and Alok Choudhary. High performance I/O for parallel computers: Problems and prospects. IEEE Computer, 27(3):59–68, March 1994. 953 6. Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996. 953 7. Rajeev Thakur, William Gropp, and Ewing Lusk. An abstract-device interface for implementing portable parallel-I/O interfaces. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 180–187, October 1996. 955
A Performance Study of Two-Phase I/O Phillip M. Dickens 1
Department of Computer Science Illinois Institute of Technology 2 Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory
Abstract. Massively parallel computers are increasingly being used to solve large, I/O intensive applications in many different fields. For such applications, the I/O subsystem represents a significant obstacle in the way of achieving good performance. While massively parallel architectures do, in general, provide parallel I/O hardware, this alone is not sufficient to guarantee good performance. The problem is that in many applications each processor initiates many small I/O requests rather than fewer larger requests, resulting in significant performance penalties due to the high latency associated with I/O. However, it is often the case that in the aggregate the I/O requests are significantly fewer and larger. Two-phase I/O is a technique that captures and exploits this aggregate information to recombine I/O requests such that fewer and larger requests are generated, reducing latency and improving performance. In this paper, we describe our efforts to obtain high performance using twophase I/O. In particular, we describe our first implementation which produced a sustained bandwidth of 78 MBytes per second, and discuss the steps taken to increase this bandwidth to 420 MBytes per second.
1
Introduction
Massively parallel computers are increasingly used to solve large, I/O-intensive applications in several different disciplines. However, in many such applications the I/O subsystem performs poorly, and represents a significant obstacle to achieving good performance. The problem is generally not with the hardware; many parallel I/O subsystems offer excellent performance. Rather, the problem arises from other factors, primarily the I/O patterns exhibited by many parallel scientific applications [1,5] In particular, each processor tends to make a large number of small I/O requests, incurring the high cost of I/O on each such request. The technique of collective I/O has been developed to better utilize the parallel I/O subsystem [2,7,8]. In this approach, the processors exchange information about their individual I/O requests to develop a picture of the aggregate I/O request. Based on this global knowledge, I/O requests are combined and submitted in their proper order, making a much more efficient use of the I/O subsystem. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 959–965, 1998. c Springer-Verlag Berlin Heidelberg 1998
960
Phillip M. Dickens
Two significant implementation techniques for collective I/O are two-phase I/O [2,7] and disk-directed I/O [4,6]. In the two-phase approach, the application processors collectively determine and carry out the optimized approach. In this paper, we deal only with the two-phase approach. Consider a collective read operation. If the data is distributed across the processors in a way that conforms to the way it is stored on disk, each processor can read its local array in one large I/O request. This distribution is termed the conforming distribution, and represents the optimal I/O performance. Assume the array is not distributed across the processors in a conforming manner. The processors can still perform the read operation assuming the conforming distribution, and then use interprocessor communication to redistribute the data to the desired distribution. Since interprocessor communication is orders of magnitude faster than I/O calls, it is possible to obtain performance that approaches that of the conforming distribution. The question then arises as to what is the best approach to implement the two-phase I/O algorithm. While much has been published regarding the performance gains using this technique, relatively little has been written about specific implementation issues, and how these issues affect performance. In this paper, we focus on the implementation issues that arise when implementing a two-phase I/O algorithm. We start by describing a very simple implementation, and go through a sequence of steps where we explore different optimizations to this basic implementation. At each step, we discuss the modification to the algorithm and discuss its impact on performance.
2
Experimental Design
We performed all experiments on the Intel Paragon located at the California Institute of Technology. This Paragon has 381 compute nodes and an I/O subsystem with 64 SCSI I/O processors, each of which controls a 4GB seagate drive. We define the maximum I/O performance as each processor writing its portion of the file assuming the conforming distribution. We define the naive approach as each processor performing its own independent file access without any global strategy, i.e. no collective I/O. We assume an SPMD computation, where each processor operates on the portion of the global array that is located in its local memory. We study the costs of writing a two-dimensional array to disk for various two-phase I/O implementations. Our metric of interest is the percentage of the maximum bandwidth achieved by each approach. The application does nothing except make repeated calls to the two-phase I/O routine. Our experiments involved a two dimensional array of integers with dimensions 4096 X 4096 (for a total file size of 64 megabytes). We used MPI for all communications. To conserve space, we present one graph which maps the performance of each of the various approaches.
A Performance Study of Two-Phase I/O
3
961
Experimental Results
The initial implementation of the two-phase I/O algorithm is quite simple. First, the processors exchange information related to their individual I/O requirements to determine the collective I/O requirement. Next, each processor goes through a series of sending and receiving messages to perturb the data into the conforming distribution. When a processor receives a portion of its data it performs a simple byte to byte copy into its write buffer. When a processor sends a portion of its data it performs a byte for byte copy from its local array into the send buffer. After all data has been exchanged, each processor performs its write operation in one large request. This initial implementation uses both blocking sends and blocking receives. The cost associated with the naive implementation (no collective I/O) is the latency incurred from issuing many small I/O requests. There are four primary costs associated with the simple implementation of two-phase I/O. First is the extra buffer space required for the write buffer and the communication buffers. Second is the cost of copying data into the communication buffers, copying data from the communication buffers to the write buffer, and copying data from the local array into the write buffers. Third is interprocessor communication and fourth is the actual writing of the data to the disk. The results for the non-collective I/O implementation is depicted in the curve labeled the naive approach of Figure 1. The initial implementation of the twophase I/O algorithm is depicted in the curve labeled Step1. As noted, the graph depicts the percentage of the maximum bandwidth achieved by each approach given 16, 64 and 256 processors. With 16 processors, each processor must allocate a four megabyte write buffer and the messages passed between the processors are one megabyte. With these values, the time to perform the many small I/O operations is virtually the same as the extra costs associated with the two-phase approach. When we move to 64 processors however, each processor allocates a much smaller write buffer (one megabyte), and the messages between the processors are much smaller (131072 bytes). Thus its relative performance is improved. With the naive approach however the additional processors are all issuing many small I/O requests resulting in significant contention for the IOPs and communication network. With 64 processors, this very simple implementation of the two-phase I/O algorithm improves performance by a factor of 3. With 256 processors, it improves performance by a factor 40. 3.1
Step2: Reducing Copy Costs
As noted above, two-phase I/O requires a significant amount of copying. For this reason, we would expect modifications that reduced the cost of copying data would have a significant impact on performance. The next step then is to change all of the copy routines to use the memcpy() library call whenever possible. This optimization, labeled as Step2 in Figure 1, has a tremendous impact on performance. As can be seen, with the optimized copying routine this implementation of the two-phase I/O algorithm outperforms
962
Phillip M. Dickens
both the naive approach and the initial implementation. The improvement in performance over the initial implementation varies between a factor of 1.7 and a factor of 15. The reason for such a tremendous difference is that memcpy() uses a block move instruction, and requires only four assembly language instructions per block of memory. Performing a byte for byte copy requires 20 assembly language instructions per byte. 3.2
Step3: Asynchronous Communication
The next step we investigated was performing asynchronous rather than synchronous communication. In this implementation, a processor first posts all of its sends (using MPI Isend) and then posts all of its receives (using MPI Irecv). After posting its communications, the processor copies any of its own data into the write buffer. The processor then waits for all of the messages it needs to receive, and copies the data into the write buffer as they arrive. It then waits for all of the send operations to complete and performs the write. The trade-offs in this implementation are quite interesting. In previous versions of the algorithm, a processor would alternate between sending a message and waiting to receive a message, blocking until a specific message arrives before posting its next send. The advantage is that the processor frees the send and receive buffers immediately, thus maintaining only one communication buffer at any time. With reduced buffering requirements the cost of paging is decreased. There are two advantages to using asynchronous communication. First, the underlying system can complete any message that is ready without having to wait for a particular message. Second, it can perform the copying between its local array and its write buffer with the asynchronous communication going on in the background. The primary drawback of this approach is that each asynchronous request requires its own communication buffer. Thus the cost of buffering is considerably higher reducing performance. The results are shown in the curve labeled Step 3 in Figure 1. With 16 processors, the asynchronous approach further improves performance by approximately 20%. With 64 processors however, this improvement over the previous implementation is reduced to approximately 5%. There is no noticeable improvement with 256 processors. The reason for the decrease in performance with 64 and 256 processors is that as the number of processors increase, the time required to perform the actual write to disk begins to dominate the costs of the algorithm. Thus using asynchronous communication is less important than it is with a smaller number of processors. 3.3
Step4: Reversing the Order of the Asynchronous Communication
The next approach is to reverse the order of the asynchronous sends and receives. Thus a processor first posts all of its asynchronous receives and then posts its asynchronous sends. The idea behind this optimization is that MPI communications are generally much faster if the receive has been posted (and thus a buffer
A Performance Study of Two-Phase I/O
963
in which to receive the message has been provided) before the message is sent [3]. Thus pre-allocating all of the receive buffers before the corresponding sends are initiated should improve the interprocessor communication costs. There is of course the same issue discussed above: pre-allocating all of the communication buffers can result in increased paging activity. The results are given in the curve labeled Step4 in Figure 1. With 16 processors, reversing the order of sends and receives provides a 15% improvement over the previous approach. Again this improvement in performance decreases as the number of processors is increased. With 64 processors there is an improvement of approximately 5%, and the improvement disappears with 256 processors. This is again due to the fact that the write time begins to dominate the cost of the algorithm as the number of processors increases. 3.4
Step5: Combining Synchronous with Asynchronous Communication
The final optimization we pursued was to combine asynchronous receives with synchronous sends. The idea behind this optimization is that posting all of the receive buffers will improve communication costs, and releasing the send buffer after each use will reduce the buffering costs. The results are shown in the curve labeled step5. This implementation results in performance gains of up to 10%.
4096 X 4096 Array of Integers 1.0
Percentage of Maximum Bandwidth
0.8 Naive Step1 Step2 Step3 Step4 Step5
0.6
0.4
0.2
0.0
0
100 200 Number of Processors
300
Fig. 1. Improvement in performance as a function of the implementation and the number of processors.
964
3.5
Phillip M. Dickens
Scalability of Two-Phase I/O
Various File Sizes With 256 Processors 500.0 Naive Step1 Step2 Step3 Step4 Step5 Max
Bandwidth (Megabytes per Second)
400.0
300.0
200.0
100.0
0.0
0
500 1000 Total File Size in Megabytes
1500
Fig. 2. The bandwidth achieved by each approach as the file size is increased and the number of processors is a constant 256.
For completion, we examine the behavior of these two-phase I/O implementations as the amount of data being written increases and the number of processors remains a constant 256. We looked at total file sizes of 16 megabytes, 64 megabytes, 256 megabytes and 1 gigabyte. The results are shown in Figure 2. In this figure, the bandwidth achieved by each approach is given as a function of the total file size. For comparison, the maximum bandwidth is also shown. As can be seen, neither the naive approach nor the first two-phase implementation scale well. It is interesting to note that step3, where all communication is asynchronous and the sends are posted before the receives, performs more poorly in the limit than does step2 where all communication is synchronous. Also in the limit there is a small improvement in performance between step2 and step4 and both approaches appear to scale relatively well. The combination of asynchronous receives and synchronous sends scales very well, and, in the limit, approaches the optimal performance. It is important to note that in the limit the initial implementation of twophase I/O resulted in a bandwidth of 78 megabytes per second. The final implementation resulted in a bandwidth of 420 megabytes per second.
A Performance Study of Two-Phase I/O
4
965
Conclusions
In this paper, we investigated the impact on performance of various implementation techniques for two-phase I/O, and outlined the steps we followed to obtain high performance. We began with a very simple implementation of two-phase I/O that provided a bandwidth of 78 megabytes per second, and ended with an optimized implementation that provided 420 megabytes per second. During the course of this analysis, we provided a good look at the trade-offs involved in the implementation of a two-phase I/O algorithm. Current research is aimed at extending these results to other parallel architectures such as the IBM SP2.
References 1. Crandall, P., Aydt, R., Chien, A. and D. Reed Input-Output Characteristics of Scalable Parallel Applications, In Proceedings of Supercomputing ’95, ACM press, December 1995. 959 2. DelRosario, J., Bordawekar, R. and Alok Choudhary. Improved parallel I/O via a two-phase run-time access strategy. In Proceedings of the IPPS ’93 Workshop on Input/Output in Parallel Computer Systems pages 56-70, Newport Beach, CA, 1993. 959, 960 3. Gropp, W., Lusk, E. and A. Skjellum. Using MPI. Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge, Massachusetts. 1996. 963 4. Kotz, D. Disk-directed I/O for MIMD multiprocessors. ACM Transactions on Computer Systems 15(1):41-74, February 1997. 960 5. Kotz, D. and N. Nieuwejaar. Dynamic file-access characteristics of a production parallel scientific workload. In Supercomputing ’94 pages 640-649, November 1994. 959 6. Kotz, D. Expanding the potential for disk-directed I/O. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing. Pages 490 - 495, IEEE Computer Society Press. 960 7. Thakur, R. and A. Choudhary. An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays. Scientific Programming 5(4):301-317, Winter 1996. 959, 960 8. Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996. 959
Workshop 13+14 Architectures and Networks Kieran Herley and David Snelling Co-chairmen
This workshop embraces two broad themes: routing and communication networks and parallel computer architecture. The first theme is devoted to communication in parallel computers and covers all aspects of the theory and practice of the design, analysis and evaluation of interconnection networks and their routing algorithms. The continued vitality of this area is reflected by the strength and quality of the papers presented in Sessions 1 and 3 (Routing Networks and Algorithms I and II) of this workshop. Although recent years have seen computer architecture undergoing a period of contraction and a tendency toward uniformity in the form of shared memory systems composed of super scalar processors, there is nonetheless no lack of innovation in the field. The computer architecture theme, represented by the papers of Sessions 2 and 4 (Computer Architecture I and II), highlights not only innovation within the converging area of super scalar systems, but the continued presence of novel ideas as well. A total of 32 submissions were received (19 on the theme of routing and communication networks and 13 on the computer architecture theme) of which a total of 15 were accepted by the workshop committee for presentation at the conference. Session 1 is devoted to the issue of routing algorithms for unstructured or loosely structured communication on various networks. The paper by S. R. Donaldson, J. M. D. Hill and D. B. Skillicorn entitled “Predictable Communication on Unpredictable Networks: Implementing BSP over TCP/IP” addresses the problem of supporting the well-known BSP model on networks of workstations running TCP/IP. Techniques that offer high throughput with low variance are presented and evaluated experimentally with encouraging results. N. Agrawal and C. P. Ravikumar consider novel techniques for efficient wormhole routing where deadlock is handled not by avoidance but by detection and recovery in their paper “Adaptive Routing Based on Deadlock Recovery”. An adaptive routing technique is presented that ensures that deadlocks are infrequent and two schemes are outlined for deadlock recovery are described and evaluated. M. Ould-Khaoua’s paper “On the Optimal Topology for Multicomputers with Fully-Adaptive Wormhole Routing: Torus or Hypercube?” revisits the hypercube versus torus comparison for wormhole routing networks. Whereas previous studies, generally based on nonadaptive routing, indicated the superiority of torus topologies, the paper suggests that under certain conditions the hypercube may offer superior performance when adaptive techniques are employed. The fundamental h-relation routing problem is the focus of the paper “Constant Thinning Protocol for Routing in Complete Networks” by A. Kautonen, V. Lepp¨ annen D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 967–969, 1998. c Springer-Verlag Berlin Heidelberg 1998
968
Kieran Herley and David Snelling
and M. Penttonen. They present and evaluate a randomized algorithm for this problem on the OCPC model with encouraging results. T. Gr¨ un and M. A. Hillebrand’s paper, “NAS Integer Sort on Multi-Threaded Shared Memory Machines”, proposes novel techniques for improving performance in the already novel environment of multi-threaded computer architecture. They use both the commercial Tera MTA and the experimental SB-PRAM machines as platforms for these techniques. In “Analysing a Multistreamed Superscalar Speculative Instruction Fetch Mechanism”, R. R. dos Santos and P. O. A. Navaux address the possibility of a more conservative step toward multithreading within the more traditional super scalar environment. Their primary target is the performance loss due to instruction fetch latency resulting from miss predicted branches. Not surprisingly we find detailed consideration of implementation costs critical in the design of special purpose processor arrays. In “Design of Processor Arrays for Real-time Applications”, D. Fimmel and R. Merker study various algorithms for optimizing the layout of uniform processes on processor arrays. Session 3 is largely devoted to routing algorithms for relatively structured communication such as gossiping or all-to-all scattering. The paper “Interval Routing and Layered Cross Product: Compact Routing Schemes for Butterflies, Mesh of Trees and Fat Trees” by T. Calamoneri and M. Di Ianni show how shortest-path routing information may be maintained in a compact (spaceefficient) way for a class of interconnections that includes the butterfly, the mesh of trees and certain fat-trees. The results rely on an elegant exploitation of interval routing techniques. The gossiping problem involves simultaneously broadcasting a set of packets where each node is the origin of a distinct packet and must receive a copy of the packet that originates at every other node. This problem is the subject of both the paper “Gossiping Large Packets on Full-Port Tori” by U. Meyer and J. F. Sibeyn and “Time-Optimal Gossip in Noncombining 2-D ˇ Tori with Constant Buffers” by M. Soch and P. Tvrd´ık. The former presents a near-optimal algorithm for two- and higher-dimensional tori that are timeindependent (each node repeats the same simple routing action repeatedly for the duration of the algorithm’s execution), whereas the latter presents a timeoptimal algorithm for two-dimensional tori that requires only a constant amount of buffer-space per node. The paper “Divide-and-Conquer Algorithms on TwoDimensional Meshes” by M. Valero-Garc´ıa, A. Gonz´alez, L. D´ıaz de Cerio and D. Royo explores techniques for supporting divide-and-conquer computations on the two-dimensional mesh. The approach provides efficient solutions to this problem under both the wormhole and store-and-forward routing regime. The all-to-all-scatter problem involves routing a set of distinct packets, one for each source-destination node-pair, where each packet is to routed from its source to its destination. The paper “Average Distance and All-to-All Scatter in Kautz Networks” by P. Salinger and P. Trvd´ık describes an asymptotically optimal algorithm for this problem for the Kautz network. Modern ccNUMA systems represent the merger of easy to build distributed memory systems and easy to program SMP systems. In their paper, “Reac-
Workshop 13+14 Architectures and Networks
969
tive Proxies: a Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention”, S. A. M. Talbot and P. H. J. Kelly propose and evaluate a technique for distributing the contention that arises naturally in ccNUMA systems when processes share a data structure. The approach is novel in that the degree of distribution is driven by the degree of contention and that well behaved applications do not suffer because of the protocol. The flip side of performance is always reliability. T. Skeie presents techniques for “Handling Multiple Faults in Wormhole Mesh Networks”. The infrastructure for large distributed systems needs to provide both performance and reliability. The presence of multiple paths between nodes on a mesh provides both. In this case, fault tolerance is achieved without dramatic losses in performance. In another instance of novel approaches in modern computer architecture, “Shared Control Supporting Control Parallelism using a SIMD-like Architecture”, N. Abu-Ghazaleh and P. Wilsey provide mechanisms for supporting control parallelism in SIMD systems. The aim is to provide flexibility in conditional and similar computation that does not require global redundant computation necessary in traditional SIMD systems. The members of the workshop committees wish to extend their sincere thanks to those who acted as reviewers during the selection process for their invaluable help during that period and to all of those who by their submissions have contributed to the success of this workshop.
Predictable Communication on Unpredictable Networks: Implementing BSP over TCP/IP Stephen R. Donaldson1 , Jonathan M.D. Hill1 , and David B. Skillicorn2 1
Oxford University Computing Laboratory, UK. 2 CISC, Queen’s University, Canada
Abstract. The BSP cost model measures the cost of communication using a single architectural parameter, g, which measures permeability of the network to continuous traffic. Architectures, typically networks of workstations, pose particular problems for high-performance communication because it is hard to achieve high throughput, and even harder to do so predictably. Yet both of these are required for BSP to be effective. We present a technique for controlling applied communication load that achieves both. Traffic is presented to the communication network at a rate chosen to maximise throughput and minimise its variance. Performance improvements as large as a factor of two over MPI can be achieved.
1
Introduction
The BSP (Bulk Synchronous Parallel) model [10,8] views a parallel machine as a set of processor-memory pairs, with a global communication network and a mechanism for synchronising all processors. A BSP calculation consists of a sequence of supersteps. Each superstep involves all of the processors and consists of three phases: (1) processor-memory pairs perform a number of computations on data held locally at the start of a superstep; (2) processors communicate data into other processor’s memories; and (3) all processors synchronise. The BSP cost model treats communication as an aggregate operation of the entire executing architecture, and models the cost of delivery using a single architectural parameter, the permeability, g. This parameter can be intuitively understood as defining the time taken for a processor to communicate a single word to a remote processor, in the steady state where all processors are simultaneously communicating. The value of the g parameter will depend upon: (1) the bisection bandwidth of the communication network topology; (2) the protocols used to interface with and within the communication network; (3) buffer management by both the processors and the communication network; and (4) the routing strategy used in the communication network. However, the BSP runtime system also makes an important contribution to the performance by acting to improve the effective value of g by the way it uses the architectural facilities. For example, [6] shows how orders of magnitude D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 970–980, 1998. c Springer-Verlag Berlin Heidelberg 1998
Predictable Communication on Unpredictable Networks
971
improvements in g can be obtained, for architectures using point-to-point connections, by packing messages before transmission, and by altering the order of transmission to avoid contention at receivers. In this paper, we address the problem raised by shared-media networks and protocols such as TCP/IP, where there is far greater potential to waste bandwidth. For example, if two processors try to send more or less simultaneously, collision in the ether means that neither succeeds, and transmission capacity is permanently lost. The problem is compounded because it is hard for each processor to learn anything of the global state of the network. Nevertheless, as we shall show, significant performance improvements are possible. We describe techniques, specific to our implementation of BSPlib [5], that ensure that the variation in g is minimised for programs running over bus-based Ethernet networks. Compared to alternative communications libraries such as Argonne’s implementation of MPI [2], these techniques have an absolute performance improvement over MPI in terms of the mean communication throughput, but also have a considerably smaller standard deviation. Good performance over such networks is of practical importance because networks of workstations are increasingly used as practical parallel computers.
2
Minimising g in Bus-based Ethernet Networks
Ethernet (IEEE 802.3) is a bus-based protocol in which the media access protocol, 1-persistent CSMA/CD (Carrier Sense Multiple Access with Collision Detection) proceeds as follows. A station wishing to send a frame listens to the medium for transmission activity by another station. If no activity is sensed, the station begins transmission and continues to listen on the channel for a collision. After twice the propagation delay, 2τ , of the medium, no collision can occur, as all stations sensing the medium would detect that it is in use and will not send data. However, a collision may occur during the 2τ window. On detection, the transmitting station broadcasts a jamming signal onto the network to ensure that all stations are notified of the collision. The station recovers from a collision by using a binary exponential back-off algorithm that re-attempts the transmission after t× 2τ , where t is a random variable chosen uniformly from the interval [0, 2k ] (where k is the number of collisions this attempted transmission has experienced). For Ethernet, the protocol allows k to reach ten, and then allows another six attempts at k = 10 (see, for example, King [7]). Analysis of this protocol (Tasaka [9]) shows that S → 0 as G → ∞ (where S is the rate of successful transmissions and G the rate at which messages are presented for delivery), whereas for a p-processor BSP computer one would expect that S → B as G → ∞, from which one could conclude that g = p/B; where B is a measure of the bandwidth. In the case of BSP computation over Ethernet, the effect of the exponential backoff is exaggerated (larger delays for the same amount of traffic) because the access to the medium is often synchronised by the previous barrier synchronisation and the subsequent computation phase. For perfectly-balanced
972
Stephen R. Donaldson et al.
G−S Retries
T
⊕
BSPlib
G
Transport Layer
S Analytic Model
(TCP)
Cable
Fig. 1. Schematic of aggregated protocol layers and associated applied loads
computations and barrier synchronisations, all processors attempt to get their first message onto the Ethernet at the same time. All fail and back off. In the first phase of the exponential back-off algorithm, each of p processors choose a uniformly-distributed wait period in the interval [0, 4τ ]. Thus the expected number of processors attempting the retransmits in the interval [0, 2τ ] is p/2, making secondary collisions very likely. If the processors are not perfectly balanced, and a processor gains access to the medium after a short contention period, then that process will hold the medium for the transmission of the packet, which will take just over 1000µs for 10Mbps Ethernet. With high probability, many of the other processors will be synchronised by this successful transmission due to the 1-persistence of this protocol. The remaining processors will then contend as in the perfectly-balanced scenario. In terms of the performance model, this corresponds to a high applied load, G, albeit for a short interval of time. If S were (at least) linear in G then this burstiness of the applied load would not be detrimental to the throughput and would average out.
0.9
0.64 S(G)
0.8
0.62
0.7
0.58 Expectation
S, rate of successful transmisions
0.6 0.6
0.5
0.4
0.56
0.54 0.3 0.52
0.2
0.5
0.1
0 0
20
40
60
80
100
G, applied load
Fig. 2. Plot of applied load (G) against successful transmissions (S)
0.48 2
5
8
11 14 Number of processors
17
20
23
Fig. 3. Contention expectation for a particular slot as a function of p
Predictable Communication on Unpredictable Networks
973
Fortunately, the BSP model allows assumptions to be made at the global level based on local data presented for transmission. At the end of a superstep and before any user data is communicated, BSPlib performs a reduction, in which all processors determine the amount of communication that each processor intends sending. From this, the number of processors involved in the communication is determined. For the rest of the communication the number of processors involved and the amount of data is used to regulate (at the transport level) the rate at which data is presented for transmission on the Ethernet. By using BSD Socket options (TCP NDELAY), the data presented at the transport layer are delivered immediately to the MAC layer (ignoring the depth of the protocol stack and the availability of a suitable window size). Thus, by pacing the transport layer, pacing can be achieved at the MAC or link layer. This has the effect of removing burstiness from the applied load. Most performance analyses give a model of random-access broadcast networks which provide an analytic, often approximate, result for the successful traffic, S, in terms of the offered load, G. Hammond and O’Reilly [3] present a model for slotted 1-persistent CSMA/CD in which the successful traffic, S, (or efficiency achieved) can be determined in terms of the offered load G, the end-to-end propagation delay, τ (bounded by 25.65µs, i.e., the time taken for a signal to propagate 2500m of cable and 4 repeaters), and E, the frame transmit time (which for 10Mbps Ethernet with a maximum frame size of 1500 bytes is approximately 1200µs). Figure 2 shows the predicted rate of successful traffic against applied load, assuming that that jamming time is equal to τ . Since both S, the rate of successful transmissions, and G are normalised with respect to E, S is also the channel efficiency achieved on the cable. T , shown in Figure 1, also normalised with respect to E, is the load applied by BSPlib on the transport layer. Our objective is to pace the injection of messages into the transport layer such that T , on average, is equal to a steady-state value of S without much variance. The value of T determines the position on the S–G curve of Figure 2 in a steady state; in particular, T can be chosen to maximise S. If the applied load is to the right of the maximum throughput in Figure 2, then small increases in the mean load lead to a decrease in channel efficiency which in turn increases the backlog in terms of retries and further increases the load. Working to the right of the maximum therefore exposes the system to these instabilities which manifest themselves in variances in the communication bandwidth—a metric we try to minimise [4,1]. In contrast, when working to the left of the maximum, small increases in the applied load are accompanied by increases in the channel efficiency which helps cope with the increased load and therefore instabilities are unlikely. As an aside, the Ethernet exponential backoff handles the instabilities towards the right by rescheduling failed transmissions further and further into the future, which decreases the applied load. In BSPlib, the mechanism of pacing the transport layer is achieved by using a form of statistical time-division multiplexing that works as follows. The frame size and the number of processors involved in the communication are known. As the processors’ clocks are not necessarily synchronised, it is not possible to allow
974
Stephen R. Donaldson et al.
the processors access in accordance with some permutation, a technique applied successfully in more tightly-coupled architectures [6]. Thus the processors choose a slot, q, uniformly at random in the interval [0 . . . Q− 1] (where Q is the number of processors communicating at the end of a particular superstep), and schedule their transmission for this slot. The choice of a random slot is important if the clocks are not synchronised as it ensures that the processors do not repeatedly choose a bad communication schedule. Each processor waits for time qε after the start of the cycle, where ε is a slot time, before passing another packet to the transport layer. The length of the slot, ε, is chosen based on the maximum time that the slot can occupy the physical medium, and takes into account collisions that might occur when good throughput is being achieved. The mechanism is designed to allow the medium to operate at the steady state that achieves a high throughput. Since the burstiness of communication has been smoothed by this slotting protocol, the erratic behaviour of the low-level protocol is avoided, and a high utilisation of the medium is ensured. An alternative protocol, not considered here, would be to implement a deterministic token bus protocol in which each station can only send data whilst holding a “token”. This scheme was not considered viable as it is inefficient for small amounts of traffic due to the need for the explicit token pass when the token holding processor has no message for the “next to-go” processor. In the worst case this would double the communication time. Also, token mechanisms protect shared resources such as a single Ethernet bus, however a network may be partioned into several independent segments, or processors may be connected via a switch. In this case, the token bus protocol would ensure only a single processor has access to the medium at any time, therefore wasting bandwidth. In contrast, the parameters used in the slotting mechanism can be trivially adjusted to take advantage of a switch based medium. For example, for a full-duplex cross-bar switch, Q can be assumed to be 1 (Q = 2 for half-duplex), and ε encapsulates the rate at which the switch and protocol stacks of sender and receiver can absorb messages in a steady state. If the back-plane capacity of the switch is less than the capacity of the sum of the links, then Q and ε can be adjusted accordingly. Therefore, the randomised slotting mechanism is superior to a deterministic token bus scheme, as Q and ε can be used to model a large variety of LAN interconnects.
3
Determining the Value of ε
In any steady state, T = S because, if this were not the case, then either unbounded capacity for the protocol stacks would be required, or the stacks would dispense packets faster than they arrive, and hence contradict the steady-state assumption. Since ε is the slot time, packets are delivered with a mean rate of 1/ε packets per µs. Normalising this with respect to the frame size E gives a value for T = E/ε packets per unit frame-time. We therefore choose an S value from the curve and infer a value of the slot size ε = E/S as S = T in a steady state. Choosing a value of S = 80% and E = 1200µs gives a slot size of 1500µs.
Predictable Communication on Unpredictable Networks
975
800 340
MPIch time BSPlib time MPIch average time 320 Time at peak bandwidth
750
MPIch time BSPlib time MPIch average time 700 Time at peak bandwidth
300 650
Time in msec
Time in msec
280
260
240
600
550
500 220 450 200 400 180 350 160 0
500
1000 Slot time in usecs
1500
2000
0
(a) 2 processors
500
1000 Slot time in usecs
1500
2000
1500
2000
(b) 4 processors
1100
1400
MPIch time BSPlib time MPIch average time 1000 Time at peak bandwidth
1300
MPIch time BSPlib time MPIch average time Time at peak bandwidth
1200
Time in msec
Time in msec
900
800
700
1100
1000
900
800 600 700 500 0
500
1000 Slot time in usecs
(c) 6 processors
1500
2000
0
500
1000 Slot time in usecs
(d) 8 processors
Fig. 4. Delivery time as a function of slot time for a cyclic shift of 25,000 words per processor, p = 2, 4, 6, 8, data for mpich shown at 1500 although slots are not used In practice, while the maximum possible value for τ is known, the end-to-end propagation delay of the particular network segment is not, and this influences the slot size via the contention interval modelled in Figure 2. The analytic model assumes a Poisson arrival process, whereas for a finite number of stations, the arrival process is defined by independent Bernoulli trials (the limiting case of this process, as the number of processors increases, is Poisson, and approximates the finite case after the number of processors reaches about 20 [3]). More complicated topologies could also be considered where more than one segment is used. The slot size ε can be determined empirically by running trials in which the slot size is varied and its effect on throughput measured. The experiments involved a 10Mbps Ethernet networked collection of workstations. Each workstation is a 266MHz Pentium Pro processor with 64MB of memory running Solaris 2. The experiments were carried out using the TCP/IP implementation of BSPlib. The machines and network were dedicated to the experiment, although the Ethernet segment was occasionally used for other traffic as it was a subset of a teaching facility. Figure 4(a) to Figure 4(d) plot the time it takes to realise a cyclic-shift communication pattern (each processor bsp hpputs a 25,000 word message into the memory of the processor to its right) for various slot sizes (ε ∈ [0, 2000]) and for 2, 4, 6 and 8 processors. The figures show the delivery time as a function
976
Stephen R. Donaldson et al.
620
70 MPIch average time
MPIch standard deviation
600
60
Standard deviation of time in msec
Mean time in msec
580
560
540
520
50
40
30
20
500
10
480
460
0 0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
1800
2000
(a) p = 4, mean per slot-size
0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
1800
2000
(b) p = 4, standard deviation per slot-size
1250
180 MPIch average time
51.7209 160
1200
Standard deviation of time in msec
140
Mean time in msec
1150
1100
1050
1000
120
100
80
60
40 950 20
900
0 0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
(c) p = 8, mean per slot-size
1800
2000
0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
1800
2000
(d) p = 8, standard deviation per slot-size
Fig. 5. Mean and standard deviation of delivery times of data from Figures 4(a,b) of slot size, oversampled 10 times. The horizontal line towards the bottom of each graph gives the minimum possible delivery time based on bits transmitted divided by theoretical bandwidth. Results for an MPI implementation of the same algorithm running on top of the Argonne implementation of MPI (mpich) [2] are also shown on these graphs. In this case, the data is presented at a slot size of 1500 µs (even though mpich does not slot), so only one (oversampled) measurement is shown. The dotted horizontal line in the centre of these figures is the mean delivery time of the MPI implementation. The BSP slot time should be chosen to minimise the mean delivery time. Choosing a small slot time gives some good delivery times, but the scatter is large. In practice, a good choice is the smallest slot time for which the scatter is small. For p = 2 this is 1200 µs, for p = 4 it is 1450 µs, for p = 6 it is 1650 µs, and for p = 8 it is 1700 µs. Notice that these points do not necessarily provide the minimum delivery times, but they provide the best combination of small delivery times and small variance in these times. Figure 4(b) shows a particularly interesting case at ε = 1500, as both the mean transfer rate and standard deviation of the BSPlib benchmark is much smaller than those of the corresponding mpich program. This slot-size can be clearly seen in Figure 5(c) and Figure 5(d) where the scatter caused by the
Predictable Communication on Unpredictable Networks
350
977
500 MPIch time BSPlib time MPIch average time
MPIch time BSPlib time MPIch average time
450
300
Time in msec
Time in msec
400
250
200
350
300
250 150 200
100
150 0
500
1000 Slot time in usecs
(a) 4 processors
1500
2000
0
500
1000 Slot time in usecs
1500
2000
(b) 8 processors
Fig. 6. Delivery time as a function of slot time for a cyclic shift of 8,300 words per processor, p = 4, 8, data for mpich shown at 1500 although slots are not used oversampling at each slot size in Figure 4(b) has been removed by only displaying the mean and standard deviation of the oversampling. In contrast, the mean and largest outlier of the mpich program in Figure 4(d) is clearly lower than the corresponding BSPlib program when a slot size of 1500 is used. For larger configurations, the slot size that gives the best behaviour increases and the mean value of g for BSPlib quickly becomes worse than that for mpich. An increase in the best choice of slot size from Figure 4(a) to Figure 4(d) should be expected as the probability P (n) of n processors choosing a particular slot is binomially distributed. Thus as p increases, so does the expectation E{X ≥ 2} of the amount of contention for the slot, where p p iP (i) P (n) = (1/p)n (1 − 1/p)p−n and E{X ≥ 2} = n i=2 Figure 3 shows that for p ≈ 20 and greater, the dependence on p is minimal, and therefore the increase in slot size reaches a fixed point. Below twenty processors the dependence varies by at most 26%. The limit as p → ∞ gives E{X ≥ 2} → 1 − 1/e ≈ 0.63, as shown in the figure. The same is true of the probability of contention, but the range is very small, from 0.25 at p = 2, and as p → ∞, P {x ≥ 2} → 1 − 2/e ≈ 0.26. In the mpich implementation [2] of MPI, large communications are presented to the socket layer in a single unit. However, in the BSPlib implementation all communications are split into packets containing at most 1418 bytes, so that we can pace the submission of packets using slotting. For this benchmark, each BSPlib process sends 71 small packets in contrast to mpich’s single large message. Therefore, when p is small we would expect BSPlib to perform worse than mpich due to the extra passes through the protocol stack, and for larger values of p we would expect that the benefits of slotting out-weigh the extra passes through the protocol stack. Figures 4(a) to 4(d) show an opposite trend. As can be seen from the Figures 4(b)–(d), as p increases, there is a noticeable “hump” in the data as the slot size increases. This phenomenon is not explained
978
Stephen R. Donaldson et al.
400
28
380
26 24 Standard deviation of time in msec
360
Mean time in msec
340 320 300 280 260 240
22 20 18 16 14 12 10
220
8
200
6 0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
(a) p = 8, mean per slot-size
1800
2000
0
200
400
600
800 1000 1200 Slot time in usecs
1400
1600
1800
2000
(b) p = 8, standard deviation per slot-size
Fig. 7. Mean and standard deviation of delivery times of data from Figure 6(b) by the discussion above. The problem arises because we are modelling the communication as though it were directly accessing the Ethernet, without taking into account the TCP/IP stack. What we are observing is the TCP acknowledgement packets, which interfere with data traffic as they are not controlled by our slotting mechanism. The effect of this is to increase the optimum slot size to a value that ensures that there is enough extra bandwidth on the medium such that the extra acknowledgement packets do not negatively impact the transmission of data packets. Implementations of TCP use a delayed acknowledgement scheme where multiple packets can be acknowledged by a single acknowledgement transmission. To minimise delays, a 200ms timeout timer is set when TCP receives data [11]. If during this 200ms period data is sent in the reverse direction then the pending acknowledgement is piggy-backed onto this data packet, acknowledging all data received since the timer was set. If the timer expires, the data received up to that point is acknowledged in a packet without a payload (a 512 bit packet). In the benchmark program that determines the optimal slot size, a cyclic shift communication pattern is used. When p > 2 there is no reverse traffic during the data exchange upon which to piggy-back acknowledgements. If the entire communication takes less than 200ms then only p acknowledgement packets will be generated for each superstep; as the total time exceeds 200ms, considerably more acknowledgement packets are generated. In Figure 4(a) the communication takes approximately 200ms and a minimal number of acknowledgements are generated as can be seen by the lack of a hump. In Figures 4(b)–(d), the size of the humps increases in line with the increased number of acknowledgements. The mpich program does not suffer as severely from this artifact as BSPlib. When slotting is not used (for example in mpich) there is potential for a rapid injection of packets onto the network by a single processor for a single destination, which means that it is likely that more packets arrive at their destination before the delayed acknowledgement timer expires. This reduces the number of acknowledgement packets. When slotting is used, packets are paced onto the network with a mean inter-packet time between the same source-destination pair of pε. This drasti-
Predictable Communication on Unpredictable Networks
979
cally decreases the possibility of accumulated delayed acknowledgements. For example, in Figure 4(c), as the total time for communication is approximately 800ms, and as the slot size steadily increases, the number of acknowledgements increases. This in turn steadily increases the standard deviation and mean of the communication time. From the figure it can be seen that this suddenly drops off when the slot size becomes large as the probability of collision decreases due to the under-utilisation of the network. The global nature of BSP communication means that data acknowledgement and error recovery can be provided at the superstep level as opposed to the packet by packet basis of TCP/IP. By moving to UDP/IP, we can implement acknowledgements and error recovery within the framework of slotting. This lower-level communication scheme is under development, although the hypothesis that it is the acknowledgements limiting the scalability of slotting can be tested by performing a benchmark on a dataset size that requires a total communication time that is less than 200ms. Figures 6(a)–(d) shows the slotting benchmark for an 8333-relation where there are no obvious humps. In all configurations the mean and standard deviations of the BSPlib results are considerably smaller than mpich. Also, as can be seen from Figure 7 the optimal slot size at p = 8 is approximately 1200µs.
4
Conclusions
We have addressed the ability of the BSP runtime system to improve the performance of shared-media systems using TCP/IP. Using BSP’s global perspective on communication allows each processor to pace its transmission to maximise throughput of the system as a whole. We show a significant improvement over MPI on the same problem. The approach provides high throughput, but also stable throughput because the standard deviation of delivery times is small. This maintains the accuracy of the cost model, and ensures the scalability of systems.
Acknowledgements The work of Jonathan Hill was supported in part by the EPSRC Portable Software Tools for Parallel Architectures Initiative, as Research Grant GR/K40765 “A BSP Programming Environment”, October 1995-September 1998. David Skillicorn is supported in part by the Natural Science and Engineering Research Council of Canada.
References 1. S. R. Donaldson, J. M. D. Hill, and D. B. Skillicorn. Communication performance optimisation requires minimising variance. In High Perfomance Computing and Networking (HPCN’98), Amsterdam, April 1998. 973
980
Stephen R. Donaldson et al.
2. W. Gropp and E. Lusk. A high-performance MPI implementation on a sharedmemory vector supercomputer. Parallel Computing, 22(11):1513–1526, Jan. 1997. 971, 976, 977 3. J. L. Hammond and P. J. P. O’Reilly. Performance Analysis of Local Computer Networks. Addison Wesley, 1987. 973, 975 4. J. M. D. Hill, S. Donaldson, and D. B. Skillicorn. Stability of communication performance in practice: from the Cray T3E to networks of workstations. Technical Report PRG-TR-33-97, Oxford University Computing Laboratory, October 1997. 973 5. J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP Programming Library. Parallel Computing, to appear 1998. see www.bsp-worldwide.org for more details. 971 6. J. M. D. Hill and D. B. Skillicorn. Lessons learned from implementing BSP. Journal of Future Generation Computer Systems, 13(4–5):327–335, April 1998. 970, 974 7. P. J. B. King. Computer and Communication Systems Performance Modelling. International series in Computer Science. Prentice Hall, 1990. 971 8. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and answers about BSP. Scientific Programming, 6(3):249–274, Fall 1997. 970 9. S. Tasaka. Performance Analysis of Multiple Access Protocols. Computer Systems Series. MIT Press, 1986. 971 10. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, August 1990. 970 11. G. R. Wright and W. R. Stephens. TCP/IP Illustrated, Volume 2. Addison-Wesley, 1995. 978
Adaptive Routing Based on Deadlock Recovery Nidhi Agrawal1 and C.P. Ravikumar2 1 2
Hughes Software Systems, Sector18, Electronic City, Gurgaon, INDIA Deptt. of Elec. Engg., Indian Institute of Technology, New Delhi, INDIA
Abstract. This paper presents a deadlock recovery based fully adaptive routing for any interconnection network topology. The routing is simple, adaptive and is based on calculating the probabilities of routing at each node to neighbors, depending upon the static and dynamic conditions of the network. The probability of routing to the ith neighbor at any node is a function of the traffic and distance from the neighbor to the destination. Since with our routing algorithm deadlocks are rare, deadlock recovery is a better solution. We also propose here two deadlock recovery schemes. Since deadlocks occur due to cyclic dependencies, these cycles are broken by allowing one of the messages involved in deadlock to take an alternate path consisting of buffers reserved for such messages. These buffers can be centralized buffers accessible to all neighboring nodes or can be set of virtuals. The performance of our algorithm is compared with other recently proposed deadlock recovery schemes. The 2-Phase routing is found to be superior compared to the other schemes in terms of network throughput and mean delay.
1
Introduction
Interprocessor communication is the bottleneck in achieving high performance in Message Passing Processors (MPPs). Various routing algorithms exist in the literature for wormhole routed[4] interconnection networks. Most routing algorithms achieve deadlock-freedom by restricting the routing, for example, Turn Model[5] router for hypercubes and meshes ensures deadlock-freedom by prohibiting certain turns in the routing algorithm. Many routing techniques achieve deadlock-freedom and adaptivity through the use of virtual channels [3,4]. Not only do virtual channels increase hardware complexity, they also reduce the speed of the router. Chien[2] has shown that virtual channels adversely affect the router performance, and multiplexing more than 2 to 3 virtual channels over one physical channel results in poor router performance. From the discussion above, it is clear that achieving deadlock-freedom and adaptivity at the cost of increased router complexity and increased delays is not a good solution. In this paper we propose fully adaptive routing based on deadlock recovery. The algorithm breaks deadlocks without adding significant hardware. We introduce the notion of Routing Probability at each node. Pi (j, k) denotes the probability of routing a message at node i to a neighboring node k, destined for j. Entirely chaotic routing can be achieved by selecting Pi (j, k) = 1/di . On D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 981–988, 1998. c Springer-Verlag Berlin Heidelberg 1998
982
Nidhi Agrawal and C.P. Ravikumar
the other hand, a deterministic routing algorithm can be modeled by selecting Pi (j, l) = 1 for some 1 ≤ l ≤ di , and Pi (j, k) = 0, ∀k = l; here di is the degree of node i. In this paper we propose a simple heuristic to calculate the routing probabilities at any node depending on the distance from the destination and/or congestion at the neighboring node. Our algorithm dynamically computes the probabilities, thus providing adaptivity. A message is declared to be deadlocked at any node i, if the message header has to wait at node i for more than a predefined time called T imeOut. Since deadlocks are infrequent, it is more logical to recover from deadlocks rather than avoiding. Various schemes for deadlock recovery are available in the literature[1,6]. Compressionless routing[6] is a deadlock recovery scheme based on killing the deadlocked packets and transmitting again. This scheme requires padding flits to be attached with the message when the message length is less than the network diameter. Moreover, messages that are killed and rerouted suffer large delays. A deadlock recovery scheme called Disha[1] uses a single flit buffer at each node. The disadvantage of this scheme is that at any time only one message can be routed onto deadlock-free buffers, thus only one of the deadlocks we can recover from at a time; furthermore the token based mutual exclusion on the central virtual network is implemented by adding additional hardware. We propose here two schemes for deadlock recovery. Our first scheme makes use of multiple central buffers; on the other hand our second scheme, also called 2-Phase Routing divides the network into two virtual networks. Both the schemes eliminate the disadvantages associated with the previously proposed recovery schemes. Rest of the paper is organized as follows : Next section describes the routing algorithm. Section 3 describes the proposed deadlock recovery schemes. The results of simulation are given in Section 4. Section 5 concludes the paper.
2 2.1
Adaptive Routing Algorithm
Various heuristics may be used to calculate the routing probabilities Pi (j, k) introduced in Section 1. In this section we propose a simple heuristic as shown below : 1
∀k ∈ N br(i),
Pi (j, k) = Wk 1
(1)
k Wk
Here N br(i) is the set of immediate neighbors of node i and Wk is a weight assigned to the neighbor k which is a linear function of the distance δ(j, k) and the congestion at the neighboring node k denoted as Qk . The weight Wk is assigned as follows : Wk = α ∗ δ(j, k) + β ∗ Qk
(2)
Each node is assumed to have the knowledge of the fault-status of its neighbor and the congestion at each neighboring node. The congestion at any node k,
Adaptive Routing Based on Deadlock Recovery
983
also denoted by Qk , is the average amount of time a header has to spend at k normalized against a timeout period T imeOut. To measure the value of Qk , node k makes use of as many timers as there are neighbors. If a header waits for more than T imeOut period, the message is declared to be deadlocked. For most regular interconnection networks, the normalized distance function δ() is easy to evaluate. The scaling factors α and β are non-negative real constants, which dictate the nature of the routing algorithm; by selecting β = 0, we obtain distance-optimal routing. Similarly, by selecting α = 0, we obtain a hot-potato routing algorithm. We found optimum values of the ratio α β , by simulation, for uniform and bit-reversal traffic (see Section 4). 2.2
Detection of Deadlocks
The selection of T imeOut interval greatly affects the router performance. If the T imeOut period is very large, the deadlocked messages remain in the network and blocking many other messages; on the other hand if the T imeOut period is small, false deadlocks are declared. Whenever node i receives a message header H, a timer T (i, H) starts counting the number of clock cycles the header has to wait before being forwarded. If for any header H, T (i, H) exceeds T imeOut at node i, then header is declared to be deadlocked. The optimum T imeOut interval can be found by simulations (see Section 4). We also measured the frequency of deadlocks due to the proposed routing algorithm. It is observed that under maximum load less than 5% messages are deadlocked for a 64 node hypercube and T imeOut = 16 clocks.
3 3.1
Deadlock-Recovery Recovery Scheme 1
After detecting a deadlock, the deadlock cycle is broken by switching one of the messages involved in the deadlock cycle to the central buffers kept aside at each node for routing deadlocked messages. Each node say i has K + 1 central buffers numbered 0, 1, 2..., K. These buffers are accessible by all neighboring nodes of i. The k th central buffers of all the nodes form a central virtual network V Nk , 0 ≤ k ≤ K. On any central virtual network, only one message is allowed to travel at a time. Thus, at most K deadlocks can be recovered simultaneously. The permission to break deadlocks is given in a ”round-robin” fashion to all the nodes. This may be implemented using K tokens corresponding to each central virtual network. Token i corresponds to the central virtual network V Ni . These tokens are nothing but packets of one flit size, which rotate on V N0 along a cycle including all healthy nodes of the network. Each token carries the address of the next node in the cycle. Further details on token management and implementation are given below: – Token synchronization: The tokens are synchronized with the router clock i.e. each token remains at a node for one clock cycle while circulating. The worst
984
Nidhi Agrawal and C.P. Ravikumar
case waiting time to capture a token is (N ∗ (Avg.Dist.)/K) clock cycles, where N is the number of nodes and Avg.Dist. is the average distance of the network. – Capturing a token: When a node detects a deadlock and concurrently receives a token j along the central buffer 0 carrying the address of the node, the token is captured by the node. – Regenerating a token: In order to make sure that messages do not get deadlocked on central virtual networks, a captured token j at a node Ni is regenerated after δ(Ni , D) clock cycles, ensuring that there is at most one packet on the network, where D is the destination node. 3.2
Recovery Scheme-2
The recovery scheme-2 , also called 2-phase routing discussed in this section, provides simultaneous recovery from all the deadlocks. Each physical channel is divided into two virtual channels. The network can be viewed as two virtual networks : the Adaptive Virtual Network(AVN) and the Deadlock-free Virtual Network(DVN). All the messages are initiated in AVN. Routing is carried out without any restrictions following the routing algorithm of Section 2. Whenever any message gets deadlocked, the routing enters phase 2 and the message is switched over to DVN. The routing procedure followed on DVN can be any deadlock-free routing for example, e-cube routing for hypercube, X-Y routing and negative-first routing for meshes and k-ary n-cubes. We describe here briefly, a deadlock-free, Hamiltonian path based routing to route on DVN, which is partially adaptive and works for various topologies like hypercubes, k-ary n-cubes, Star graphs, meshes etc. The nodes of DVN are numbered according to Hamiltonian number denoted as Hamilt(i). The DVN is further partitioned into two subnetworks the Higher Virtual Subnetwork (HSN) and the Lower Virtual Subnetwork (LSN). In HSN, there is an edge from any node i to any other node j, iff Hamilt(i) < Hamilt(j). There is an edge from i to j in the lower network iff Hamilt(i) > Hamilt(j). Each of these subnetworks is acyclic, thus routing in each of these networks is deadlock-free [7]. The routing function 1 (2 ) of Equation3(4) corresponds to routing in LSN(HSN) from i to j, where EL (EH ) is edge set of LSN(HSN). 1 (i, j) = k : (i, k) ∈ EL , Hamilt(i) > Hamilt(k) ≥ Hamilt(j)
(3)
2 (i, j) = k : (i, k) ∈ EH , Hamilt(i) < Hamilt(k) ≤ Hamilt(j)
(4)
The routing functions defined by Equations 3 and 4 are deterministic. In a modification of the routing function of Equation 3, if we allow a message to switch from LSN to HSN, when a faulty or congested node is encountered, without permitting the message to return to LSN, the routing function 1 continues to remain deadlock-free (see Theorem 1). A similar argument holds for the function 2 . The resulting partially adaptive routing is called Adapt-Hamilt.
Adaptive Routing Based on Deadlock Recovery
1
C1,2
2
1
C 4,1
C 4
2,3
C
2
C 4,1
3 C
C2,1
1,2
C 2,3
C 2,1
C
C
C
C 4,3
985
3,2
C 3,2 4
C 1,4
3
3,4
4,1
C 4,3
3,4 (A)
(B)
Fig. 1. (A) HSN and LSN for 2-dimensional Hypercube (B) CDG for 1 and 2
C
C
1,2
C 2,3
1,4
C 3,4
C
2,1
C3,2
C 4,1
C4,3
(A)
C
C
1,2
C 2,3
1,4
C 3,4
C
C
2,1
C3,2
4,1
C4,3
(B)
Fig. 2. CDG for adaptive routing in a 2-d Hypercube
Theorem 1. The adaptive routing Adapt-Hamilt is deadlock-free. Proof. We have seen that the routing function 1 of Equation 3 is deadlock-free, implying that the channel dependency graph(CDG) induced by this function is acyclic [4](see Figure 1). The modified routing introduces additional edges in the CDG. These additional edges are of the form (xh , yl ) as shown in Figure 2, where xh is a node of the higher network CDG and yl is a node of the lower network CDG. The adaptive routing function Adapt-Hamilt, if allows the additional edges of the form (xh , xl ) then edges (xl , yh ) are not permitted. Hence the theorem.
4
Performance Study
We conducted experiments to measure the performance of our adaptive routing with various deadlock recovery schemes on a 256 node hypercube. The performance metric considered are the throughput and the mean delay. Throughput of the network is defined as the mean number of messages going out of the network per node per clock cycle. Mean delay is the mean number of clock cycles elapsed from the time the first flit of the message enters the network to the time the last flit leaves the network. It may be noted that throughput and latency are measured for various applied loads, thus throughput-latency graphs do not represent a function. Each node generates messages with a poisson distribution. Two kinds of traffic patterns are generated: uniform traffic and bit-reversal traffic. In uniform traffic the destinations are generated uniformly with each node being equally probable; on the other hand in bit-reversal traffic pattern, the destination node address
986
Nidhi Agrawal and C.P. Ravikumar
is obtained by reversing the bits of the source node address, for example source s1 , s2 , ..., sn sends a message to the destination node D = sn , sn−1 , ..., s2 , s1 . Messages are 8 flits long. The results are taken for 4000 messages out of which the information for first 2000 messages is discarded (warm-up period). It takes one clock cycle time to transfer a flit across a physical channel. The header is assumed to be one flit long. The experiment is repeated several times by varying the seed and the average value is plotted. 4.1
Finding Optimum T imeOut and
α β
For fine tuning T imeOut period for each type of traffic, we conducted experiments. The network performance is measured for various T imeOut periods. Figure 3 shows the throughput versus latency curves for various T imeOut periods under bit-reversal traffic pattern. It may be observed that latency increases slowly initially upto throughput value of 0.02 messages per node per clock cycle and then there is a sharp increase. Since the highest throughput is achieved with T imeOut = 16, it is chosen as the optimum value. Similarly, for uniform traffic the optimum T imeOut value is also found to be 16. The plot is not shown due to the lack of space. For both type of traffic model, we found optimum value of the ratio α β. Figure 4 shows the performance of the routing algorithm for various values of α α β for uniform traffic. The optimum value of β is the value which gives highest peak performance. The throughput reaches its maximum value upto a certain applied load then starts reducing; this is the unstable condition of the network. From the Figure 4 it can be noticed that the optimum value is α β = 10, because for this value the highest throughput is achieved before saturation. Similarly for bit-reversal traffic the optimum value of α β is found to be 5, as shown in figure 5. 4.2
Comparison of Various Schemes
We compared the performance of the proposed routing algorithm for various deadlock recovery schemes. Both the proposed recovery schemes are compared with the existing deadlock recovery scheme disha. Figure 6 shows a comparison of the three recovery schemes under uniform traffic conditions. The throughput value saturates at 10% with both disha and scheme-1 and at 21% with 2-phase routing. It is also observed that by adding more and more central virtual buffers (or tokens) the performance of scheme-1 improves. In figure 6, the performance of the scheme-1 is measured for 3 central virtual buffers at each node. We analyze the performance of the proposed routing in the presence of node faults. It is observed that there is a negligible degradation in the performance routing for various fault conditions. The plots are not shown here due to the lack of space. The performance is also compared for the three recovery schemes under bitreversal traffic. Figure 7 shows the comparison. It may be observed that the recovery scheme-1 outperforms the recovery scheme disha, and the 2-phase routing performs far better than both the scheme-1 and disha. Figure 7 shows that
Adaptive Routing Based on Deadlock Recovery
987
after saturation the mean delay increases sharply, however the throughput value does not increase. In this case saturation occurs only at 6%.
5
Conclusion
We have presented a fully adaptive, restriction-free routing algorithm. Since deadlocks are rare we considered deadlock recovery instead of traditional deadlock avoidance. We have presented two deadlock recovery schemes, the recovery scheme-1 and 2-phase routing. The recovery scheme-1 provides simultaneous recovery from K deadlocks, where K is the number of central buffers at each node; on the other hand 2-phase routing provides the simultaneous recovery from all the deadlocks. The performance of both the proposed recovery schemes is compared with the recovery scheme disha [1]. It is observed that the recovery scheme-1 outperforms the recovery procedure of [1], under both uniform and bitreversal traffic. Also the 2-phase routing performs far superior than both disha and the recovery scheme-1.
References 1. K.V.Anjan and T.M.Pinkston, ”Disha: A deadlock Recovery Scheme for Fully Adaptive Routing”, Proc. of the 19th Int. Parallel Processing Symposium, 1995. 982, 987 2. A.A.Chien, ”A Cost and Speed Model for k-ary n-cube Wormhole Routers”, IEEE Transactions on Parallel and Distributed Systems, Vol.9, No.2, pages 150 − 162, 1998. 981 3. W.Dally and H.Aoki, ”Deadlock-free Adaptive Routing in Multicomputer Networks using Virtual Channels”, IEEE Transactions on Parallel and Distributed Systems, Vol.4, No.4,1993, pages 466 − 475. 981 4. W.Dally and C.Seitz, ”Deadlock-free message routing in multiprocessor interconnection networks”, IEEE Transactions on Computers, C-36(5), 1987, pages 547 − 553. 981, 985 5. C.J.Glass and L.M.Ni, ”The turn model for Adaptive routing”, Proc. of the 19th Int. Symp. on Comp. Architecture, 1992, pages 278 − 287. 981 6. J.Kim, Z.Liu and A.Chien, ”Compressionless Routing: A Framework for Adaptive and Fault-tolerant Routing”, IEEE Transactions on Parallel and Distributed Systems, Vol.8, No.3, pages 229 − 244, 1997. 982 7. X.Lin, P.K.McKinley and L.M.Ni, ”Deadlock-free multicast wormhole routing in 2D mesh multicomputers”, IEEE Trans. on Parallel and Distributed Systems, Vol.5, no.8, 1994, pages 793 − 804. 984
988
Nidhi Agrawal and C.P. Ravikumar
100 alpha/beta=0.1 alpha/beta=1.0 alpha/beta=10.0 alpha/beta=20.0
100 Latency in cycles
80 Latency in cycles
120
TimeOut=8 TimeOut=16 TimeOut=24
90
70 60 50
80 60
40 40
30 20 0.01
20 0.015
0.02 0.025 0.03 0.035 0.04 Accepted traffic in msg/node/cycle
0.045
0.05
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Accepted traffic in msg/node/cycle
Fig. 3. Tuning T imeOut interval for Fig. 4. Tuning bit-reversal traffic
α β
0.1
for unifrom traffic
100 Recovery Scheme Disha Recovery Scheme 1 Recovery Scheme 2
100 Latency in cycles
80 Latency in cycles
120
alpha/beta=0.1 alpha/beta=1.0 alpha/beta=5.0 alpha/beta=10.0
90
70 60 50
80 60
40 40
30 20
20
0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 Accepted traffic in msg/node/cycle
Fig. 5. Tuning traffic
α β
Recovery Scheme Disha Recovery Scheme 1 Recovery Scheme 2
Latency in cycles
80 70 60 50 40 30 20 0.01
0.02
0.05
0.1 0.15 0.2 0.25 Accepted traffic in msg/node/cycle
0.3
for bit-reversal Fig. 6. Throughput Vs. mean delay for the three recovery schemes under uniform traffic
100 90
0
0.03 0.04 0.05 0.06 Accepted traffic in msg/node/cycle
0.07
Fig. 7. Throughput Vs. mean delay for the three recovery schemes under bit-reversal traffic
On the Optimal Network for Multicomputers: Torus or Hypercube? Mohamed Ould-Khaoua Department of Computer Science, University of Strathclyde Glasgow G1 1XH, UK.
Abstract. This paper examines the relative performance merits of the torus and hypercube when adaptive routing is used. The comparative analysis takes into account channel bandwidth constraints imposed by VLSI and multiple-chip technology. This study concludes that it is the hypercube which exhibits the superior performance, and therefore is a better candidate as a high-performance network for future multicomputers with adaptive routing.
1
Introduction
The hypercube and torus are the most common instances of k-ary n-cubes [5]. The former has been used in early multicomputers [3,10] while the latter has become popular in recent systems [4,8]. This move towards the torus has been mainly influenced by Dally’s study [5]. When systems are laid out on a VLSIchip, Dally has shown that under the constant wiring density constraint, the 2 and 3-D torus outperform the hypercube due to their higher bandwidth channels. Abraham [1] and Agrawal [2] have argued that the wiring density argument is applicable where a network is implemented on a VLSI-chip, but not in situations where it is partitioned over many chips. In such circumstances, they have identified that the most critical bandwidth constraint is imposed by the chip pin-out. Both authors have concluded that it is the hypercube which exhibits better performance under this new constraint. Wormhole routing [11] has also promoted the use of high-diameter networks, like the torus, as it makes latency independent of the message distance in the absence of blocking. In wormhole routing, a message is broken into flits for transmission and flow control. The header flit governs the route, and the remaining data flits follow in a pipeline. If the header is blocked, the data flits are blocked in situ. Most previous comparative analyses of the torus and hypercube [1,2,5,6] have used deterministic routing, where a message always uses the same network path between a given pair of nodes. Deterministic routing has been widely adopted in practice [3,8,10] because it is simple and deadlock-free. However, messages cannot use alternative paths to avoid congested channels. Fully-adaptive routing has often been suggested to overcome this limitation by enabling messages to explore all the available paths in the network. Duato [6] has recently proposed a D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 989–993, 1998. c Springer-Verlag Berlin Heidelberg 1998
990
Mohamed Ould-Khaoua
fully-adaptive routing, which achieves deadlock-freedom with minimal hardware requirement. The Cray T3E [4] is an example of a recent machine that uses Duato’s routing algorithm. The torus continues to be a popular topology even in multicomputers which employ adaptive routing. However, before adaptive routing can be widely adopted in practical systems, it is necessary to determine which of the competing topologies are able to fully exploit its performance benefits. To this end, this paper re-assesses the relative performance merits of the torus and hypercube in the context of adaptive routing. The study compares the performance of the 2 and 3-D torus to that of the hypercube. The analysis uses Duato’s fully-adaptive routing [6]. The present study uses queueing models developed in [9] to examine network performance under uniform traffic. Results presented in the next section reveal that it is the hypercube which provides the optimal performance under both the constant wiring density and pin-out constraints, and thus is the best candidate as a high-performance network for future multicomputers with fully-adaptive routing.
2
Performance Comparison
The torus has higher bandwidth channels than its hypercube counterpart under the constant wiring density and pin-out constraints. The detailed derivation of the exact relationship between the channel width of the torus in terms of that of the hypercube under both constraints can be found in [1,2,5]. The router’s switch in the hypercube is larger than that in the torus due to its larger number of physical and virtual channels. As a consequence, the switching delay in the hypercube should be higher due to the additional complexity. Comparable switching delays in the two networks can be obtained if the routers have comparable switch sizes. This can be achieved by normalising the total number of channels in the routers of the two networks. When mapped in the 2D plane, the hypercube ends up with longer wires, and thus higher wire delays than the torus. However, delays due to long wires can be reduced by using pipelined channels as suggested in [12]. The performance of the networks is examined below when both the wire delay is taken into account and when it is ignored. For illustration, network sizes of N=64 and 1024 nodes are examined. The channel width in the hypercube is one bit. The channel width in the torus is normalised to that of the hypercube. The message length (M) is 128 flits. A physical channel in the hypercube has V=2 virtual channels. The total number of virtual channels per router in the torus is normalised to that in the hypercube. Figures 1-a and b depict latency results in the torus and hypercube under the constant wiring density constraint and when the effects of wire delays are taken into account in the 64 and 1024 node systems respectively. The figures reveal that the torus is able exploit its wider channels to provide a lower latency than the hypercube under light to moderate traffic. However, as traffic increases its performance degrades as message blocking rises, offsetting any advantage of
On the Optimal Network for Multicomputers: Torus or Hypercube?
991
*
Fig. 1. The performance of the torus and hypercube including the effects of wiring delays. (a) N=64, (b)N=1024.
*
Fig. 2. The performance of the torus and hypercube ignoring the effects of wiring delays. (a) N=64, (b)N=1024.
992
Mohamed Ould-Khaoua
having wider channels. Figures 2-a and b show latency when the effects of wire delays are ignored. The torus outperforms the hypercube under light to moderate traffic, but loses edge to the hypercube under heavy traffic. The difference in performance between the two networks increases in favour of the hypercube for larger network sizes. Figures 1 and 2 together reveal an important finding about the torus, and that is even though wires are longer in the hypercube, the torus is more sensitive to the effects of wire delays. This is because a message in the torus crosses, on average, a larger number of routers, and therefore require a longer service time to reach its destination. Since the ratio in channel width in the hypercube and torus decreases under the constant pin-out constraint, we can conclude that the hypercube is even more favourable when the networks are subjected to this condition.
3
Conclusion
This paper has compared the performance merits of the torus and hypercube in the context of adaptive routing. The results have revealed that the hypercube has superior performance characteristics to the torus, and therefore is a better candidate as a high-performance network for future multicomputers, that use adaptive routing.
References 1. S. Abraham, Issues in the architecture of direct interconnection networks schemes for multiprocessors, Ph.D. thesis, Univ. of Illinois at Urbana-Champaign(1991). 989, 990 2. A. Agarwal., Limits on interconnection network performance, IEEE Trans. Parallel & Distributed Systems,Vol. 2(4), (1991) 398–412. 989, 990 3. R. Arlanskas, iPSC/2 system: A second generation hypercube, Proc. 3rd ACM Conf. Hypercube Concurrent Computers and Applications,Vol. 1 (1988). 989 4. Cray Research Inc., The Cray T3E scalable parallel processing system, on Cray’s web page at http://www.cray.com/PUBLIC/product-info/T3E/. 989, 990 5. W.J. Dally, Performance analysis of k-ary n-cubes interconnection networks, IEEE Trans. Computers,Vol. 39(6) (1990) 775-785. 989, 990 6. J. Duato, A New theory of deadlock-free adaptive routing in wormhole routing networks, IEEE Trans. Parallel & Distributed Systems, Vol. 4 (12) (1993) 320– 331. 989, 990 7. J. Duato, M.P. Malumbres, Optimal topology for distributed shared-memory multiprocessors: hypercubes again?, Proc. EuroPar’96, Lyon, France, (1996) 205–212. 8. R.E. Kessler, J.L. Schwarzmeier, CRAY T3D: A new dimension for Cray Research, in CompCon, Spring (1993) 176–182. 989 9. M. Ould-Khaoua, An Analytical Model of Duato’s Fully-Adaptive Routing Algorithm in k-Ary n-Cubes, to appear in the Proc. 27th Int. Conf. Parallel Processing, 1998. 990 10. C.L. Seitz, The Cosmic Cube, CACM, Vol. 28 (1985) 22–33.
On the Optimal Network for Multicomputers: Torus or Hypercube?
993
11. C.L. Seitz, The hypercube communication chip, Dep. Comp. Sci., CalTech, Display File 5182:DF:85 (1985). 989 989 12. S.L. Scott, J.R. Goodman, The impact of pipelined channel on k-ary n-cube net works, IEEE Trans. Parallel & Distributed Systems, Vol. 5(1) (1994) 2-16. 990
Constant Thinning Protocol for Routing h-Relations in Complete Networks Anssi Kautonen1 , Ville Lepp¨anen2 , and Martti Penttonen1 1
2
Department of Computer Science University of Joensuu P.O.Box 111, 80101 Joensuu, Finland, {Anssi.Kautonen,Martti.Penttonen}@cs.joensuu.fi Department of Computer Science and Turku Centre for Computer Science University of Turku Lemmink¨ aisenkatu 14 A, 20520 Turku, Finland, [email protected]
Abstract. We propose a simple protocol, called constant thinning protocol, for routing in a complete network under OCPC assumption, analyze it, and compare it with some other routing protocols.
1
Introduction
Parallel programmers would welcome the “flat” shared memory, because it would make PRAM [15] style programming possible and thus make easier to utilize the rich culture of parallel algorithms written for the PRAM model. As large memories with a large number of simultaneous accesses do not seem feasible, the only possible way to build a PRAM type parallel computer appears to be to build it from processors with local memories. The fine granularity of parallelism and global memory access, what makes the PRAM model so desirable for algorithm designers, sets very high demands for data communication. Fortunately, memory accesses at the rate of the processor clock are not necessary, but due to the parallel slackness principle [16], latency in access does not imply inefficiency. It is enough to route an h-relation efficiently. By definition, in h-relation there is a complete graph, where each node (processor) has at most h packets to send, and it is the target of at most h packets. We assume the OCPC (Optical Communication Parallel Computer) or 1-collision assumption [1]: If two or more packets arrive at a node simultaneously, all fail. An implementation of an h-relation is work-optimal at cost c, if all packets arrive at targets in time ch. The first attempt to implement an h-relation is to use a greedy routing algorithm. By greedy principle, one tries to send packets as fast as one can. The fatal drawback of the greedy algorithm is the livelock: packets can cause mutual failure of sending until eternity. In a situation, when each of two processors has one packet targeted to the same processor, they always fail. Another routing algorithm was proposed by Anderson and Miller [1], and it was improved by Valiant [15]. They realize work-optimally an h-relation for D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 993–998, 1998. c Springer-Verlag Berlin Heidelberg 1998
994
Anssi Kautonen et al.
h ∈ Ω(log p), where p is the number of processors. Other algorithms with even lower latency were provided by [6,7,2,3,11]. Contrary to these, the h-relation algorithm of Ger´eb-Graus and Tsantilas [5], GGT for short, has the advantage of being direct, i.e. the packets go to their targets directly, without intermediate nodes. For other results related with direct or indirect routing, see also [4,14,8,9,12,13]. Actually our new algorithm was inspired by [13], although the latter deals with the continuous routing problem, not the h-relation. proc GGT(h,,α) for i = 0 to log1/(1−) h do Transmit((1 − )i h,,α) proc Transmit(h,,α) for all processors P pardo √ e (h + max{ 4αh ln p, 4α ln p}) times do for 1− choose an unsent packet x at random attempt to send x with probability # unsenth packets
2
Thinning Protocols
The throughput of the greedy routing of randomly addressed packets is characterized by 1 1 (1 − )h−1 ≥ h e where 1/h is the probability that one of the h−1 competing processors is sending to the same processor at the same time. (For all x > 0, (1−1/x)x−1 ≥ e−1 .) This would be the throughput if all processors would create and send a new randomly addressed packet, which is not the case in routing an h-relation. When failed packets are resent again and again, the risk of livelock grows. A solution is to decrease the sending probability. In GGT algorithm, the sending probability of packets varies between 1 and 1 − (0 < < 1). The transmission of packets is thus ’thinned’ by factor 1 to 1/1 − , preventing the livelock. We now propose a very simple routing protocol, where thinning is more explicit. proc CT(h,h0 ,t,t0 ) %1 ≤ h0 ≤ h, 1 ≤ t0 ≤ t for all processors do while packets remain do Transmit h randomly chosen packets (if so many remain) within time window [1..th ] h := max{(1 − e−1/t0 )h, h0 } A drawback of thinning is that a processor cannot successfully send a packet at those moments, when it does not even try to send. Thinning by factor t would thus imply inefficiency by factor t. But this is somewhat balanced by the success probability, which increases from 1/e to (1 −
1 1 h−1 ) ≥ 1/t . th e
Constant Thinning Protocol for Routing h-Relations
995
The expected throughput with thinning is thus characterized by the function te1/t whose growth is very modest: for t = 1, 1.2, 1.5, 2.0 and 3.0 the values of te1/t are 2.7, 2.8, 2.9, 3.3, and 4.2. Still, even small thinning increases greatly the robustness against livelock. For very small numbers of packets constant thinning alone does not guarantee high success probability. For that reason, CT has a minimum size h0 ∈ Ω(log p) for thinning window, to prevent repeated collisions.
3
Analysis
The analyses for GGT and CT are rather similar. One can prove that 1. the number of unsent packets decreases geometrically from h to log p 2. the rest of the packets can be routed in time O(log p log log p) Theorem 1. For h ∈ Ω(log p log log p), CT routes any h-relation in time O(h) with high probability. Proof. Consider a round of the while loop, where h ≥ k0 log p = h0 for a suitable constant k0 . We shall show that if in the beginning of a round the routing task is an h -relation, it will be a ch -relation after the round with high probability. To ease the calculation of probabilities, we worsen the routing task a little by completing the initial h -relation to a full h -relation. We do this by adding ’dummy packets’ with proper destination to those processors not having initially h packets. We assume that the dummy packets participate in routing as the other packets. In the analysis below, the dummy packets can collide with normal packets as well as with other dummy packets, but in the actual algorithm attempting to route a dummy packet corresponds to an unallocated time slot (unable to cause any collisions). Thus the number of successful normal packets is always better in the actual situation. We first show that, with high probability, the number of packets decreases to ch or below. As a packet has the probability 1/th of being sent at a given moment of time (from the interval [1 . . . th ]), and by the h -relation assumption, at most h packets have the same target, the probability of success for a packet at a given moment of time is at least 1 1 1 × (1 − )h −1 ≥ 1/t , th th e · th since 1 1 m ) ≥ (1 − )h −1 th th for any m ≤ h − 1 (consider m as the actual number of other packets with the same target) and (1 −
(1 −
1 h −1 ) = th
1/t 1/t 1 1 (1 − )th −t ≥ (1 − )th −1 ≥ e−1/t th th
996
Anssi Kautonen et al.
for h > 1 and t ≥ 1. For each packet, the probability of success for a packet during the procedure call is at least 1 × th = e−1/t . e1/t · th Hence, the expected number of successful packets of a processor within these th units of time is Et = h e−1/t . By applying Chernoff bound [10] 1 2
P r(N < (1 − )Et ) ≤ e− 2
Et
and choosing (1 − )Et = Et0 , we get = 1 − Et0 /Et ≈ 1/t0 − 1/t, and P r(N < h e−1/t0 ) ≤ e−0.5×
2
×h e−1/t
≤
1 p2.5
for h > h0 = k0 log p for some constant k0 . Hence, in a processor the number of outgoing packets decreases by compression factor c = 1 − 1/e1/t0 with probability 1 − 1/p2.5 , and in all p processors √ and in all log1/c p < p rounds with probability 1 − 1/p. Guaranteeing that the number of incoming packets is at most ch for each processor is analyzed analogously. Clearly, the full h -relation (completed with dummy packets) decreases to ch -relation. Removing the dummy packets from the system can only decrease the degree of the relation. Within log1/c h geometrically converging phases, or O(ht) time, there remain no more than O(log p) packets in each processor. Observe that by choosing a larger h0 in the analysis above, we can easily show the same progression in the degree of the relation with probability 1 − p−α for any positive constant α. For the rest of the algorithm, when h ≤ h0 packets remain, the success probability of a packet is h 1 h0 −1 h 1 × (1 − ) ≥ . th0 th0 h0 te1/t Thus the expectation of sending times for this packet is te1/t h0 /h . The sum of all these expectations, until all packets have been sent, is te1/t h0 (
1 1 1 + . . . + + 1) = E ∈ O(h0 log h0 ). + h0 h0 − 1 2
By another form of Chernoff bound [10] P r(T > r) ≤
1 2r
for r > 6E
we see that
1 p2.5 for some k0 and therefore all packets can be transmitted to their targets in time O(log p log log p) with high probability. By combining the two phases we see that all packets can be routed in time O(h + log p log log p) with high probability. By choosing h ∈ Ω(log p log log p) we complete the proof.
P r(T > k1 h0 log h0 ) ≤
Constant Thinning Protocol for Routing h-Relations
4
997
Experiments
We ran some experiments to get practical experience of the new algorithm, see Table 1. The results are averages of 500 experiments.
Table 1. Routing cost of the CT algorithm with h = log22 p and h0 = log2 p. Thinning parameters t = 1.2, 1.5, 2.0 were tried. For comparison, results of GGT with h = log22 p, = 0.25, α = 0.1 are presented. p t = 1.2 t = 1.5 t = 2.0 GGT 128 3.74 3.82 4.20 4.86 256 3.73 3.77 4.16 4.92 512 3.76 3.77 4.13 4.95 1024 3.81 3.74 4.10 4.97 2048 3.78 3.72 4.07 4.99 4096 3.79 3.70 4.05 5.00
By the results in Table 1, the constant thinning algorithm CT achieves the cost 3.7 with slackness h = log2 p. With a suitable combination of and α GGT comes below 5 but is clearly slower than CT. Note that e = 2.7 is the lower bound of the cost. In addition to mere numbers, our routing simulator shows the progress of routing graphically (see Figure 1). In CT the throughput of packets is constant for most of the time, while GGT follows a saw blade pattern. The “tail” of the graph is very critical. Too little thinning or too small h0 grows the tail, as the algorithm approaches to the greedy algorithm.
Fig. 1. Graphical output of a CT simulation in case p = 1024, h = 128, h0 = 16, t = t0 = 1.2. The picture is divided in three horizontal bands by ragged borderlines at 1/t = 83% and at 1/te1/t = 36% of all processors. The bottom band represents successful processors, the middle band failed processors, and the top band passive processors. The vertical lines at intervals of cth, c2 th, . . ., until ci th < h0 = 16 (c = 1 − 1/e1/t0 ) separate the phases of the algorithm. The total time in picture is 456 = 3.56h.
998
Anssi Kautonen et al.
References 1. R.J. Anderson and G.L. Miller. Optical communication for pointer based algorithms. Technical Report CRI-88-14, Computer Science Department, University of Southern California, LA, 1988. 993 2. A. Czumaj and F. Meyer auf der Heide, and V. Stemann. Shared memory simulations with triple-logarithmic delay. Proc. ESA95, 46–59, 1995. 994 3. M. Dietzfelbinger and and F. Meyer auf der Heide. Simple, efficient shared memory simulations. Proc. SPAA93, 110–119. 994 4. J. H˚ astad, T. Leighton, and B. Rogoff. Analysis of backoff protocols for multiple access channels. SIAM J. Comput. 25:740–774, 1996 994 5. M. Ger´eb-Graus and T. Tsantilas. Efficient optical communication in parallel computers. In Proc. SPAA’92, pp. 41 – 48, June 1992. 994 6. L.A. Goldberg, M. Jerrum, T. Leighton, and S. Rao. A doubly logarithmic communication algorithm for the completely connected optical communication parallel computer. In Proc. SPAA’93, pp. 300 – 309, June 1993. 994 7. L.A. Goldberg, Y. Matias, and S. Rao. An optical simulation of shared memory. In Proc. SPAA’94, pp. 257 – 267, June 1994. 994 8. L.A. Goldberg and P.D. MacKenzie, Analysis of Practical Backoff Protocols for Contention Resolution with Multiple Servers. Proc. SODA’96, pp. 554–563. 994 9. L.A. Goldberg and P.D. MacKenzie, Contention Resolution with Guaranteed Constant Expected Delay, Proc. FOCS’97, pp. 213–222. 994 10. T. Hagerup, C. R¨ ub. A guided tour of Chernoff bounds. Information Processing Letters 33:305–308, 1989. 996 11. R.M. Karp, M. Luby, and F. Meyer auf der Heide. Efficient PRAM simulation on a distributed memory machine. Proc. STOC’92, pp. 318–326. 994 12. A. Kautonen and V. Lepp¨ anen and M. Penttonen. Simulations of PRAM on Complete Optical Networks. In Proc. EuroPar’96, LNCS 1124:307–310. 994 13. M. Paterson and A. Srinivasan. Contention resolution with bounded delay. In Proc. FOCS’95, pp. 104–113. 994 14. P. Raghavan, and E. Upfal. Stochastic Contention Resolution With Short Delays. In Proc. STOC’95, pp. 229–237. 994 15. L.G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Vol. A, 943–971, 1990. 993 16. L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33:103-111, 1990. 993
NAS Integer Sort on Multi–threaded Shared Memory Machines Thomas Gr¨ un and Mark A. Hillebrand Computer Science Department, University of the Saarland Postfach 151150, Geb. 45, 66041 Saarbr¨ ucken, Germany [email protected] [email protected] Abstract. Multi–threaded shared memory machines, like the commercial Tera MTA or the experimental SB–PRAM, have an extremely good performance on the Integer Sort benchmark of the NAS Parallel Benchmark Suite and are expected to scale. The number of CPU cycles is an order of magnitude lower than the numbers reported of general purpose distributed memory or shared memory machines; even vector computers are slower. The reasons for this behavior are investigated. It turns out that both machines can take advantage of a fetch–and–add operation and that due to multi–threading no time is lost waiting for memory accesses to complete. Except for non–scalable vector computers, the Cray T3E, which supports fetch–and–add but not multi–threading, is the only parallel computer that could challenge these machines.
1
Introduction
One of the first programs that ran on a node processor of the Tera MTA (multi– threaded architecture) [2,3] was the Integer Sort (IS) from the Parallel Benchmark suite [8,14] of the Numerical Aerospace Simulation Facility (NAS) at NASA Ames Research Center. According to press releases of Tera Corporation [17], a single node processor of the Tera machine was able to beat a one–processor Cray T90, which hitherto held the record for one processor machines, by more than 30 percent. The SB–PRAM [1,9,12,15] has many features in common with the Tera MTA, although it is inspired by a totally different idea, viz realizing the PRAM model from theoretical computer science [18]. Both machines differ from most of todays shared memory parallel computers in two aspects: First, they implement the UMA (uniform memory access) model instead of the NUMA (non uniform memory access) scheme found in todays scalable shared memory machines like SGI Origin, Sun Starfire or Sequent Numa–Q. In other words, they do not employ data caches with a cache coherence protocol, but hide memory latency by means of multi–threading. Second, they provide special hardware for fetch–and–add instructions in the memory system. Commercial microprocessors, which are employed in most of todays parallel computers do not offer this kind
This work was partly supported by the German Science Foundation (DFG) under contract SFB 124, TP D4.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 999–1009, 1998. c Springer-Verlag Berlin Heidelberg 1998
1000
Thomas Gr¨ un and Mark A. Hillebrand
of hardware support for running efficient parallel programs. An exception is the Cray T3E, which is a UMA machine with fetch–and–add support but without multi–threading. The IS benchmark is part of the NAS (Numerical Aerospace Simulation Facility) Parallel Benchmark Suite (NPB) [8]. It models the ranking step of a counting sort (kind of bucket sort) application, which occurs for instance in particle simulations. The IS benchmark takes a list L of small integers as an input and computes for every element x ∈ L the rank r(x) as its position in the sorted list. On this benchmark, the best program on a 4–processor SB–PRAM spends 3.3 cycles per element (CPE) on the average, and the single–node Tera machine needs 2.4 cycles. Both machines are expected to scale without significant loss of efficiency. A Cray T90 node approaches 5 CPE and the numbers reported for the MPI–based IS sample code on various distributed and shared memory computers (including the Cray T3E) are greater than 25 CPE. Besides, scalability is a problem on these machines. In this article we explain, why this numbers differ in such a wide range and investigate whether there are better implementations than the sample code for CC–NUMA machines and the Cray T3E. The paper is organized as follows. Section 2 discusses the IS benchmark specification. Section 3 addresses multi–threaded shared memory machines and Section 4 discusses a fast algorithm for vector computers. Section 5 presents results of the MPI–based message–passing sample code and Section 6 investigates if CC–NUMA machines or the Cray T3E UMA machine could improve on these results. Section 7 concludes.
2
The NAS Integer Sort Benchmark
The Integer Sort (IS) program is part of the NAS Numerical Parallel Benchmark suite [8]. Despite its name it does not sort but rank an array of small integers. The first revision (NPB 1) was a “paper and pencil” benchmark specification, so as not to preclude any programming tricks for a particular machine. The benchmark specifies that an array keys[] of N keys is filled using a fully specified random number generator that produces Gaussian distributed random integers in the range [0, Bmax −1]. This key generation and a final verification step, which checks the results, are excluded from the timing. There are several “classes” of the benchmark, which define the parameter values N and Bmax . We concentrate on class A (N = 223 , Bmax = 219 ); the parameters for class B and C are higher by a factor 4, and 16 respectively. The timed part of the benchmark consists of 10 iterations of (a) a modification of two keys, (b) the actual ranking step, and (c) a partial verification that tests if the well known rank of 5 keys has been computed correctly. A typical implementation of the inner loops that perform the actual ranking is listed in Table 1. The NPB 2 revision of the benchmark intends to supplement the original benchmark description with sample implementations that can be used as a starting point for machine–specific optimizations. There is a serial implementation (NPB 2.3–serial) and an implementation based on the message passing standard
NAS Integer Sort on Multi–threaded Shared Memory Machines
1001
Table 1. Inner loops of the NAS IS benchmark (1) (2) (3) (4)
for for for for
i i i i
= = = =
1 1 1 1
to to to to
B: N: B: N:
count[i] = 0 count[key[i]]++ i−1 pos[i] = count[i] k=0 rank[i] = pos[key[i]]++
// // // //
clear count array count keys calc start position ONLY NPB 1
MPI (NPB 2.3b2). Unfortunately, the fourth loop of the benchmark, which was present in earlier sample implementations of NPB 1, has been omitted in NPB 2. However, the fourth loop is identical to the second loop except for storing the result, and does not require a new programming technique. Therefore, we use the NPB 2 variant of IS, class A throughout this paper. Performance measures. The original performance measure of NAS IS is the elapsed time for 10 iterations of the inner loops. The main performance measure in our investigation is the number of CPU cycles per key element (CPE), because it emphasizes the architectural aspect rather than technology. We will point out situations in which CPE is an unfair measure.
3
Multi–threaded SMMs
First, we sketch the SB–PRAM design and highlight the Tera MTA differences. Then we describe the result of the best SB–PRAM implementation and indicate the algorithmic changes in the Tera MTA implementation. 3.1
Hardware Design
The SB–PRAM [1,9,12] is a shared memory computer with up to 128 processors and an equal number of memory modules, which are connected by a butterfly network. Processors send their requests to access a memory location via the butterfly network to the appropriate memory modules and receive an answer packet if the request was of type LOAD. There are no caches in the memory system. Three key concepts are employed to yield a uniform load distribution on the memory modules and to hide memory latency: (a) synchronous multi–threading of 32 virtual processors (VP) with a delayed load; (b) hashing of subsequent logical addresses to physical addresses that are spread over all memory modules; (c) butterfly network with input buffers and combining of packets. The combining facility is extended to do parallel prefix computations, which, due to synchronous execution, look to the programmer as if the VPs executed the operations one after another in a predefined order (sequential semantics). The design ensures that the VPs are never stalled in practice, neither by network congestion on the way to the memory modules nor by answer packets that arrive too late. Although this issue has been intensively investigated in [7]
1002
Thomas Gr¨ un and Mark A. Hillebrand
and [19] using different and detailed simulators, the main criticism on the SB– PRAM is that these investigations were based on simulations only. Therefore, a 64 processor machine is being built as a proof of concept. At the time of this writing, the re–design of an earlier 4–processor prototype [4] has been completed. The 64 processor machine is expected to be running in summer 1998. In [10] a variant of the SB–PRAM, called High Performance PRAM (HPP), has been sketched. Due to modest architectural changes and using top 1995 technology, the HPP is expected to achieve the tenfold performance of the SB– PRAM on average compiler generated programs. Tera MTA. The design goal of the Tera MTA [2,3,17] was to build a powerful multi–purpose parallel computer with special emphasis on numerical programs. In order to meet this goal, it was determined early in the design phase to employ GaAs technology and liquid cooling. A processor chip runs at ≥ 294 MHz and has a power dissipation of 6 KW per processor. Similar to the SB–PRAM, the Tera MTA hides memory latency by means of multi–threading. Unlike the SB–PRAM, it does not schedule a fixed number of VPs round robin, but it can support up to 128 threads, all of which can have up to 8 active memory requests. New threads can be created using a low overhead mechanism; the penalty for an instruction that attempts to use a register that is waiting for a memory request to arrive is only one cycle. The memory system of the Tera MTA is completely different from that of the SB–PRAM: the network topology is a kind of 3–dimensional torus. The messages are processed by a randomized routing scheme that may detour packets if the output link in the right direction is either overloaded or out of order. To our knowledge there is no detailed, technical description or a correctness proof of the network available to the public. The machine supports fetch–and–add instructions; the requests are not combined in the network, but serviced sequentially at the memory modules. Because the Tera threads may become asynchronous due to race conditions in the network, the machine does not offer the sequential semantics of the SB–PRAM. 3.2
Benchmark Implementation
Due to some limitations of our gcc compiler port for the SB–PRAM, we have hand–coded the loop bodies in assembly language and manually unrolled the inner loops of the benchmark. For details we refer to [11]. We now present how many instructions are necessary at least to implement the IS benchmark on the SB–PRAM. Hence, we hand–code the loop bodies in assembly language and neglect loop overhead. As on every other machine, the count array can be cleared with a single “store with auto–increment” instruction, if every VP is assigned a different but contiguous portion of the count array. The arrays count[] and pos[] are used one at a time and can be coalesced into a single cp[] array, which saves a few address computations. Because there are no data dependencies between different iterations of loop 2, all PROCS virtual processors are mapped round robin onto the keys and cp arrays and synchronously execute loops 2 and 3. Two successive
NAS Integer Sort on Multi–threaded Shared Memory Machines (a) elem1 = M( k ptr1+=2*PROCS); cp ptr1 = elem1 + cp base; syncadd( cp ptr1, 1); (b) elem1 = M( cp ptr1+=2*PROCS); elem1 = mpadd( ranksum, elem1); M( cp ptr1) = elem1;
1003
elem2 = M( k ptr2+=2*PROCS); cp ptr2 = elem2 + cp base; syncadd( cp ptr2, 1); elem2 = M( cp ptr2+=2*PROCS); elem2 = mpadd( ranksum, elem2); M( cp ptr2) = elem2;
Fig. 1. Interleaved loop bodies: (a) Loop 2, and (b) Loop 3.
loop bodies are interleaved in order to fill all load delay slots (see assembly code in Figure 1.a). The first instruction combines the incrementing of a properly initialized, private key pointer k ptr by 2·PROCS with the loading of the next key. In the second instruction of the loop body, the pointer cp ptr is computed as cp[] indexed by key, and the syncadd operation (mpadd without result) increments this entry. Figure 1.b lists the assembly code for the third loop, which also employs the interleaving technique. In the first instruction, the pointer cp ptr is adjusted and an entry of cp[] is read. Then, the accumulated rank is computed in the multi–prefix add on ranksum, and the result is written back to the cp array. This loop relies on synchronous execution, because the order of mpadd instructions among VPs is crucial. Loop 2 contributes most instructions to the timed portion of the code, beN cause it is executed N times whereas loops 1 and 3 are executed only Bmax = 16 1 3 times. Thus, CPE(ranking) = 16 + 3 + 16 = 3.25 is optimal for the SB–PRAM. The benchmark implemented using assembly language macros and 64–fold unrolled loop achieves 3.30 CPE. A similar tuned C version is slower by roughly one cycle, because the compiler does not use “load with auto–add of 2*PROCS” but generates two separate instructions instead. We have run the benchmark on our instruction–level simulator as well as on the real 4 processor machine. The run times differ by less than 0.1%, a fact that validates our network simulations which predicted that memory stalls are extremely unlikely. The run time obtained by a 128 processor simulation is only 1 th of the one processor run time, i.e. speedup is nearly linear. 1% slower than 128 Because no caches are employed in the SB–PRAM, the run times of class B and C can be simply and accurately predicted; they are higher by a factor of 4, respectively 16. The High–Performance–PRAM HPP would gain a factor of 5.6 in absolute speed, but the CPE value would be worse, because the machine can issue memory requests at only a third of the instruction rate. For a fairer comparison, the CPU clock should be replaced by the memory request rate. Tera MTA. The Tera MTA system has a sophisticated parallelizing compiler, which processes the sequential source code and automatically generates threads when needed. The assembler output of loop 2, which is listed in [2], shows a 2 instruction loop body that is 5-fold unrolled. The memory operations are “load next key” and “fetch & increment count”. An integer addition plus the
1004
Thomas Gr¨ un and Mark A. Hillebrand
assume: V1 ← 0..63 (a) 1 : V2 ← key[ V1 + k ] (b) 2.1 : count[ V2 ] ← V1 2 : V3 ← count[ V2 ] + 1 2.2 : V4 ← count[ V2 ] - V1 3 : count[ V2 ] ← V3 2.3 : if( V4 = 0 ) check
Fig. 2. Loop 2 vector code: (a) straight forward, (b) correction loop overhead are hidden in the non–memory operations of the ten instructions of the loop body. Note, that the CPU clock and the memory request rate are equal in the super–scalar Tera MTA design. Since the Tera MTA lacks synchronous execution, the third loop cannot be parallelized as on the SB–PRAM. Instead, every processor works on a contiguous block of the count array, computes the local prefix sums and outputs the global sum of its elements. In a second step, the prefix sum of the global sums is computed. Then, each processor adds the appropriate global prefix sum to its local elements. This procedure requires 3 instructions plus a small logarithmic term. For the class A benchmark it is less than 4 instructions. The Tera MTA 1 4 + 2 + 16 = 2.3125 CPE. The improvement compared to the requires less than 16 SB–PRAM comes from the ability to execute one memory operation and two other (integer, jump, . . . ) operations per instruction. We do not have access to a Tera MTA ourselves. The performance figures contained in this section are derived from [2,5] and personal communication with P. Briggs (Tera Computer Corporation) and L. Carter (SDSC). The numbers reported in [2] are 1.53 seconds and 5.25 < CPE < 5.3125 for NPB 1. Assuming a CPE value of 5.25, the 294 MHz machine has an overhead of at 6 least 1.53·294·10 5.25·10·223 − 1 = 2.14 %, i.e. hiding memory latency works well for the one processor prototype. The fourth inner loop of the benchmark, which is not present in NPB 2, accounts for 3 CPE. If we attribute all overhead to the three loops of IS(NPB 2) and conservatively assume CPE(NPB 2)=2.3125, we end up with a run time of < 0.68 seconds. Carter [5], who ran the NPB 2 benchmark on a 145 MHz machine, achieved a run time of 2.05 seconds, which corresponds to an overhead of 50 percent. He believes that on the early machine flipped bits in the network packets induced retransmissions, because the memory overhead figures for other benchmarks dropped as the hardware became more stable.
4
Vector Computers
In [5], a Tera MTA node is compared to a Cray T90 vector processor node on the NPB 2–serial sample code of IS. Vector processors have instructions for computing prefix sums on vector registers. Thus, the third benchmark loop can be parallelized in the same way as on the Tera MTA. The difficult part is loop 2, where even a single vector processor encounters problems. Figure 2.a lists the straightforward, but wrong, implementation of loop 2. Successive, contiguous stripes of key[] are read into vector register V2.
NAS Integer Sort on Multi–threaded Shared Memory Machines
1005
Table 2. IS (class A) benchmark results Machine CPU clock minimum maximum Name [MHz] #procs time CPE #procs time CPE SB–PRAM 7 1 39.55 3.30 4 9.89 3.30 SB–PRAM simulation 128 0.31 3.33 Tera MTA 294 1 0.68 2.32 Cray T90 440 1 1.11 5.82 SB–PRAM (MPI) 7 2 160.8 26.85 IBM SP2 WN 66 2 29.1 27.38 128 0.6 60.42 SGI Origin 2000 195 2 20.5 95.30 32 1.6 119.02 Cray T3E-900 450 2 12.7 136.26 256 0.4 258.15
Then, V3 is filled by reading the corresponding count[] entries and incrementing them. Afterwards V3 is written back. If a specific key value v is contained twice in V2 at positions k1 and k2 , count[v] is incremented only once, because V3[k2 ], which finally ends up in count[v], contains countold[v]+1 and not the incremented V3[k1 ] value. The Cray compiler can cure the situation using a patented algorithm [6] invented by Booth.1 Before writing V3 back in line 3, the code listed in Figure 2.b tests whether duplicates occur, and, if necessary, processes them in the scalar unit. The run time of IS(NPB 2), as reported in [5], is 1.11 s, which corresponds to 5.82 CPE. The NPB 2.3–serial sample code contains an unnecessary store that possibly has not been detected by the compiler. Hence, the optimal CPE for a single vector computer could be lower by 1 CPE. The above code for loop 2 can be easily parallelized by maintaining a private copy counti[] of the count array on each node computer. When p denotes the number of participating processors, the third loop becomes a parallel prefix p−1 i−1 count computation pos[i] = l [k]. Memory consumption as well k=0 l=0 as the execution time of loop 3 scale with p. If p becomes very large, a two pass radix sort, like described by Zagha [20], can improve performance.
5
Comparison with MPI–Based Sample Code
Table 2 lists the benchmark results discussed so far and compares them with run times of MPI–based sample code. We implemented the necessary MPI library functions on the SB–PRAM and optimized the sample code to the extent that an optimizing compiler could also achieve [11]. The times for the other machines have been taken from benchmark results published by NAS. The IBM SP2 is a distributed memory machine (DMM) with an extremely fast interconnection network, the SGI Origin 2000 a cache–coherent NUMA, and the Cray T3E a UMA shared memory machine (SMM) without cache coherence. 1
Personal Communication with Larry Carter and Max Dechantsreiter.
1006
Thomas Gr¨ un and Mark A. Hillebrand
Sample code implementation. The keys are distributed on all participating p computers before the timing starts. In a first local step, each computer sorts its keys using only the b most significant bits into 2b buckets. Because the keys are put into a sorted order, this step alone requires more instructions than IS(NPB 1). Then, a mapping of buckets to processors, which balances the load, is computed. Afterwards, all local bucket parts are sent to their destinations in a single, global, all–to–all communication step. Now, every processor owns a distinct, contiguous key range and can count its keys locally. The pos[] entries can be computed from count[] like in the Tera MTA implementation. The sample code algorithm is guided by the idea of doing one, central all–to– all communication. The price for this procedure is to rearrange the keys locally, such that keys that are sent to the same destination are stored contiguously in memory. A plus of the implementation is the good cache efficiency: the key arrays are accessed sequentially, so that the initial cache miss time is amortized over the complete cache line, and the count arrays fit into second level cache. With less than p = 16 processors another strategy with fewer communication overhead would be possible: every processor could count its keys in a local count array and pos[] could be computed like on vector processors after transposing N data elements the p × Bmax count matrix. If p < 16, then only p · Bmax < 16 · 16 had to be sent over the network. However, the sample code is written for a large number of processors and slow communication networks. The CPE of the SB–PRAM is 26.85, of which 9 cycles belong to memory instructions operating on key elements. The SB–PRAM issues one instruction per cycle and encounters no memory waits. The other machines have super– scalar processors, but suffer from memory waits. The impact of wait states on the CPE is the higher, the higher the clock rate of the specific machine is. Table 2 highlights this fact, which limits the benefit of cache–based architectures from future technological improvements compared to vector computers or multi– threaded machines.
6
Other Algorithms
The MPI–based sample code is tailored for distributed memory machines. We now investigate, if there are better algorithms for CC–NUMA machines or the Cray T3E. CC–NUMA. Cache–coherent non–uniform–memory–access SMMs [13], like the SGI Origin, have overcome the argument “SMMs do not scale” by other means than the SB–PRAM or the Tera MTA. While multi–threading relies on having enough parallelism available to keep the processors busy, CC–NUMA machines try to minimize the average memory latency by caching, at the risk of running idle on cache misses. We now investigate, if the direct approach of locking the count[] elements for updating could improve performance so much, that the results of multi– threaded SMMs could be reached. Therefore, we calculate the average memory
NAS Integer Sort on Multi–threaded Shared Memory Machines
1007
latency introduced by accessing count[keys[i]] in loop 2. We assume that on a p–processor machine 1p of the count[] elements are cached on every processor and the miss penalty is 20 cycles (i.e. 100 ns on a 200 MHz computer). Thus, the average memory latency is t=
1 p−1 ·1 + · 20 p p cache hit cache miss
.
13 . In other On machines with more than p = 16 processors, t is greater than 18 16 words, the exchange of count[] entries between the coherent caches is rather expensive. If the remaining cycles for the program are taken into account, the CPE is at least an order of magnitude higher than on multi–threaded SMMs with fetch–and–add.
Cray T3E. A Cray T3E node computer consists of a DEC Alpha processor with up to 2 GByte local RAM. Additionally, the system logic provides a notion of a logically addressable shared memory through the E–registers [16]. Up to 512 E–registers can be accessed by the user to trigger shared memory accesses. In addition to load and store, E–registers also support fetch & add and an especially fast fetch & increment operation. Although the machine does not support multi–threading in hardware, memory wait conditions can be avoided by using multiple E–registers to simulate multi–threading in software. With this technique, alternate key load and count increment operations can be performed at the maximal network clock of 75 MHz. The third loop of IS can also be parallelized in the same way as on the Tera MTA. Thus, less than 3 network cycles per key element are possible.
7
Conclusion
We have investigated the performance of the IS benchmark on the SB–PRAM and the Tera MTA, two scalable SMMs which employ multi–threading to hide memory latency instead of caching like most of todays computers. Besides multi– threading, both machines take advantage of a fetch–and–add instruction to update shared data structures conflict–free. These two properties lead to a performance that is an order of magnitude higher compared to published results of other scalable parallel computers, both DMMs and SMMs. In particular, the experimental 7 MHz SB–PRAM prototype can compete with modern parallel computers on this benchmark. The Tera MTA makes use of expensive top technology, and achieves better performance figures than non– scalable top vector processors. Although CC–NUMA SMMs could possibly improve performance compared to DMMs, they are still an order of magnitude slower than multi–threaded machines. Alone the Cray T3E, which supports and atomic fetch & increment operation could challenge the performance of the Tera MTA by emulating multi–threading in software.
1008
Thomas Gr¨ un and Mark A. Hillebrand
Acknowledgments The authors would like to thank Larry Carter, Max Dechantsreiter, and Preston Briggs for helpful discussions and all people who contributed to the SB–PRAM project [15].
References 1. F. Abolhassan, R. Drefenstedt, J. Keller, W. J. Paul, and D. Scheerer. On the Physical Design of PRAMs. Computer Journal, 36(8):756–762, December 1993. 999, 1001 2. R. Alverson, P. Briggs, S. Coatney, S. Kahan, and R. Korry. Tera Hardware– Software Cooperation. In Proc. of Supercomputing ’97. San Jose, CA, November 1997. 999, 1002, 1003, 1004 3. R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The TERA Computer System. In Proc. of Intl. Conf. on Supercomputing, June 1990. 999, 1002 4. P. Bach, M. Braun, A. Formella, J. Friedrich, T. Gr¨ un, and C. Lichtenau. Building the 4 Processor SB–PRAM Prototype. In Proc. of the 30th Hawaii International Conference on System Sciences, pages 14–23, January 1997. 1002 5. J. Boisseau, L. Carter, K.S. Gatlin, A. Majumdar, and S. Snavely. NAS Benchmarks on the Tera MTA. In Proc. of Workshop on Multi-Threaded Execution, Architecture and Compilation (M-TEAC 98), Las Vegas, February 1998. 1004, 1005 6. M. Booth. US Patent 5247696. see http://www.patents.ibm.com, September 1993. 1005 7. C. Engelmann and J. Keller. Simulation–based comparison of hash functions for emulated shared memory. In Proc. PARLE (Parallel Architectures and Languages Europe), pages 1–11, 1993. 1001 8. D. Bailey et al. The NAS Parallel Benchmarks. RNR Technical Report RNR-94007, NASA Ames Research Center, March 1994. see also http://science.nas.nasa.gov/Software/NPB. 999, 1000 9. A. Formella, T. Gr¨ un, and C.W. Kessler. The SB–PRAM: Concept, Design and Construction. In Draft Proceedings of 3rd International Working Conference on Massively Parallel Programming Models (MPPM-97), November 1997. see also http:// www-wjp.cs.uni-sb.de/~formella/ mppm.ps.gz. 999, 1001 10. A. Formella, J. Keller, and T. Walle. HPP: A High–Performance–PRAM. In Proceedings of the 2nd Europar, volume II of LNCS 1124, pages 425–434. Springer, August 1996. 1002 11. T. Gr¨ un and M. A. Hillebrand. NAS Integersort on the SB–PRAM. Manuscript, available via http://www-wjp/~tgr/NASIS, May 1998. 1002, 1005 12. T. Gr¨ un, T. Rauber, and J. R¨ ohrig. Support for Efficient Programming on the SB–PRAM. International Journal of Parallel Programming, 26(3):209–240, June 1998. 999, 1001 13. D.E. Lenoski and W.-D. Weber. Scalable Shared–Memory Multiprocessing. Morgan Kaufmann Publishers, 1995. 1006 14. NAS Parallel Benchmarks Home Page. http:// science.nas.nasa.gov/Software/NPB/. 999 15. SB–PRAM Home Page. http:// www-wjp.cs.uni-sb.de/sbpram. 999, 1008
NAS Integer Sort on Multi–threaded Shared Memory Machines
1009
16. S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proc. of the VII ASPLOS (Architectural Support for Programming Languages and Operating Systems) Conference, pages 26–36. ACM, October 1996. 1007 17. Tera Computer Corporation Home Page. http://www.tera.com/. 999, 1002 18. J. van Leeuwen, editor. Handbook of Theoretical Computer Science, volume A, pages 869–941. Elsevier, 1990. 999 19. T. Walle. Das Netzwerk der SB–PRAM. PhD thesis, University of the Saarland, 1997. in German. 1002 20. M. Zagha and G.E. Belloch. Radix Sort for Vector Multiprocessors. In Proc. of Supercomputing ’91, pages 712–721, New York, NY, November 1991. 1005
Analysing a Multistreamed Superscalar Speculative Instruction Fetch Mechanism Rafael R. dos Santos1 and Philippe O. A. Navaux2 1
UNISC - University of Santa Cruz do Sul Santa Cruz - RS, Brazil [email protected] 2 Informatics Institute CPGCC/Federal University of Rio Grande do Sul P.O. Box 15064 91501-970, Porto Alegre - RS, Brazil [email protected]
Abstract. This work presents a new model for multistream speculative instruction fetch in superscalar architectures. The performance evaluation of a superscalar architecture with this feature is presented in order to validate the model and to compare its performance with a real superscalar architecture. This model intends to eliminate the instruction fetch latency introduced by branch instructions in superscalar pipelines. Finally, some considerations about the model are presented as well as suggestions and remarks to future works.
1
Introduction
Even using accurated branch prediction mechanisms, current superscalar architectures offer less performance than an ideal architecture. When a branch is encountered, the branch predictor can predict it as to be taken or not taken. In the first case, contiguous instructions are already into the fetch stage when the prediction is made, and these instructions do not take part of the predicted path. For the second case, nothing occurs if the prediction is correct. But for both cases, we are assuming that the branch prediction mechanism is efficient. The problem usually is that branches are predicted as to be taken and each time this occur a flow interruption also occurs. The time requiered to refill the fetch buffer and put the correct instructions into the instruction queue is greater than the necessary time to execute these instructions. If there are many functional units and flow interruptions occur frequently, it should be expected that the instruction queue will be empty for many cycles. This means that the instruction fetch mechanism must be designed to keep the instruction queue with instructions that can be scheduled for execution on free functional units. A superscalar architecture could not execute instructions faster than it can fetch instructions from the instruction cache. So, design an efficient fetch mechanism is more important than increase the number of resources, because it is not D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1010–1017, 1998. c Springer-Verlag Berlin Heidelberg 1998
Analysing a Multistreamed Superscalar Speculative Instruction
1011
possible to increase the performance only by increasing the number of functional units. The question is how many instructions could be fetched per cycle? Where are these instructions comming from? An alternative approach to branch prediction is instead of predict whether the branch is taken or not taken, is execute speculatively both the taken and not-taken paths and cancel the execution of the incorrect path as soon as the branch result is known [3].
2
Fetching Instructions from Multiple Streams
Talcott [9] reffers a technique called Fetch Taken and not-Taken Paths to reduce the branch problem. In [1] was presented a new model based on this scheme. In this model, both paths of a conditional branch are fetched and put into the fetch buffer. When the branch prediction stage transfers instructions, from the fetch buffer to the instruction queue, and reaches a branch into the fetch buffer, both possible paths of this branch and its instructions are already into the fetch buffer. In this case, no delay is introduced when the branch is predicted as taken. The transfer is redirected to the predicted path whithout delay. To enable this operation, the fetch stage must detect the branch instruction just when it is fetched. In the next cycle, it must also start to fetch from both paths. When the branch is predicted, the instructions transfer to instruction queue is not interrupted. In the multistreamed superscalar architecture proposed in [6], the fetch stage was modified to enable to fetch both paths of a branch instruction. PC
St
PC
St
PC
Stream
C-List Stream 0
PC Instr
FBuffer 0 PC Stream
Stream 1
PC Instr
FBuffer 1
PC PC Instr
C-List
St
PC Stream
C-List
Stream n-1
FBuffer n-1
Fig. 1. Fetch Buffer Structure (e.g. fetch depth equal to 4 streams) The figure 1 suggests a new buffer structure in the multistream pipeline. The number of stream buffers defines the fetch depth. Each stream has four independent elements: Program Counter, Status Bit, Children List and Fetch Buffer. In the Fetch Buffer are stored the fetched instructions. The PC is used to point the instructions that are in the stream. The status bit indicates whether the stream structure is busy and the Children-List stores the identification of the children streams. The Children-List also stores the branch address and the identifier of its children streams allowing the predict stage to start the transfer of instructions of the new stream when a branch is predicted, without any delay.
1012
2.1
Rafael R. dos Santos and Philippe O. A. Navaux
Multistream Architecture Operation
The Fetch stage fetches instructions to put them into the fetch buffer. When conditional branch instruction is detected, a new stream is generated and initialized, as if there were available resources. The stream generation consist of Children-List updating operation with the branch address and the identification of a new stream structure that will store the instructions related to the new stream. For each branch detected there are two possible paths. The instructions that are in the not taken path are fetched and stored into the same stream structure where is stored the conditional branch. But, the taken path and their instructions will be stored in this new stream structure. The stream structure initialization consists of the PC initialization with the target address and the setting of the status bit, to indicate that the structure was allocated. The predict stage transfers instructions, from the fetch buffer to the instruction queue (like conventional superscalar architectures) looking for branch instructions. However, when it finds a branch instruction, it also makes a prediction. When the prediction is to be taken, this stage just concatenates the instructions which are in the children stream of this branch, discarding the closest instructions. When the prediction is not taken, the children stream is discarded and the neightbouring instructions continue to be transfered to the instruction queue. When a stream is discarded, all children streams originated by this stream are also discarded, through a recursive operation.
3
Experimental Framework
For this work, we used 4 benchmarks from the SPECint95 suite (compress, go, ijpeg, li). Also, we used a execution-driven simulator to generate traces. This simulators acomplish with the SPARCV7 instruction set and simulates the execution in a scalar pipelined fashion. This is the reference machine. For the superscalar simulations, we used 2 trace-driven simulators which executed the choosed benchmarks based on the traces generated by the first simulator. These 2 simulators are respectively called Real and Mulflux as we will describe below: – Real Simulator: Simulates the execution of programs using two-level branch prediction [4,5]. – Mulflux Simulator: Simulates the execution of programs using the proposed speculative instruction fetch mechanism [6].
4
Performance Analysis
In this section, we analize the performance of the multistreamed model based on the data extracted through several simulations. The experiments simulate
Analysing a Multistreamed Superscalar Speculative Instruction
1013
different configurations of the real machine and the multistreamed machine. The tests started with a fetchwidth equal to 2 up to 8 instructions per cycle. We can point out that in all situations the cycles with no dispatch are less than for the real machine. The no dispatch decreases with the increase of fetchwidth in the real machine. The percentage decreases from 40.50%, for fetchwidth equal to 2, to 39.30% in configurations with a fetchwidth equal to 8 instructions, as showed in the figure 2. In the Mulflux machine, this percentage increases join to the increase of fetchwidth. This occur because the mispredictions and resources conflict increase too. Other studies have been developed to research more accurated branch predictors and the ideal balancing of the architectures resources to support a new and more powerfull parallelism through the pipeline. The sum of the three components, which causes no dispatch in both machines, correspond to in the percentage of cycles with no dispatch.
Fig. 2. No Dispatch in Mulflux with 2 Streams and Real Architecture The divergency between the components is important in the Real machine. The occurency of an empty queue is the critical component that contributed to the existency of no dispatch cycles. This is not the case for the Mulflux machine. It is predictable that when there are more instructions ready to dispatch, the number of functional units becomes the main problem. This is true in the Mulflux architecture. The occurency of empty queue in the Real machine decrease from 30.05%, with fetchwidth equal to 2 to 21.79%, with fetchwith equal to 8 instructions per cycle. In the Mulflux, the results obtained are 7.73% and 7.80% for the same configurations perhaps applying multiple streams.
Fig. 3. Components that Causes Emptying Queue in Mulflux with 2 Streams and Real Architecture
1014
Rafael R. dos Santos and Philippe O. A. Navaux
The figure 3 presents the efficiency of the multistream model to reduce the empty queue occurrency. In the Mulflux, the stream interruptions are up to 5.24%, but in the Real architecure this percentage reach 27.07%. The multistreamed model enables a reduction around 74.24% of empty queue occurrency. The effect of stream interruptions were reduced by 80.64% in the Mulflux machine. Another important aspect is that resource conflict and speculation depth must be considered with more attention when the multistream model is used. 4.1
Impact Analysis of Multistream Implementation
Before the implementation of the multistream model, we must consider the impact in the close blocks, and also the fetchwidth, resource conflicts and speculation depth. In the previous results we did not consider the instruction cache misses. The use of the multistream model could generate more cache misses and the fetch latency can be important. So, the potential of the mechanism can be reduced. Then, we performed some experiments using a instruction cache with the same delay cycles as those of the Intel Pentium to observe the performance under real conditions. In this cache, the miss latency is equal to 3 cycles for the L1 cache, and 13 cycles when the miss causes an access to the L2 cache. Also, for the last results we consider that the fetchwidth was multiplied by the number of valid stream structures. If the fetchwidth is equal to 8 instructions and the number of valid streams (initialized structures) is equal to 4 then the total and the real fetchwidth is equal to 32 instructions per cycle. This was made in the last experiments. To avoid problems with the cache size and its configuration we have proposed a new strategy called Dynamic Split Fetch - (DSF) which consists in spliting the total fetchwidth between the valid stream structures [6]. In this case, if there are 4 valid streams and the fetchwidth is equal to 8 instructions per cycle, will be fetched 2 instructions for each valid stream and the fetchwidth is kept up to 8 instructions. 4.2
Overall Performance
In this section we discusse the overall performance delivered by the multistreamed mechanism and compare it with real architectures performance. Thus, we could observe the aspects discussed in the last section. Increasing the number of functional units delivers more speed up for the multistreamed architecture as show in the figure 4. However, we can observe that no dispatch cycles increases even when the number of functional units increases. This is due to the speculation depth that was kept for all configuration with a single branch unit. The machine considered has n generic functional units and only one branch unit that could stall the dispatch of instructions when saturated. This is showed in the graphic 5.
Analysing a Multistreamed Superscalar Speculative Instruction
1015
Fig. 4. Multistream Speed up when the Number of Functional Units Increase
Fig. 5. Components of No Dispatch when Functional Units Increases
We made several simulations variating the number of functional units for the multistreamed architecture. We wanted to show that the resource conflict is the main factor in the limitation of the performance in this machine like in an ideal machine. We performed experiments with 5 machines consisting of: a Multistream with perfect icache, a Multistream with normal icache, a Multistream with normal icache and DSF (Dynamic Split Fecth), a Real machine with perfect icache and a Real machine with normal icache. The results that will be presented comes from the same configurations for each machine [6]. We used 2 streams structures to the multistream machines. All machines have a fetchwidth equal to 8 instructions per cycle, and a fetch buffer and instruction queue with 16 and 32 entries respectively; a dispatchwidth equal to 8 instructions; 8 functional units, each one with 8 reservation stations; 1 branch unit with 8 reservation stations; 8 bus results and reorder buffer with 64 entries. The figure 6 shows the no dispatch cycles expended for each machine. The best case is the multistream machine with perfect icache and using a total fetchwidth that consists in multiplying the fetchwidth by the number of valid stream structures. We could observe that the Mulflux with normal icache has a percentage of no dispatch cycles similar to the Mulflux with normal icache using DSF. The division of the fetchwidth by each valid stream do not harm the performance of the considered cases. The Real machines expend more cycles with no dispatch. The figure 7 shows the components that cause no dispatch cycles in each machine. In the mulstistream machines, the worst component is the resource conflict (around 45.00% of the no dispatch cycles). In the Real machines the emptying queue result around 60% of the occurency of no dispatch. In the second
1016
Rafael R. dos Santos and Philippe O. A. Navaux
figure 7, we could notice the reduction of the stream interruptions in multistream machines.
Fig. 6. Percentage of No Dispatch
Fig. 7. No Dispatch Components and Emptying Queue Components
5
Conclusions and Future Works
The superscalar speed up is directly proportional to its IPC (Instructions per Cycle). It is desirable to obtain a speed up proportional to the number of functional units present in the architecture. However, the constant flow interruptions flush the instruction queue and decrease the number of ready instructions which could be dispatched. Thus, the IPC is reduced drastically because of the instruction queue flushing and a desirable speed up could not be achieved. The multistreamed model allow a reduction around 74.24% of empty queue occurrency. The effect of stream breaks was reduced by 80.64% in the Mulflux machine. Another important aspect is that resource conflict (structural hazards) and speculation depth must be considered carefully when the multistream model is used. In our experiments, the instruction cache performed similar performance in both cases: multistream and real architectures. The use of Dynamic Split Fetch brings a worthwhile strategy that allows to keep the number of icache buses. Even if we reduce the occurency of empty instruction queue, we have verified that the decrease of no dispatch cycles do not decrease as we wanted. In the Mulflux machine the no dispatch cycles is between 28.99% and 33.11%, while in the Real machine it is between 39.30% and 40.50%. The increase of instructions flow in the instruction queue generate a major resource conflict like in the ideal architecture. Increasing the number of functional units but keeping the speculation depth did not allow good results in our experiments. Because of this, we are looking for resource balancing and ideal speculation depth in multistreamed
Analysing a Multistreamed Superscalar Speculative Instruction
1017
architectures. The saturation of branch unit could come late the instructions execution. Remarks to the next step could be pointed here. We are looking to introduce new instruction cache mechanisms in our simulations. Such mechanisms have been proposed and we believe that they will be used in next generations of superscalar microprocessors. Also, after to get a good configuration of our architecture we plan to compare it with other alternatives like trace processors [8], simultaneous multithreading [2], multiscalar [7] and other alternatives which certainly will be suggested. Such comparison is very important to get an idea about the potential of such architecture and its complexity of implementation. The fourth generation of microprocessors can not be predicted at the moment but many new schemes have been proposed. The question is how to increase the performance of current microprocessors and how to obtain more machine parallelism using efficiently the increasingly chip density ?
References 1. Chaves Filho,Eliseu M. et al. A Superscalar Architecture with Multiple Instruction Streams. In: SBAC-PAD, VIII. Recife, Agosto 1996. Proceedings.... SBC/UFPE, 1996, pp 67-77 (in portuguese). 1011 2. Eggers, Susan J. et al. Simultaneous Multithreading: A Plataform for NextGeneration Processors. IEEE Micro, V.17, n.5, Sep/Oct 1997. 1017 3. Fromm; Richard. Branching on Superscalar Machines: Speculative Execution of Multiple Branch Paths Project Final Report. December 11, 1995. CS 252:Graduate Computer Architecture. 1011 4. Yeh, Tse-Yu; Patt, Yale N. Two-Level Adaptive Training Branch Prediction. In: ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 24., 1991. Proceedings... New York: ACM, 1991. p. 51-61. 1012 5. Yeh, Tse-Yu; Patt, Yale N. A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History. In: ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 20., 1993. Proceedings... New York: ACM, 1993. p. 257-266. 1012 6. Santos, Rafael R. dos. A Mechanism for Multistreamed Speculative Instruction Fetch. Porto Alegre: CPGCC/UFRGS, 1997 (M.Sc. Thesis - in portuguese). 1011, 1012, 1014, 1015 7. Sohi, G. S.; Breach, S. E.; Vijaykumar, T. N. Multiscalar Processors. Computer Architetcure News, New York, v.23, n.2, p. 414-425, 1995. 1017 8. Smith, James E,; Vajapeyam, Sriram. Trace Processors: Moving to FourthGeneration. Computer, Los Alamitos, v.30, n.9, p.68-74, Sep. 1997. 1017 9. Talcott, Adam R. Reducing the Impact of the Branch Problem in Superpipelined and Superscalar Processors. Santa Barbara: University of California, 1995 (Ph. D. Thesis). 1011
Design of Processor Arrays for Real-time Applications Dirk Fimmel and Renate Merker Department of Electrical Engineering Dresden University of Technology {fimmel,merker}@iee1.et.tu-dresden.de
Abstract. This paper covers the design of processor arrays for algorithms with uniform dependencies. The design constraint is a limited latency of the resulting processor array. As objective of the design the minimization of the costs for an implementation of the processor array in silicon is considered. Our approach starts with the determination of a set of proper linear allocation functions with respect to the number of processors. It follows the computation of a uniform affine scheduling function. Thereby, a module selection and the size of partitions of a following partitioning is determined. A proposed linearization of the arising optimization problems permits the application of integer linear programming.
1
Introduction
Processor arrays represent an appropriate kind of co-processors for time consuming algorithms. Especially in heterogeneous systems a dedicated hardware for parts of algorithms, e.g. for the motion estimation of MPEG, is required. An admissible latency Lg for the selected part of the algorithm can be derived from the time constraint of the entire algorithm. The objective of the design is a processor array with minimal hardware costs that matches the admissible latency Lg . In this paper, we consider algorithms with uniform dependencies. First, a set of linear allocation functions that lead to a small number of processors is computed. Then, for each allocation function a uniform affine scheduling function is determined. We assume that a LSGP (locally sequential and globally parallel) -partitioning can be applied to the processor array. The size of the partitions is derived with respect to the constraint that the latency of the resulting processor array is less than Lg . Furthermore, the processor functionality, i.e. kind and number of modules that have to be implemented in each processor, is determined. The objective of the design is a minimization of hardware costs. We measure hardware costs by the chip area needed to implement modules and registers in silicon.
The research was supported by the ”Deutsche Forschungsgemeinschaft”, in the project A1/SFB358.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1018–1028, 1998. c Springer-Verlag Berlin Heidelberg 1998
Design of Processor Arrays for Real-time Applications
1019
Related works handling the allocation function cover (1) a limited enumeration process to determine a linear allocation function leading to a minimal number of processors of the processor array [15], (2) the determination of a set of linear allocation functions that match a given interconnection network of the processor array [16], (3) the inclusion of a limited radius of the interconnections into the determination of the allocation function [14] and (4) the minimization of the chip area of the processor array by consideration of the processor functionality [3]. An approach to compute a variety of linear allocation and scheduling functions is proposed in [11]. Some notes to the determination of unconstrained minimal scheduling functions for algorithms with uniform dependencies can be found in [1]. Resource constraint scheduling for a given processor functionality is presented in [13]. An approach to minimize the throughput by consideration of the chip area is proposed in [9]. In [2] the approach [13] is extended to determine additionally the processor functionality in order to minimize a chip area latency product. The paper is organized as follows. Basics of the design of processor arrays are given in section 2. In section 3 hardware constraints considered in this paper are introduced. A linear program to determine a set of linear allocation functions is presented in section 4. Section 5 covers the determination of scheduling functions. A linear programming approach is presented in detail. Finally, a short conclusion is given in section 6.
2
Design of Processor Arrays
In this paper we restrict our attention to the class of algorithms that can be described as systems of uniform recurrence equations (SURE) [5]. Definition 1 (System of uniform recurrence equations). A system of uniform recurrence equations is a set of equations Si of the following form: Si :
yi [i] = Fi (· · · , yj [fijk (i)], · · · ),
i ∈ I,
1 ≤ i, j ≤ m, 1 ≤ k ≤ mij , (1)
where i ∈ Zn is an index vector, fijk (i) = i − dkij are index functions, the constant vectors dkij ∈ Zn are called dependence vectors, yi are indexed variables and Fi are arbitrary single valued operations. All equations are defined in the index space I being a polytope I = {i | Hi i ≥ hi0 }, Hi ∈ Qmi ×n , hi0 ∈ Qmi . We suppose that the SURE has a single assignment form (every instance of a variable yi is defined only once in the algorithm) and that there exists a partial order of the instances of the equations that satisfies the data dependencies. Next, we introduce a graph representation of the data dependencies of the SURE. Definition 2 (Reduced dependence graph (RDG)). The equations of the SURE build the m nodes vi ∈ V of the reduced dependence graph V, E . The directed edges e = (vi , vj ) ∈ E are the data dependencies weighted by the dependence vectors d(e) = dkij . Source and sink of an edge e ∈ E are called σ(e) and δ(e) respectively.
1020
Dirk Fimmel and Renate Merker
The main task of the design of processor arrays is the determination of the time and the processor when and where each instance of the equations of the SURE has to be evaluated. In order to keep the regularity of the algorithm in the resulting processor array, we apply only uniform affine mappings [8] to the SURE. Definition 3 (Uniform affine scheduling). A uniform affine scheduling function τi (i) assigns an evaluation time to each instance of the equations: τi : Zn → Z :
τi (i) = τ T i + ti ,
1 ≤ i ≤ m,
(2)
where τ ∈ Zn , ti ∈ Z. Definition 4 (Linear processor allocation). A linear allocation function π(i) assigns an evaluation processor to each instance of the equations: π : Zn → Zn−1 :
π(i) = Si,
(3)
where S ∈ Zn−1×n is of full row rank. Since S is of full row rank, the vector u ∈ Zn which is coprime and satisfies Su = 0 and u = 0 is uniquely defined and called projection vector. Because of the lack of space we refer to [2,3] for a treatment of uniform affine allocation functions πi : Zn → Zn−1 : πi (i) = Si + pi , 1 ≤ i ≤ m. The importance of the projection vector is due to the fact that those and only those index points of an index space lying on a line spanned by the projection vector u are mapped onto the same processor. Due to the regularity of the index space and the uniform affine scheduling function, the processor executes the = 0 operations associated with that index points one after each other if τ T u with a constant time distance λ = |τ T u| which is called iteration interval. The application of a scheduling and an allocation function to a SURE results in a so called fullsize array.
3
Hardware Description
We consider a given set M of modules which are responsible to evaluate the operations of a processor. Instead of assuming given processors we want to determine modules which realize the operations of the processors. First, we introduce some measures needed to describe the modules. To each module ml ∈ M we assign an evaluation time dl ∈ Z in clock cycles needed to execute the operation of module ml , a necessary chip area cl ∈ Z needed to implement the module in silicon and the number nl ∈ Z of instances of that module which are implemented in one processor. If a module ml ∈ M has a pipeline architecture we assign a time offset ol to that module which determines the time delay after that the next computation can be started on this module, otherwise ol = dl . Some modules are able to compute different operations, i.e. a multiplication unit is likewise able to compute an addition. To such modules different delays dli and offsets oli depending on the operations Fi are assigned.
Design of Processor Arrays for Real-time Applications
1021
The assignment of a module ml ∈ M to an operation Fi is denoted as m(i), and the set of modules which are able to perform the operation Fi is Mi . The addressing of the instance of the module m(i) which performs the operation Fi is given by ui ∈ Z.
4
Determination of Allocation Functions
The allocation function maps each index vector i ∈ I to a processor p of the processor space P = {p | p = Si ∧ i ∈ I}. Our aim is the determination of a set of proper linear allocation functions that lead to processor spaces with a small number of processors. In our approach we approximate the number of processors of the processor space P by the number of processors of the enclosing constant bounded polytop (cbpolytop) Q = {p | pmin ≤ p ≤ pmax } of P, where P ⊆ Q, and each face of Q intersects P at least in one point. The consideration of the cb-polytop Q allows the formulation of the search for allocation functions as a linear optimization problem. Program 1 (Determination of allocation functions) for j = 1 to n − 1 minimize vj1 − vj2 + 1, subject to vj2 ≤ sTj wl ≤ vj1 , sTj bk + (1 − rk )R ≥ 1, n−j+1 rk = 1, k=1
vj1 , vj2 ∈ Q, 1 ≤ l ≤ |W|, sj ∈ Zn , 1 ≤ k ≤ n − j + 1, (4.1)
(4)
rk ∈ {0, 1},
end for where W is the set of vertices wl of I, R is a sufficiently large constant and the vectors bk , 1 ≤ k ≤ n − j + 1, are spanning the right null space of the matrix (s1 , · · · , sj−1 )T . Constraint (4.1) is replaced by s1 = 0 for j = 1. Constraint (4.1) ensures that the vectors sj , 1 ≤ j ≤ n − 1, are linearly independent, i.e. that rank(S) = n−1. Constant R has to fulfill R ≥ max {|sTj bk |}. 1≤k≤n−j+1
The number of processors of the cb-polytop Q is
n−1 j=1
(vj1 − vj2 + 1).
A motivation of our approach is given in the following theorem. Theorem 1 (Number of processors of an enclosing cb-polytop). If N (N ) is the number of processors of the enclosing cb-polytop of the processor space resulting after application of the allocation function defined in program 1 (of another arbitrary linear allocation function), then N ≤ N . Since we are interested in several allocation functions we replace constraint (4.1) in program 1 for j = 1 by sT1 ul = 0, 1 ≤ l < i, in order to determine the i-th allocation function, where ul are the projection vectors of the previous determined allocation functions.
1022
Dirk Fimmel and Renate Merker
Example 1. We consider a part of the GSM speech codec system as example. The considered algorithm consists of two equations. I. II.
y1 [i, k] = y1 [i − 1, k] + r[i]y2 [i − 1, k − 1], y2 [i, k] = y2 [i − 1, k − 1] + r[i]y1 [i − 1, k],
(i, k)T ∈ I, (i, k)T ∈ I,
I = {(i, k)T | 1 ≤ i ≤ 8, 1 ≤ k ≤ 120}. The index space with the data dependencies as well as the reduced dependence graph are depicted in Fig. 1.
k I
120
d1
2
d2
d3 d4
1
II 1
2
3
4
5
6
7
8 i
Fig. 1. Index spaces with data dependencies and reduced dependence graph The dependence vectors are: I → I : d1 = (1, 0)T , II → I : d2 = (1, 1)T , I → II : d3 = (1, 0)T , II → II : d4 = (1, 1)T . The set of vertices W of the index space I is W = {(1, 1), (8, 1), (1, 120), (8, 120)}. Application of program 1 leads to the matrices S1 = (1, 0) and S2 = (0, 1), and hence to the projection vectors u1 = (0, 1)T and u2 = (1, 0)T respectively. The next section covers the determination of a scheduling function and the processor functionality with respect to each projection vector ui computed in program 1.
5 5.1
Determination of a Scheduling Function General Optimization Problem
In this section we propose an approach to determine a scheduling function as well as a module selection and a conflict free assignment of the modules to the operations of the SURE. Our objective is the minimization of the costs for a hardware implementation of the processor array subject to the condition that a given latency Lg is satisfied. A scheduling function is determined for each linear allocation function resulting after application of program 1. First, we present a general description of the optimization problem and go into more detail in the next paragraphs.
Design of Processor Arrays for Real-time Applications
1023
Program 2 (Determination of a scheduling function) minimize chip area C of the processor array subject to latency L of the processor array: L ≤ Lg causal scheduling function selection of modules conflict free access to the modules 5.2
Objective Function
The number of processors Np is given exactly since the allocation function is fixed at this step. In each processor of the fullsize array we have to implement nl instances of module ml . The chip area needed to implement the modules of |M| n l cl . all processor of the fullsize array is therefore Cf = Np l=1
Additional chip area is needed to implement (1) registers to store intermediate results, (2) combinatorial logic to control the behaviour of the processors and (3) interconnections between the processors. The costs for the control logic are assumed to be negligible. The number of registers depends on the number of processors as well as the number of data produced in each processor. We approximate the chip area Cr needed to implement the registers in silicon by Cr = Np m cr , where cr is the chip area of one register and m is the number of equations of the SURE. Since the number of processors is independent of the scheduling function and the module selection, Cr can be treated as a constant. An assessment of the effort for the implementation of the interconnections is difficult. Instead of measuring the chip area of interconnections we propose a minimization of the length or a limitation of the radius of the interconnections respectively while determining the allocation function. As a consequence of the above discussion we conclude that it is sufficient to consider the chip area Cf needed to implement the modules as measure for the hardware costs. Now, we assume that the admissible latency Lg enables a slow down of the fullsize array. Hence, we can decrease the hardware costs by partitioning the fullsize array. We apply the LSGP-partitioning [4] (tiling) by a factor of K, i.e. each partition contains K processors of the fullsize array. Each partition represents a processor of the resulting processor array. We assume that the partitioning matches exactly the fullsize array and we neglect the additional hardware needed to control the partitions. Hence, we get the hardware costs of the resulting processor array Cp = Cf /K. Since the number of processors Np is fixed it is sufficient to consider the chip area Cf of only one processor of the fullsize array. Our objective is to minimize Cp = Cf /K, where Cp = Np Cp and Cf = Np Cf . 5.3
Admissible Latency Lg
The latency Lf of the fullsize array is given by Lf = tmax − tmin , where tmin ≤ τ T w + ti ≤ tmax − dm(i),i ,
∀w ∈ W,
1 ≤ i ≤ m,
1024
Dirk Fimmel and Renate Merker
where tmin , tmax ∈ Z and W is the set of vertices w of the index space I. The consideration of only the vertices of I is justified by the fact that linear functions have their extreme values at extreme points of a convex space [10]. The relaxation to integer values has only small effort for sufficiently large index spaces I. The assumption that Lg >> Lf enables a partitioning of the fullsize array by a factor K, where KLf ≤ Lg . This inequality is a sufficient condition since we can presume that a partitioning exists where the latency L of the resulting processor array satisfies L ≤ KLf . An intuitive justification yields the opposite case where the distribution of a sequential program to K processors leads to a speed up less or equal to K. Next, we linearize the constraints Cf = KCp and KLf ≤ Lg . The minimal latency Lmin of a fullsize array is easy to determine by assumption of unlimited resources and assignment of the fastest possible module to each operation of the SURE. Using Lmin we can limit K by 1 ≤ K ≤ Kmax = Lg /Lmin . The lower bound of K can be increased by consideration of the minimal hardware costs for |M| one processor Cmin = min{ nl cl } satisfying ∃ml ∈ Mi .nl ≥ 1, 1 ≤ i ≤ m. l=1
The application of a resource constraint scheduling [13] with the modules leading to Cmin yields Lmax and Kmin = max{1, Lg /Lmax }. The introduction of Kmax − Kmin + 1 binary variables γj ∈ {0, 1} enables an equivalent formulation of the constraints Cf = KCp and KLf ≤ Lg as follows: Cf ≤ jCp + (1 − γj )Cmax , jLg ≥ jKmin Lf + γj (j − Kmin )Lg , K max γj = 1,
Kmin ≤ j ≤ Kmax , Kmin ≤ j ≤ Kmax ,
(5)
j=Kmin where Cmax =
|M| l=1
nmax cl , and nmax is the maximal number of instances of l l
module ml which can be implemented in one processor of the fullsize array. A reduction of the number of variables γi from Kmax − Kmin + 1 to (Kmax − Kmin + 1)/z, z ∈ Z, is possible by solving the optimization problem iteratively. In the first iteration only such j, Kmin ≤ j ≤ Kmax , are considered that satisfy j mod z = a, where a = Kmin mod z. The second iteration of the optimization problem is solved for jmax − z ≤ j ≤ jmax + z, where jmax is the solution of the first iteration. 5.4
Causality Constraint
In order to ensure a valid partial order or the equations preserving the data dependencies the scheduling function has to satisfy the causality constraint. Definition 5 (Causality constraint). A scheduling function τi (i) = τ T i + ti has to satisfy the following constraint: τ T d(e) + tδ(e) − tσ(e) ≥ dm(σ(e)),σ(e) ,
∀e ∈ E.
(6)
Design of Processor Arrays for Real-time Applications
5.5
1025
Module Selection and Prevention of Access Conflicts
The assignment of a module to the operation Fj is described by |Mj | binary variables rjl ∈ {0, 1}, where rjl = 1 and rjk = 1 ↔ m(j) = mk . ml ∈Mj
The following resource constraint prevents access conflicts to the module. Definition 6 (Resource constraint). For a given projection vector u a uniform affine scheduling function τi (i) = τ T i + ti has to satisfy the following constraint: (tj mod λ) − (tk mod λ) ≥ om(j),k , if (tj mod λ) > (tk mod λ), λ − (tj mod λ) + (tk mod λ) ≥ om(j),j , (7) (tk mod λ) − (tj mod λ) ≥ om(j),j , if (tj mod λ) ≤ (tk mod λ), λ − (tk mod λ) + (tj mod λ) ≥ om(j),k , for all j = k, 1 ≤ j, k ≤ m, with m(j) = m(k) and uj = uk , where λ = |τ T u|. We refer to [2] for an explanation and a linearization of (7). 5.6
Result of the Optimization Problem
The result of the optimization problem in program 2 is a set of modules have to be implemented in each processor and the number of processors of the fullsize array building one partition and hence one processor of the resulting processor array. Program 2 is solved for each projection vector computed in program 1. Then we select the projection vector ui leading to minimal hardware costs |M| 1 C = Cp + Cr = Np ( K nl cl + m cr ). i=l
Example 2. (Continuation of example 1) The admissible latency Lg is measured in clock cycles and supposed to be Lg = 1200. The considered set of modules is listed in table 1. We solve the
Table 1. Set of modules
m1 m2 m3 mr
operation evaluation time di time offset oi chip area ci in clock cycles in clock cycles normalized add/mult 3 3 297 add/mult 4 4 169 add/mult 8 8 44 reg 1
optimization problem of program 2 separately for the projection vectors u1 and u2 determined in program 1. In order to justify our approach we present some interesting solutions in table 2 instead of only the best one given by the solver of the optimization problem.
1026
Dirk Fimmel and Renate Merker
Table 2. Results of the optimization problem projection vector Np Cr = Np m cr module selection K Cp = T
u1 = (0, 1) ” ” u2 = (1, 0)T ” ”
8 8 8 120 120 120
16 16 16 240 240 240
1 × m3 1 × m2 2 × m1 1 × m3 2 × m2 1 × m1
Np K
1 1 3 9 37 25
Pnc
|M| i=1
i i
C = Cr +Cp
704 1352 1584 586 1096 1425
720 1368 1600 826 1336 1665
Minimal hardware costs occur by using projection vector u1 and implementing one instance of module m3 in each processor of the resulting processor array. Finally, we want to give some short notes to the computational effort of our example. The worst case is program 2 with respect to the projection vector u1 . The program consists of 168 constraints with 51 binary, 14 integer and one rational variable. The solution takes 11.3 seconds of CPU time on a SUN SPARC Station 10. 5.7
Selection of Projection Vectors
An alternative approach to the separate computation of a scheduling function for each projection vector consists in consideration of the projection vectors ui as parameters in program 2. We assume that the iteration interval λ ≤ λmax . A linearization of the constraint λ = |τ T u| is achievable using four inequalities and one binary variable υ. λ − 2υλmax ≤ τ T u ≤ λ, λ − 2(1 − υ)λmax ≤ −τ T u ≤ λ,
λ ∈ Z, υ ∈ {0, 1}.
(8)
Suppose, that we have to select one of P projection vectors ui , 1 ≤ i ≤ P . Using P binary variables αi , 1 ≤ i ≤ P , we replace (8) by: λ − 2υλmax − 2(1 − αi )λmax ≤ τ T ui ≤ λ + (1 − αi )λmax , λ − 2(1 − υ)λmax − 2(1 − αi )λmax ≤ −τ T ui ≤ λ + (1 − αi )λmax , P i=1
αi = 1,
αi ∈ {0, 1},
1 ≤ i ≤ P, υ ∈ {0, 1},
1 ≤ i ≤ P.
The projection vector ui is selected if αi = 1. Furthermore, we have to include the hardware costs for the registers. The objective function is changed to minimize the hardware costs C of the resulting processor array. New constraints have to be introduced to take the different number of processors Npi of the fullsize arrays with respect to the projection vectors ui into account: C ≥ Npi (Cp + m cr ) − (1 − αi )Npi (Cmax + m cr ),
1 ≤ i ≤ P.
Design of Processor Arrays for Real-time Applications
1027
We do not recommend this approach since the condition of the integer linear program deteriorates strongly.
6
Conclusion
The presented approach is suitable to derive cost minimal processor arrays for algorithms with the requirement of an admissible latency. The arising optimization problems are given in a linearized form which permits the use of standard packages to solve the problems. Possible extensions to our approach are the inclusion of the power consumption and the inclusion of a rough approximation of the effort needed to implement the interconnections.
References 1. A. Darte, Y. Robert: ”Constructive Methods for Scheduling Uniform Loop Nests”, IEEE Trans. on Parallel and Distributed Systems, Vol. 5, No. 8, pp. 814-822, 1994 1019 2. D. Fimmel, R. Merker: ”Determination of the Processor Functionality in the Design of Processor Arrays”, Proc. Int. Conf. on Application-Specific Systems, Architectures and Processors, pp. 199-208, Z¨ urich, 1997 1019, 1020, 1025 3. D. Fimmel, R. Merker: ”Determination of an Optimal Processor Allocation in the Design of Massively Parallel Processor Arrays”, Proc. Int. Conf. on Algorithms and Parallel Processing, pp. 309-322, Melbourne, 1997 1019, 1020 4. K. Jainandunsing: ”Optimal Partitioning Scheme for Wavefront/Systolic Array Processors”, IEEE Proc. Symp. on Circuits and Systems, 1986 1023 5. R.M. Karp, R.E. Miller, S. Winograd: ”The organization of computations for uniform recurrence equations”, J. of the ACM, vol.14, pp. 563-590, 1967 1019 6. D.I. Moldovan: ”On the Design of Algorithms for VLSI Systolic Arrays”, Proceedings of the IEEE, pp. 113-120, January 1983 7. P. Quinton: ”Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations”, IEEE 11-th Int. Symp. on Computer Architecture, Ann Arbor, pp. 208-214, 1984 8. S.K. Rao: ”Regular Iterative Algorithms and their Implementations on Processor Arrays”, PhD thesis, Stanford University, 1985 1020 9. J. Rossel, F. Catthoor, H. De Man: ”Extension to Linear Mapping for Regular Arrays with Complex Processing Elements”, Proc. Int. Conf. on Application-Specific Systems, Architectures and Processors, pp. 156-167, Princeton, 1990 1019 10. A. Schrijver: Theory of Linear and Integer Programming, John Wiley & Sons, New York, 1986 1024 11. A. Schubert, R. Merker: ”Systolization of Recursive Algorithms with DESA”, in Proc. 5th Int. Workshop Parcella ’90, Mathematical Research, G. Wolf, T. Legendi, U. Schendel (eds.), vol. 2, Akademie-Verlag Berlin, 1990, pp. 267-276, 1994 1019 12. J. Teich: ”A Compiler for Application-Specific Processor Arrays”, PhD thesis, Univ. of Saarland, Verlag Shaker, Aachen, 1993 13. L. Thiele: ”Resource Constraint Scheduling of Uniform Algorithms”, Int. Journal on VLSI and Signal Processing, Vol. 10, pp. 295-310, 1995 1019, 1024
1028
Dirk Fimmel and Renate Merker
14. Y. Wong, J.M. Delosme: ”Optimal Systolic Implementation of n-dimensional Recurrences”, Proc. ICCD, pp. 618-621, 1985 1019 15. Y. Wong, J.M. Delosme: ”Optimization of Processor Count for Systolic Arrays”, Research Report YALEU/DCS/RR-697, Yale Univ., 1989 1019 16. X. Zhong, S. Rajopadhye, I. Wong: ”Systematic Generation of Linear Allocation Functions in Systolic Array Design”, J. of VLSI Signal Processing, Vol. 4, pp. 279-293, 1992 1019
Interval Routing & Layered Cross Product: Compact Routing Schemes for Butterflies, Mesh of Trees and Fat Trees Tiziana Calamoneri1 and Miriam Di Ianni2 1
Dip. di Scienze dell’Informazione, Universit` a di Roma “la Sapienza” via Salaria 113, I-00198 Roma, Italy. [email protected] 2 Istituto di Elettronica, Universit` a di Perugia via G.Duranti 1/A, I-06123 Perugia, Italy. [email protected]
Abstract. In this paper we propose compact routing schemes having space and time complexities comparable to a 2-Interval Routing Scheme for the class of networks decomposable as Layered Cross Product (LCP) of rooted trees. As a consequence, we are able to design a 2-Interval Routing Scheme for butterflies, meshes of trees and fat trees using a fast local routing algorithm. Finally, we show that a compact routing scheme for networks which are LCP of general graphs cannot be found by any only using shortest paths information on the factors.
1
Introduction
The information needed to route messages in parallel and distributed systems must be somehow stored in each node of the network. The simplest solution consists of a complete routing table – stored in each node u – that specifies for each destination v at least one link incident to u and lying on a path from u to v. The required space of such a solution is Θ(n log δ), where δ is the node degree and n the number of nodes in the network. Efficiency considerations lead to store shortest paths information. For better and fair use of network resources, storing, for each entry of the routing table, as many outgoing links as necessary to describe all shortest paths in the network should be aimed. Due to limited storage space at each processor, a linear increase of the routing table size in n is not acceptable. Research has then focused on identifying classes of network topologies whose shortest paths information can be succinctly stored, assuming that some “short” labels can be assigned to nodes and links at preprocessing time. In the Interval Routing Scheme (in short, IRS) [7,12], node-labels belong to the set {1, . . . , n}, while link-labels are pairs of node-labels representing cyclic intervals of [1, . . . , n]. A message with destination v arriving at a node u is sent by u onto an incident link whose label [v1 , v2 ] is such that v ∈ [v1 , v2 ]. Such an approach allows one to achieve an efficient memory occupation. An IRS is said D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1029–1039, 1998. c Springer-Verlag Berlin Heidelberg 1998
1030
Tiziana Calamoneri and Miriam Di Ianni
optimum if the route traversed by each message is a shortest path from its source to its destination. It is said overall optimum if a message can be routed along any shortest path. In [6,12] optimum IRSs have been designed for particular network topologies. In [3,11,13] it has been proved the existence of networks that do not admit any optimum IRS. Multi-label Interval Routing Schemes were introduced [7] to extend the model in order to allow more than one interval to be associated to each link: a k-IRS is a scheme associating at most k intervals to each link. A message whose destination is node v is sent onto a link labeled (I1 , . . . , Ik ) if v ∈ Ii for some 1 ≤ i ≤ k. In [2] a technique for proving lower bounds on the minimum k allowed was developed and in [4] it has been used to construct n-node networks for which any optimal k-IRS requires k = Θ(n). It was proved that for some well known interconnection networks, such as shuffle exchange, cube connected cycle, butterfly and star graph, each optimal k-IRS requires k = Ω(n1/2− ) – for proper values of – to store one shortest path for each pair [5]. Of course, this lower bound still holds to store any shortest path. In this paper, after providing the necessary preliminary definitions (Section 2), we propose overall optimum compact routing schemes (Section 3 ) based on the same leading idea as the Multi-label Interval Routing for all networks which are Layered Cross Product (LCP) [1] of rooted trees (in short, T-networks). For many commonly used interconnection networks falling in this definition no overall optimum compact routing scheme was known. Among them we recall three widely studied topologies: butterflies, mesh of trees and fat trees. Our compact routing scheme requires as much space and time as those required by a 2-IRS. The achievement is particularly meaningful for butterflies because of the result in [5]. Finally, in Section 4, we give a negative result by proving that the knowledge of shortest paths on the factors of a network could be not enough to compute shortest paths on it.
2
Definitions and Preliminary Results
Point to point communication networks are usually represented by graphs, whose nodes stand for processors and edges for communication links. We always represent each edge {u, v} by the pair of (oriented) arcs, (u, v) and (v, u). An l-layered graph, G = (V 1 , V 2 , . . . , V l , E) consists of l layers of nodes; V i is the (non-empty) set of nodes in layer i, 1 ≤ i ≤ l; every edge in E connects vertices of two adjacent layers. In particular a rooted tree T of height h is a h-layered graph, layer i defined either as the set of nodes having distance i − 1 from the root or as the set of nodes having distance h − i from the root. From now on, we call T a root-tree or a leaf-tree according to whether the first or the second way of defining layers is chosen. In Fig. 1, T1 is a root-tree, while T2 is a leaf-tree. Let G1 = (V11 , V12 , . . . , V1l , E1 ) and G2 = (V21 , V22 , . . . , V2l , E2 ) be two l-layered graphs. Their Layered Cross Product (LCP for short) [1] G1 × G2 is an l-layered graph G = (V 1 , V 2 , . . . , V l , E) where V i is the cartesian product
Interval Routing Layered Cross Product
1031
of V1i and V2i , 1 ≤ i ≤ l, and a link ((a, α), (b, β)) belongs to E if and only if (a, b) ∈ E1 and (α, β) ∈ E2 . Many common networks are LCP of trees [1]. Among theml: the butterfly with N inputs and N outputs is the LCP of two N -leaves complete binary trees (Fig. 1.a), the mesh of trees of size 2N is the LCP of two N -leaves complete binary trees with paths of length log N attached to their leaves (Fig. 1.b), the fat tree of height h [10] is the LCP of a complete binary tree and a complete quaternary tree, both of height h (Fig. 1.c).
1 2 3 1 2 3 4 5 1 2 3
T1
T1
T2
T1
3 2 1
5 4 3 2 1 T2
a
b
3 2 1
T2 c
Fig. 1. Butterfly, Mesh of Trees and Fat-tree as LCP of rooted trees.
Fact 21 Let (a, α), (b, β) ∈ V (G1 × G2 ). Any shortest path from (a, α) to (b, β) is never shorter than a shortest path from a to b in G1 and a shortest path from α to β in G2 . Fact 22 If G is the LCP of either two root-trees or two leaf-trees then G is a tree. Observe that the LCP of two trees could be also not connected. This is not a restriction to our discussion, since we deal with connected networks that are the LCP of trees and not with any LCP of trees.
3
Designing Compact Routing Schemes for T-networks
Let G = T1 ×T2 , where both T1 and T2 are trees. In [12] an IRS for trees has been shown. Thus, consider the two IRSs for T 1 and T 2 and let L1 and I1 , L2 and I2 be the node- and link-labelings of an IRS for T 1 and T 2 , respectively. A node (u1 , u2 ) ∈ V (G) is labeled with a triple (a, α, l) if L1 (u1 ) = a, L2 (u2 ) = α and l is the layer of both u1 in T1 and u2 in T2 (and of (u1 , u2 ) in G). Similarly, a link ((u1 , u2 ), (v1 , v2 )) ∈ E is labeled with a triple (I1 , I2 , l), where I1 = I1 ((u1 , v1 )), I2 = I2 ((u2 , v2 )) and l is the layer of (v1 , v2 ). It is possible to rename all nodes
1032
Tiziana Calamoneri and Miriam Di Ianni
according to their node-labeling L; therefore, in the following we speak about labeling to mean link-labeling and we refer to nodes themselves to mean their node-labels. To complete the definition of the compact routing scheme for G, we must describe algorithm A stored in each node (a, α, la ) used to route a packet onto a shortest path connecting the current node (a, α, la ) itself to the destination (t, τ, lt ). Informally, at each step A tries to take the greedy choice: if both shortest path factors P1 (a, t) and P2 (α, τ ) move towards the same level then A moves in that direction, that is, it chooses link ((a, α, la ), (b, β, lb )). Otherwise, if a = t (or α = τ ), then P1 (or P2 ) is null, so the other path must be followed1 . Finally, if P1 (a, t) and P2 (α, τ ) go towards opposite levels, A follows the path going away from level lt . We will prove that in this way a shortest path on G is always used. Now, we are ready to describe algorithm A formally. Algorithm A ( input: t, τ, lt ; output: one outgoing link labeled (I1 , I2 , l)) { The labeling of the outgoing links, a, α and la are known constants in each node.}
if a = t and α = τ then extract the packet else if there exists a link labeled (I1 , I2 , l) s.t. t ∈ I1 and τ ∈ I2 then choose it else if there exists a link having label (I1 , I2 , l) s.t. t ∈ I1 and (α = τ or |la − lt| < |l − lt|) then choose it else if there exists a link having label (I1 , I2 , l) s.t. τ ∈ I2 and (a = t or |lα − lτ | < |l − lτ |) then choose it;
Notice that algorithm A, stored in each node, does not increase the asympthotic space complexity with respect to 2-IRS and runs in O(δ) time, δ being the maximum node degree. As a consequence, if A is able to route packets to their destinations we have designed a compact routing scheme. In the following, we first show the optimality (Thm. 1) and then its overall optimality (Thm. 2). Theorem 1. If (a, α, la ) transmits a packet, whose destination is (t, τ, lt ), to an adjacent node (b, β, lb ), then (b, β, lb ) belongs to a shortest path from (a, α, la ) to (t, τ, lt ). Proof. If G is the LCP of either two root-trees or two leaf-trees the statement is trivially true because, from Fact 22, there exists a unique path between any pair of nodes. It must necessarily be the Layered Cross Product of the corresponding paths in the factors. Thus, links labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 can always be used. Therefore, from now on we shall always suppose that T1 is a root-tree and T2 is a leaf-tree. The proof considers the truth of the if-conditions in A. 1. There exists a link (I1 , I2 , l) such that τ ∈ I2 and a = t. Let α, β, β1 , . . . , βk ,τ be the shortest path in T2 . Then, by definition of LCP, there exist b, b1 , . . . , bk 1
When we say that A follows a path Pi (either i = 1 or i = 2), we mean that A follows an edge on G whose i-th factor belongs to Pi .
Interval Routing Layered Cross Product
1033
in T1 such that a, b, b1 , . . . , bk , a is a path (crossing the same links more than once) such that (a, α, la ), (b, β, lb ), (b1 , β1 , lb1 ), . . . , (bk , βk , lbk ), (a, τ, la ) is a path in G starting at (a, α, la ) and ending at (a, τ, la ). Fact 21 ensures that this is a shortest path in G. The case with t ∈ I1 and α = τ is symmetric. 2. There exists a link labeled (I1 , I2 , lb ) such that t ∈ I1 and τ ∈ I2 . W.l.o.g., suppose la < lt : b is a child of a and β is the father of α (Fig. 2). Moreover, since b belongs to a shortest path from a to t, t is a descendant of b. Two cases are possible: – τ is an ancestor of α (Fig. 2.a). Then, the shortest paths from a to t in T1 and from α to τ in T2 have the same length lt − la . It is easy to see that in G there exists a unique shortest path from (a, α, la ) to (t, τ, lt ). Furthermore, all links in such path are labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 . Thus, (b, β, lb ) belongs to a path of length lt − la and for Fact 21 it is the shortest path. – α and τ have a common ancestor (Fig. 2.b). Let γ be the nearest common ancestor and let lc be its layer. By the definition of layers, and since T1 is a root-tree and T2 is a leaf-tree, it must be lt < lc .
l c l t lb la
l c l t lb l a
c
t
(t, τ) (b, β) (a, α)
b a
α β
a
τ
t b a
la lb lt lc
(c, γ) (t, δ) (t, τ) (b, β) (a, α)
α
β
δ γ
τ
la lb lt lc
b
Fig. 2. la < lt ; a. τ is an acestor of α; b. α and τ have a common ancestor.
Let c be one of the descendants of t at layer lc in T1 and δ be the node at layer lt belonging to the shortest path from α to γ in T2 . There exists a path from (a, α, la ) to (t, δ, lt ) in G whose links are always labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 and having length equal to the length of the shortest path from α to δ in T2 (this is easily proved by induction on the length). The path in G from (t, δ, lt ) to (t, τ, lt ) chosen by A is a shortest one because of case 1. Since the path from (a, α, la ) to (t, τ, lt ) passing through (b, β, lb ) has the same length as the path from α to τ in T2 , it is a shortest one. 3. No outgoing link from (a, α, la ) is labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 ; furthermore, neither a = t nor α = τ . Then, one shortest path factor moves towards increasing layers while the other shortest path factor moves towards decreasing layers. Suppose first lt > la , that is, t is closer than a to the leaves in T1 while τ is closer than α to the root in T2 .
1034
Tiziana Calamoneri and Miriam Di Ianni
Notice that it is not possible to have both shortest paths factors moving towards the leaves. Thus, both shortest paths factors move towards the roots. Then, t and a have a nearest common ancestor c at layer lc . On T2 , either τ is an ancestor of α (see Fig. 3) or τ and α have a nearest common ancestor δ at layer ld (Fig. 4). Consider the case of Fig. 3 first. Let b the father of a in T1 and let I1 be the label of (a, b) in T1 . Since lt − lb > lt − la and t ∈ I1 , A chooses one link labeled (I1 , I2 , lb ) and ending in (b, β, lb ). Let d be the node belonging to the shortest path from a to t at layer la in T1 . Node (b, β, lb ) lies on the path
lt
la lb lc
t
(t, τ) c
(d, δ) (c, γ) (b, β) (a, α)
b a
lc
α τ
β
γ l b la
lt
Fig. 3. The two shortest paths factors move towards the roots and τ is an ancestor of α while t and a have a common ancestor c. from (a, α, la ) to (d, α, la ) – through a node (c, γ, lc ) – of the same length as the shortest path from a to d in T1 . From this last node, there is a single shortest path to (t, τ, lt ) constituted by links always labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 . The length of such path from (a, α, la ) to (t, τ, lt ) through (b, β, lb ) is equal to the length of the shortest path from a to t in T1 . Consider now the situation in Fig. 4: both shortest paths factors move towards the roots, t and a have a nearest common ancestor c at layer lc , and τ and α have a nearest common ancestor δ at layer ld . To go from (a, α, la ) to (t, τ, lt ) through a shortest path it is necessary to use either first a shortest path in T1 till c or first a shortest path in T2 till δ. Indeed, consider a path that alternates links whose first factor is from a shortest path in T1 with links whose second factor is from a shortest path in T2 : let (a, α, la ), (a1 , α1 , l1 ), . . . , (ak , αk , lk ) be the first fragment of such a path having the first factor as the shortest path in T1 , with ai = c, i = 1, . . . , k. If the next step is a link whose second factor belongs to a shortest path towards τ in T2 , such link ends necessarily in (u, αk−1 , lk−1 ), where u is a child of ak in T1 . For the sake of symmetry, since u and ak−1 are at the same layer in the same subtree rooted at ak (with lk > lc ), the distance of (u, αk−1 , lk−1 ) from (t, τ, lt ) is equal to the distance of (ak−1 , αk−1 , lk−1 ) from (t, τ, lt ). Thus, in order to reach (t, τ, lt ) from (a, α, la ) at least two steps have been wasted. Hence, necessarily one of the following is a shortest path in G: – (Fig. 4.a) a path from (a, α, la ) to (b, β, lb ) to (c, γ, lc ) to (h, α, la ) to (t, η, lt ) to (f, δ, ld ) to (t, τ, lt ), where γ is a descendant of α at layer lc
Interval Routing Layered Cross Product
1035
in T2 , h and f are descendants of c at layer la and ld , respectively, in T1 and η is an ancestor of α at layer lt in T2 . The length of such a path is 2(la − lc ) + (lt − la ) + 2(ld − lt ) = 2(ld − lc ) + la − lt – (Fig. 4.b) a path from (a, α, la ) to (d, δ, ld ) to (k, τ, lt ) to (a, θ, la ) to (c, σ, lc ) to (h, θ, la ) to (t, τ, lt ), where d and k are descendant of a at layers ld and lt , respectively, in T1 and θ and σ are descendants of τ at layers la and lc , respectively, in T2 . The length of such a path is (ld − la ) + (ld − lt ) + 2(lt − lc ) = 2(ld − lc ) + lt − la Thus, the shortest path depends on the sign of lt −la . Since in our hypothesis lt > la , the shortest path is the first one, that is the path whose first step goes away from lt . Since node (b, β, lb ) is such that lt − lb > lt − la , then A chooses the first path. When lt < la , the same reasoning applies. Finally, the same discussion holds also if lt = la . However, notice that in this case any of the two choices is possible.
l d l t l a lb l c d k a (a, α) b (b, β) h c (c, γ) (h, δ) t (t, τ) f (t, η) (f, δ)
γ
a
β α
η δ
τ
l d l t la lb l c d k a b (a, α)(d, δ)(a, θ) (c, σ) (k, η)(k, τ) c h
(h, θ) (t, τ)
t lc lb la lt ld
f
γ β α
σ τ
η δ
θ
lc lb la lt ld
b
Fig. 4. The two shortest paths factors move towards the roots, t and a have a common ancestor c and τ and α have a common ancestor δ.
Theorem 2. If a packet must be transmitted from node (a, α, la ) to node (t, τ, lt ) and (a, α, la ), (a1 , α1 , la1 ), (a2 , α2 , la2 ), . . . , (t, τ, lt ) is any shortest path from (a, α, la ) to (t, τ, lt ), then algorithm A can possibly use it. Proof. Again, thanks to Fact 22, we shall always suppose that T1 is a roottree and T2 is a leaf-tree. The proof is divided according to the truth of the if-conditions in A and most of considerations done in the proof of Theorem 1 are used here. 1. There exists a link (I1 , I2 , l) such that τ ∈ I2 and a = t: then (cf. proof of Thm.1) G contains a path from (a, α, la ) to (t, τ, lt ) of length equal to the path from α to τ in T2 . Let k + 1 be such length. Suppose that a path (a, α, la ), (a1 , α1 , la1 ), . . . , (ak , αk , lak ), (t, τ, lt ) exists in G that is not found by A. Thus, α1 does not belong to the shortest path from α to τ in T2 , that
1036
Tiziana Calamoneri and Miriam Di Ianni
is, if link ((a, α, la ), (a1 , α1 , la1 )) is labeled (I1 , I2 , la1 ), τ ∈ I2 . Hence, since by definition of LCP p = α, α1 , . . . , αk , τ must be a path in T2 , some αi must be the same as some αj , with i = j. But the distance between α and τ in T2 is k + 1, then p must be longer than k + 1, an absurd. The case t ∈ I1 and α = τ is symmetric. 2. There exists a link labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 . W.l.o.g., suppose la < lt and therefore b a child of a and β the father of α. Since b belongs to a shortest path from a to t, t is a descendant of b. Two cases are possible: – τ is an ancestor of α. This case is trivially proved, since there is in G a unique shortest path from (a, α, la ) to (t, τ, lt ) and A finds that path. – α and τ have a nearest common ancestor γ at layer lc . Recall that lt ≤ lc and a shortest path in G must be as long as the path from α to τ in T2 (cf. proof of Thm.1). The proof of this case is similar to the one of case 1. of this theorem and thus omitted. 3. No outgoing link from (a, α, la ) is labeled (I1 , I2 , l) such that t ∈ I1 and τ ∈ I2 . Suppose lt > la . We have already proved that one of the following cases must occur: – both shortest paths factors move towards the roots, τ is an ancestor of α while t and a have a nearest common ancestor c at layer lc : algorithm A chooses a link labeled (I1 , I2 , l) such that t ∈ I1 and such link belongs to a path from (a, α, la ) to (t, τ, lt ) having the same length as the path from a to t in T1 . Again, a reasoning very similar to that one of case 1. of this theorem applies. – both shortest paths factors move towards the roots, t and a have a nearest common ancestor c at layer lc , τ and α have a nearest common ancestor δ at layer ld . We have already proved that a shortest path from (a, α, la ) to (t, τ, lt ) necessarily crosses either first a shortest path in T1 to c or first a shortest path in T2 to δ. This implies (cf. proof of Thm.1) that a shortest path in G must necessarily be a path from (a, α, la ) to (c, γ, lc ) to (h, α, la ) to (t, η, lt ) to (d, δ, ld ) to (t, τ, lt ), where γ is a descendant of α at layer lc in T2 , h is a descendant of c at layer la in T1 and η is an ancestor of α at layer lt in T2 (Fig. 4.a). Since A can choose any link ((a, α, la ), (b, β, lb )) labeled (I1 , I2 , lb ) such that lt − lb > lt − la and t ∈ I1 , then it is able to choose any path of this sort, and this proves the assertion. The same reasoning applies when lt < la and when lt = la .
4
LCP of General Graphs: Driving some Conclusions
In the previous section we have proposed a method to compute a compact routing scheme for all T-networks. In the proofs of correctness, we have strongly used the properties of the factors and having a unique (shortest) path between any couple of nodes. It is immediate to wonder whether our technique can be
Interval Routing Layered Cross Product
1037
extended to networks which are the LCP of more general graphs. In this section we show a negative result in this direction. Namely, we prove a property of the LCP allowing one to deduce that the knowledge of node- and link-labels on factors (and therefore the knowledge of their shortest paths) gives not enough information to find shortest paths in their LCP. Theorem 3. There exist a layered graph G = (V, E) LCP of G1 = (V1 , E1 ) and G2 = (V2 , E2 ), a source node (s, σ, ls ) ∈ V , a destination node (t, τ, lt ) ∈ V and an edge e ∈ E having the following properties: i. e is the layered cross product of two edges e1 ∈ E1 and e2 ∈ E2 ; ii. e belongs to a shortest path from (s, σ, ls ) to (t, τ, lt ); iii. neither e1 belongs to a shortest path from s to t in G1 , nor e2 belongs to a shortest path from σ to τ in G2 . Proof. The graph in the assertion is shown in Fig.5 in which the following convention is used: each edge in the drawing of G1 and G2 (and therefore of G) represents a simple chain whose length is determined by the difference of layers where its extremes lie. Suppose the shortest path from s to t in G1 passes through m and the shortest path from σ to τ passes through ν. Then, the following relations must hold: |ls − lt |+ 2|lt − lm | < 2|ls − ld |+ |ls − lt | and 2|ls − ln |+ |ls − lt | < |ls − lt |+ 2|lt − lr | that is |lm − lt | < |ls − ld | and |ls − ln | < |lt − lr |. Whenever one of the following inequalities hold |ls − lt | + 2|lt − lr | < 2|ls − ln | + |ls − lt | + 2|lt − lm | 2|ls − ld | + |ls − lt | < 2|ls − ln | + |ls − lt | + 2|lt − lm | either the path through (m, µ1 , lm ) and (r, ρ, lr ) (first inequality) or the path through (n2 , ν, ln ) and (d, δ, ld ) (second inequality) are shorter than the path through (n2 , ν, ln ) and (m, µ2 , lm ). That is, the shortest path passes through an edge – either ((m, µ1 , lm ), (r, ρ, lr )) or ((n2 , ν, ln ), (d, δ, ld )) – whose factors are not on a shortest path. As a consequence of the previous theorem, we can state the following fact: Fact 41 Let G be a network that is the LCP of any two graphs G1 and G2 . The only knowledge of the compact routing schemes (i.e. a node- and link-labeling scheme) on G1 and G2 may not be sufficient to deduce a compact routing scheme for G. Anyway, the previous claim does not forbid one to find some special cases in which the particular structure either of the network itself or of its factors helps in defining a compact routing scheme.
1038
Tiziana Calamoneri and Miriam Di Ianni
(n1 ,ν)
n1 t d
m
(t,τ)
(d,δ)
(s,σ)
s
(n2 ,ν) (m,µ ) (m,µ2) 1
n2
r
(r,ρ)
δ
ld
ν
ln
σ
ls
τ µ1
µ2 ρ
lt lm lr
Fig. 5. Edges on a shortest path in G whose factors do not lie on any shortest path in G1 and in G2 .
Acknowledgments We thank Richard B. Tan for a careful reading of earlier drafts and for his helpful suggestions on improving the paper.
References 1. S. Even, A. Litman, “Layered Cross Product - A technique to construct interconnection networks”, 4th ACM SPAA, 60-69, 1992. 1030, 1031 2. M. Flammini, J. van Leeuwen, A. Marchetti-Spaccamela, “The complexity of interval routing in random graphs”, MFCS 1995, LNCS 969, 37-49, 1995. 1030 3. P. Fraigniaud, C. Gavoille, “Optimal interval routing”, CONPAR, LNCS 854, 785796, 1994. 1030 4. C. Gavoille, S. Perennes, “Lower bounds for shortest path interval routing”, SIROCCO’96, 1996. 1030 ˇ 5. R. Kr´ aˇloviˇc, P. Ruˇziˇcka, D. Stefankoviˇ c, “The complexity of shortest path and dilation bounded interval routing”, Euro-Par’97, LNCS 1300, 258-265, 1997. 1030 6. J. van Leeuwen, R.B. Tan, “Computer networks with compact routing tables”, in: G. Rozemberg and A. Salomaa (Eds.) The Book of L, Springer-Verlag, Berlin, 298-307, 1986. 1030 7. J. van Leeuwen, R.B. Tan, “Interval routing”, Computer Journal, 30, 298-307, 1987. 1029, 1030 8. J. van Leeuwen, R.B. Tan, “Compact routing methods: a survey”, Colloquium on Structural Information and Communication Complexity (SICC’94), 1994. 9. F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, Inc.,1992. 10. C.E. Leiserson, “Fat-Trees: universal networks for hardware-efficient supercomputing”, IEEE Trans. on Comp., 34, 892-901, 1985. 1031 11. P. Ruˇziˇcka, “On efficiency of Interval Routing algorithms”, in: M.P. Chytil, L. Janiga, V. Koubeck (Eds.), MFCS 1988, LNCS 324, 492-500, 1988. 1030
Interval Routing Layered Cross Product
1039
12. N. Santoro, R. Khatib, “Routing without routing tables”, Tech. Rep. SCS-TR-6, School of Computer Science, Carleton University, 1982. Also as: “Labelling and implicit routing in networks”, Computer Journal, 28, 5-8, 1985. 1029, 1030, 1031 13. S.S.H. Tse, F.C.M. Lau, “A lower bound for interval routing in general networks”, Tech. Rep. 94-04, Dept. of Computer Science, The University of Hong Kong, 1994 (to appear in Networks). 1030
Gossiping Large Packets on Full-Port Tori Ulrich Meyer and Jop F. Sibeyn Max-Planck-Institut f¨ ur Informatik Im Stadtwald, 66123 Saarbr¨ ucken, Germany. {umeyer,jopsi}@mpi-sb.mpg.de http://www.mpi-sb.mpg.de/∼umeyer,∼jopsi/
Abstract. Near-optimal gossiping algorithms are given for two- and higher dimensional tori. It is assumed that the amount of data each PU is contributing is so large, that start-up time may be neglected. For twodimensional tori, an earlier algorithm achieved optimality in an intricate way, with a time-dependent routing pattern. In our algorithms, in all steps, the PUs forward the received packets in the same way.
1
Introduction
Meshes and Tori. One of the most thoroughly investigated interconnection schemes for parallel computation is the n × n mesh, in which n2 processing units, PUs, are connected by a two-dimensional grid of communication links. Its immediate generalizations are d-dimensional n × · · · × n meshes. Despite their large diameter, meshes are of great importance due to their simple structure and efficient layout. Tori are the variant of meshes in which the PUs on the outside are connected with “wrap-around links” to the corresponding PUs at the other end of the mesh, thus tori are node symmetric. Furthermore, for tori the bisection width is twice as large as for meshes, and the diameter is halved. Numerous parallel machines with two- and three-dimensional mesh and torus topologies have been built. Gossiping. Gossiping is a fundamental communication problem used as a subroutine in parallel sorting with splitters or when solving ordinary differential equations: Initially each of the P PUs holds one packet, which must be routed such that finally all PUs have received all packets (this problem is also called all-to-all broadcast). ˇ Earlier Work. Recently Soch and Tvrd´ık [5] have analyzed the gossiping problem under the following conditions: Packets of unit size can be transferred in one step between adjacent PUs, store-and-forward model. In each step a PU can exchange packets with all its neighbors, full-port model. The store-and-forward model gives a good approximation of the routing practice, if the packets are so large that the start-up times may be neglected. If an algorithm requires k > 1 packets per PU, then this effectively means nothing more than that the packets have to be divided in k chunks each. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1040–1046, 1998. c Springer-Verlag Berlin Heidelberg 1998
Gossiping Large Packets on Full-Port Tori
1041
In [5], it was shown that on a two-dimensional n1 × n2 torus, the given problem can be solved in (n1 ·n2 −1)/4 steps if n1 , n2 ≥ 3. This is optimal. The algorithm is based on time-arc-disjoint broadcast trees: The action of each PU is time dependent, and thus, for every routing decision, a PU has to perform some non-trivial computation (alternatively these actions can be precomputed, but for a d-dimensional torus with P PUs this requires Θ(P ) storage per processor). ˇ New Results. In this paper, we analyze the same problem as Soch and Tvrd´ık. Clearly, we cannot improve their optimality. Instead, we try to determine the minimal concessions towards simple time-independent algorithms: In our gossiping algorithms, after some O(d) precomputation on a d-dimensional torus, a PU knows once and for all, that packets coming from direction xi have to be forwarded in direction xj , 1 ≤ i, j ≤ d. Time-independence ensures that the routing can be performed with minimal delay, for a fixed size network the pattern might even be built into the hardware. Also on a system on which the connections must be somehow switched, this is advantageous. By routing the packets along d edge disjoint Hamiltonian paths, it is not hard to achieve the optimal number of steps if each PU initially stores k = d packets. Here optimal means a routing time of k·P/(2·d) steps on a torus with P PUs. For d = 2, the schedule presented in Section 2 is particularly simple, and might be implemented immediately. Our scheme remains applicable for higher dimensions: the proof that edge-disjoint Hamiltonian cycles exist [3,2,1] is constructive, but these constructions are fairly complicted. Unfortunately, it is not possible to achieve the optimal number of steps using d fixed edge disjoint Hamiltonian paths and only one packet per PU which then circulates concurrently along all d paths. In [4] we show for the two-dimensional case that at least P/40 extra steps are needed due to multiple receives. In Section 3, we investigate a second approach that allows an additional o(P ) routing steps. On a d-dimensional torus, it runs in P/(2 · d) + o(P ) steps, with only one packet per PU. For d = 2, this is almost optimal and time independent. Therefore, this algorithm might be preferable over the one from [5]. More importantly, for higher dimensions, particularly for the practically relevant case d = 3, this gives the first simple and explicit construction that achieves close to optimally. In these algorithms, we construct partial Hamiltonian cycles: on a d-dimensional torus, we construct d cycles, each of which covers P/d + o(P ) PUs. These are such that for every cycle, every PU is adjacent to a PU through which this cycle runs.
2
Optimal-Time Algorithms
In this section we describe a simple optimal gossiping algorithm for twodimensional n1 × n2 tori assuming that initially each PU holds k = 2 packets. Basic Case. The PU with index (i, j) lies in row i and column j, 0 ≤ i < n1 , 0 ≤ j < n2 . PU (0, 0) is located in the upper-left corner. First PU (i, j) determines whether j is odd or even and sets its routing rules as follows:
1042
Ulrich Meyer and Jop F. Sibeyn
j < n2 − 1, j even : T ↔ R; B ↔ L. j < n2 − 1, j odd : T ↔ L; B ↔ R. Here T, B, L, R designate the directions “top”, “bottom”, “left” and “right”, respectively. By T ↔ R, we mean that the packets coming from above should be routed on to the right, and vice-versa. The other ↔ symbols are to be interpreted analogously. Only in the special case j = n2 −1 we apply the rule T ↔ R; B ↔ L. The resulting routing scheme is illustrated in the left part of Figure 1.
f
b
b
f
f
b
b
f
f
b
Fig. 1. Left: A Hamiltonian cycle on a 4 × 4 torus, whose complement (drawn with thin lines) also gives a Hamiltonian cycle. Right: Partial Hamiltonian cycles on a 4 × 5 torus. The PUs in column 3 lie on only one cycle. Such a PU passes the packets on this cycle that are running forwards to the PU below it, and those that are running backwards to the PU above it.
Odd ni . Now we consider the case n1 even and n2 odd, the remaining cases can be treated similarly. Here we do not construct complete Hamiltonian cycles, but cycles that visit most PUs, and pass within distance one from the remaining PUs. Except for Column n2 − 2, the rules how to pass on the packets are the same as in the basic case. In Column n2 − 2 we perform L ↔ R. PUs which do not lie on a given cycle, out-of-cycle-PUs, abbreviated OOC-PUs, are provided with the packets transferred along this cycle by their neighboring on-cycle-PUs, OC-PUs. With respect to different cycles, a PU can be both out-of-cycle and on-cycle. The packets received by an OOC-PU are not forwarded. This can be achieved in such a way, that a connection has to transfer only one packet in every step. The resulting routing scheme is illustrated in the right part of Figure 1. Changing Direction. Each OOC-PU C receives packets from two different OC-PUs, A and B. Let A transfer the packets that are walking forwards, let B provide the packets in backward direction. If m denotes the cycle length and A and B are not adjacent on the cycle, then m/2 circulation steps are not enough:
Gossiping Large Packets on Full-Port Tori
1043
some packets are received from both directions, while others pass by without notice. We modify the algorithm as follows. Let l be the number of OC-PUs between A and B, then during the first l steps, A transfers to C the packets that are running forward, and during the last m/2 − l steps those that run backward. B operates oppositely. In our case l = 2 · n2 − 1 for all PUs in Column n2 − 2. Thus, each PU can still compute its routing decisions in constant time. Theorem 1 If every PU of an n1 × n2 torus holds 2 packets, then gossiping can be performed in n1 · n2 /2 steps. The considered schemes have the advantage that their precomputation can be performed in constant time. Alternatively, one might use the scheme in [3] for the construction of two edge-disjoint Hamiltonian cycles on any two-dimensional torus.
3
One-Packet Algorithms
In this section we give another practical alternative to the approach of [5]: we present algorithms which require only one packet per PU and are optimal to within o(P ) steps. The idea is simple: we construct d edge-disjoint cycles of length P/d + o(P ). Each of the cycles must have the special property, that if a PU does not lie on such a cycle, this PU must be adjacent to two other PUs belonging to this cycle. Each of these two PUs will transmit the packets from one direction of the cycle to the out-of-cycle-PU. Two-Dimensional Tori. The construction can be viewed as an extension of the approach described in Section 2. There we interrupted the regular zigzag pattern for one column in order to cope with an odd value for n2 . Now we discard the zigzag in n2 − 2 consecutive columns, the only two remaining zigzag columns connect the long stretched row parts to form two partial cycles: j = 0 : T ↔ R; B ↔ L. j = 1 : T ↔ L; B ↔ R. 2 ≤ j < n2 : L ↔ R. Binding the out-of-cycle-PUs is done exactly as in the algorithm of Section 2; the case of odd n1 and n2 can be treated by inserting one special row. Each of the partial cycles consists of n1 · n2 /2 + n1 PUs, so using both directions in parallel one needs n1 · n2 /4 + n1 /2 steps to spread all the packets within the cycle. For every OOC-PU the two supplying OC-PUs are separated by n2 + 1 other PUs on their cycle, thus we obtain a fully time-independent n1 · n2 /4 + O(n1 + n2 ) step algorithm. Applying the forward/backward switching as presented in Section 2 we get Theorem 2 If every PU of an n1 × n2 torus holds 1 packet, then gossiping can be performed in n1 · n2 /4 + n1 /2 + 1 steps.
1044
Ulrich Meyer and Jop F. Sibeyn
cycle 0
cycle 1
cycle 2
Fig. 2. Three edge-disjoint cycles used in the one-packet algorithm on a threedimensional torus. Three-Dimensional Tori. For three-dimensional tori, we generalize the previous routing strategy, showing more abstractly the underlying approach. We assume that n2 and n3 are divisible by three. The constructed patterns are similar to the bundle of rods in a nuclear power-plant. We construct three cycles. These are identical, except that they start in different positions: Cycle j, 0 ≤ j ≤ 2, starts in PU (0, j, 0). The cycles for twodimensional tori were composed of n1 /2 laps: sections of a cycle starting with a zigzag and ending with the traversal of a wrap-around connection. The zigzags were needed to bring us two positions further in order to connect to the next lap. Here, we have three zigzags, bringing us three positions further. The sequence of directions in the zigzag pattern is given by (2, 1, 2, 1, 2, 1). After these zigzags, moves in direction 1 are repeated until a wrap-around connection is traversed, the next lap can begin. In this way we can fill up one plane (using n2 /3 laps), but in order to get to the next plane, there must be a second zigzag type consisting of the following direction pattern: (2, 1, 2, 1, 3, 1). This type is applied every n2 /3 laps. The total schedule is illustrated in Figure 2. If PU P = (x, y, z), x ≥ 3 belongs to cycle i, then PUs Pr = (x, (y + 1) mod 3, z) and Pd = (x, y, (z + 1) mod 3) belong to cycle (i + 1) mod 3, Pl =
Gossiping Large Packets on Full-Port Tori
1045
(x,(y−1) mod 3, z) and Pu = (x, y, (z − 1) mod 3) belong to cycle (i − 1) mod 3. In this way, P on cycle i can be supplied with packets which are not lying on its own cycle: it receives packets running forward on cycle (i + 1) mod 3 from Pr , backward-packets from Pd . Pl provides packets running backward on cycle (i − 1) mod 3, forward-packets are sent by Pu . Each pair of supporting on-cyclePUs is separated by (n1 /3 + 1) · n2 − 1 other PUs. In the upper part of our reactor, exactly two cycles pass through each PU P , and the shifts are so, that the connections that are not used by cycle traffic lead to PUs that lie on the other cycle. At most 2 · (n1 /3 + 1) · n2 − 1 PUs lie between two supporting on-cycle-PUs. Multiple receives in the OOC-PUs can again be eliminated by the forward/backward switching of Section 2. Otherwise O(n1 · n2 ) extra steps are necessary. Theorem 3 If every PU of an n1 × n2 × n3 torus (n2 , n3 integer multiples of three) holds 1 packet, then gossiping can be performed in n1 ·n2 ·n3 /6+n1 ·n2 /2+1 steps. Higher Dimensional Tori. The idea from the previous section can be generalized without problem for d-dimensional tori. Now, we construct d edge-disjoint cycles, each of them covering P/d + o(P ) PUs. The cycles are numbered 0 through d − 1. Cycle j starts in position (0, j, 0, . . . , 0). As before a cycle is composed of laps, starting with a zigzag and ending with moves in direction 1. Now there are d − 1 types of zigzags, which are used in increasingly exceptional cases. The general pattern telling along which axis a (positive) move is made is as follows: zigzag(2, 1) = (2, 1, 2, 1). zigzag(3, 1) = (2, 1, 2, 1, 2, 1), zigzag(3, 2) = (2, 1, 2, 1, 3, 1). zigzag(i, 1) = .. (zigzag(i − 1, 1), 2, 1), . zigzag(i, i − 2) = (zigzag(i − 1, i − 2), 2, 1), zigzag(i, i − 1) = (2, 1, 2, 1, 3, 1, 4, 1, 5, 1, . . ., i − 1, 1, i, 1). Here zigzag(d, j) indicates the j-th zigzag pattern for d-dimensional tori. During lap i, 1 ≤ i ≤ n2 /d · n3 · · · · · nd , the highest numbered zigzag is applied for which the condition i mod (n2 /d · n3 · · · · · nj ) = 0 is true. Namely, after precisely so many laps, the cycle has filled up the hyperspace spanned by the first j coordinate vectors. Using induction over the number of dimension we can prove Theorem 4 If every PU of a d-dimensional n1 × · · · × nd torus (n2 , . . . , nd integer multiples of d) with P PUs holds 1 packet, then gossiping can be performed in P/(2 · d) + P/(2 · n1 ) + 1 steps.
1046
4
Ulrich Meyer and Jop F. Sibeyn
Conclusion
We have completed the analysis of the gossiping problem on full-port store-andforward tori. In [5] only one interesting aspect of this problem was considered. We have shown that almost equally good performance can be achieved by simpler time-independent algorithms, and given explicit schemes for higher-dimensional tori as well.
References 1. Alspach, B., J-C. Bermond, D. Sotteau, ‘Decomposition into Cycles I: Hamilton Decompositions,’ Proc. Workshop Cycles and Rays, Montreal, 1990. 1041 2. Aubert, J., B. Schneider, ‘Decomposition de la Somme Cartesienne d’un Cycle et de l’Union de Deux Cycles Hamiltoniens en Cycles Hamiltonien,’ Discrete Mathematics, 38, pp. 7–16, 1982. 1041 3. Foregger, M.F., ‘Hamiltonian Decomposition of Products of Cycles,’ Discrete Mathematics, 24, pp. 251–260, 1978. 1041, 1043 4. Meyer, U., J. F. Sibeyn, ‘Time-Independent Gossiping on Full-Port Tori,’ Techn. Rep. MPI-98-1-014, Max-Planck Inst. f¨ ur Informatik, Saarbr¨ ucken, Germany, 1998. 1041 ˇ 5. Soch, M., P. Tvrd´ık, ‘Optimal Gossip in Store-and-Forward Noncombining 2-D Tori,’ Proc. 3rd International Euro-Par Conference, LNCS 1300, pp. 234–241, Springer-Verlag, 1997. 1040, 1041, 1043, 1046
Time-Optimal Gossip in Noncombining 2-D Tori with Constant Buffers ˇ Michal Soch and Pavel Tvrd´ık Department of Computer Science and Engineering Czech Technical University, Karlovo n´ am. 13 121 35 Prague, Czech Republic {soch,tvrdik}@sun.felk.cvut.cz http://cs.felk.cvut.cz/pcg/
Abstract. This paper describes a time-, transmission-, and memoryoptimal algorithm for gossiping in general 2-D tori in the noncombining full-duplex and all-port communication model.
1
Introduction
Gossiping, also called all-to-all broadcasting, is a fundamental collective communication problem: each node of a network holds a packet that must be delivered to every other node and all nodes start simultaneously. In this paper, we consider gossiping in the noncombining store-and-forward all-port full-duplex model. In such a model, a gossip protocol consists of a sequence of rounds and broadcast trees can be viewed as levels of arc sets, one level for one round. Given a broadcast tree BT (u) rooted in vertex u of a graph G, the arcs crossed by packets in round t are denoted by At (BT (u)). The height of a broadcast tree BT (u), denoted by h(BT (u)), is the number of levels of BT , i.e., the number of rounds of a broadcast using this tree. Given BT (u) and i < h(BT (u)), the i-th level subtree of BT (u), denoted by BT [i] (u), is the subtree of BT (u) consisting of the first i levels. Given a graph G, every node of G has to receive |V(G)| − 1 packets and in one round, it can receive δ(G) packets in the worst case, where |V(G)| is the size of G and δ(G) is the minimum degree of a node in G. The lowerbound on the
. number of rounds of a gossip in G is therefore τg (G) = |V(G)|−1 δ(G) If G is a vertex-transitive network, the gossip problem can be solved by constructing a generic broadcast tree, denoted by BT (∗), whose root can be placed into any vertex of G. All broadcast trees are therefore isomorphic. 2-D torus T (m, n) is a cross product of 2 cycles of size m and n. Vertex set of T (m, n) is the cartesian product {0, . . , m − 1} × {0, . . , n − 1}. Tori are vertex transitive and the isomorphic copies of the generic broadcast trees are made by translation.
ˇ Agency under Grant 102/97/1055, FRVS ˇ This research was supported by GACR Agency under Grant 1251/98, and CTU Grant 3098102336.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1047–1050, 1998. c Springer-Verlag Berlin Heidelberg 1998
1048
ˇ Michal Soch and Pavel Tvrd´ık
Definition 1. Given nodes u = [ux , uy ] and v = [vx , vy ] of T (m, n), a translation from u to v, denoted by ψu→v , is induced by node mapping ([wx , wy ] → [wx ⊕m vx m ux , wy ⊕n vy n uy ], where the addition and subtraction is taken modulo m and n, respectively. A gossip in T (m, n) is time- and transmission-optimal iff h(BT (∗)) = τg (T (m, n)) and all isomorphic copies of BT (∗) are pairwise time-arc-disjoint , i.e., = mn−1 4 two arcs at the same time-level of any two broadcast tree are never mapped on the same arc of T (m, n) and the communication in every round is therefore contention-free. It is known that in tori, time-arc-disjointedness is equivalent to the distinctness of directions of arcs at every level of the generic tree. The set of directions of arcs At (BT (∗)) is denoted by dir(At (BT (∗))). In a 2-D torus there exist four directions, usually denoted by N,E,W,S. N-S direction is vertical, W-E direction is horizontal. Hence, a sufficient condition for BT (∗) to guarantee a time-optimal gossip is that for any t < τg (T (m, n)), every At (BT (∗)) is a set of 4 arcs of 4 distinct directions N, E, W, S. In general, a gossip algorithm in T (m, n) may require additional buffers in routers for packets which must wait at least one round before they can be sent out and the routers can get rid of them. Let β(G) denote the maximum size of auxiliary buffers per router during gossiping in network G. In [2], we have presented a time-optimal gossip protocol for general T (m, n). However, this algorithm is not memory-optimal, it requires auxiliary buffers for Θ(max(n, m − n)) packets per router. The generic broadcast tree of T (m, n) in [2] is built in two steps: filling up the maximal square submesh of odd side + informing the rest of nodes. Additional buffers for packets are required on the sides of the square submesh. In this paper, we present a time- and memory-optimal gossip algorithm for T (m, n) which requires β(T (m, n)) = 3. To keep the size of auxiliary buffers constant, the generic broadcast tree is built from vertical stripes of width 2. Only constant number of packets must be stored for dissemination in directions W and E, which allows to concatenate vertical stripes horizontally.
2
Generic Broadcast Trees for Optimal Gossip on T (m, n).
Theorem 1. Forany m ≥ n ≥ 2, there exists a generic BT (∗) of T (m, n) such that h(BT (∗)) = mn−1 and |At (BT (∗))| = | dir(At (BT (∗)))| for all 1 ≤ t ≤ 4 h(BT (∗)) and β(T (m, n)) ≤ 3. If n = 2 and m is even, one extra round is needed. Proof. For m = n, the algorithm is trivial and β(T (m, m)) = 0 if m is odd and β(T (m, m)) = 1 otherwise. Assume without losing generality m > n. The construction of time-arc-disjoint trees for time- and memory-optimal gossip depends slightly on the values of m and n, the number of different cases is 7, the basic idea is, however, the same in all of them, see [3] for details. The memory requirements of our algorithm are stated in the following table.
Time-Optimal Gossip in Noncombining 2-D Tori with Constant Buffers
1049
n=2 n=3, m≤5 n odd n≥4 even, m odd n≥4 even, m even β(T (m, n)) 0 0 2 2 3 In this paper, we describe only one particular case when m > n ≥ 5 are odd numbers. To make the construction of BT (∗) in T (m, n) as simple as possible, the same patterns of arc sets At (BT (∗)) are used repeatedly in various rounds t. The whole generic tree BT (∗) is then built using several arc patterns. Arc pattern i is depicted on Figures 1 and 2 as a quadruple of arcs labeled i. We associate with every pattern i a so called expansion operator, denoted by Γi . The broadcast tree is then specified by a regular expression over expansion operators. If At (BT (∗)) has pattern i, an attachment of arcs at level t to (t − 1)-level generic subtree is described as an application of the corresponding operator BT [t] (∗) = BT [t−1] (∗)Γi .
2
4
3 1 2
1
Y1
2
1 1
3
Y2
4
4 3
3
4
2
(a)
(b)
(c)
Fig. 1. The first phase of building BT (∗) in T (m, n), if m ≥ 11, m > n ≥ 5 are n−1 2
odd. (a) Y1 = Γ1
n−3 2
Γ2 . (b) Y2 = Y1 Γ3
Γ4 . (c) Y22 .
The generic tree is built in two phases. Figure 1(a) depicts the first phase of constructing BT (∗) if m ≥ 11. If m ≤ 9, the first phase is void. The black circle is the root of BT (∗). The square symbols ✷ denoted the nodes which have to store packets needed in later steps. Repeated application of Γ1 in Figure 1(a) is n+1
n−1
followed by Γ2 which produces n+1 level subtree Y1 = BT [ 2 ] (∗) = Γ1 2 Γ2 . 2 Note that the expansion by Γ1 proceeds in N-S direction. Further n−3 2 levels are attached in direction N-S by repeated application of Γ3 and after applying Γ4 , n−3
we get n-level subtree Y2 = BT [n] (∗) = Y1 Γ3 2 Γ4 . If m ≥ 15, the whole process can be repeated by expanding Y2 in W-E direction, see Figure 1(c). The second phase depends on m mod 4. Let us describe the case of m ≡ 1 mod 4 (the other m−9
case is similar). Let Y3 = Y2 4 . If m = 9, then Y3 shrinks to the root. Figure 2 depicts the solution. We interpret the use of expansion operators similarly as in the first phase. In 3n−1 rounds, we expand Y3 to Y4 = 2 n−3 n−3 2 2 Y3 Y1 Γ3 Γ5 Γ3 Γ4 (see Figure 2(a)). Y4 is then expanded in 3 n−1 rounds 4 using Γ6 and Γ7 (see Figure 2(b)). Pattern Γ62 Γ7 diffuses vertically until the
ˇ Michal Soch and Pavel Tvrd´ık
1050
2
3 1 2
4 3
4
3 5
3
4
5
1
Y3
2 1
1
3
5 3
7
3 4
6 6
7 6
Y4
6 7
3
5
7
2
(a)
(b)
Fig. 2. The second phase of constructing BT (∗), if m > n ≥ 5 are odd, m ≡ 1 mod 4. (a) Construction of Y4 . (b) Application of Γ6 and Γ7 patterns. boundary is reached. Finally, if n ≡ 1 mod 4 (as shown on Figure 2(b)), one final round is needed, and if n ≡ 3 mod 4, after two more applications of Γ6 , only two uninformed nodes surrounded by informed ones remain. The memory requirements for this case of m and n follow easily. In the first phase, 2 packets must be stored to apply Γ2 (see the square symbols in Figure 1(a)). Then, these buffers can be reused to store packets for application of Γ4 (see Figure 1(b)). The second phase is similar. In every round of the gossip, no more than two packets must be stored at a time. ✷
3
Conclusions
The minimal-height time-arc-disjoint trees in the proof of Theorem 1 provide time-, transmission-, and memory-optimal gossip algorithm in noncombining full-duplex all-port 2-D tori. The algorithm requires routers with buffers for at most 3 packets. An interesting problem is to find the exact lower bound on the size of additional buffers. Another open problem is to find a time- and memory-optimal gossip algorithm for 2-D meshes.
References 1. J.-C. Bermond, T. Kodate, and S. Perennes. Gossiping in Cayley graphs by packets. In M. Deza, et. al., editors, Combinatorics and Computer Science, LNCS 1120, pages 301–315. Springer, 1995. ˇ 2. M. Soch and P. Tvrd´ık. Optimal gossip in store-and-forward noncombining 2-D tori. In C. Lengauer, et. al., editors, Euro-Par’97 Parallel Processing, LNCS 1300, pages 234–241. Springer, 1997. Research report at http://cs.felk.cvut.cz/pcg/. 1048 ˇ 3. M. Soch and P. Tvrd´ık. Time-optimal gossip in noncombining 2-D tori with constant buffers. Manuscript at http://cs.felk.cvut.cz/pcg/. 1048
Divide-and-Conquer Algorithms on Two-Dimensional Meshes Miguel Valero-Garc´ıa, Antonio Gonz´ alez, Luis D´ıaz de Cerio, and Dolors Royo Dept. d’Arquitectura de Computadors - Universitat Polit`ecnica de Catalunya c/Jordi Girona 1-3, Campus Nord - D6, E-08034 Barcelona, Spain {miguel,antonio,ldiaz,dolors}@ac.upc.es
Abstract. The Reflecting and Growing mappings have been proposed to map parallel divide-and-conquer algorithms onto two - dimensional meshes. The performance of these mappings has been previously analyzed under the assumption that the parallel algorithm is initiated always at the same fixed node of the mesh. In such scenario, the Reflecting mapping is optimal for meshes with wormhole routing and the Growing mapping is very close to the optimal for meshes with store-and-forward routing. In this paper we consider a more general scenario in which the parallel divide-and-conquer algorithm can be started at an arbitrary node of the mesh. We propose and approach that is simpler than both the Reflecting and Growing mappings, is optimal for wormhole meshes and better than the Growing mapping for store-and-forward meshes.
1
Introduction
The problem of mapping divide-and-conquer algorithms onto two - dimensional meshes was addressed in [2]. First, a binomial tree was proposed to represent divide-and-conquer algorithms. Then, two different mappings (called the Reflecting and Growing mappings) were proposed to embed binomial trees onto two-dimensional meshes. It was shown that the Reflecting mapping is optimal for wormhole routing since the required communication can be carried out in the minimum number of steps (there are not conflicts in the use of links). On the other hand, the Growing mapping was shown to be very close to the optimal for the case of store-and-forward routing. The communication performance of the Reflecting and the Growing mappings was analyzed in [2] under the assumption that the divide-and-conquer algorithm is always started at a fixed node of the mesh. In the following, we use the term fixed-root to refer to this particular scenario. In this paper, we consider a more general scenario in which a divide-and-conquer algorithm can be started at any arbitrary node of the mesh. The term arbitrary-root will be used to refer to this scenario. It will be shown that the Reflecting mapping is still optimal for wormhole routing in the arbitrary-root scenario but the performance of the Growing mapping for store-and-forward routing can be very poor in some common cases.
This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-429/95).
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1051–1056, 1998. c Springer-Verlag Berlin Heidelberg 1998
1052
Miguel Valero-Garc´ıa et al.
An alternative solution for the arbitrary-root scenario is proposed in this paper, which is inspired in a previous work on embedding hypercubes onto meshes and tori [1]. The proposed scheme, which will be called DC-cube embedding, has the following properties: a) it can be applied to meshes with either wormhole or store-and-forward routing, b) it is significantly simpler than the Reflecting and Growing mappings, c) it is optimal for wormhole routing, and d) it is significantly faster than the Growing mapping in some common cases, for store-and-forward routing. A more detailed explanation of the results presented in this paper can be found in [4]. The rest of this paper is organized as follows. Section 2 reviews the Reflecting and Growing mappings [2], which are the basis for our proposal. In section 3, these mappings are extended to the arbitrary-root scenario. Section 4 presents our proposal. Finally, section 5 presents a performance comparison of the different approaches and draws the main conclusions.
2
Background
Rajopadhye and Telle [2] propose the use of a binomial tree to represent a divideand-conquer algorithm. Every node of the binomial tree represents a process that performs the following computations: 1. Receive a problem (of size x) from the parent (the host, if the node is the root of the tree). 2. Solve the problem locally (if ‘small enough”), or divide the problem into two subproblems, each of size αx), and spawn a child process to solve one of these parts. In parallel, start solving the other part, by repeating step 2. 3. Get the results from the children and combine them. Repeat until all children’s results have been combined. 4. Send the results to the parent (the host, if the node is the root). Step 2 is referred to as the division stage. It has a number of phases corresponding to the different repetitions of step 2 (the number of levels of the binomial tree). The division stage is followed by the combining stage in which the results of the different subprobems are combined to produce the final result. For the sake of simplicity, as in [2], only the division stage will be considered from now on. Two different values for parameter α (see step 2) will be considered: α = 1 (the problem is replicated) and α = 1/2 (the problem is halved). The execution of a binomial tree on a two-dimensional mesh can be specified in terms of the embedding of the tree onto the mesh. Two different embeddings were proposed in [2]: the Reflecting mapping and the Growing mapping. It was shown that the Reflecting mapping is optimal for wormhole routing since the communications are carried out without conflicts in the use of the mesh links. On the other hand, the Growing mapping is close to the optimal under storeand-forward routing. These performance properties were derived assuming a fixed-root scenario, that is, the root of the tree is always assigned to the same mesh node.
Divide-and-Conquer Algorithms on Two-Dimensional Meshes
3
1053
Extending the Reflecting and Growing Mappings to the Arbitrary-Root Scenario
In the following, we consider the case in which the divide-and-conquer algorithm can be started at an arbitrary node (a, b) of the mesh (this is called the arbitraryroot scenario). We have first considered two straightforward approaches to extend the Reflecting and Growing mappings to the arbitrary-root scenario. In approach A we build a binomial tree which is an isomorphism of the binomial tree used in [2] for the fixed-root scenario (every label of the new tree is obtained by a fixed permutation of the bits in the old label). The isomorphism is defined so that the root of the binomial tree is mapped (by the corresponding embedding Reflecting or Growing) onto node (a, b). In approach B, the whole problem is first moved from node (a, b) to the starting node according to the original proposal (for the fixed-root scenario). These approaches will be compared with our proposal, that is described in the next section.
4
A New Approach for the Arbitrary-Root Scenario
Our approach to perform a divide-and-conquer computation on a mesh, under the arbitrary-root scenario is inspired in the technique proposed in [1] to execute a certain type of parallel algorithms (which will be referred to as CC-cube algorithms) onto multidimensional meshes and tori. A d-dimensional CC-cube algorithm consists in 2d processes which cooperate to perform a certain computation and communicate using a d-dimensional hypercube topology (the dimensions of the hypercube will be numbered from 0 to d − 1). The operations performed by every process in the CC-cube can be expressed as follows: do i=0, d-1 compute exchange information with neighbour in dimension i enddo The above case corresponds to a CC-cube which uses the dimensions of the hypercube in increasing order. However, any other ordering in the use of the dimensions is also allowed in CC-cubes. A CC-cube can executed in a mesh multicomputer by using an appropriate embedding. In particular, the standard and xor embeddings were proposed to map the CC-cube onto a multidimensional meshes and tori respectively. The properties of both embeddings were extensively analyzed in [1, 3]. A divide-and-conquer algorithm can be regarded as a particular case of CCcube algorithm. This particular case will be referred as DC-cube (from Divideand-Conquer) and has the following peculiarities with regard to CC-cubes: – a) In every iteration, only a subset of the processes are active (all the processes are active in every iteration of a CC-cube).
1054
Miguel Valero-Garc´ıa et al. 8
9 13
12 0 4 2 10 14
6
1 5
3 11
7 15
Fig. 1. The four iterations of a 4-dimensional DC-cube starting at process 0 and using the hypercube dimensions in ascending order. – b) Communication between neighbour nodes is always unidirectional (instead of the bidirectional exchange used by CC-cubes). A particular DC-cube is characterized by: (a) a process which is responsible for starting the divide-and-conquer algorithm, and (b) a certain ordering of the hypercube dimensions, which determine the order in which the processes of the DC-cube are activated. Figure 1 shows an example of a 4-dimensional DC-cube starting at process 0, and using the dimensions in ascending order. In this figure, the arrows represent the communications that are carried out in every iteration of the DC-cube. Finally, it can be shown that a binomial tree with the root labelled as l is equivalent to a d-dimensional DC-cube initiated at process l and using the hypercube dimensions in descending order. The standard embedding has been proposed to map a 2k-dimensional hypercube onto a 2k × 2k mesh. The function S which maps a node i (i ∈ [0, 22k − 1]) of the hypercube onto a node (a, b) (a, b ∈ [0, 2k − 1]) of the mesh is defined as1 : i k S(i) = , i mod 2 . (1) 2k The properties of the standard embedding that are more relevant to this paper are: a) It is an embedding with constant distances (this property means that the neighbors in dimension i of the hypercube are found at a constant distance in the mesh, for any pair of neighbor nodes, and b) It has a minimal average distance. Property (a) is very attractive for the purpose of using the embedding on the arbitrary-root scenario. Property (b) is attractive from the performance point of view. To start a divide-and-conquer algorithm from an arbitrary node (a, b) of the wormhole mesh, we use a DC-cubeinitiated at process S −1 ((a, b)) = a · 2k + b, and using the dimensions of the hypercube in descending order. It is easy to see that the standard embedding of such a DC-cube is conflict free, and therefore the approach is optimal. The formal proof of this property is very similar to the 1
The standard embedding can be easily extended to the general case of C-dimensional meshes. This extension is however out of the scope of this paper.
Divide-and-Conquer Algorithms on Two-Dimensional Meshes
1055
Table 1. Comparison of approaches ts k+1
GA 2
te (general α) −2
GB 3 · 2k−1 −1 S
k+1
2
−2
te (α = 1)
2 k−1 (α+α )2 +(α +α4 ) (2α2α)2−1 −1 2 k−1 2k−1 −1+α + α2 +(α3 +α4 ) (2α2α)2−1 −1 k α2k−1 (α+α2 ) 2 2α 2−1
2
k−1
3
k+1
2
−2
te (α = 0.5)
3 8
1 2k + 1− 2k−1
3 3 · 2k−1 −1 2k−1 + 18 + 2k+2
2k+1 −2
3 2
1− 21k
proof given in [2] for the case of the Reflecting mapping under the fixed-root scenario. To start a divide-and-conquer algorithm from an arbitrary node (a, b) of the store-and-forward mesh, we use a DC-cube initiated at process S −1 ((a, b)) = a · 2k + b, and using the dimensions of the hypercube in ascending order. In this way, the distances corresponding to the first iterations of the computation (involving larger messages) are smaller.
5
Comparison of Approaches and Conclusions
As a general consideration, it can be said that the standard embedding is significantly simpler than both Reflecting and Growing mapping. When considering the wormhole arbitrary-root scenario, both the approach A for Reflecting mapping and the standard embedding of DC-cubes are optimal. The comparison of approaches for the store-and-forward arbitrary-root scenario are summarized in table 1, in terms of the average communication cost (GA for approach A, GB for approach B and S for the standard embedding of DC-cube). The average cost GA is defined as: k
k
2 −1 2 −1 1 a,b GA = 2k G , 2 a=0
(2)
b=0
where Ga,b is the cost of the division stage when the computation is initiated at node (a, b). This cost is defined in terms of the startup cost incurred in every communication between neighbor nodes (denoted by ts ) and the transmission time per message size unit (denoted by te ). The average costs GB and S are defined in a similar way. Two particular cases of the term affecting te are distinguished, corresponding to α = 1 and α = 0.5. The expressions in table 1 can be compared in two different cases. In the ”small volume” case, the communication cost is assumed to be dominated by the term affecting ts (this will happen when ts /te is large and/or the problem is small). In the ”large volume” case, the cost is assumed to be dominated by the term affecting te (this will happen when ts /te is small and/or the problem is large). The conclusions drawn from table 1 are: – a) In the ”small volume” case, approach B is the best, since: GA = S = (4/3)GB .
1056
Miguel Valero-Garc´ıa et al.
– b) In the ”large volume” case and α = 1, the conclusion is exactly the same as case (a). – c) In the ”large volume” case and α = 0.5, approach S is significantly better than the rest, since: k 2 5 1 k−2 + GA = 2 + S and GB = S . (3) 2 3 12 Note that cases (a) and (c) are expected to be the most frequent since a parallel computer is targeted to solve large problems. Besides, they are also the most relevant since they may be very time consuming. Note that in case (c) the improvement of the standard embedding of DC-cube over approaches A and B is proportional to the number of nodes and therefore it is very high for large systems.
References 1. Gonz´ alez, A., Valero-Garc´ıa, M., D´ıaz de Cerio L.: Executing Algorithms with Hypercube Topology on Torus Multicomputers. IEEE Transactions on Parallel and Distributed Systems 8 (1995) 803–814 1052, 1053 2. Lo, V., Rajopadhye, S., Telle, J.A.: Parallel Divide and Conquer on Meshes. IEEE Transactions on Parallel and Distributed Systems 10 (1996) 1049–1057 1051, 1052, 1053, 1055 3. Matic, S.: Emulation of Hypercube Architecture on Nearest-Neighbor MeshConnected Processing Elements. IEEE Transactions on Computers 5 (1990) 698– 700 4. Valero-Garc´ıa, M., Gonz´ alez, A., D´ıaz de Cerio, L., Royo, D.: Divide-and-Conquer Algorithms on Two-Dimensional Meshes. Research Report UPC-DAC-1997-30, http://www.ac.upc.es/recerca/reports/INDEX1997DAC.html 1052
All-to-All Scatter in Kautz Networks
Petr Salinger and Pavel Tvrd´ık Department of Computer Science and Engineering Czech Technical University, Karlovo n´ am. 13 121 35 Prague, Czech Republic {salinger,tvrdik}@sun.felk.cvut.cz
Abstract. We give lower and upper bounds on the average distance of Kautz digraph. Using this result, we give a memory-optimal and asymptotically time-optimal algorithm for All-to-All-Scatter in noncombining all-port store-&-forward Kautz networks.
1
Introduction
Shuffle-based interconnection networks, which include shuffle-exchange, de Bruijn, and Kautz networks, are alternatives to orthogonal (meshes, tori, hypercubes) and tree-based networks. Binary shuffle-based networks have been shown to have similar properties as fixed-degree hypercubic networks (butterflies, CCCs), but general de Bruijn and Kautz networks form a special family of interconnection topologies whose properties are still poorly understood. Recently, the interest in these networks has come from designers of optical interconnects [2,3]. In this paper, we focus on d-ary Kautz digraphs. We give lower and upper bounds on the average distance of Kautz digraph. Using this result, we give a memory-optimal and asymptotically time-optimal algorithm for Allto-All-Scatter in noncombining all-port Kautz networks. Vertex (arc) set of graph G is denoted by V (G) (A(G), respectively). The alphabet of d letters, {0, 1, . . . , d − 1}, is denoted by Zd . For d ≥ 2 and D ≥ 2, Kautz digraph of degree d and diameter D, K(d, D), has vertex set V (K(d, D)) = {x1 . . . xD ; xi ∈ Zd+1 ∧ xi = xi+1 ∀i ∈ {1, . . . , D − 1}} and there is an arc from vertex x = x1 . . . xD to vertex y = y1 . . . yD , x, y ∈ V (K(d, D)), if xi+1 = yi ∀i ∈ {1, . . . , D − 1}. The arc from vertex x1 . . . xD to x2 . . . xD xD+1 is denoted by x1 . . . xD xD+1 .
2
The Average Distance of Kautz Digraphs
Let N = |V (G)|. The distance between vertices x, y ∈ V (G) is denoted by d(x, y). The average distance of x ∈ V (G), d(x), is the average of distances between x and all the vertices of G (including x): d(x) = N1 y∈V (G) d(x, y). The average
ˇ grant No. 102/97/1055 and FRVS ˇ grant This research has been supported by GACR No. 1251/98.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1057–1061, 1998. c Springer-Verlag Berlin Heidelberg 1998
1058
Petr Salinger and Pavel Tvrd´ık
distance of G is defined as the average of average distances of all vertices of G: d(G) = N1 x∈V (G) d(x). In K(d, D), given vertices x = x1 . . . xD and y = y1 . . . yD , there is a unique shortest dipath from x to y. To calculate d(x, y) and to construct the shortest path from x to y, we have to find the largest overlap of x and y, i.e., the smallest i such that xi+1 . . . xD = y1 . . . yD−i . Then d(x, y) = i and the shortest path is obtained by performing left shifts introducing successively yD+1−i , . . . , yD . This fact allows to compute easily d(x) for any x and consequently d(K(d, D)). In this section, we give tight upper and lower bounds on d(K(d, D)). Our analysis is based on shortest-path spanning trees, similarly to de Bruijn networks [1]. The upper (lower) bound on the average distance is reached when all vertices are as far (close) as possible from the root. Lemma 1. In K(d, D), for any vertex x, D−
1 (d − 1)(1 −
1 d2 )
+
D 1 d + d−1 dD−2 (d2 − 1)
≤ d(x) ≤ D −
2 1 + . d − 1 (d2 − 1)dD−1
There are (d + 1)d vertices matching the upper bound. These are all vertices formed by alternating a and b, a = b ∈ Zd+1 . There are (d + 1)d(d − 1)D−2 vertices matching the lower bound. These are all vertices x = x1 . . . xD where = xD , for all i, 1 ≤ i ≤ D − 1. xi
3
Memory-Optimal and Asymptotically Time-Optimal AAS in Kautz Digraphs
All-to-all scatter (AAS), also called complete exchange, is a fundamental collective communication pattern. Every vertex has a special packet for every other vertex and all vertices start scattering simultaneously. Most of the published AAS algorithms assume that in one communication step, vertices can combine individual packets arbitrarily into longer messages and can recombine incoming compound messages into new compounds. This assumption is not very realistic, since it implies substantial overhead and compound messages can become very large. Hence, we assume the all-port noncombining lock-step model. Every packet travels in the network as a separate entity. All vertices start the scattering simultaneously and transmit packets with the same speed in steps. A router receives packets from all input links in one step, those destined for the local processor stores into the local processor memory, and the remaining ones together with those ejected by the local processor sends out using all output links in the next step. We assume store-and-forward hardware: every input and output buffer of routers can store one packet. If a packet cannot be retransmitted in the next step, it must be stored in an auxiliary buffer of the router. An AAS algorithm is said to be memory-optimal if the routers require auxiliary buffers for O(1) packets. The AAS algorithm is based on lexicographic labeling of input and output arcs of vertices and on greedy routing. To describe these notions, we need three auxiliary functions.
All-to-All Scatter in Kautz Networks
Definition 1. For any a ∈ Zd+1 , b ∈ Zd : σ(a, b) = b For a = b ∈ Zd+1 : ξ(b, a) = µ(a, b) = b−1
b+1 b if a > b, if a < b.
1059
if a ≤ b, otherwise.
Kautz vertex x1 . . . xD is adjacent from d vertices σ(x1 , i)x1 . . . xD−1 , i = 0, . . . , d−1. The arc σ(x1 , i)x1 . . . xD , denoted by x↓i, has input label i. Similarly, vertex x1 . . . xD is adjacent to d vertices x2 . . . xD σ(xD , j), j = 0, . . . , d − 1. The arc x1 . . . xD σ(xD , j) , denoted by x↑j, has output label j. Hence arc x1 . . . xD+1
of K(d, D) has input label ξ(x1 , x2 ) and output label µ(xD , xD+1 ) and therefore x1 . . . xD+1 = (x2 . . . xD+1 )↓ξ(x1 , x2 ) = (x1 . . . xD )↑µ(xD , xD+1 ). Every Kautz vertex x = x1 . . . xD is the root of a “virtual” complete d-ary tree of height D, whose level l, 1 ≤ l ≤ D − 1, consists of all Kautz vertices xl+1 . . . xD y1 . . . yl , constructed from x1 . . . xD by l left shift operations, and whose level-D vertices are all Kautz vertices y1 . . . yD , y1 = xD . This tree is called the greedy routing virtual tree rooted in x, denoted by GRT (x). Obviously, some Kautz vertices appear more than once in GRT (x), this depends on the periodicity of x. All GRT (x) have the same output arc labeling. The AAS algorithm consists of two similar phases. The first phase has dD−1 rounds. In each round r, 1 ≤ r ≤ dD−1 , of the first phase, every root x = x1 . . . xD inserts a d-tuple of packets into GRT (x), one packet per output arc, and the packets move down the tree using D-step greedy routing. Let y = y1 . . . yD be a destination vertex, y1 = xD . The D-step greedy routing in GRT (x) sends packets from x to y via vertices x2 . . . xD y1 , x3 . . . xD y1 y2 , . . . , xD y1 . . . yD−1 . Transmission of packets from vertices xi . . . xD y1 . . . yi−1 to vertices xi+1 . . . xD y1 . . . yi , 1 ≤ i ≤ D, is said to be the i-th step of round r. (We put xD+1 ≡ y1 ). The core of the AAS algorithm is functions Φ1 and Φ2 . Definition 2. Let x = x1 . . . xD ∈ V (K(d, D)), 0 ≤ r ≤ dD−1 −1, 0 ≤ i ≤ d−1. r,i Then Φ1 (x, r, i) = y1r,i . . . yD , where r,i y1 = σ(xD , i), r,i = σ(ykr,i , qkr ), k = 1, . . . , D − 1, yk+1 r qk = (ξ(xk , xk+1 ) + r div dk−1 ) mod d, k = 1, . . . , D − 1. r,i Let 0 ≤ r ≤ dD−2 − 1. Then Φ2 (x, r, i) = z1r,i . . . zD , where r,i z1 = xD , r,i zk+1 = ykr,i , k = 1, . . . , D − 1. Router in Kautz vertex x = x1 . . . xD executes in the first phase of the AAS protocol the following algorithm: for r = 0 to dD−1 − 1 do sequentially{ for i = 0 to d − 1 do in parallel{ take the packet destined for vertex Φ1 (x, r, i) from the local memory; apply the first step of the D-step greedy routing}; for j = 2 to D do sequentially for i = 0 to d − 1 do in parallel{ receive a packet from arc x↓i;
1060
Petr Salinger and Pavel Tvrd´ık
apply the j-th step of the D-step greedy routing}; for i = 0 to d − 1 do in parallel{ receive a packet from arc x↓i; store packet into the local processor memory}}; The second phase has dD−2 rounds and uses (D − 1)-step greedy routing from x = x1 . . . xD to xD y1 . . . yD−1 , which is defined similarly, using function Φ2 instead of Φ1 . Theorem 1. The AAS algorithm is memory-optimal and contention-free. Every arc of K(d, D) in every step of every round is used for transmitting one packet. The algorithm is asymptotically time- and transmission-optimal. Proof. It follows from Definition 2 that for all i and for any two rounds r1 = r2 , = Φ1 (x, r2 , i). Also Φ1 (x, r1 , i1 ) = Φ1 (x, r1 , i2 ) for any i1 = i2 . We Φ1 (x, r1 , i) need just to show that the communication is contention-free. Consider round r r,i is of the first phase and root x. The packet destined for Φ1 (x, r, i) = y1r,i . . . yD issued via the i-th output arc of x. The first step is contention-free and every vertex receives one packet per input arc. Vertex x2 . . . xD y1r,i , where y1r,i = σ(xD , i) gets the packet from its input arc with input label ξ(x1 , x2 ) and greedy routing forwards this packet to vertex x3 . . . xD y1r,i y2r,i , where y2r,i = σ(y1r,i , (ξ(x1 , x2 ) + r div d0 ) mod d) = σ(y1r,i , (ξ(x1 , x2 ) + r) mod d). Hence, vertex x2 . . . xD y1r,i will use output arc µ(y1r,i , σ(y1r,i , ξ(x1 , x2 ) + r) mod d) = (ξ(x1 , x2 ) + r) mod d. This is clearly a permutation of input arcs to output arcs. The same argument applies to any step j, j = 3, . . . , D, of round r. The same argument can be used to prove that the second-phase communication is also contention-free. The total number of steps of the AAS algorithm is clearly τ (d, D) = DdD−1 +(D−1)dD−2. The lower bound on the number of steps of AAS in digraph G is d(G)|V (G)|2 . τAAS (G) = |A(G)| Therefore τ (d, D) DdD−1 + (D − 1)dD−2 2d − 1 = ≤1+ . D−2 τAAS (K(d, D)) D(d + 1)(d − 1)2 − d2 d(K(d, D))(d + 1)d Let Dmin (d) be the smallest diameter such that the slowdown of the AAS algorithm in K(d, D) is at most 5 % for any D ≥ Dmin (d). Then Dmin (2) = 22, Dmin (3) = 7, Dmin (4) = 4, and for d ≥ 5, Dmin(d) = 2.
4
Conclusions
We have given tight lower and upper bounds on the average distance in K(d, D). They are very close to the diameter, they can be approximated with D − 1 d−1 . We have also designed a memory-optimal and asymptotically time- and
All-to-All Scatter in Kautz Networks
1061
transmission-optimal all-to-all scatter algorithm in noncombining all-port K(d, D). It is based on greedy routing which treats the Kautz networks as if they were vertex-transitive. Every vertex becomes the root of a virtual complete d-ary tree. For d ≥ 5, the slowdown of the algorithm is within 5% for any diameter D.
References 1. J.-C. Bermond, Z. Liu, and M. Syska. Mean eccentricities of de Bruijn networks. Networks, 30:187–203, 1997. 1058 2. G. Liu, K. Y. Lee, and H. F. Jordan. TDM and TWDM de Bruijn networks and ShuffleNets for optical communications. IEEE Transactions on Computers, 46:695–701, 1997. 1057 3. K. Sivarajan and R. Ramaswami. Lightwave networks based on de Bruijn graphs. IEEE/ACM Trans. Networking, 2(1):70–79, 1994. 1057
Reactive Proxies: A Flexible Protocol Extension to Reduce ccNUMA Node Controller Contention Sarah A. M. Talbot and Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine 180 Queen’s Gate, London SW7 2BZ, United Kingdom {samt, phjk}@doc.ic.ac.uk
Abstract. Serialisation can occur when many simultaneous accesses are made to a single node in a distributed shared-memory multiprocessor. In this paper we investigate routing read requests via an intermediate proxy node (where combining is used to reduce contention) in the presence of finite message buffers. We present a reactive approach, which invokes proxying only when contention occurs, and does not require the programmer or compiler to mark widely-shared data. Simulation results show that the hot-spot contention which occurs in pathological examples can be dramatically reduced, while performance on well-behaved applications is unaffected.
1
Introduction
Unpredictable performance anomalies have hampered the acceptance of cachecoherent non-uniform memory access (ccnuma) architectures. Our aim is to improve performance in certain pathological cases, without reducing performance on well-behaved applications, by reducing the bottlenecks associated with widely-shared data. This paper moves on from our initial work on proxy protocols [1], eliminating the need for application programmers to identify widelyshared data. Each processor’s memory and cache is managed by a node controller. In addition to local memory references, the controller must handle requests arriving via the network from other nodes. These requests concern cache lines currently owned by this node, cache line copies, and lines whose home is this node (i.e. the page holding the line was allocated to this node, by the operating system, when it was first accessed). In large configurations, unfortunate ownership migration or home allocations can lead to concentrations of requests at particular nodes. This leads to performance being limited by the service rate (occupancy) of an individual node controller, as demonstrated by Holt et al. [6]. Our proxy protocol, a technique for alleviating read contention, associates one or more proxies with each data block, i.e. nodes which act as intermediaries for reads [1]. In the basic scheme, when a processor suffers a read miss, instead D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1062–1075, 1998. c Springer-Verlag Berlin Heidelberg 1998
Reactive Proxies: A Flexible Protocol Extension
1063
Proxy
Home
Home
Home Proxy
Proxy Proxy
(a) Without proxies
(b) With two proxy clusters (read Line l)
(c) Read next data block (i.e. read Line l + 1)
Fig. 1. Contention is reduced by routing reads via a proxy of directing its read request directly to the location’s home node, it sends it to one of the location’s proxies. If the proxy has the value, it replies. If not, it forwards the request to the home: when the reply arrives it can be forwarded to all the pending proxy readers and can be retained in the proxy’s cache. The main contribution of this paper is to present a reactive version, which uses proxies only when contention occurs, and does not require the application programmer (or compiler) to identify widely-shared data. The rest of the paper is structured as follows: reactive proxies are introduced in Section 2. Our simulated architecture and experimental design are outlined in Section 3. In Section 4, we present the results of simulations of a set of standard benchmark programs. Related work is discussed in Section 5, and in Section 6 we summarise our conclusions and give pointers to further work.
2
Reactive Proxies
The severity of node controller contention is both application and architecture dependent [6]. Controllers can be designed so that there is multi-threading of requests (e.g. the Sun S3.mp is able to handle two simultaneous transactions [12]) which slightly alleviates the occupancy problem but does not eliminate it. Some contention is inevitable, and will increase the latency of transactions. The key problem is that queue lengths at controllers, and hence contention, are nonuniformly distributed around the machine. One way of reducing the queues is to distribute the workload to other node controllers, using them as proxies for read requests, as illustrated in Fig. 1. When a processor makes a read request, instead of going directly to the cache line’s home, it is routed first to another node. If the proxy node has the line, it replies directly. If not, it requests the value from the home itself, allocates it in its own cache, and replies. Any requests for a particular block which arrive at a proxy before it has obtained a copy from the home node, are added to a distributed chain of pending requests for that block, and the reply is forwarded down the pending chain, as illustrated in Fig. 2. It should be noted that write requests are
1064
Sarah A. M. Talbot and Paul H. J. Kelly
(a) First request to proxy has to be forwarded to the home node:
Client 1
1. client makes read request (proxy_read_request)
Proxy 2. request is sent on to the home (read_request)
Home
(b) Second client request, before data is returned, forms pending chain: 3. client makes read request (proxy_read_request)
Client 1 4. client receives pointer to 2nd client (take_hole)
Client 2
Proxy
Home
(c) Data is passed to each client on the pending chain:
Client 1 6. data supplied to Client 1 (take_shared)
Home
7. data supplied to Client 2 (take_shared)
Client 2
Proxy
5. data supplied to the proxy (take_shared)
Fig. 2. Combining of proxy requests not affected by the use of proxies, except for the additional invalidations that may be needed to remove proxy copies (which will be handled as a matter of course by the underlying protocol). The choice of proxy node can be at random, or (as shown in Fig. 1) on the basis of locality. To describe how a client node decides which node to use as a proxy for a read request, we begin with some definitions: – P: the number of processing nodes. – H(l): the home node of location l. This is determined by the operating system’s memory management policy. – N PC: the number of proxy clusters, i.e. the number of clusters into which the nodes are partitioned for proxying (e.g. in Fig. 1, N PC=2). The choice of N PC depends on the balance between degree of combining and the length of the proxy pending chain. N PC=1 will give the highest combining rate, because all proxy read requests for a particular data block will be directed to the same proxy node. As N PC increases, combining will reduce, but the number of clients for each proxy will also be reduced, which will lead to shorter proxy pending chains.
Reactive Proxies: A Flexible Protocol Extension
Home
1065
Home Proxy
bounce
(a) Input buffer full, some read requests bounce
(b) Reactive proxy reads
Fig. 3. Bounced read requests are retried via proxies – PCS(C): the set of nodes which are in the cluster containing client node C. In this paper, PCS(C) is one of N PC disjoint clusters each containing P/N PC nodes, with the grouping based on node number. – PN (l, C) the proxy node chosen for a given client node (C) when reading location l. We use a simple hash function to choose the actual proxy from the proxy cluster PCS(C). If PN (l, C) = C, or PN (l, C) = H(l), then client C will send a read request directly to H(l) The choice of proxy node is, therefore, a two stage process. When the system is configured, the nodes are partitioned into N PC clusters. Then, whenever a client wants to issue a proxy read, it will use the hashing function PN (l, C) to select one proxy node from PCS(C). This mapping ensures that requests for a given location are routed via a proxy (so that combining occurs), and that reads for successive data blocks go to different proxies (as illustrated in Fig. 1(c)). This will reduce network contention [15] and balance the load more evenly across all the node controllers. In the basic form of proxies, the application programmer uses program directives to mark data structures: all other shared data will be exempt from proxying [1]. If the application programmer makes a poor choice, then the overheads incurred by proxies may outweigh any benefits and degrade performance. These overheads include the extra work done by the proxy nodes handling the messages, proxy node cache pollution, and longer sharing lists. In addition, the programmer may fail to mark data structures that would benefit from proxying. Reactive proxies overcome these problems by taking advantage of the finite buffering of real machines. When a remote read request reaches a full buffer, it will immediately be sent back across the network. With the reactive proxies protocol, the arrival of a buffer-bounced read request will trigger a proxy read (see Fig. 3). This is quite different to the basic proxies protocol, where the user has to decide whether all or selected parts of the shared data are proxied, and proxy reads are always used for data marked for proxying. Instead, proxies are only used when congestion occurs. As soon as the queue length at the destination node has reduced to below the limit, read requests will no longer be bounced and proxy reads will not be used.
1066
Sarah A. M. Talbot and Paul H. J. Kelly
The repeated bouncing of read requests which can occur with finite buffers leads to the possibility of deadlock: the underlying protocol has to detect the continuous re-sending of a remote read request, and eventually send a higher priority read request which is guaranteed service. Read requests from proxy nodes to home nodes will still be subject to buffer bouncing, but the combining and re-routing achieved by proxying reduce the chances of a full input buffer at the home node. The reactive proxy scheme has the twin virtues of simplicity and low overheads. No information needs to be held about past events, and no decision is involved in using a proxy: the protocol state machine is just set up to trigger a proxy read request in response to the receipt of a buffer-bounced read request.
3
Simulated Architecture and Experimental Design
In our execution-driven simulations, each node contains a processor with an integral first-level cache (flc), a large second-level cache (slc), memory (dram), and a node controller (see Fig. 4). The node controller receives messages from, and sends messages to, both the network and the processor. The slc, dram, and the node controller are connected using two decoupled buses. This decoupled bus arrangement allows the processor to access the slc at the same time as the node controller accesses the dram. Table 1 summarises the architecture. We simulate a simplified interconnection network, which follows the the LogP model [3]. We have parameterised the network and node controller as follows: – L: the latency experienced in each communication event, 10 cycles for long messages (which include 64 bytes of data, i.e. one cache line), and 5 cycles for all other messages. This represents a fast network, comparable to the point-to-point latency used in [11]. – o: the occupancy of the node controller. Like Holt et al. [6], we have adapted the LogP model to recognise the importance of the occupancy of a node controller, rather than just the overhead of sending and receiving messages. The processes which cause occupancy are simulated in more detail (see Table 2). – g: the gap between successive sends or receives by a processor, 5 cycles. – P: the number of processor nodes, 64 processing nodes. We limit our message buffers to eight for read requests. There can be more messages in an input buffer, but once the queue length has risen above eight, all read requests will be bounced back to the sender until the queue length has fallen below the limit. This is done because we are interested in the effect of finite buffering on read requests rather than all messages, and we wished to be certain √ that all transactions would complete in our protocol. The queue length of P is an arbitrary but reasonable limit. Each cache line has a home node (at page level) which: either holds a valid copy of the line (in slc and/or dram), or knows the identity of a node which does have a valid copy (i.e. the owner); has guaranteed space in dram for the line; and holds directory information for the line (head and state of the sharing list). The
Reactive Proxies: A Flexible Protocol Extension
Interconnect
Network Controller Network buffers Node Controller
DRAM
MEM bus
SLC
SLC bus FLC CPU
Fig. 4. The architecture of a node Table 1. Details of the simulated architecture CPU
CPI 1.0 Instruction set based on DEC Alpha Instruction cache All instruction accesses assumed primary cache hits First level data cache Capacity 8 Kbytes Line size 64 bytes Direct mapped, write-through Second-level cache Capacity 4 Mbytes Line size 64 bytes Direct mapped, write-back DRAM Capacity Infinite Page size 8 Kbytes Node controller Non-pipelined Service time and occupancy See Table 2 Cycle time 10ns Interconnection network Topology full crossbar Incoming message queues 8 read requests Cache coherence protocol Invalidation-based, sequentially-consistent ccNUMA, home nodes assigned to first node to reference each page (i.e. “first-touch-after-initialisation”). Distributed directory, using singly-linked sharing list Based on the Stanford Distributed-Directory Protocol, described by Thapar and Delagi [14]
Table 2. Latencies of the most important node actions operation time (cycles) Acquire SLC bus 2 Release SLC bus 1 SLC lookup 6 SLC line access 18 Acquire MEM bus 3 Release MEM bus 2 DRAM lookup 20 DRAM line access 24 Initiate message send 5
1067
1068
Sarah A. M. Talbot and Paul H. J. Kelly
Table 3. Benchmark applications application problem size shared data marked for basic proxying Barnes 16K particles all CFD 64 x 64 grid all FFT 64K points all FMM 8K particles f array (part of G Memory) GE 512 x 512 matrix entire matrix Ocean-Contig 258 x 258 ocean q multi and rhs multi Ocean-Non-Contig 258 x 258 ocean fields, fields2, wrk, and frcng Water-Nsq 512 molecules VAR and PFORCES
distributed directory holds the identities of nodes which have cached a particular line in a sharing chain, currently implemented as a singly-linked list. The directory entry for each data block provides the basis for maintaining the consistency of the shared data. Only one node at a time can remove entries from the sharing chain (achieved by locking the head of the sharing chain at the home node), and messages which prompt changes to the sharing chain are ordered by their arrival at the home node. This mechanism is not affected by the protocol additions needed to support proxies. Proxy nodes require a small amount of extra store to be added to the node controller. Specifically we need to be able to identify which data lines have outstanding transactions (and the tags they refer to), and be able to record the identity of the head of the pending proxy chain. In addition, the node controller has to handle the new proxy messages and state changes. We envisage implementing these in software on a programmable node controller, e.g. the magic node controller in Stanford’s flash [9], or the sclic in the Sequent numa-q [10]. The benchmarks and their parameters are summarised in Table 3. ge is a simple Gaussian elimination program, similar to that used by Bianchini and LeBlanc in their study of eager combining [2]. We chose this benchmark because it is an example of widely-shared data. cfd is a computational fluid dynamics application, modelling laminar flow in a square cavity with a lid causing friction [13]. We selected six applications from the splash-2 suite, to give a cross-section of scientific shared memory applications [16]. We used both Ocean benchmark applications, in order to study the effect of proxies on the “tuned for data locality” and “easy to understand” variants. Other work which refers to Ocean can be assumed to be using Ocean-Contig.
4
Experimental Results
In this work, we concentrate on reactive proxies, but compare the results with basic proxies (which have already been examined in [1]). The performance results for each application are presented in Table 4 in terms of relative speedup with no proxying (i.e. the ratio of the execution time for 64 processing nodes to the execution time running on 1 processor), and percentage changes in execution time when proxies are used. The problem size is kept constant. The relative changes results in Fig. 5 show three different metrics:
Reactive Proxies: A Flexible Protocol Extension
1069
Table 4. Benchmark performance for 64 processing nodes applications
relative speedup proxy no proxies type
% change in execution time (+ is better, - is worse) for N PC = 1 to 8 1 2 3 4 5 6 7 8
Barnes
43.2
basic 0.0 reactive +0.3
-0.1 +0.2
0.0 +0.3
0.0 +0.2
-0.2 +0.1
+0.3 -0.1
-0.2 0.0
-0.1 +0.3
CFD
30.6
basic +6.6 reactive +5.6
+7.7 +5.4
+6.4 +4.7
+8.3 +4.5
+6.7 +5.6
+5.8 +4.1
+6.0 +4.1
+10.2 +4.8
FFT
47.4
basic +9.3 reactive +11.5
+9.0 +11
+9.8 +9.4 +9.2 +8.8 +8.9 +8.9 +10.8 +10.8 +11.1 +11.6 +11.0 +10.6
FMM
36.1
basic +0.2 reactive +0.4
+0.2 +0.3
+0.2 +0.3
GE
22.0
basic +28.7 +28.7 +28.7 +28.7 +28.7 +28.8 +28.8 +28.7 reactive +23.3 +22.9 +22.3 +21.4 +21.5 +21.4 +21.7 +21.5
Ocean-Contig
48.9
basic reactive
-3.2 -0.2
+2.6 0.0
+0.1 -0.1
-0.2 -0.3
-0.8 +0.2
-2.0 +2
-2.1 +1.3
-1.8 +2.1
Ocean-Non-Contig
50.5
basic -0.3 reactive +3.1
+1.6 +1.4
-1.7 +1.4
+4.1 +0.8
+1.0 +5.1
-0.4 +1.8
+0.2 +1.2
+4.2 +5
Water-Nsq
55.5
basic -0.7 reactive +0.2
-0.6 +0.2
-0.6 +0.2
-0.5 +0.2
-0.5 +0.2
-0.5 +0.2
-0.7 +0.1
-0.5 +0.2
+0.2 +0.4
+0.2 +0.3
+0.2 +0.4
+0.2 +0.3
+0.2 +0.3
– messages: the ratio of the total number of messages to the total without proxies, – execution time: the ratio of the execution time (excluding startup) to the execution time (also excluding startup) without proxies. – queueing delay: the ratio of the total time that messages spend waiting for service to the total without proxies, and The message ratios shown in Fig. 6 are: – proxy hit rate: the ratio of the number of proxy read requests which are serviced directly by the proxy node, to the total number of proxy read requests (in contrast, a proxy miss would require the proxy to request the data from the home node), – remote read delay: the ratio of the delay between issuing a read request and receiving the data, to the same delay when proxies are not used. – buffer bounce ratio: the ratio of the total number of buffer bounce messages to read requests. This gives a measure of how much bouncing there is for an application. This ratio can go above one, since only the initial read request is counted in that total, i.e. the retries are excluded. – proxy read ratio: the ratio of the proxy read messages to read requests - this gives a measure of how much proxying is used in an application. The first point to note from Table 4 is that there is no overall “winner” between basic and reactive proxies, in that neither policy improves the performance of all the applications for all values of proxy clusters. Looking at the
1070
Sarah A. M. Talbot and Paul H. J. Kelly
8
8
0
|
|
1
2
120
80
40 20 |
|
|
0
1
2
|
|
|
3 4 5 proxy clusters
|
|
|
6
7
8
100
60
0
|
140
0
execution time
queueing delay
messages
execution time
queueing delay
|
|
|
3 4 5 proxy clusters
|
|
|
6
7
8
water-nsq - reactive proxies
messages
|
7
0
7
execution time
queueing delay
|
6
20
6
g e - reactive proxies
|
|
|
|
8
|
|
|
|
3 4 5 proxy clusters
|
3 4 5 proxy clusters
|
|
2
|
|
|
1
|
|
|
0
|
|
messages
|
|
40
|
|
|
60
|
| |
execution time
queueing delay
80
80
|
7
ocean-non-contig - reactive proxies
|
|
100
|
|
messages
120
|
|
|
2
|
|
20
1
|
|
40
0
|
|
|
140
100
20 |
160
120
40
|
180
140
60
|
|
80
|
|
100
ocean-contig - reactive proxies
160
|
6
|
8
|
7
180
execution time
queueing delay
|
|
6
|
|
2
|
3 4 5 proxy clusters
|
1
|
2
|
0
|
1
|
|
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
20 |
|
8
execution time
queueing delay
|
|
40
|
80
|
7
|
60
|
0
|
6
messages
messages
|
|
|
|
100
c f d - reactive proxies
f m m - reactive proxies
120
|
120
140
|
140
3 4 5 proxy clusters
|
160
|
|
execution time
queueing delay
180
160
0
|
20
60
messages
|
40
180
|
3 4 5 proxy clusters
100
60
0
|
2
|
|
80
|
1
|
|
100
20
|
0
|
|
120
40
|
140
60
f f t - reactive proxies
|
160
|
180
120 80
|
0
|
20
140
|
40
160
|
60
execution time
queueing delay
180
|
80
|
100
|
120
messages
|
140
barnes - reactive proxies
|
160
|
180
|
|
|
0
1
2
|
|
|
3 4 5 proxy clusters
|
|
|
6
7
8
Fig. 5. Relative changes for 64 processing nodes with reactive proxies results for different values of N PC, for basic proxies there is no value which has a positive effect on the performance of all the benchmarks. However, for reactive proxies, there are two proxy cluster values that improve the performance of all the benchmarks, i.e. N PC=5 and 8 achieve a balance between combining, queue distribution, and length of the proxy pending chains. Reactive proxies may not always deliver the best performance improvement, but by providing stable points for N PC they are of more use to system designers. It should also be noted that, in general, using reactive proxies reduces the number of messages, because they break the cycle of re-sending read messages in response to a finite buffer bounce (see Fig. 5). Looking at the individual benchmarks: Barnes. In general, this application benefits from the use of reactive proxies. However, changing the balance of processing by routing read requests via proxy nodes can have more impact than the direct effects of reducing home node congestion. Two examples illustrate this: when N PC=6 for basic proxies, load miss delay is the same as with no proxies, store miss delay has increased slightly from the no proxy case, yet a reduction of 0.4% in lock and barrier delays results in an overall performance improvement of 0.3%. Conversely, for reactive proxies when N PC=6, the load and store miss delays are the same as when proxies are
Reactive Proxies: A Flexible Protocol Extension
|
|
|
|
6
7
8
proxy hit rate remote read delay buffer bounce ratio
proxy read ratio
|
2
|
|
|
3 4 5 proxy clusters
|
|
|
6
7
8
|
|
|
|
|
|
1
2
3 4 5 proxy clusters
|
|
|
6
7
8
water-nsq - reactive proxies
proxy hit rate remote read delay buffer bounce ratio
proxy read ratio
|
|
1
0
|
|
0
|
proxy hit rate remote read delay buffer bounce ratio
proxy read ratio
|
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
|
0.0
g e - reactive proxies
|
|
8
|
1.5
0.5
8
|
|
7
|
|
|
6
|
|
|
|
|
|
|
|
|
3 4 5 proxy clusters
|
|
|
2
|
3 4 5 proxy clusters
ocean-non-contig - reactive proxies
|
|
|
1
|
2
|
|
|
0
|
1
2.0
1.0
7
|
|
proxy read ratio
|
6
|
|
proxy hit rate remote read delay buffer bounce ratio
2.5
|
proxy read ratio
|
|
|
proxy hit rate remote read delay buffer bounce ratio
|
0
|
|
|
|
|
3.0
|
|
|
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
3.5
|
ocean-contig - reactive proxies
|
|
3 4 5 proxy clusters
|
|
8
|
|
|
7
|
|
|
6
2
|
|
|
|
|
|
proxy read ratio
|
|
3 4 5 proxy clusters
1
|
|
2
|
|
|
1
|
|
0
|
|
|
|
|
|
|
0
f m m - reactive proxies
|
|
8
|
proxy hit rate remote read delay buffer bounce ratio
|
|
7
|
|
|
|
proxy read ratio
6
|
|
|
|
|
proxy hit rate remote read delay buffer bounce ratio
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
|
|
3 4 5 proxy clusters
|
|
|
2
|
|
|
1
|
|
|
0
|
|
| |
|
|
|
|
|
|
|
|
proxy read ratio
|
proxy hit rate remote read delay buffer bounce ratio
c f d - reactive proxies
|
|
|
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
|
|
|
|
|
|
1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
f f t - reactive proxies
|
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
barnes - reactive proxies
|
1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
1071
|
|
|
0
1
2
|
|
|
3 4 5 proxy clusters
|
|
|
6
7
8
Fig. 6. Message ratios for 64 processing nodes with reactive proxies not used, but a slight 0.1% increase in lock delay gives an overall performance degradation of -0.1%. CFD. This application benefits from the use of reactive proxies, with performance improvements in the range 4.1% to 5.6%. However, the improvements are not as great as those obtained with basic proxies. The difference is attributable to the delay in triggering each reactive proxy read: for this application it is better to use proxy reads straight away, rather than waiting for read requests to be bounced. It should also be noted that the proxy hit rate oscillates, with peaks at N PC=2,4,8 (see Fig. 6). This is due to a correspondence between the chosen proxy node and ownership of the cache line. FFT. This shows a marked speedup when reactive proxies are used, of between 10.6% and 11.6%. The number of messages decreases with proxies because the buffer bounce ratio is cut from a severe 0.7 with no proxies. The mean queueing delay drops down as the number of proxy clusters increases, reflecting the benefit of spreading the proxy read requests. However, this is balanced by a slow increase in the buffer bounce ratio, because as more nodes act as proxy there will be more read requests to the home node, and these read requests will start to √ bounce as the number of messages in the home node’s input queue rises to P and above.
1072
Sarah A. M. Talbot and Paul H. J. Kelly
FMM. There is a marginal speedup compared with no proxies (between 0.3% and 0.4%). This is as expected given only the f array (part of G Memory) is known to be widely-shared, which was why it was marked for basic proxies. However, the performance improvement is slightly better than that achieved using basic proxies, so the reactive method dynamically detects opportunities for read combining which were not found by code inspection and profiling tools. GE. This application, which is known to exhibit a high level of sharing of the current pivot row, shows a large speedup in the range 21.4% to 23.2%. However, the improvement is not as good as that obtained using basic proxies. This was to be expected, because proxying is no longer targeted by marking widely-shared data structures. Instead proxying is triggered when a read is rejected because a buffer is full, and so there will be two messages (the read and buffer bounce) before a proxy read request is sent by the client. It should also be noted that the execution time increases as the number of proxy clusters increases. As the number of nodes acting as proxy goes up, there will be more read requests (from proxies) being sent to the home node, and the read requests are more likely to be bounced, as shown by the buffer bounce ratio for ge in Fig. 6. Finally, the queueing delay is much higher when proxies are in use. This is because without proxies there is a very high level of read messages being bounced (and thus not making it into the input queues). With proxies, the proxy read requests are allowed into the input queues, which increases the mean queue length. Ocean-Contig. Reactive proxies can degrade the performance of this application (by up to -0.3% at N PC=4), but they achieve performance improvements for more values of N PC than the basic proxy scheme. Unlike basic proxies, reactive proxies reduce the remote read delay by targeting remote read requests that are bounced because of home node congestion. The performance degradation when N PC=1,3,4 is attributable to increased barrier delays caused by the redistribution of messages. Ocean-Non-Contig. This has a high level of remote read requests. These remote read requests result in a high level of buffer bounces, which in turn invoke the reactive proxies protocol. Unfortunately the data is seldom widely-shared, so there is little combining at the proxy nodes, as is illustrated by the low proxy hit rates. With N PC=4, this results in a concentration of messages at a few nodes, overall latency increases, and the execution time suffers. For N PC=5, the queueing delay is reduced in comparison to the no proxy case, and this has the best execution time. Given these results, we are carrying out further investigations into the hashing schemes suitable for the PN (l, C) function, and the partitioning strategy used to determine PCS(C), to obtain more reliable performance for applications such as Ocean-Non-Contig. Water-Nsq. Using reactive proxies gives a small speedup compared to no proxies (around 0.2%). However, this is better than with basic proxies, where performance is always worse (in the range -0.5% to -0.7%, see Table 4). The extremely low proxy read ratios shows that there is very little proxying, but the high proxy hit rates indicate that when proxy reads are invoked there is a high level of combining. It is encouraging to see that the proxy read ratio is kept low:
Reactive Proxies: A Flexible Protocol Extension
1073
this shows that the overheads of proxying (extra messages, cache pollution) are only incurred when they are needed by an application. To summarise, the results show that for reactive proxies, when the number of proxy clusters (N PC) is set to five or eight, the performance of all the benchmarks improves, i.e. they achieve the best balance between combining, queue length distribution, and the length of the proxy pending chains in our simulated system. This is a very encouraging result, because without marking widely-shared data we have obtained sizeable performance improvements for three benchmarks (ge, fft, and cfd), and had no detrimental effect on the other well-behaved applications. By selecting a suitable N PC for an architecture, the system designers can provide a ccnuma system with more stable performance. This is in contrast to basic proxies, where although better performance can be obtained for some benchmarks, the strategy relies on judicious marking of widely-shared data for each application.
5
Related Work
A number of measures are available to alleviate the effects of contention for a node, such as improving the node controller service rate [11], and combining in the interconnection network for fetch-and-update operations [4]. Architectures based on clusters of bus-based multiprocessor nodes provide an element of read combining since caches in the same cluster snoop their shared bus. Caching extra copies of data to speed-up retrieval time for remote reads has been explored for hierarchical architectures, including [5]. The proxies approach is different because it does not use a fixed hierarchy: instead it allows requests for copies of successive data lines to be serviced by different proxies. Attempts have been made to identify widely-shared data for combining, including the glow extensions to the sci protocol [8,7]. glow intercepts requests for widely-shared data by providing agents at selected network switch nodes. In their dynamic detection schemes, which avoid the need for programmers to identify widely-shared data, agent detection achieves better results than the combining of [4] by using a sliding window history of recent read requests, but does not improve on the static marking of data. Their best results are with program-counter based prediction (which identifies load instructions that suffer very large miss latency) although this approach has the drawback of requiring customisation of the local node cpus. In Bianchini and LeBlanc’s “eager combining”, the programmer identifies specific memory regions for which a small set of server caches are pre-emptively updated [2]. Eager combining uses intermediate nodes which act like proxies for marked pages, i.e. their choice of server node is based on the page address rather than data block address, so their scheme does not spread the load of messages around the system in the fine-grained way of proxies. In addition, their scheme eagerly updates all proxies whenever a newly-updated value is read, unlike our protocol, where data is allocated in proxies on demand. Our less aggressive scheme reduces cache pollution at the proxies.
1074
6
Sarah A. M. Talbot and Paul H. J. Kelly
Conclusions
This paper has presented the reactive proxy technique, discussed the design and implementation of proxying cache coherence protocols, and examined the results of simulating eight benchmark applications. We have shown that proxies benefit some applications immensely, as expected, while other benchmarks with no obvious read contention still showed performance gains under the reactive proxies protocol. There is a tradeoff between the flexibility of reactive proxies and the precision (when used correctly) of basic proxies. However, reactive proxies have the further advantage that a stable value of N PC (number of proxy clusters) can be established for a given system configuration. This gives us the desired result of improving the performance of some applications, without affecting the performance of well-behaved applications. In addition, with reactive proxies, the application programmer does not have to worry about the architectural implementation of the shared-memory programming model. This is in the spirit of the shared-memory programming paradigm, as opposed to forcing the programmer to restructure algorithms to cater for performance bottlenecks, or marking data structures that are believed to be widely-shared. We are currently doing work based on the Ocean-Non-Contig application to refine our proxy node selection function (PN (l, C)). In addition, we are continuing our simulation work with different network latency (L) and finite buffer size values. We are also evaluating further variants of the proxy scheme: adaptive proxies, non-caching proxies, and using a separate proxy cache.
Acknowledgements This work was funded by the U.K. Engineering and Physical Sciences Research Council through the cramp project GR/J 99117, and a Research Studentship. We would also like to thank Andrew Bennett and Ashley Saulsbury for their work on the alite simulator and for porting some of the benchmark programs.
References 1. Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, and Sarah A. M. Talbot. Using proxies to reduce cache controller contention in large shared-memory multiprocessors. In Luc Boug´e et al, editor, Euro-Par 96 European Conference on Parallel Architectures, Lyon, volume 1124 of Lecture Notes in Computer Science, pages 445–452. Springer-Verlag, August 1996. 1062, 1065, 1068 2. Ricardo Bianchini and Thomas J. LeBlanc. Eager combining: a coherency protocol for increasing effective network and memory bandwidth in shared-memory multiprocessors. In 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, pages 204–213, October 1994. 1068, 1073 3. David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten von Eicken. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78–85, November 1996. 1066
Reactive Proxies: A Flexible Protocol Extension
1075
4. Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU Ultracomputer – designing a MIMD shared memory parallel computer. IEEE Transactions on Computers, C-32(2):175–189, February 1983. 1073 5. Seif Haridi and Erik Hagersten. The cache coherence protocol of the Data Diffusion Machine. In E. Odijk, M. Rem, and J.-C Syre, editors, PARLE 89 Parallel Architectures and Languages Europe, Eindhoven, volume 365 of Lecture Notes in Computer Science, pages 1–18. Springer-Verlag, June 1989. 1073 6. Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The effects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995. 1062, 1063, 1066 7. David V. James, Anthony T. Laundrie, Stein Gjessing, and Gurindar S. Sohi. Scalable Coherent Interface. IEEE Computer, 23(6):74–77, June 1990. 1073 8. Stefanos Kaxiras, Stein Gjessing, and James R. Goodman. A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors. In (to appear) 12th ACM International Conference on Supercomputing, Melbourne, July 1998. 1073 9. Jeffrey Kuskin. The FLASH Multiprocessor: designing a flexible and scalable system. PhD thesis, Computer Systems Laboratory, Stanford University, November 1997. Also available as a technical report, CSL-TR-97-744. 1068 10. Tom Lovett and Russell Clapp. STiNG: a CC-NUMA computer system for the commercial marketplace. 23rd Annual International Symposium on Computer Architecture, Philadelphia, in Computer Architecture News, 24(2):308–317, May 1996. 1068 11. Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. 24th Annual International Symposium on Computer Architecture, Denver, in Computer Architecture News, 25(2):219–228, June 1997. 1066, 1073 12. Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The S3.mp scalable shared memory multiprocessor. In Proceedings of the International Conference on Parallel Processing Vol. 1, pages 1–10, August 1995. 1063 13. B. A. Tanyi. Iterative Solution of the Incompressible Navier-Stokes Equations on a Distributed Memory Parallel Computer. PhD thesis, University of Manchester Institute of Science and Technology, 1993. 1068 14. Manu Thapar and Bruce Delagi. Stanford distributed-directory protocol. IEEE Computer, 23(6):78–80, June 1990. 1067 15. Leslie G. Valiant. Optimality of a two-phase strategy for routing in interconnection networks. IEEE Transactions on Computers, C-32(8):861–863, August 1983. 1065 16. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. Proceedings of the 22nd Annual International Symposium on Computer Architecture, in Computer Architecture News, 23(2):24–36, June 1995. 1068
Handling Multiple Faults in Wormhole Mesh Networks Tor Skeie Department of Informatics, University of Oslo Box 1080, Blindern, N-0316 OSLO, Norway. [email protected]
Abstract. We present a fault tolerant method tailored for n-dimensional mesh networks that is able to handle multiple faults, even for two dimensional meshes. The method does not require existence of virtual channels. The traditional way of achieving fault tolerance based on adaptivity and adding virtual channels as the main mechanisms, has not shown the ability to handle multiple faults in wormhole mesh networks. In this paper we propose another strategy to provide high degree of fault-tolerance, we describe a technique which alters the routing function on the fly. The alteration action is always taken locally and distributed to a limited number of non-neighbor nodes.
1
Introduction
In recent years we have seen an increasing interest in multicomputers both in industry and academia. As the number of components in such computers grow the probability of failing components increase, and therefore the ability to adapt to errors becomes a significant issue in this field. In this paper we study wormhole routed interconnect networks with respect to fault tolerance. The wormhole switching technique that was described by Dally and Seitz [5] is an improvement of the virtual cut through technique of Kermani and Kleinrock [15]. The concept of wormhole routing has been taken up in many interconnection networks, in particular in the domain of multiprocessor interconnection, see e.g. [18,21,20]. A survey of wormhole routing techniques can be found in [19]. Fault-tolerance in a communication network is defined as the ability of the network to effectively utilize its redundancy in the presence of faulty links. A failing link may destroy one path in use between two communicating nodes, but given that the failing link has not split the network into two unconnected parts, there will still be other paths that the two ends may communicate over. Much work on fault-tolerance in wormhole routed networks has been reported. Linder and Harden described an adaptive routing algorithm for k-ary n-cubes which requires up to 2n virtual channels per physical channel [16]. Boppana and Chalasani present an algorithm to handle block faults in meshes where faulty links and nodes form rectangular regions [1]. The method requires 2 virtual
This work is supported by Esprit under the OMI-ARCHES
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1076–1088, 1998. c Springer-Verlag Berlin Heidelberg 1998
Handling Multiple Faults in Wormhole Mesh Networks
1077
channels per physical channel, and is able to handle any number of faults as long as the fault regions do not overlap. The technique requires that a fault free node knows the status of its own links and its neighbors’ links. In [2] they generalize this method to also handle non-convex fault regions requiring 4 virtual channels per physical channel in the non-overlapping case. The method requires fault region shape-finding messages, and may in many cases mark more links as faulty than those that have actually failed. Several other methods rely on the virtual channel concept as well [4,9,6,8,11,3]. However, other approaches have been proposed. Glass and Ni have described a fault tolerant routing algorithm for meshes which does not require virtual channels [13]. Their method is based on the turn-model for generating adaptive routing algorithms [10,12] and needs extra control lines to agree upon allowed turns. Lysne, Skeie and Waadeland [17] also describe fault tolerant routing without using virtual channels. Basically, their methods rely on alteration of the routing function. Unfortunately, both these methods tolerate very few faults for low dimension meshes. In this paper we present a method for fault tolerance which is different from the above mentioned work in terms of that it both can handle a significant number of faults and does not need virtual channels. The technique relies on a limited amount of non local status information, the action is taken locally and distributed to a limited number of nodes. The remainder of the paper is organized as follows. Section 2 contains some basic notations. In section 3 we describe the fault tolerant routing method and finally in section 4 we conclude.
2
Preliminaries
The definitions used in this paper adhere to standard notation and definitions in wormhole routing [17]. The following theorem is due to Dally and Seitz [5] and Duato [7]. It will be used extensively within this paper. Theorem 1. A wormhole network is free from deadlocks if its channel dependency graph is acyclic. We shall in this paper consider bidirectional n-dimensional mesh networks, also called (k, n)-mesh networks, where k is the radix, n is the number of dimensions and N = k n is the number of nodes. A (k, n)-mesh is a (k, n)-torus without connecting wraparound links. To simplify presentation of the fault tolerant method, we will describe it in detail for the two-dimensional case and afterwards present generalization to three and higher dimensions. The four directions of a two-dimensional mesh are labeled north, south, east and west. Furthermore the channels of a horizontal (vertical) link l are denoted least /lwest (lsouth /lnorth ), respectively.
1078
2.1
Tor Skeie
The Fault Model
In this paper we will concentrate on link faults, and assume that if a channel fails then the entire link fails in both directions. The link failures are considered to be “hard” in the sense that faulty links will be down for quite a while (hours or even days). Furthermore we assume that each node in addition to knowing the status of its own links, also has status information regarding east and north links of some of the nodes in the row (column) that it is positioned in. For a more thorough definition we refer to section 3.2. To provide nodes with the link status information stated above we assume that the network has a control line structure where routers can send/receive specific control messages. The IEEE 1355 [14] STC104 [23] from SGS-Thomson is a router with such a control line structure.
3
A Distributed Fault Tolerant System for Positive-First Routing
In this section we present an algorithm on that handles multiple faults for positive-first routing (PF) without using virtual channels. PF is a variant of the turn model [12] which is a method for designing adaptive routing algorithms. The idea in the turn model is to start with a system where every packet can take all possible paths in every router. Then one prohibits just enough turns in order to break cycles in the channel dependency graph. Positive first routing is defined by the turns showed in figure 1. We make the additional restriction that packets should be routed minimally in a fault free network, even if there are detours that do not introduce illegal turns. From the
Fig. 1. The six allowed turns (solid arrows) for positive first routing in the fault free case. above we observe that there is adaptivity in the south-west and north-east directions because in each of these directions one is allowed to make use of two turns. Routing in the south-east and north-west directions is, however, deterministic. The name positive-first comes from the hindrance of making turns from negative to positive directions. The fault tolerant system is based on altering the routing function in some nodes, so that new paths are defined that avoid the faulty links. For most of
Handling Multiple Faults in Wormhole Mesh Networks
1079
the failure situations it will be nodes in two rows and/or columns that need a reprogrammed routing function, however, in some specific situations a cascading update might be necessary (see section 3.3). It will always be the node, where one or both of the east and north links recently failed, that initiates the necessary updating step to be taken. This means that the updating action to be provided is locally determined. Such a node will be called nmaster throughout the remainder of the paper. Alteration messages are then distributed via the control line structure to nodes that need their routing function updated, henceforth the distributed view of the fault tolerant system. On the other hand the node where its west- and/or south links fail, stays passive concerning alteration of the routing function until reception of updating message. Such a node we denote for nslave . We assume the nodes to have some amount of programmable logic so they can implement the alteration algorithm to respond to the failure situation, and furthermore that nodes on the fly can reprogram their routing function upon reception of updating messages. Below we shall first present the necessary alteration of the routing function for different link failure situations in order to give all packets affected by a faulty link new paths. Next we show that the updated routing function will be deadlock free. Depending on the failure situation we divide the description in two parts: 1. The situation when only one of the east and north links of a node are faulty. 2. The more complex situation where both these links are failing. In the first situation our methods do not use non-local link status information in order to perform the routing function alteration, while for the second case this becomes necessary. Recall that we denote the node with its east- and/or north links failing for nmaster , and the nodes that are receiving a updating message from nmaster are called nslavei . Furthermore the east and north links of nmaster are designated le and ln , respectively. For clarity, we shall in the methods proposed below assume that rerouting channels (links) are not faulty at the moment the updating action is initialized, in section 3.3 we also handle faulty rerouting channels. 3.1
The East or the North Link of a Node Fails
For this case we show that we can recover without introducing the two prohibited turns (figure 1). The East Link of a Node Fails. The following method is used to alter the routing function: Alteration 1: The packets that have paths through leeast are rerouted northwards at nmaster (Alter 1 in figure 2(a)). Alteration 2: The packets that have paths through lewest that arrive (or originate) at the nodes in the row east of lewest are rerouted northwards as well (Alter 2 in figure 2(a)). Alteration 3: The packets that have paths through lewest that arrive at the nodes in the row above alteration 2 are rerouted westwards (Alter 3 in figure 2(a)).
1080
Tor Skeie
The Alterations 2 and 3 are initialized by nmaster distributing a update message to the two corresponding set of nodes. By altering the routing function to the nodes east of nslave in the same row, we ensure that packets with original paths through lewest are not sent towards nslave (Alter 2). If they were sent towards nslave they would be forced to take the prohibited west-north turn. Notice also that packets going in the adaptive south-west direction, that can possible use lewest must for the same reason be stopped from going further south at the row right above the failing link (Alter 3).
Alter 2
Alter 3
Alter 3
ln
Update
Update
Alter 1 nmaster
Alter 2
le
n slave
(a)
nmaster
Alter 1
(b)
Fig. 2. Identifying the nodes (visualized as darkened boxes) that will get their routing function altered according to our method. The thick arrows indicate the redirections of the affected packets through the failing link. (a) The east link of nmaster fails. (b) The north link of nmaster fails.
The North Link of a Node Fails. Handling the north link failure situation of a node is symmetric to the east link one. Due to this symmetry we here only present the alteration of the routing function in a figurative manner (figure 2(b)). Lemma 1. If alteration of the routing function is done according to our methods when the east or the north link of a node fails, the new routing function is connected and deadlock free. Proof. From the described alteration methods it is obvious that the new routing function is connected, since nodes in the vicinity of the failing link are updated with respect to the affected packets in both channel directions. It also follows trivially from the turn model that the new routing function is deadlock free since none of the prohibited turns are used.
Handling Multiple Faults in Wormhole Mesh Networks
3.2
1081
Both the East and North Links of a Node Fail
In this more complex failure situation the link status information has to be consulted in order for nmaster to decide the proper alteration action. In addition it becomes necessary to introduce the two illegal turns, the south-east and westnorth turn, respectively.
Alter 2
Alter 3
Nb
l1
Na
ln
l
nmaster
ln
le
Alter 1
nmaster
n ipt
l2
(a)
1
le
Nwpt
ns
l2
(b)
Fig. 3. (a) A failure situation where the two node sets Na and Nb cannot communicate without introducing the prohibited double turn, the south-east and westnorth dependencies, respectively. (b) Alteration of the routing function when both east and north links of nmaster fail. The thick arrows indicate redirections of affected packets through the failing links (channels).
Let us first give some motivation for the algorithm proposed below; consider the failure situation showed in figure 3(a). The east and north connecting links (marked le and ln , respectively) of node nmaster have recently failed, in addition we assume that before this situation arises the links l1 and l2 have failed. Notice that since both le and ln now are failing, packets originating in the node set Na destined for the node set Nb and vice versa are not able to reach their destinations without introducing the prohibited double turn, south-east and west-north dependencies, respectively. Thus the key issue is to introduce the prohibited double turn in such a way that the two node sets can communicate with each other, and furthermore such that no cycles in the channel dependency graph are created as a consequence of the introduction of prohibited turns. We shall in the following introduce the prohibited double turn associated with operating east and north links of a node ni , either positioned west of nmaster in the same
1082
Tor Skeie
row or positioned south of nmaster in the same column. We shall show that this reestablishes connectedness and preserves freedom from deadlocks. If there are several candidate nodes ni in the same row (column) we pick the east (north) most of these, furthermore we denote this specific node for nipt (Introduction of Prohibited Turns). Henceforth, for nmaster to be able to decide which node ni is able to serve as nipt , it must have knowledge about the status of east and north links of all nodes in the same row (column) and that in addition are positioned west (south) of it. This defines the non local information required for our method. Below we define the necessary alteration for this specific situation. We present it for the case when nipt is located west of nmaster . The case when a node south of nmaster is selected as nipt is symmetric. Alteration 1: The packets that have paths through lewest that arrive at the nodes in the row east of le are rerouted northwards. Furthermore the packets that have paths through lewest that arrive at the nodes in the row above the previously specified one are rerouted eastwards (Alter 1 in figure 3(b)). Alteration 2: The rerouting proposed here consists of new paths for those packets via lnsouth that can reach their destination without introducing the prohibited south-east turn. They could have been rerouted through nipt as well, however, we prefer eastwards rerouting in order to spread the traffic. If there are any nodes south of nmaster in the same column that have their east link operating, let ns be the up most node of these candidates. Furthermore let Nwpt (reachable Without Prohibited Turn) be the node set defined by the nodes located at the same vertical position as ns or south of ns , at the same horizontal position as ns or west of ns , and in addition positioned east of nipt (figure 3(b)). The packets having paths through lnsouth that arrive at the nodes in the column north of nmaster and are destined for the node set Nwpt , are rerouted eastwards. Furthermore the packets having paths through lnsouth that arrive at the nodes in the column east of the previously specified one and are destined for the node set Nwpt , are rerouted southwards (Alter 2 in figure 3(b)). Alteration 3: This alteration redirects those packets that have their detour through nipt . The packets having paths through leeast and lnnorth that arrive at nodes in the part of the row between nipt and nmaster (inclusive), are rerouted westwards. In addition the packets having paths through lewest and lnsouth that arrive at nodes in the row above the previously defined one, are rerouted westwards as well. Furthermore the packets with paths through leeast and lnnorth are rerouted northwards at nipt , and the packets with paths through lewest and lnsouth are rerouted southwards in the node north of nipt (Alter 3 in figure 3(b)). Lemma 2. If alteration of the routing function is done according to the method above for the situation when both east and north links of a node fail, the new routing function is connected and deadlock free. Proof. The proof is divided in two: 1. Connected: It follows trivially from the description of the methods that the new routing function is connected.
Handling Multiple Faults in Wormhole Mesh Networks
1083
2. Deadlock free: Assume that a prohibited double-turn DTi is introduced as a consequence of east and north link failures at nmasteri . We first prove that DTi alone cannot form a cycle in the updated channel dependency graph. Secondly we prove that DTi cannot be involved in a cycle together with other prohibited double-turns either. We assume that DTi is introduced west of nmasteri , the case when DTi is south of nmasteri is similar. (i) If DTi is involved in a cycle the cyclic dependencies must go through the channels connected to nmaster i . Since both channel directions of nmaster i ’s east and north links are failing, a potential cycle must come via the south link of nmasteri . This means that another prohibited south-east or west-north turn must be involved to form a cycle in one of the directions. Therefore, DTi cannot form a cycle if is the only prohibited turn introduced. (ii) Now assume that there exists another prohibited double-turn DTj south of nmasteri . We know that DTj is introduced as a consequence of the failure of both the east and north links of nmasterj and consider DTj to be located west of nmaster j . From point (i) above the cyclic dependencies must come through the south link of nmasterj , thus we need yet another prohibited double-turn to form a cycle, and so forth. The case when DTj is introduced south of nmasterj is treated analogously. This means that no cycle can be formed even by two or more prohibited double-turns when they are introduced according to our method. Our system is able to handle a significant number of link failure situations. The situations that the proposed method cannot handle are those where both east and north links of nmaster fail simultaneously, and in addition one of the following two conditions apply: 1. There does not exist any node neither west of nmaster in the same row, nor south of nmaster in the same column that have both its east and north links operating. 2. Any candidate node ni west (south) of node nmaster in the same row (column) with both its east and north links operating, does not have a horizontal (vertical) path to nmaster . 3.3
Rerouting Channel is Faulty
In the above presentation of alteration algorithms we assumed that the specified rerouting channels (links) were operating. In this section we identify what impact faulty rerouting channels will have on alteration of the routing function. When rerouting is requested along earlier failed channels we know there already exist rerouting paths for these faults, thus we only need to reupdate the routing function along these redirections to also include affected packets associated with the fault currently handled. Moreover, the additional nodes to be visited are exactly defined by the alteration methods specified in sections 3.1 and 3.2. To demonstrate this we give the three possible situations for requesting rerouting along a faulty channel, and refer them to the previous defined alterations: (i) Requesting rerouting along a faulty east or north channel when only
1084
Tor Skeie
one of these has failed: According to Alteration 1 specified in section 3.1, reupdate the routing function in this node (which must be a nmasterold ) by letting affected packets from the original fault keep their rerouting northwards if it is the east channel that is failing. With respect to rerouting requested for the north channel direction, let affected packets keep their rerouting eastwards (refer to Alter 1 in figure 2(b)). (ii) Requesting rerouting along a east or north channel when both of these have failed: From Alteration 3 in section 3.2, if niptold is located west of nmasterold , reupdate the routing function in nodes between niptold and nmasterold (inclusive) to also include affected packets from the current fault (figure 3(b)). There will be a symmetric action for the case when niptold is located south of nmasterold . (iii) Requesting rerouting along a faulty west or south channel: Recall from Alteration 2, 3 (in section 3.1) and figure 2(a), that if rerouting is requested for a failing west channel direction the nodes in the row west of this channel and nodes in the row above the previous one need to be updated with respect to the affected packets from current fault. Symmetric updating is necessary when rerouting is requested for a failing south channel (refer to figure 2(b)). The three situations defined above require some additional updating to be initiated. Now, perhaps we in this reupdating see yet another faulty channel. Therefore, the natural question to ask is: will this repeated alteration process terminate successfully? Lemma 3. The alteration process initialized as a consequence of a broken rerouting channel will terminate successfully if it is done as specified above.
Proof. Since there are a limited number of earlier faults to visit, it means that the only possible way to achieve a non terminating alteration process must be as a consequence of a “loop”. Let us denote the failing link starting the updating for lcurrent . The rest of the proof is divided in three: (i) First, when rerouting is requested along a faulty east or north channel, the additional updating is only done locally at the source node of this channel. It is therefore obvious that this situation cannot contribute to a “loop”. (ii) Second, the only possibility to construct a reupdating “loop” (revisiting lcurrent ) is when the following apply: 1. The node nmasterold and lcurrent are located in the same row (column) and 2. lcurrent is between nmasterold and niptold . However, the last requirement could not entail correctness since it will mean that the path between nmasterold and niptold does not exist, which contradicts the assumption of the existence of such a path between these two nodes. (iii) Third, if rerouting is requested for a broken west channel additional rerouting is only needed for channels located east and north of this one, which is even further east and north of lcurrent . Hence, lcurrent cannot be revisited. The same reasoning can be used for a broken south channel. Therefore, we can conclude that a repeated alteration process will converge.
Handling Multiple Faults in Wormhole Mesh Networks
3.4
1085
Three- and Higher-Dimensional Meshes
This section illustrates how the positive-first fault tolerant method can be extended to three- and higher-dimensional meshes. The modification follows naturally from the method presented above. In figure 4 we define the prohibited turns
U W S
U N E
D
Directions
U N
E Prohibited double−turns
N
E
Fig. 4. The prohibited double-turns for fault free routing in three-dimensional meshes. The referred directions denote the following: U - Up, D - Down, N North, S - South, E - East and W - West, respectively.
(shown as double-turns) for fault free routing in three-dimensional meshes. We have prohibited just enough turns to avoid cycles in the channel dependency graphs, thus it follows from [12] that the routing function is deadlock free. In presence of faults we have two main routing function alteration actions depending of the failure situation in the three dimensional case as well: 1. The situations where at least one of the east, north and up links of a node are operating. Affected packets of a newly failing link are redirected through one that still is present. This means that no prohibited turns need to be introduced. This is analogous to the situations handled in section 3.1. 2. The more complex situation when the east, north and up links are faulty simultaneously. This means that affected packets have to be redirected via a node (nipt ) either down, south or west of nmaster . In this case it becomes necessary to introduce the prohibited double-turns, which corresponds to the situation defined in section 3.2. Note that in the three-dimensional case each link participates in two planes, for example an east link is visible in both the east-north and east-up planes, respectively. Henceforth, nodes in the two planes that a failing link is located in, must have altered routing function. For each plane nodes in two rows/columns require to be updated, as in the two dimensional case. Regarding point 2 above, the additional nodes along the detour through nipt (the node where the prohibited turns are introduced) need to be updated, refer Alteration 3 in section 3.2.
Lemma 4. If alteration of the routing function is performed as stated above in presence of faulty links in three-dimensional meshes, the new routing function is connected and deadlock free.
1086
Tor Skeie
Proof. The proof can be fully elaborated as the proofs in sections 3.1 and 3.2. Regarding the deadlock issue for the situation when all the east, north and up links of a node are faulty simultaneously, creation of any cycle requires several prohibited turns. Since we for each of these situations only associate the prohibited turns with one single node (nipt ), this cannot form cyclic dependencies. Generalization to n-dimensions follows the same pattern. For each new dimension one prohibits just enough turns to avoid cycles in the channel dependency graph in the fault free case. Then two main routing function alteration processes will be defined along the same lines as described above. 3.5
Simulation Experiments
In order to evaluate the performance of fault tolerant system we have done a series of experiments. We have simulated a 16 × 16 mesh in presence of 1%, 3% and 5% faulty links. The results show graceful degradation in performance, especially for non-uniform traffic and low number of faults where the degradation is hardly noticeable. The results are not presented here due to space constraints, but can be found in [22].
4
Conclusion
We have presented a method for achieving fault tolerance in wormhole routed (k, n)-mesh networks. Unlike other techniques we do not require existence of virtual channels and tolerate multiple faulty links, even for two-dimensional meshes. The proposed concept relies on that the network is able to alter the routing function on the fly. The initiation for alteration is always taken locally and distributed to small number of non-neighbor nodes via control lines. The IEEE 1355 STC104 [23] router is an example of such a router with a control line structure. For most of the failure situations, the needed alteration of the routing function is fixed and simple. In order to handle more complex situations, such as when both the east and north links of a node (n) fail, we assume knowledge of link status information from some of the other nodes in the same row/column as node n. This non-local information is used by the local master node responding to a failure. The link status information is gathered via the control lines. The implementation cost of our method is basically that it requires each node to have an amount of programmable logic, making it possible to implement the proposed alteration algorithms. Secondly we assume the network to have a control line structure where nodes can send/receive specific control messages. Extra logic in the nodes and some form of control lines are, however, usual mechanisms for achieving fault tolerance [2,13]. Our investigations indicate that it is relatively easy to incorporate node faults into our method, where such faults will be modeled as a failure of all its links. An other interesting path of further work is to investigate if our technique also can be extended to handle (k, n)-tori networks.
Handling Multiple Faults in Wormhole Mesh Networks
1087
Acknowledgments I would like to thank Associate Professor Olav Lysne and Associate Professor Øystein Gran Larsen for their valuable comments and assistance in preparing this paper.
References 1. R. V. Boppana and S. Chalasani. Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Transactions on Computers, 44(7):848–864, 1995. 1076 2. S. Chalasani and R. V. Boppana Communication in Multicomputers with Nonconvex Faults. IEEE Transactions on Computers, 46(5):616–622, 1997. 1077, 1086 3. A. A. Chien and J. H. Kim. Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. Journal of the Association for Computing Machinery, 42(1):91–123, 1995. 1077 4. W. J. Dally and H. Aoki. Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Transactions on Parallel and Distributed Systems, 4(4):466–475, 1993. 1077 5. W. J. Dally and C. L. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. IEEE Transactions on Computers, C-36(5):547–553, 1987. 1076, 1077 6. B.V. Dao, J. Duato, and S. Yalamanchili. Configurable flow control mechanisms for fault-tolerant routing. In Proceedings of the 22nd International Symposium on Computer Architecture, pages 220–229. ACM Press, 1995. 1077 7. J. Duato. A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. Int. Conf. on Parallel Processing, I:142–149, Aug. 1994. 1077 8. J. Duato. A theory to increase the effective redundancy in wormhole networks. Parallel Processing Letters, 4:125–138, 1994. 1077 9. P. T. Gaughan and S. Yalamanchili. A family of fault-tolerant routing protocols for direct multiprocessor networks. IEEE Transactions on Parallel and Distributed Systems, 6(5):482–497, 1995. 1077 10. C. J. Glass and L. M. Ni. The turn model for adaptive routing. In Proceedings of the 19th International Simposium on Computer Architechture, pages 278–287. IEEE CS Press, California, 1992. 1077 11. C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes. In TwentyThird Annual Int. Symp. on Fault-Tolerant Computing, pages 240-249, 1993. 1077 12. C. J. Glass and L. M. Ni. The turn model for adaptive routing. Journal of the Association for Computing Machinery, 41(5):874–902, 1994. 1077, 1078, 1085 13. C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Transactions on Parallel and Distributed Systems, 7(6):620–636, June 1996. 1077, 1086 14. IEEE 1355-1995. IEEE standard for Heterogeneous InterConnect (HIC) (Low cost, low latency scalable serial interconnect for parallel system construction), 1995. 1078 15. P. Kermani and L. Kleinrock. Virtual cut-through: A new computer communication switching technique. Computer Networks, 3:267–286, 1979. 1076
1088
Tor Skeie
16. D. H. Linder and J. C. Harden. An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. IEEE Transactions on Computers, 40(1):2–12, 1991. 1076 17. O. Lysne, T. Skeie and T. Waadeland. One-Fault Tolerance and Beyond in Wormhole Routed Meshes. Microprocessors and Microsystems, Elsevier, 21(7-8):471–481, 1998. 1077 18. M. D. May, P. W. Thompson, and P. H. Welch, editors. Networks, routers and transputers: function performance and application. IOS Press, 1993. 1076 19. L. M. Ni and P.K. McKinley. A survey of wormhole routing techniques in direct networks. Computer, 26:62–76, 1993. 1076 20. Paragon XP/S product overview. Intel Corp., Supercomputer Systems Div, 1991. 1076 21. C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su. The architecture and programming of the Ametek series 2010 multicomputer. In Proceedings of the Third Conference Hypercube Concurrent Computers and Applications, Pasadena (California), volume I, pages 33–36, 1988. 1076 22. T. Skeie. Topics in Interconnect Networking. PhD Thesis, ISBN 82-7368-190-4, Dept. of Informatics, University of Oslo, 1998. 1086 23. P. W. Thompson and J. D. Lewis. The STC104 asynchronous packet switch. VLSI Design, 2(4):305–314, 1995. 1078, 1086
Shared Control — Supporting Control Parallelism Using a SIMD-like Architecture Nael B. Abu-Ghazaleh1 and Philip A. Wilsey2 1
2
Computer Science Dept. State University of New York Binghamton, NY 13902-6000 Department of ECECS, PO Box 210030 University of Cincinnati, Cincinnati, Ohio 45221–0030
Abstract. SIMD machines are considered special purpose architectures chiefly because of their inability to support control parallelism. This restriction exists because there is a single control unit that is shared at the thread level; concurrent control threads must time-share the control unit. We present an alternative model for building centralized control architectures that better supports control parallelism. This model, called shared control, shares the control unit(s) at the instruction level — in each cycle the control signals for the supported instructions are broadcast to the PEs. In turn, a PE receive its control by synchronizing with the control unit responsible for its current instruction. There are a number of architectural issues that must be resolved. This paper identifies some of these issues and suggests solutions to them. An integrated shared-control/SIMD architecture design (SharC) is presented and used to demonstrate the performance relative to a SIMD architecture.
1
Introduction
Parallel architectures are classified according to their control organization as Multiple Instruction streams Multiple Data streams (MIMD), or Single Instruction stream Multiple Data streams (SIMD) machines. MIMD machines have a distributed control organization: each Processing Element (PE) has a control unit and is able to sequence a control thread (program segment) locally. Conversely, SIMD machines have a centralized control organization: the PEs share one control unit. A single thread executes on the control unit, broadcasting instructions to the PEs for execution. Because the control is shared and the operation is synchronous, SIMD PEs are small and inexpensive. In a centralized control organization (e.g., SIMD [11],[14] and MSIMD [5],[18] machines), an arbitrary number of PEs share a fixed number of control units. Traditionally, sharing of control has been implemented at the thread level; the PEs following the same thread concurrently share a control unit. The presence of application-level control-parallelism causes the performance of this model to drop (proportionately to the degree of control parallelism). This drop occurs D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1089–1099, 1998. c Springer-Verlag Berlin Heidelberg 1998
1090
Nael B. Abu-Ghazaleh and Philip A. Wilsey
because the control units are time-shared among the threads, with only the set of PEs requiring the currently executing control thread actively engaged in computation. General parallel applications contain control parallelism [3],[7] and, therefore, perform poorly on SIMD machines. Accordingly, SIMD machines have fallen out of favor as a platform for general-purpose parallel processing [10], [15]. This paper presents Shared Control: a model for constructing centralized control architectures that better supports control-parallelism. Under shared control, the control units are shared at the instruction (or atomic function) level. Each PE is assigned a local program and PEs executing the same instruction, but not necessarily the same thread, receive their control from the same control unit. A control unit is assigned to each instruction, or group of similar instructions, in the instruction set and, thus, broadcasts the microinstruction sequences to implement that instruction repeatedly to the PEs. Each PE receives its control by synchronizing with the control unit corresponding to its current instruction. Thus, all the PEs are able to advance their computation concurrently, regardless of the degree of control parallelism present in the application. The similarity of the hardware to the SIMD model allows the SIMD mode to be supported at little additional cost. With the ability to support control-parallelism efficiently, the major drawback of the SIMD model is overcome. Shared control is a unique architectural paradigm; the classic association between control units and threads, present in all Von-Neumann based architectures, does not exist in this model. Therefore, it introduces several architectural issues that are unique to it. This identifies some of these issues and discusses solutions to them. The feasibility of the solutions is demonstrated using a SIMD/sharedcontrol architecture design, SharC. Using a detailed simulator of SharC, the performance of the model is studied for some irregular problems. The remainder of this paper is organized as follows. Section 2 introduces the shared control model. Section 3 presents some architectural issues relating to a general shared control implementation. Section 4 presents a case study of a shared-control architecture. In Section 5, the performance of the architecture is studied. Finally, Section 6 presents some concluding remarks.
2
Shared Control
A shared control architecture is a centralized control architecture where the control units are shared at the operation level. PEs executing the same operation, but not necessarily the same thread, may share the use of the same control unit. An overview of a shared control machine is shown in Figure 1. The control program (microprogram) implementing the instruction set for the shared control mode is partitioned across a number of tightly coupled control units. This partitioning is static; it is carried out at architecture design time. Each control unit repeatedly broadcasts the microprogram sequence assigned to it to the PEs. A PE receives its control from the control unit associated with its current instruction. The PE synchronizes with the control unit by selecting the set
Shared Control — Supporting Control Parallelism
1091
Add Control unit PE0 Load Control unit
Instruction register
Add
PE Memory
PEn
Send Control unit
Send
Instruction register
PE Memory
Fig. 1. Overview of a Shared Control Organization back to fetch fetch
exec. inst.
exec. inst.
represents setting active sets
Fig. 2. Implementation of Several Instructions on a Single Control Unit
of control signals broadcast by that control unit. PEs are able to advance their computation concurrently; thus, MIMD execution is supported. We first consider the problem of implementing and optimizing shared control using a single control unit; this is a special case that is needed for the general solution. Figure 2 shows the single control unit implementation. The control unit must supply control for all the instructions in every cycle. More precisely, the control unit sequentially issues all the paths through the microprogram, and the PEs conditionally participate in the path corresponding to their current instruction. In the remainder of this section, some of the architectural issues involved in constructing a single-control unit shared control multiprocessor are discussed. Managing the Activity Status: Before every instruction execution stage, the activity bits for the PEs that are interested in this stage must be set (represented by the shaded circles in Figure 2). On SIMD machines setting the activity status before a conditional region requires the following operations on the PEs: (i) save current active set, (ii) evaluate the condition, and (iii) set active bit if condition is true. In addition, at the end of the conditional region the active set saved in
1092
Nael B. Abu-Ghazaleh and Philip A. Wilsey
Instruction Register Opcode
1
Global Activity bit
Decoder k−bit immediate activity vector 0
0 0 0 1 0
. . .
0 0 0
Fig. 3. Activity Set Management
step (i) is restored. The active bit can be tracked by saving the bit directly to an activity bit stack [13], or by using an activity counter that is incremented to indicate a deeper context [12]. Both schemes have a significant overhead (register space, as well as several execution steps). Implementing activity management using the SIMD model adds unacceptable overhead to the shared control cycle time since it is required several times per cycle in shared control. The condition for the activity of a PE for instruction stage (called the immediate activity) is contained in the opcode field in the instruction register, allowing the following optimization to be made. The opcode field is decoded into a k-bit vector (as shown in Figure 3). At the beginning of every instruction segment, the bit vector is shifted into the immediate activity bit. Thus, instruction i is decoded into a bit vector consisting of 1 in the ith position and 0 elsewhere, mirroring the order of the execution regions. Only PEs with a high immediate activity bit participate in the current instruction segment. The register shift can be performed concurrently with the execution of each region at no additional cost; activity management cost is eliminated. A compositional instruction set: Examining Figure 2), it can be observed that each PE receives the control for all k instructions, but uses only one. Compositional instruction sets relax this restriction by allowing the PEs to use any number of the k instruction segments in each cycle [4]. The output of each execution region is deposited in a temporary register that serves as an input to the next one. The output of the last stage is stored back to the final destination. Thus, an instruction is represented by the subset of the instruction segments that compose it. Composition can be easily incorporated into the activity management scheme in Figure 3.
3
General Shared Control
A general implementation of the shared control model uses multiple control units to implement the microprogram for the instruction set. There are a number of architecture issues that are introduced by this model (in addition to the ones
Shared Control — Supporting Control Parallelism
1093
present in the single control unit implementation). The discussion in this section will cover the range of possible solutions to each issue, rather than consider specific solutions in detail. Control I/O pins: At first glance, this is the primary concern regarding the feasibility of shared control; because the model uses multiple control units, the width of the broadcast bus and the number of required pins at the PEs may become prohibitive. Fortunately, the increase in the number of pins is not linear with the number of control units because: (i) each control unit is responsible only for an instruction or a group of instructions; (ii) literal fields and register number fields are not broadcast; they are part of the instruction (they have to be broadcast in SIMD and MSIMD architectures); and (iii) pins that carry the same values on different control units throughout their lifetime are routed as a single physical pin. The Control Broadcast Network: The control broadcast network is responsible for delivering the control signals from the control units to the PEs. Traditionally, control broadcast has been a bottleneck on centralized control machines; solutions to this problem include pipelining of the broadcast [3], and caching the control units closer to the PEs [16]. With advances in fabrication technology, there is a trend to move processing to memory [8]. For such systems, the control units may be replicated per chip, simplifying the control broadcast problem. Control Unit Synchronization and Balance: The control stream required by each PE is not supplied on demand as per traditional computers. Rather, the control units have to be synchronized such that the control streams are available when a PE needs them (with minimal waiting time). A possible model to synchronize the control units is to force them to share a common cycle length, called fundamental instruction cycle, determined by the slowest control unit. However, the instruction cycle time may vary widely for the instructions in the instruction set, forcing long idle times for PEs with short instructions. Fortunately, there are a number of synchronization models that reduce PE idle time, including: (i) issuing the long instructions infrequently; (ii) allowing the short instructions to execute multiple times while the long instruction is executing; and (iii) breaking long instructions in a series of shorter instructions [2]. Support for Communication and I/O: Support of communication, I/O and other system functions poses the following problem: the system must provide support for both SIMD and MIMD operation at a cost that can be justified against the simple PEs. We focus this discussion on the communication subsystem. There are two options for the support communication. The first option restricts the support to the SIMD mode; the more difficult problem of supporting MIMD-mode communication is side stepped. Restricting communication to SIMD mode is inefficient because: (i) all the PEs have to synchronize and switch back to SIMD if any of them needs to communicate, (ii) because of SIMD semantics, all the PEs must wait for the PE with the worst path communication; a bottleneck of synchronous operation when irregular communication patterns are required [6]. Another alternative is to support MIMD operation using an inexpensive network [1].
1094
4
Nael B. Abu-Ghazaleh and Philip A. Wilsey
A Case Study: SharC
In this section we present an overview of the architecture of SharC, a proofof-concept shared control architecture [1]. We preface this discussion by noting that the SharC architecture was designed for possible fabrication within a very small budget (10,000 US dollars for a 64 PE prototype). The PE chip fits within the Mosis small chip (a very small chip size with 100 I/O pads); much higher integration is possible with more aggressive technology. While SharC does not represent a realistic high-performance design using today’s technology, it can still be used as an impartial model to investigate the shared-control performance relative to SIMD performance.
P0
P1 Control Broadcast
P2 P3
Control Unit(s)
Net Node
Data Broadcast and Reduction network
Memory
To other Net Nodes
PE Chip Array
Fig. 4. System Overview
Figure 4 presents an overview of the system architecture. The shared control subsystem consists of 9 control units. The fundamental cycle for all the control units is 11 cycles, with the exception of the load/store control units which require 22 cycles. There are 4 PEs per chip sharing a single memory port. Communication occurs using the same memory port as well: 4 cycles of the 11-cycle fundamental cycle are reserved for the exchange of two message with the network processor (one each way). Most of the integer operations are mapped to the same control unit and implemented using composition. The fundamental cycle is 11 cycles long, constrained by the access time of the shared memory/communication port. If dedicated (non-shared) memory ports are supplied, this length can be reduced to 4 cycles for most instructions. As was expected, the number of control pins for the shared control mode exceeded the number required by SIMD mode. However, the number of pins only increased from 112 pins for the SIMD mode to 118 pins for the shared control mode; the increase is small because of reasons discussed in Section 3. The average number of Clocks Per Instruction (CPI) for the shared control mode is higher than the CPI for SIMD mode; most SIMD instructions require less than
Shared Control — Supporting Control Parallelism
1095
11 cycles to implement. The reason for the higher CPI is the time required for the additional fetch required for each instruction, as well as the idle cycles required to synchronize the control units. Thus, for a pure data-parallel applications, a SIMD implementation yields better performance than shared control. However, as the degree of control parallelism increases, the performance of shared control remains constant, while SIMD performance degrades. The SharC communication network is a packet-switched toroidal mesh network, with a chip (4 PEs) at each node sharing a network processor. The network processor is made simple by using an adaptive deflection routing strategy (no buffering is necessary) [9]. In many cases, deflection routing results in better performance than oblivious routing strategies (path taken by packet independent of other packets), because excess traffic is deflected away from congested areas; creating better network load balance.
5
Performance Analysis
A detailed simulator of a scalable configuration of the SharC design is used to study its performance. The model incorporates a structural description of the control subsystem at the microcode level; it also serves to verify the correctness of the microcode. The test programs were written using assembly language, and assembled using a macro assembler. The assembler supports SIMD/SPMD programming model; an algorithm is written as a SIMD/SPMD application, and any region can be executed in either mode (allowing mixed-mode programming). A transition from SIMD to MIMD operation is initiated explicitly by the SIMD program (using a special c_mimd instruction). A transition from MIMD back to SIMD occurs when all the PEs have halted (using a halt instruction); implementing a full barrier synchronization.
Speedup for the Best Case 10000 6-Queen 7-Queen 8-Queen Ideal 1000
100
10
1 1
4
16 64 Number of Processors
256
1024
Fig. 5. Speedup for the N-Queen Problem
Our first example is the N-Queen problem: a classic combinatorial problem that is easy to express but difficult to solve. The N-Queen problem has recently
1096
Nael B. Abu-Ghazaleh and Philip A. Wilsey
Run time for the Search Competition Algorithm 350000 SIMD Mode SPMD Mode 300000
Time (cycles)
250000
200000
150000
100000
50000
0 1
2
4
8 Number of PEs
16
32
64
Fig. 6. Search Competition Algorithm Performance
found applications in optical computing and VLSI routing [17]. We considered the exhaustive case (find all solutions, rather than find a solution). The N-Queen problem is highly irregular and not suited to SIMD implementation. Figure 5 shows the performance of the shared control implementation normalized to that of the SIMD implementation. The SIMD solution was worse than a sequential solution because it failed to extract any parallelism, but incurred the activity management overhead. The speedup leveled because the of parallelism present at the small scales of the problem that were studied is limited. The second application is a database search competition algorithm; an algorithm characteristic of massively parallel database operations. Each PE is initialized with a segment of a database (sorted by keys), and the PEs search for keys in their database segment. As the number of PEs is scaled, the size of the database is scaled (the size of the segment at each PE is kept the same). For a small number of PEs, there is little control parallelism, and SIMD performs better than shared control. As the number of PEs is increased, the paths through the database followed by each PE diverge according to their respective data. The performance of the SIMD implementation drops with the increased number of PEs while shared control performance remains constant. Finally, we consider a parallel case statement with balanced cases (Figure 7). By varying the number of cases in the case statement, the degree of control parallelism is directly controlled. The instruction mix within the SIMD blocks generated randomly, with a typical RISC profile comprised of 30% load/store instructions with branches forced to the next address (allowing for the pipelined instruction). The blocks are chosen to of balanced lengths (10% variation). The program was simulated in three different ways: a SIMD implementation, a shared control (SPMD) implementation, and a mixed mode implementation. The mixed mode implementation executes the switch statement in the shared control mode, but executes block A and block B in SIMD. Figure 8 shows execution times as a function of the number of cases. Not surprisingly, the performance of the SIMD mode degrades (linearly) with the
Shared Control — Supporting Control Parallelism
1097
plural p = p_random(1, n); [SIMD BLOCK A] switch(p) { case 1: [SIMD BLOCK 1]; break; . . case n: [SIMD BLOCK n]; } [SIMD BLOCK B]
Fig. 7. Parameterized Parallel Branching Region Execution Times with Increased Control Parallelism 3500 SIMD Mode SPMD Mode Mixed Mode
Time (cycles)
3000
2500
2000
1500
1000 0
1
2
3 Number of Cases
4
5
6
Fig. 8. Simulation results for the architecture on a parameterized branching region increased control parallelism injected by increasing the number of cases. Both the mixed mode and MIMD execution times did not change significantly with the added control parallelism. The mixed mode implementation is more efficient than the full shared control implementation because of its superior performance on the leading and trailing SIMD blocks.
6
Conclusions
SIMD machines offer an excellent price to performance alternative for applications that fit their restricted control organization. Commercial SIMD machines with superior price to performance ratio continue to be built for specific applications. Unfortunately, centralized control architectures (like the SIMD model) cannot support control parallelism in applications; the control unit has to sequentially broadcast the different control sequences required by each of the control threads. In this paper, we present a model for building centralized control architectures that is capable of supporting control parallelism efficiently. The shared
1098
Nael B. Abu-Ghazaleh and Philip A. Wilsey
control model is less efficient than the SIMD model on regular code sequences (it requires an additional instruction fetch, and memory space to hold the task instructions). When used in conjunction with the SIMD model, irregular regions of applications can be executed in shared control, extending the utility of the SIMD model to a wider range of applications. The shared control model is fundamentally different from the SIMD model. Therefore, there are a set of architectural and implementation issues that must be addressed before an efficient shared control implementation can be realized. The performance of SharC was studied using a detailed RTL simulator. Applications were implemented in the SIMD mode, and in the shared control mode. The performance of the shared control implementation was compared to a pure SIMD implementation for several applications (the highly irregular N-queen, and massively parallel database search algorithms were used). Even at the small scales of the problems considered, shared control resulted in significant improvement in performance over the pure SIMD implementation. Unfortunately, larger applications (or problem sizes of the studied applications) could not be studied because: (i) simulating a massively parallel machine on a uniprocessor is inefficient: the simulation run time for the 1024 PE 8-queen problem was in excess of 20 hours on a SunSparc 100 workstation, and (ii) we do not have a compiler for the machine; all the examples were written and coordinated in assembly language. Finally, we used a parameterized conditional region benchmark to demonstrated that shared control is beneficial for applications where even a small degree of control parallelism is present.
References 1. Abu-Ghazaleh, N. B. Shared Control: A Paradigm for Supporting Control Parallelism on SIMD-like Architectures. PhD thesis, University of Cincinnati, July 1997. (in press). 1093, 1094 2. Abu-Ghazaleh, N. B., and Wilsey, P. A. Models for the synchronization of control units on shared control architectures. Journal of Parallel and Distributed Computing (1998). (in press). 1093 3. Allen, J. D., and Schimmel, D. E. The impact of pipelining on SIMD architectures. In Proc. of the 9th International Parallel Processing Symp. (April 1995), IEEE Computer Society Press, pp. 380–387. 1090, 1093 4. Bagley, R. A., Wilsey, P. A., and Abu-Ghazaleh, N. B. Composing functional unit blocks for efficient interpretation of MIMD code sequences on SIMD processors. In Parallel Processing: CONPAR 94 – VAPP VI (September 1994), B. Buchberger and J. Volkert, Eds., vol. 854 of Lecture Notes in Computer Science, SpringerVerlag, pp. 616–627. 1092 5. Bridges, T. The GPA machine: A generally partitionable MSIMD architecture. In Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Architecutres (1990), pp. 196–203. 1089 6. Felderman, R., and Kleinrock, L. An upper bound on the improvement of asynchronous versus synchronous distributed processing. In Proceedings of the SCS Multiconference on Distributed Simulation (January 1990), vol. 22, pp. 131–136. 1093
Shared Control — Supporting Control Parallelism
1099
7. Fox, G. What have we learnt from using real parallel machines to solve real problems? Tech. Rep. C3P–522, Caltech Concurrent Computation Program, California Institute of Technology, Pasadena, CA 91125, March 1988. 1090 8. Gokhale, M., Holmes, B., and Iobst, K. Processing in memory: The Terasys massively parallel PIM array. IEEE Computer (April 1995), 23–31. 1093 9. Greenberg, A., and Goodman, J. Sharp approximate models of deflection routing in mesh networks. IEEE Transactions on Communications 41 (Jan. 1993). 1095 10. Hennesy, J. L., and Patterson, D. A. Computer Architecture a Quantitave Approach, Second Edition. Morgan Kaufman Publishers Inc., San Mateo, CA, 1995. 1090 11. Hillis, W. D. The Connection Machine. The MIT Press, Cambridge, MA, 1985. 1089 12. Keryell, R., and Paris, N. Activity counter: New optimization for the dynamic scheduling of SIMD control flow. In 1993 International Conference on Parallel Processing (Aug. 1993), vol. 2, pp. 184–187. 1092 13. MasPar Computer Corporation. MasPar Assembly Language (MPAS) Reference Manual. Sunnyvale CA, July 1991. 1092 14. Nickolls, J. The design of the MasPar MP-1. In Proceedings of the 35th IEEE Computer Society International Conference (1990), pp. 25–28. 1089 15. Parhami, B. SIMD machines: Do they have a significant future? Computer Architecture News (September 1995), 19–22. 1090 16. Rockoff, T. SIMD instruction caches. In Proceedings of the Symposium on Parallel Architectures and Algorithms ’94 (May 1994), pp. 67–75. 1093 17. Sosic, R., and Gu, J. Fast search algorithms for the N-queens problem. IEEE Transactions on Systems, Man, and Cybernatics (Nov. 1991), 1572–1576. 1096 18. Weems, C., Riseman, E., and Hanson, A. Image understanding architecture: Exploiting potential parallelism in machine vision. Computer (Feb 1992), 65–68. 1089
Workshop 23 ESPRIT Projects Ron Perrott and Colin Upstill Co-chairmen
The ESPRIT Programme has now been in existence for over a decade. One of the areas to benefit substantially from the injection of ESPRIT research and development funding has been the area of parallel computing or high performance computing. It is therefore appropriate that at this conference there should be two sessions dealing with HPC projects. A wide range of papers were submitted, and those selected represent state-of-the-art projects in various areas of interest. The projects break down into applications which can benefit from the use of HPC and those which deal with the software required for making high performance computers more efficient. In the former category are papers like Parallel Crew Scheduling in Pharos. This is an application which considers the problem of scheduling airline crews and the use of a network of workstations. It is based on the widely known Carmen system which is used by most European airlines to schedule their crews among aeroplanes. The objective is to produce a parallel version of the Carmen system and execute it on a network of workstations and demonstrate the improvements and benefits which can be obtained. Essentially there are two critical parts in the application dealing with pairing generation and optimisation. The pairing generator distributes the numeration of pairings over the processors and the optimiser is based on an iterative Lagrangian heuristic. Initial results indicate the benefits the speed-up which can be obtained with this application on a network of workstations. The second paper deals with CORBA (Common Object Request Broker Architecture) and develops a compliant programming environment for HPC. This is part of an ESPRIT Project known as PACHA. The main objective is to help the design of HPC applications using independent software components through the use of distributed objects. It proposes to harness the benefits of distributive and parallel programming using a combination of two standards, namely CORBA and MPI. CORBA provides transparent remote method invocations which are handled by an object request broker that provides a communication infrastructure independent of the underlying network. This project introduces a new kind of object which is referred to as a CORBA parallel object. It then allows the aggregation of computing resources to speed up the execution of a software component. The interface is described using an extended IDL to manage data distribution among the objects of the collection. CORBA is being implemented using Orbix from Iona Technologies and has already been tested using a signal processing application based on a client server approach. The OCEANS project deals with optimising compilers for embedded applications. The objective of OCEANS is to design and implement an optimising D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1101–1103, 1998. c Springer-Verlag Berlin Heidelberg 1998
1102
Ron Perrott and Colin Upstill
compiler that utilises aggressive analysis techniques and integrates source level restructuring transformations with low level machine-dependent optimisations. This will then provide a prototype framework for iterative compilation where feedback from the low level is used to guide the selection of a suitable sequence of source level transformations and vice versa. The OCEANS compiler is centred around two major components, a high level restructuring system known as MT1 and a low level system for supporting language transformations and optimisations known as Salto. In order to validate the compiler, four public domain multimedia codes have been selected. Initial results indicate that the results using the initial prototype are satisfactory in comparison with the production compiler but their implementation still needs to be refined at both the high and low levels. The fourth paper deals with industrial stochastic simulations on large-scale meta-computers, and reports the acheivements of the PROMENVIR (Probabilistic Mechanical Design Environment) project. This project culminated in a pan-European meta-computer demonstration of the use of PROMENVIR running a multibody simulation application. PROMENVIR is an advanced computing tool for performing stochastic analysis of generic physical systems. The tool provides the user with a framework for running a stochastic Monte Carlo simulation using a preferred deterministic solver and then provides sophisticated means to analyse the results. The European meta-computer consisted of a total of 102 processors geographically distributed among the members of the consortium. The paper details the problems that were encountered in establishing the meta-computer and carrying out the execution of the chosen application. It makes useful contributions on the feasibility and reliability of such an approach. HIPEC stands for High Performance Computing Visualisation System Supporting Network Electronic Commerce applications. This is an ESPRIT Project whose main objective is to integrate advanced high performance technologies to form a generic electronic commerce application. The application is aimed at giving a large number of SMEs a configuration and visualisation tool to support the selling process within the show room and to enlarge their business using the same tool over the Internet. Computer graphics technology plays a primary role in the project as realistic images are required in such commerce. The particular experiment was based on the generic technology and applied to bathroom furniture application. The project was shown to be of benefit in this particular application area. The final paper is the porting of the SEMC3D electromagnetics code to HPF. This was a project within the ESPRIT PHAROS Project, in which four industrial simulation codes are ported to HPF. The electromagnetic simulation code was ported onto two machines, namely the Meiko CS2 and the IBM SP2. The paper describes the application and the numerical computational features which are important and then analyses the porting of the code to a parallel machine. A series of phases such as code cleaning, translation to Fortran 90, and inserting HPF directives are described. This is followed by a series of measurements. The first stage in the port was the conversion of the Fortran 77 code to Fortran 90,
Workshop 23 ESPRIT Projects
1103
for which converted code single processor times on the IBM SP2 where obtained. This was then followed by conversion to HPF code and performance times over a range of processors documented. Finally, a comparison with message passing code was carried out. The speed-ups obtained in the benchmarking of this particular code are described as encouraging by the users, and give evidence of the benefits of using HPF.
Parallel Crew Scheduling in PAROS Panayiotis Alefragis1 , Christos Goumopoulos1 , Efthymios Housos1 , Peter Sanders2 , Tuomo Takkula3 , and Dag Wedelin3 1
2
University of Patras, Patras, Greece Max-Planck-Institut f¨ ur Informatik, Saarbr¨ ucken, Germany 3 Chalmers University of Technology, G¨ oteborg, Sweden
Abstract. We give an overview of the parallelization work done in PAROS. The specific parallelization objective has been to improve the speed of airline crew scheduling, on a network of workstations. The work is based on the Carmen System, which is used by most European airlines for this task. We give a brief background to the problem. The two most time critical parts of this system are the pairing generator and the optimizer. We present a pairing generator which distributes the enumeration of pairings over the processors. This works efficiently on a large number of loosely coupled workstations. The optimizer can be described as an iterative Lagrangian heuristic, and allows only for rather fine-grained parallelization. On low-latency machines, parallelizing the two innermost loops at once works well. A new “active-set” strategy makes more coarsegrained communication possible and even improves the sequential algorithm.
1
The PAROS Project
The PAROS (Parallel large scale automatic scheduling) project is a joint effort between Carmen Systems AB of G¨oteborg, Lufthansa, the University of Patras and Chalmers University of Technology, and is supported under the European Union ESPRIT HPCN (High Performance Computing and Networking) research program. The aim of the project is to generally improve and extend the use and performance of automatic scheduling methods, stretching the limits of present technology. The starting point of PAROS is the already existing Carmen (Computer Aided Resource Management) System, used for crew scheduling by Lufthansa as well as by most other major European airlines. The crew costs are one of the main operating costs for any large airline, and the Carmen System has already significantly improved the economic efficiency of the schedules for all its users. For a detailed description of the Carmen system and airline crew scheduling in general, see [1]. See also [3,6,7] for other approaches to this problem. At present, a typical problem involving the scheduling of a medium size fleet at Lufthansa requires 10-15 hours of computing, for large problems as much as 150 hours. The closer to the actual day of operation the scheduling can take
This work has been supported by the ESPRIT HPCN program.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1104–1114, 1998. c Springer-Verlag Berlin Heidelberg 1998
Parallel Crew Scheduling in PAROS
1105
place, the more efficient it will be with respect to market needs and late changes. Increased speed can also be used to solve larger problems and to increase the solution quality. The speed and quality of the crew scheduling algorithms are therefore important for the overall efficiency of the airline. An important objective of PAROS is to improve the performance of the scheduling process through parallel processing. There exist attempts to parallelize similar systems on high end parallel hardware (see [7]), but our focus is primarily to better utilize a network of workstations and/or multiprocessor workstations, allowing airlines such as Lufthansa to use their already existing hardware more efficiently. In section 2 we give a general description of the crew scheduling process in the Carmen system. The following sections report the progress in parallelizing the main components of the scheduling process: the pairing generator in section 3, and the optimizer in section 4. Section 5 gives some conclusions and pointers for future work.
2
Solution Methodology and the Carmen System
For a given timetable of flight legs (non stop flights), the task of crew scheduling is to determine the sequence of legs that every crew should follow during one or several workdays, beginning and ending at a home base. Such routes are known as pairings. The complexity of the problem is due to the fact the crews cannot simply follow the individual aircraft, since the crews on duty have to rest in accordance with very complicated regulations, and there must always be new crews at appropriate locations which are prepared to take over. At Lufthansa, problems are usually solved as weekly problems with up to the order of 104 legs. For problems of this kind there is usually some top level heuristic which breaks down the problem into smaller subproblems, often to different kinds of daily problems. Due to the complicated rules, the subproblems are still difficult to handle directly, and are solved using some strategy involving the components of pairing generation and optimization. The main loop of the Carmen solution process is shown in Figure 1. Based on an existing initial or current best solution a subproblem is created by opening up a part of the problem. The subproblem is defined by a connection matrix containing a number of legal leg connections, to be used for finding a better solution. The connection matrix is input to the pairing generator which enumeratively generates a huge number of pairings using these connections. Each pairing must be legal according to the contractual rules and airline regulations, and for each pairing a cost is calculated. The generated pairings are sent to the optimizer, which selects a small subset of the pairings in order to minimize the total crew costs, under the constraints that every flight leg must receive a crew. In both steps the main algorithmic difficulty is the combinatorial explosion of possibilities, typical for problems of this kind. Rules and costs are handled by a special rule language, and is translated into runnable code by a rule compiler, which is then called by the pairing generator.
1106
Panayiotis Alefragis et al.
Fig. 1. Main loop in Carmen solution process.
A typical run of the Carmen system consists of 50-100 daily iterations in which 104 to 106 pairings are generated in each iteration. Most of the time is spent in the pairing generator and the optimizer. For some typical Lufthansa problems, profiling reveals that 5-10% of the time is consumed in the overall analysis and the subproblem selection, about 70-85% is spent in the pairing generator, and 10-20% in the optimizer. These figures can however vary considerably for different kinds of problems. The optimizer can sometimes take a much larger proportion of the time, primarily depending on the size and other characteristics of the problem, the complexity of the rules that are called in the inner loop of the pairing generator, and various parameter settings. As a general conclusion however, it is clear that the pairing generator and the optimizer are the main bottlenecks of the sequential system, and they have therefore been selected as the primary targets for parallelization.
3
Generator Parallelization
The pairing generation algorithm is a quite straightforward depth first enumeration that starts from a crew base, and builds a chain of legs by following possible connections as defined by the matrix of possible leg connections. The search is heuristically pruned by limiting the number of branches in each node, typically between 5-8. The search is also interrupted if the chain created so far is not legal. The parallelization of the pairing generator is based on a manager/worker scheme, where each worker is given a start leg from which it enumerates a part of the tree. The manager dynamically distributes the start legs and the necessary additional problem information to the workers in a demand driven manner. Each worker generates all legal pairings of the subtree and returns them to the manager. The implementation has been made using PVM, and several efforts have been made to minimize the idle and communication times. Large messages are sent whenever possible. Load balancing is achieved by implementing a dynamic
Parallel Crew Scheduling in PAROS
1107
workload distribution scheme that implicitly takes into account the speed and the current load of each machine. The number of start legs that are sent to each worker is also changing dynamically with a fading algorithm. In the beginning, a large number of start legs is given and in the end only one start leg per worker is assigned. This scheme reconciles the two goals of minimizing the number of messages and of good load balancing. Efficiency is also improved by pre-fetching. A worker requests the next set of start legs before they are needed. It can then perform computation while its request is being serviced by the manager. The parallel machine can also be extended dynamically. It is possible to add a new host at any time to the virtual parallel machine and this will cause a new worker to be started automatically. Another feature of the implementation is the suspend/resume feature which makes sure that only idle workstations are used for generation. Periodically the manager requests the CPU load of the worker, suspends its operation if the CPU load for other tasks is over a specified limit, and resumes its operation if the CPU load is below a specified limit. The performance overhead depends on the rate of the load checking operation which in any case is small enough (≤ 1%). 3.1
Scalability
For the problems tested, the overhead for worker initialization depends on the size of the problem and consists mainly of the initialization of the legality system and the distribution of the connection matrix. This overhead is very small compared to the total running time. The data rate for sending generated pairings from a worker to the manager is typically about 6 KB/s, assuming a typical workstation and Lufthansa rules. This is to be compared with the standard Ethernet speed of 1 MB/s, so after its start-up phase the generator is CPU bound for as many as 200 workstations. Given the high granularity and the asynchronous execution pattern we therefore expect the generator to scale well up to the order of 100 workstations connected with standard Ethernet, of course provided that the network is not too loaded with other communication. 3.2
Fault Tolerance
Several fault tolerance features have been implemented. To be reliable when many workstations are involved, the parallel generator can recover from task and host failures. The notification mechanism of PVM [4] is used to provide application level fault tolerance to the generator. A worker failure is detected by the manager which also keeps the current computing state of the worker. In case of a failure the state is used for reassigning the unfinished part of the work to another worker. The failure of the manager can be detected by the coordinating process, and a new manager will be started. The recovery is achieved since the manager periodically saves its state to the filesystem (usually NFS) and thus can restart from the last checkpoint. The responsibility of the new manager is to reset the workers and to request only the
1108
Panayiotis Alefragis et al.
generation of the pairings that have not been generated, or have been generated partially. This behavior can save a lot of computing time, especially if the fault appears near the end of the generation work. 3.3
Experimental Results
We have measured the performance of the parallel generator experimentally. The experiments have been performed on a network of HP 715/100 workstations of roughly equivalent performance, connected by Ethernet at the computing center of the University of Patras, during times when almost exclusive usage of the network and the workstations was possible. The results are shown in Table 3.3. What we see is the elapsed time for a single iteration of the generator, and the running time in seconds as a function of the number of CPUs. The speedup in all cases is almost linear to the number of CPUs. The generation time decreases in all cases almost linearly to the number of the CPUs used.
Table 1. Parallel generator times for typical pairing problems. problem name legs lh dl gg 946 lh dl splimp 946 lh wk gg 6196 sj dl kopt 1087
4
pairings 396908 318938 594560 159073
1 CPU 2 CPUs 4 CPUs 6 CPUs 8 CPUs 10 CPUs 26460 13771 7061 4536 3466 2818 20760 10797 5448 3686 2797 2181 31380 16834 8436 5338 4312 3288 10860 5563 2804 1892 1385 1112
Optimizer Parallelization
The problem solved by the optimizer is known as the set covering problem with additional base capacity constraints, and can be expressed as min{cx : Ax ≥ 1, Cx ≤ d, x binary} Here A is a 0-1 matrix and C is arbitrary non-negative. In this formulation every variable corresponds to a generated pairing and every constraint corresponds to a leg. The size of these problems usually range from 50 × 104 to 104 × 106 (constraints × variables), usually with 5-10 non-zero A-elements per variable. In most cases only a few capacity (≤) constraints are present. This problem in its general form is NP-hard, and in the Carmen System it is solved by an algorithm [9], that can be interpreted as a Lagrangian based heuristic. Very briefly, the algorithm attempts to perturbate the cost function as little as possible to give an integer solution to the LP-relaxation. From the point of view of parallel processing it is worth pointing out that the character of the algorithm is very different, and for the problems we consider also much faster,
Parallel Crew Scheduling in PAROS
1109
compared to the common branch and bound approaches to integer programming, and is rather more similar to an iterative equation solver. This also means that the approach to parallelization is completely different, and it is a challenge to achieve this on a network of workstations. The sequential algorithm can be described in the following simplified way that highlights some overall aspects which are relevant for parallelization. On the top level, the sequential algorithm iteratively modifies a Lagrangian cost vector c¯ which is first initialized to c. The goal of the iteration is to modify the costs so that the sign pattern of the elements of c¯ corresponds to a feasible solution, where c¯j < 0 means that xj = 1, and where c¯j > 0 means that xj = 0. In the sequential algorihm the iteration proceeds iteratively by considering one constraint at a time, where the computation for this constraint modifies the c¯ values for the variables of that constraint. Note that this implies that values corresponding to variables not in the constraint are not changed by this update. The overall structure is summarized as c¯ = c reset all si to 0 κ=0 repeat for every constraint i ri = c¯i − si si = function of ri , bi and some parameters c¯i = ri + si increase κ according to some rule until no sign changes in c¯ In this pseudocode, c¯i is the sub-vector of c¯ corresponding to non-zero elements of constraint i. The vector si represents the contribution of constraint i to c¯. This contribution is cancelled out before every iteration of constraint i where a new contribution is computed. The vectors ri are temporaries. A property relevant to parallelization is that when an iteration of a constraint has updated the reduced costs of its variables, then the following constraints have to use these new updated values. Another property is that the vector si is usually a function of the current c¯-values for at most two variables in the constraint, and these variables are referred to as the critical variables of the constraint. In the following subsections we summarize the different approaches to parallelization that have been investigated. In addition to this work, an aggressive optimization of the inner loops was performed, leading to an additional speedup of roughly a factor of 3. 4.1
Constraint Parallel Approach
A first observation is that if constraints have no common variables they may be computed independently of each other, and a graph coloring of the constraints
1110
Panayiotis Alefragis et al.
would reveal the constraint groups that could be independent. Unfortunately, the structure of typical set covering problems (with many variables and few long constraints) is such that this approach is not possible to apply in a straightforward way. The approach can however be modifed to let every processor maintain a local copy of c¯, and a subset of the constraints. Each processor then iterates its own constraints and updates its local copy of c¯. The different copies of c¯ must then be kept consistent through communication. Although the nature of the algorithm is such that the c¯ communication can be considerably relaxed, the result for our type of problems did not allow a significant speedup on networks of workstations. 4.2
Variable Parallel Approach
Another way of parallelization is to parallelize over the variables and the inner loops of the constraint calculation. The main parts of this calculation that concern the parallelization are 1) collect the c¯ for the variables in the constraint. 2) find the two least values of ri . 3) copy the result back to c¯. When the variables are distributed over the processors, each processor is then responsible for only one piece of c¯ and the corresponding part of the A-matrix. Some of the operations needed for the constraint update can be conveniently done locally, but the double minimum computation requires communication. This operation is associative, and it is therefore possible to get the global minimal elements through a reduction operation on the local critical minimum elements. To minimize communication it is also possible to group the constraints and perform the reduction operation for several independent constraints at a time. This strategy has been successfully implemented on an SGI Origin2000, with a speedup of 7 on 8 processors. However, it cannot be used directly on a network of workstations due to the high latency of the network. We have therefore investigated a relaxation to this algorithm, that does a lazy update of c¯. The approach is based on updating c¯ only for the variables required by the constraint group that will be iterated next. The reduced cost vector is then fully updated based on the contributions of the previous constraint group, during the reduction operation of the current constraint group, and so overlaps computation with both communication and idle time. The computational part of the algorithm is slightly enlarged and the relaxation requires that constraint groups with a reasonable number of common variables can be created. The graph colouring heuristics that have been used are based on [8], which gives 10-20% fewer groups than a greedy first fit algorithm, although the computation times are slightly higher. For numerical convergence reasons random noise is added in the solution process and even with minor changes in the parameters the running times can vary considerably, and also the solution quality can be slightly different. Tests are shown in Table 4.2 for pure set covering problems from the Swedish Railways,
Parallel Crew Scheduling in PAROS
1111
and the settings have been adjusted to ensure that a similar solution quality is preserved on the average (usually well within 1% of the optimum), and the problems shown here are considered as reasonably representative for the average case. Running times are given in seconds for the lazy variable parallel implementation for 1 to 4 processors. The hardware in this test has been 4 × HP 715/100 connected with 10 Mbps switched Ethernet. It can be seen that reasonably good speedup results are obtained even for medium sized problem instances. For networks of workstations a possibility could here also be to use faster networks such as SCI [5] or Myrinet [2] and streamlined software interfaces.
Table 2. Results for the lazy variable parallel code. problem name rows columns 1 CPU 2 CPUs 3 CPUs 4 CPUs sj daily 17sc 58 1915 0.60 2.95 3.53 4.31 sj daily 14sc 94 7388 17.30 9.45 11.78 11.82 sj daily 04sc 429 38148 288.73 124.76 82.94 72.00 sj daily 34sc 419 156197 951.53 634.54 365.90 259.32
4.3
Parallel Active Set Approach
As previously mentioned, the constraint update is usually a function of at most two variables in the constraint. If the set of these critical variables was known in advance, it would in principle be possible to ignore all other variables and receive the same result much faster. This is not possible, but it gives intuitive support for a variation of the sequential algorithm, where a smaller number of variables that are likely to be critical are selected in an active set. The idea is to make the original algorithm work only on the active set, and then sometimes do a scan over all variables to see if more variables should become active. For our problems this is especially appealing considering the close relation to the so called column generation approach to the crew scheduling problem, see [3]. The approach was first implemented sequentially and turns out to work well on real problems. An additional and important performance benefit is that the scan can be implemented by a columnwise traversal of the constraint matrix which is much more cache friendly and gives an additional sequential speedup of about 3 for large instances. This modified algorithm also gives a new possiblity for parallelization. If the active set is small enough, then the scan will dominate the total execution time, even if the active set is iterated many times between every scan. In constrast to the original algorithm, all communication necessary for a global scan of all variables can be concentrated into a single vector valued broadcast and reduction operation. Therefore, the parallel active set approach is successful even on a network of workstations connected over Ethernet. Some results for pure set covering problems from Lufthansa are shown in Table 4.3. The settings have been
1112
Panayiotis Alefragis et al.
adjusted to ensure that a similar solution quality is preserved on the average. Running times are given in seconds for the original production code Prob1, the new code with the active set strategy disabled, and with active set strategy for 1, 2 and 4 processors. The hardware in this test has been 4 × Sun Ultra 1/140 connected with a shared 100 Mbps Ethernet. From these results, and other more
Table 3. Results of parallel active set code. problem name lh dl26 02 lh dl26 04 lh dt1 11 lh dt58 02
rows 682 154 5287 5339
columns 642613 121714 266966 409350
Prob1 no active set 1 CPU 2 CPUs 4 CPUs 9962 4071 843 412 294 1256 373 216 105 60 1560 765 298 78 66 2655 2305 924 406 207
extensive tests, it can be concluded that in addition to the sequential speedup, we can obtain a parallel speedup of about a factor of 3 using a network of four workstations. It does not make sense however to increase the number of workstations significantly, since it is then no longer true that the scan dominates the total running time.
5
Conclusions and Future Work
We have demonstrated a fully successful parallelization of the two most time critical components in the Carmen crew scheduling process, on a network of workstations. For the generator, where a coarse grained parallelism is possible, a good speedup can be obtained also for a large number of workstations, when however issues of fault tolerance become very important. Further work includes the management of multiple APC jobs, a refined task scheduling scheme, preprocessing inside the workers (e.g. elimination of identical columns) and enhanced management of the virtual machine (active time windows of machines). For the optimizer, which requires a much more fine grained parallelism, Figure 2 shows a theoretical model of the expected parallel speedup of the different parallelizations, for a typical large problem. Although the variable parallel approach scales better, the total speedup is higher with the active set since the sequential algorithm is then faster. On the other hand, the active set can be responsible for a lower solution quality for some instances. Further tuning of the parallel active set strategy and other parameters is therefore required to improve the stability of the code both with respect to quality and speedup. It may be possible to combine the two approaches but it is not clear if this would give a significant benefit. Also, the general capacity constraints are not yet handled in any of the parallel implementations.
Parallel Crew Scheduling in PAROS
1113
Fig. 2. Expected parallel speedup. The parallel components have been integrated in a first PAROS prototype system that runs together with existing Carmen components. Given that the generator and the optimizer are now much faster, we consider parallelization also of some other components such as the generation of the connection matrix, which should be comparatively simple. Other integration issues are the communicaton between the generator and the optimizer, which is now done through files, and related issues such as optimizer preprocessing.
References 1. E. Andersson, E. Housos, N. Kohl, and D. Wedelin. OR in the Airline Industry, chapter Crew Pairing Optimization. Kluwer Academic Publishers, Boston, London, Dordrecht, 1997. 1104 2. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. Su. Myrinet: A gigabit-per-second Local Area Network. IEEE Micro, 15(1):29–36, Feb. 1995. 1111 3. J. Desrosiers, Y. Dumas, M. Solomon, and F. Soumis. Handbooks in Operations Research and Management Science, chapter Time Constrained Routing and Scheduling. North-Holland, 1995. 1104, 1111 4. A. Geist. Advanced programming in PVM. Proceedings of EuroPVM 96, pages 1–6, 1996. 1107 5. IEEE. Standard for the scalable coherent interface (sci), 1993. IEEE Std 1596-1992. 1111 6. S. Lavoie, M. Minoux, and E. Odier. A new approach for crew pairing problems by column generation with an application to air transportation. European Journal of Operational Research, 35:45–58, 1988. 1104 7. R. Marsten. RALPH: Crew Planning at Delta Air Lines. Technical Report. Cutting Edge Optimization, 1997. 1104, 1105
1114
Panayiotis Alefragis et al.
8. A. Mehrota and M. Trick. A clique generation approach to graph coloring. INFORMS Journal of Computing, 8, 1996. 1110 9. D. Wedelin. An algorithm for large scale 0-1 integer programming with application to airline crew scheduling. Annals of Operations Research, 57, 1995. 1108
Cobra: A CORBA-Compliant Programming Environment for High-Performance Computing Thierry Priol and Christophe Ren IRISA -Campus de Beaulieu - 35042 Rennes, France
Abstract. In this paper, we introduce a new concept that we call a parallel CORBA object. It is the basis of the Cobra runtime system that is compliant to the CORBA specification. Cobra is being developed within the PACHA Esprit project. It aims at helping the design of high-performance applications using independent software components through the use of distributed objects. It provides the benefits of distributed and parallel programming using a combination of two standards: CORBA and MPI. To support CORBA parallel objects, we propose to extend the IDL language to support object and data distribution. In this paper, we discuss the concept of CORBA parallel object.
1
Introduction
Thanks to the rapid increase of performance of nowadays computers, it can be now envisaged to couple several high-intensive numerical codes to simulate more accurately complex physical phenomena. Due to both the increased complexity of these numerical codes and their future developments, a tight coupling of these codes cannot be envisaged. A loosely coupling approach based on the use of several components offers a much more attractive solution. With such approach, each of these components implements a particular processing (pre-processing of data, mathematical solver, post-processing of data). Moreover, several solvers are required to increase the accuracy of simulation. For example, fluid-structure or thermal-structure interactions occur in many field of engineering. Other components can be devoted to pre-processing (data format conversion) or post-processing of data (visualisation). Each of these components requires specific resources (computing power, graphic interface, specific I/O devices). A component, which requires a huge amount of computing power, can be parallelised so that it will be seen as a collection of processes to be ran on a set of network nodes. Processes within a component have to exchange data and have to synchronise. Therefore, communication has to be performed at different levels: between components and within a component. However, requirements for communication between components or within a component are not the same. Within a component, since performance is critical, low level message-passing is required whereas between components, although performance is still required, modularity/interoperability and reusability are necessary to develop cost effective applications using generic components. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1114–1122, 1998. c Springer-Verlag Berlin Heidelberg 1998
Cobra: A CORBA-Compliant Programming Environment
1115
However, till now, low level message-passing libraries, such as MPI or PVM, are used to couple codes. It is obvious to say that this approach does not contribute to the design of applications using independent software components. Such communication libraries were developed for parallel programming so that they do not offer the necessary support for designing components which can be reused by other applications. Solutions already exist to decrease the design complexity of applications. Distributed object-oriented technology is one of these solutions. A complex application can be seen as a collection of objects, which represent the components, running on different machines and interacting together using remote object invocations. Existing standard such as CORBA (Common Object Request Broker Architecture) aims at helping the design of applications using independent software components through the use of CORBA objects 1 . CORBA is a distributed software platform which supports distributed object computing. However, exploitation of parallelism within such object is restricted in a sense that it is limited to a single node within a network. Therefore, both parallel and distributed programming environments have their own limitations which do not allow, alone, the design of high performance applications using a set of reusable software components. This paper aims at introducing a new approach that takes advantage of both parallel and distributed programming systems. It aims at helping programmers to design high performance applications based on the assembling of generic software components. This environment relies on CORBA with extensions to support parallelism across several network nodes within a distributed system. Our contribution concerns extensions to support a new kind of object we called a parallel CORBA object (or parallel object) as well as the integration of message-passing paradigms, mainly MPI, within a parallel object. These extensions exploit as much as possible the functionality offered by CORBA and requires few modifications to existing CORBA implementations. The paper is organised as follows. Section 2 gives a short introduction to CORBA. Section 3 describes our extensions to the CORBA specification to support parallelism within an object. Section 4 introduces briefly the Cobra runtime system for the execution of parallel objects. Section 5 describes some related works that share some similarities with our own work. Finally, section 6 draws some conclusions and perspectives.
2
An Overview of CORBA
CORBA is a specification from the OMG (Object Management Group) [5] to support distributed object oriented applications. Such applications can be seen as a collection of independent software components or CORBA objects. Objects have an interface that is used to describe operations that can be remotely invoked. Object interface is specified using the Interface Definition Language (IDL). The following example shows a simple IDL interface: 1
For the remaining of the paper, we will use simply object to name a CORBA object
1116
Thierry Priol and Christophe Ren
interface myservice { void put(in double a); double myop(inout long i, out long j); };
An interface contains a list of operations. Operations may have parameters whose types are similar to C++ ones. A keyword added just before the type specifies whether the parameter is an input or an output parameter or both. IDL provides an interface inheritance mechanism so that services can be extended easily. Figure 1 provides a simplified view of the CORBA architecture.
CORBA Object Object implementation IDL Skeleton
Object invocation
Client node
Server node
Object adaptor
IDL Stubs
Object Request Broker
Fig. 1. CORBA system architecture
In this figure, an object located at the client side is bound to an implementation of an object located at the server side. When a client invokes an operation, communication between the client and the server is performed through the Object Request Broker (ORB) thanks to the IDL stub (client side) and the IDL skeleton (server side). Stub and skeleton are generated by an IDL compiler taking as input the IDL specification of the object. A CORBA compliant system offers several services for the execution of distributed object-oriented applications. For instance, it provides object registration and activation.
3
Parallel CORBA Object
CORBA was not originally intended to support parallelism within an object. However, some CORBA implementations provide a multi-threading support for the implementation of objects. Such support is able to exploit simultaneously several processors sharing a physical memory within a single computer. Such level of parallelism does not require modification of the CORBA specification since it concerns only the object implementation at the server side. Instead of having one thread assigned to an operation, it can be implemented using several threads. However, the sharing of a single physical memory does not allow a large
Cobra: A CORBA-Compliant Programming Environment
1117
number of processors since it could create memory contention. One objective of our work was to exploit several dozen of nodes available on a network to carry out a parallel execution of an object. To reach this objective, we introduce the concept of parallel CORBA object. 3.1
Execution Model
Server 1
Process Client node
Object Implementation
Object invocation
IDL Skeleton
IDL Stubs
Object Adapter
Parallel server node
parallel CORBA object = CORBA object collection
...
Server n
Process Object Implementation IDL Skeleton
Object Adapter
Object Request Broker
Fig. 2. Parallel CORBA object service execution model.
The concept of parallel object relies on a SPMD (Single Program Multiple Data) execution model which is now widely used for programming distributed memory parallel computers. A parallel object is a collection of identical objects having their own data so that it complies with the SPMD execution model. Figure 2 illustrates the concept of parallel object. From the client side, there is no difference when calling a parallel object comparing to a standard object. Parallelism is thus hidden to the user. When a call to an operation is performed by a client, such operation is executed by all CORBA objects belonging to the collection. Such parallel execution is handled by the stub that is generated by an Extended-IDL compiler, which is a modified version of the standard IDL compiler. 3.2
Extended-IDL
As for a standard object, a parallel object is associated with an interface that specifies which operations are available. However, this interface is described using an IDL we extended to support parallelism. Extensions to the standard IDL aim at both specifying that an interface corresponds to a parallel object and at distributing parameter values among the collection of objects. Extended-IDL is the name of these extensions.
1118
Thierry Priol and Christophe Ren
Specifying the degree of parallelism The first IDL extension corresponds to the specification of the number of objects of the collection that will implement the parallel object. Modifications to the IDL language consist in adding two brackets to the IDL interface keyword. A parameter can be added within the two brackets to specify the number of objects belonging to the collection. Such parameter can be a ”*”, that means that the number of objects belonging to the collection is not specified in the interface. The following example illustrates the proposed extension. interface[*] ComputeFEM { typedef double dmat[100][100]; void initFEM(in dmat mat, in double p); void doFEM(in long niter, out double err); };
In this example, the number of objects will be fixed at runtime depending on the available resources (i.e. the number of network nodes if we assume that each object of the collection is assigned to only one node). The implementation of a parallel object may require a given number of objects in the collection to be able to run correctly. Such number may be inserted within the two brackets of the proposed extension. The following example gives an example of a parallel object service which is made of 4 objects. interface[4] ComputeFEM { ... };
Instead of giving a fixed number of objects in the collection, a function may be added to specify a valid number of objects in the collection. The following example illustrates such possibility. In that case, the number of objects in the collection may be only a power of 2. interface[n^2] ComputeFEM { ... };
It is the responsibility of the runtime system, in our case Cobra, to check whether the number of network nodes has been allocated according to the specification of the parallel object. IDL allows a new interface to inherit from an existing one. Parallel interface can do the same but with some restrictions. A parallel interface can inherit only from an existing parallel interface. Inheritance from standard interface is forbidden. Moreover, inheritance is allowed only for parallel interfaces that could be implemented by a collection of objects for which the number of objects coincides. The following example illustrates this restriction. interface[*] MatrixComponent { ... }; interface[n^2] ComputeFEM : MatrixComponent { ... };
Cobra: A CORBA-Compliant Programming Environment
1119
In this example, interface ComputeFEM derives from interface MatrixComponent. The new interface has to be implemented using a collection having a power of 2 objects. In the following example, the Extended-IDL compiler will generate an error when compiling because inheritance is not valid : interface[3] MatrixComponent { ... }; interface[n^2] ComputeFEM : MatrixComponent { ... };
Specifying data distribution Our second extension to the IDL language concerns data distribution. The execution of a method on a client side will provoke the execution of the method on every objects of the collection. Since, each object of the collection has it own separate address space, we must envisage how to distribute parameter values for each operation. Attributes and types of operation parameters act on the data distribution. Proposed extension of IDL for data distribution is allowed only for parameters of operations defined in a parallel interface. When a standard IDL type is associated with a parameter with an in mode, each object of the collection will receive the same value. When a parameter of an operation has either an out or a inout mode, as a result of the execution of the operation, stub generated by the Extended-IDL compiler will get a value from one of the objects of the collection. The IDL language provides multidimensional fixed-size arrays which contains elements of the same type. The size along each dimension has to be specified in the definition. We provide some extensions to allow the distribution of arrays among the objects of a collection. Data distribution specifications apply for both in, out and inout mode. They are similar to the ones already defined by HPF (High Performance Fortran). The following example gives a brief overview of the proposed extension. interface[*] MatrixComponent { typedef double dmat[100][100]; typedef double dvec[100]; void matrix_vector_mult(in dist[BLOCK][*] dmat, in dvec v, out dist[CYCLIC] dvec u); };
This extension consists in the adding of a new keyword (dist) which specifies how an array is distributed among the objects of the collection. For example, the 2D array mat is distributed by block of rows. Stubs generated by the Extended IDL compiler do not perform the same work when the parameter is an input or an output parameter. With an input parameter, the stub must scatter the distributed array so that each object of the collection received a subset of the whole array. With an output parameter, the stub must do the reverse operation. Indeed, each object of a collection contains a subset of the array. Therefore, the stub is in charge of gathering data from objects belonging to the collection.
1120
Thierry Priol and Christophe Ren
Such gathering may include a redistribution of data if the client is itself a parallel object. In the previous example, the number of objects in the collection is not specified in the interface. Therefore, the number of elements assigned to a particular object can be known only at runtime. It is why a distributed array of a given IDL type is mapped to an unbounded sequence of this IDL type. Unbounded sequence offers the advantage that its length is set up at runtime. We propose to extend the sequence structure to store information related to the distribution.
4
A Runtime System to Support Parallel CORBA Objects
The Cobra runtime system [2] aims at providing resource allocation for the execution of parallel objects. It is targeted to a network of PCs connected together using SCI [6]. Resource allocation consists in providing network nodes and shared virtual memory regions for the execution of parallel objects. Resource allocation services are performed by the resource management service (RmProcess) of Cobra. It is used when a parallel service must be bind to a client. We propose to extend the bind method provided by most of the CORBA implementations. Binding to a parallel object differs from the standard binding method. Indeed, a reference to a virtual parallel machine (vpm) is given as an argument of the bind method instead of a single machine. The vpm reference is obtained through the Cobra resource allocator. The following example illustrates how to use a parallel object service within the Cobra runtime: ... // Obtain a reference from the RmProcess service cobra = RmProcess::_bind("cobra.irisa.fr"); // Create a VPM cobra->mkvpm(vpmname, NumberOfNodes, NORES); // Get a reference to the allocated vpm pap_get_info_vpm(&vpm, vpmname); // Obtain a reference from the parallel object service: MatrixComponent cs = MatrixComponent::_bind ( &vpm ); ... // Invoke an operation provided by MatrixComponent service cs->matrix_vector_mult( a, b, &c); ...
The bind method may be called either by a single object, or by all objects belonging to a collection if the client is itself a parallel object.
5
Related Works
Several projects deal with environments for high-performance computing combining the benefits of distributed and parallel programming. The RCS [1], NetSolve [3] and Ninf [8] projects provide an easy way to access linear algebra
Cobra: A CORBA-Compliant Programming Environment
1121
method libraries which run on remote supercomputer. Each method is described by a specific interface description language. Clients invoke methods thanks to specific functions. Arguments of these functions specify method name and method arguments. These projects propose some mechanisms to manage load balancing on different supercomputers. One drawback of these environments is the difficulty for the user to add new functions in the libraries. Moreover, they are not compliant to relevant standard such as CORBA. The Legion [4] project aims at creating a world wide environment for high-performance computing. A lot of principles of CORBA (such as heterogeneity management and object location) are provided by the Legion run-time, although Legion is not CORBA-compliant. It manipulates parallel objects to obtain high-performance. All these features are in common with our Cobra run-time. However, Legion provides others services such as load balancing on different hosts, fault-tolerance and security which are not present in Cobra. The PARDIS [7] project proposes a solution very close to our approach because it extends the CORBA object model to a parallel object model. A new IDL type is added: dsequence (for distributed sequence). It is a generalisation of the CORBA sequence. This new sequence describes data type, data size, and how data must be distributed among objects. In PARDIS, distribution of objects is let to the programmers. It is the main difference with Cobra for which a resource allocator is provided. Moreover in Cobra, extended-IDL allows to describe object parallel services in more details.
6
Conclusion and Perspectives
This paper introduced the parallel CORBA object concept. It is a collection of standard CORBA objects. Its interface is described using an Extended-IDL to manage data distribution among the objects of the collection. Cobra is being implemented using Orbix from Iona Tech. It has already been tested for building a signal processing application using a client/server approach [2]. For such application, the most computing part of the application is encapsulated within a parallel CORBA object while the graphical interface is a Java applet. This applet, acting as a client, is connected to the server through the CORBA ORB. Current works are now focusing on the experiment of the coupling of numerical codes. Particular attention will be paid on the performance of the ORB which seems to be the most critical part of the software environment to get the requested performance. It is planned, within the PACHA project, to implement an ORB that fully exploits the performance of the SCI clustering technology while ensuring compatibility with existing ORB through standard protocols such as TCP/IP.
References 1. P. Arbenz, W. Gander, and M. Oettli. The Remote Computation System. In HPCN Europe ’96, volume 1067 of LNCS, pages 662–667, 1996. 1120 2. P. Beaugendre, T. Priol, G. Alleon, and D. Delavaux. A client/server approach for hpc applications within a networking environment. In HPCN’98, pages 518–525, April 1998. 1120, 1121
1122
Thierry Priol and Christophe Ren
3. H. Casanova and J. Dongara. NetSolve: A Network Server for Solving Computational Science Problems. The International Journal of Supercomputer Applications and High Performance Computing, 11(3):212–223, 1997. 1120 4. A. S. Grimshaw, W. A. Wulf, and the Legion team. The Legion Vision of a Worldwide Virtual Computer. Communications of the ACM, 1(40):39–45, January 1997. 1121 5. Object Management Group. The common object request broker: Architecture and specification 2.1, August 1997. 1115 6. Dolphin Interconnect. Clustar interconnect technology. White Paper, 1998. 1120 7. K. Keahey and D. Gannon. PARDIS: CORBA-based Architecture for Applicationlevel Parallel Distributed Computation. In Proceedings of Supercomputing ’97, November 1997. 1121 8. M. Sato, H. Nakada, S. Sekiguchi, S. Matsuoka, U. Nagashima, and H. Takagi. Ninf: A Network Based Information Library for Global World-Wide Computing Infrastructure. In HPCN Europe ’97, volume 1225 of LNCS, pages 491–502, 1997. 1120
OCEANS: Optimising Compilers for Embedded ApplicatioNS Michel Barreteau1, Fran¸cois Bodin2 , Peter Brinkhaus3 , Zbigniew Chamski4 , Henri-Pierre Charles1 , Christine Eisenbeis5 , John Gurd6 , Jan Hoogerbrugge4, Ping Hu5 , William Jalby1 , Peter M. W. Knijnenburg3 , Michael O’Boyle7 , ohr6 , Erven Rohou2 , Rizos Sakellariou6, Andr´e Seznec2 , Elena A. St¨ 4 3 Menno Treffers , and Harry A. G. Wijshoff 1
Laboratoire PRiSM, Universit´e de Versailles, 78035 Versailles, France 2 IRISA, Campus Universitaire de Beaulieu, 35042 Rennes, France 3 Department of Computer Science, Leiden University, P.O. 9512 2300 RA Leiden, The Netherlands 4 Philips Research, Information and Software Technology, Prof. Holstlaan 4 5656 AA Eindhoven, The Netherlands 5 INRIA, Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France 6 Department of Computer Science, The University, Manchester M13 9PL, U.K. 7 Department of Computer Science, The University, Edinburgh EH9 3JZ, U.K. Abstract. This paper presents an overview of the activities carried out within the ESPRIT project OCEANS whose objective is to investigate and develop advanced compiler infrastructure for embedded VLIW processors. This combines high and low-level optimisation approaches within an iterative framework for compilation.
1
Introduction
Embedded applications have become increasingly complex during the last few years. Although sophisticated hardware solutions, such as those exploiting instruction level parallelism, aim to provide improved performance, they also create a burden for application developers. The traditional task of optimising assembly code by hand becomes unrealistic due to the high complexity of hardware/software. Thus the need for sophisticated compiler technology is evident. Within the OCEANS project, the consortium intends to design and implement an optimising compiler that utilises aggressive analysis techniques and integrates source-level restructuring transformations with low-level, machine dependent, optimisations [1,14,16]. A major objective of the project is to provide a prototype framework for iterative compilation where feedback from the lowlevel is used to guide the selection of a suitable sequence of source-level transformations and vice versa. Currently, the Philips TriMedia (TM1000) VLIW processor [8] is used for the validation of the system. In this paper, we present the work that has been carried out during the first 15 months since the project started (September 1996). This has largely
This research is supported by the ESPRIT IV reactive LTR project OCEANS, under contract No. 22729.
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1123–1130, 1998. c Springer-Verlag Berlin Heidelberg 1998
1124
Michel Barreteau et al.
File.f
MT1
Code generator
File.s
PILO LORA
File.IL
SEA SALTO
File_opt.s
Report.IL
Fig. 1. The Compilation Process. concentrated on the development of the necessary compiler infrastructure. An overall description of the system is given in Section 2. Sections 3 and 4 present the high-level and the low-level subsystems respectively, while the steps that have been taken towards their integration are highlighted in Section 5. Finally, some results from the initial validation of the system are shown in Section 6, and the paper is concluded with Section 7.
2
An Overview of the OCEANS Compiler System
The OCEANS compiler is centred around two major components: a high-level restructuring system, MT1, and a low-level system for supporting assembly language transformations and optimisations, Salto, which is coupled with Sea, a set of classes that provides an abstract view of the assembly code, and tools for software pipelining (PiLo) and register allocation (LoRa). Their interaction is illustrated in Figure 1 which shows the overall organisation of the OCEANS compilation process. In particular, a program is compiled in three main steps: – First, MT1 performs lexical, syntactical and semantic analysis of a source ForTran program (File.f). Also, a sequence of source program transformations can be applied. – The restructured source program is then fed into the code generator which generates sequential assembly code that is annotated with instruction identifiers used to identify common objects in MT1 and Salto, and a file written in an Interface Language (File.IL) that provides information on data dependences and control flow graphs.
OCEANS: Optimising Compilers for Embedded ApplicatioNS
1125
– Finally, Salto (coupled with Sea) performs code scheduling and register allocation. At this step guarded instructions are created and resource constraints are taken into account. The above process is repeated iteratively until a certain level of performance is reached. Thus, different optimisations, both at the source-level and the lowlevel, are checked and evaluated. An important feature of the system is the existence of a client-server protocol that has been implemented in order to provide easy access to the compiler over the Internet, for all members of the consortium. MT1 and the code generator are located at Leiden, and Salto, Sea, PiLo and LoRa are located at Rennes.
3
High-Level Transformations
Optimizing and restructuring compilers incorporate a number of program transformations that replace program fragments by semantically equivalent fragments to obtain more efficient code for a given target architecture. The problem of finding an optimum order for applying them is commonly known as the phase ordering problem. Within the MT1 compilation system [5] this problem is solved by providing a Transformation Definition Language (TDL) [3] and a Strategy Specification Language (SSL) [2]. Transformations and strategies specified in these languages can be loaded dynamically into the compiler. 3.1
Transformation Definition Language
The TDL is based on pattern matching. The user can specify an input pattern, a transformed output pattern and a condition when the transformation can be legally and/or beneficially applied. Patterns may contain expression and statement variables. When a pattern is matched against the code these variables are bound to actual expressions and code fragments, respectively. The expression and statement variables can be used in turn in the specification of the output pattern and the condition. This mechanism allows one to specify a large number of transformations, such as loop interchange, loop distribution or loop fusion. However, it is not powerful enough to express other important transformations, such as loop unrolling. Therefore, the TDL also allows for user-defined functions in the output pattern. User-defined functions are the interface to the internal data structures of the compiler. In this way, any algorithm for transforming and testing code can be implemented and made accessible to the TDL. 3.2
Strategy Specification Language
The order in which transformations have to be applied is specified using a Strategy Specification Language (SSL). It allows the specification of an optimising strategy at a more abstract level than the source code level. This language contains sequential composition of transformations, a choice construct and two repetitive constructs.
1126
Michel Barreteau et al. Sea object
seaBase
seaINST
Transformation object
clone
seaCF
apply
Sea object
select
seaSCF seaSBk
seaLoop
Transformation object
clone
apply
Sea object
seaBBk
Fig. 2. Sea class hierarchy.
Sea object
rebuild
Fig. 3. Typical usage of Sea classes.
An if statement consists of a transformation that acts as a condition, a then part and an optional else part. The transformation in the condition can be applied successfully or not. If it is successful, the transformations in the then part are to be executed. Optionally, in the else part a list of transformations can be given which should be executed in case the transformation matched but was not applied successfully due to failing conditions. The two repetitive constructs consist of a transformation to be checked and a statement list to be executed if the condition is true or false, respectively. They consist of a while-endwhile and a repeat-until construct. Examples of how to specify strategies in SSL can be found in [2].
4
Low-Level Optimisations
Low-level optimisations are built on the top of Salto, a retargetable system for assembly language transformation and optimisation [15]. To facilitate the implementation of optimisations, a set of classes has been designed, Sea (Salto Enhanced Abstraction), that provides an abstract view of the assembly code which is more pertinent to the code scheduling and register allocation problems. The most important features of Sea are that it allows the evaluation of various code transformations before producing the final code, and that it separates the implementation of the global low-level optimisation strategy from the implementation of individual optimisation sequences. The Sea model contains two kinds of objects: code fragments The following objects can be used. seaINST: an instruction object; seaCF, an unstructured set of code fragments; seaSCF, a structured subset of control flow graph with a unique entry point; seaSBk, a superblock; seaBBk, a basic block; and finally, seaLoop, a structured piece of code that has loop properties. Figure 2 illustrates the corresponding class hierarchy. transformations to be applied to subgraphs. All transformations are characterized by the following main methods: preCond() returns the set of control flow subgraphs that qualifies for the transformation; apply() applies the transformation to a given subgraph, and finally, getStatus() that returns
OCEANS: Optimising Compilers for Embedded ApplicatioNS
1127
the status of the transformation after application (success or failure) and the reason for the failure. The usage of the Sea objects is shown in Figure 3. A transformation is tried on a cloned piece of code, then according to performance or size criteria one of the solutions found is chosen and propagated to the low-level program representation using the rebuild() method. The optimisations currently available within Sea are: register renaming, superblock construction [12], guard insertion [11], loop unrolling (also available at the high-level), local/superblock scheduling [12], software pipeline, and register allocation. The implementation of software pipeline is based on the tools PiLo [17] and LoRa [9] which generate a modulo scheduling of the loop body.
5 5.1
Integration The Interface Language
In order to transmit information between the various components of the compiler, an Interface Language (IL) was designed. This allows the propagation of information, such as data dependences and loop control data, from MT1 to Salto, as well as feedback information from the scheduled code back to MT1. An IL description consists of three sections: a list of keywords that specifies the list of attributes that can apply to an object; a default level setting that indicates the type of code the objects belong to; and a list of object references which specify the nature, contents and attributes of an object. More details on the IL can be found in [7]. 5.2
Information Forwarded and Feedback
Data dependence information is propagated from MT1 to Salto and is used for memory disambiguation. The feedback from Sea to MT1 (file Report.IL in Figure 1) contains information on the code structure, the basic blocks, as well as a record of the transformations that were applied. Data related to each basic block include the total number of assembly instructions, the critical path for scheduling the code, the number of cycles of the scheduled code, and a grouping depending on the nature of the instructions. Examples can be found in [7]. MT1 uses the feedback from Salto in order to build an internal data structure that can be accessed by the TDL and the SSL by means of user-defined functions in the condition that can check for the identity of a code fragment and suggestions made by Salto. When such a transformation is used as the condition for an if construct in the SSL, we are able to select the transformations we want to apply to this fragment. Initially, MT1 compiles the program without performing any restructuring and the compiled program is scheduled by Salto. Salto identifies the code fragments that can be improved. It reports its diagnostics to a cost model that makes a decision on what kind of restructuring could be performed next. Then, MT1
1128
Michel Barreteau et al.
reads the suggestions for restructuring and performs these. It is intended that a transformation sequence for a given program fragment is selected by following a systematic approach for searching through a domain of possible transformations. First, each different transformation is applied once and then the same follows for each branch of the tree. The search space is minimised by using a threshold condition for terminating branches that are not likely to yield an optimum result in their descendants. Some preliminary experiments using this strategy can be found in [10]; further work is in progress.
6
Validation of the Initial System
In order to validate the compiler, four public domain multimedia codes have been selected [4]. These are a low bit-rate encoder/decoder for H.263 bitstreams, an MPEG2 encoder/decoder, an implementation of the CCITT G.711, G.721 and G.723 voice compression standards, and the Persistence of Vision Ray-Tracer for creating 3D graphics. At the high-level, initial experimentation aimed at identifying those transformations that appear to be the most crucial in optimising code scheduling. Inspection of the benchmarks revealed that they contain many imperfectly nested double or triple loops with much overhead due to branch delays. In order to deal with such loops, a transformation that converts the imperfectly nested loop into a single loop has been suggested [13]. At the low-level, the initial validation of the system has been carried out by applying four different optimisation sequences: – S0 is the simplest sequence. First, the code is scheduled locally and then register allocation is performed. – S1 (u) is based on unrolling the loop body u times. The unrolled body is transformed into a superblock by guarding instructions. As in S0 , register allocation is performed after local scheduling. – S2 (u) is similar to S1 (u) except that register allocation is performed before scheduling. This usually requires less registers, allowing this sequence to succeed when S1 (u) fails due to a lack of registers. – S3 consists in applying a software pipelining algorithm. The above optimisation sequences were validated and indicative results, using the most-time consuming loops of H263, are illustrated in Figure 4. Every optimisation sequence has been applied to each of the six selected loops and the size of the resulting VLIW code and the speed of the loop, i.e. the number of cycles per iteration, were computed. From the table, a well-known result is observed: the more we unroll a loop, the faster it runs — cf. columns S1 (2), S1 (3), S1 (4) — but at the expense of a larger code size. As expected S2 (2) yields too poor performance and large code because of the presence of false dependences. Finally, software pipelining (S3 ) gives the best performance but at the expense of a very large increase in code size. Note that this transformation failed with the last loop, due to a lack of registers.
OCEANS: Optimising Compilers for Embedded ApplicatioNS
Optimisation sequences S0 S1(2) S1(3) S1(4) S2(2) S3 speed 8 6 5 5 7 3 size 8 12 16 20 13 75 speed 9 size 9
7 13
6 18
6 22
10 19
5 55
speed 12 size 12
8 16
8 24
7 28
12 24
6 121
speed 15 size 15
10 20
9 28
9 34
16 31
6 172
speed 15 size 15
10 19
10 29
8 33
17 33
7 179
speed 19 size 19
13 25
12 36
11 44
30 59
– –
1129
C code for (i=xa;i<xb; i++) { d[i]=s[i]*om[i]; } for (i=xa; i<xb; i++) { d[i]+=s[i]*om[i]; } for (i=xa; i<xb; i++) { dp[i]+=(((unsigned int)(sp[i] +sp2[i]+1))>>1)*om[i]; } for (i=xa; i<xb; i++) { dp[i]+=(((unsigned int)(sp[i]+ sp[i+1]+1))>>1)*OM[c][j][i]; } for (i=xa; i<xb; i++) { dp[i]+=(((uint)(sp[i]+sp2[i]+ sp[i+1]+sp2[i+1]+2))>>2)*om[i]; } for (k=0; k<5; k++) { xint[k]=nx[k]>>1; xh[k]=nx[k] & 1; yint[k]=ny[k]>>1; yh[k]=ny[k] & 1; s[k]=src+lx2*(y+yint[k])+x+xint[k]; }
Fig. 4. Time consuming loops extracted from H263. In most embedded applications, it is necessary to answer globally questions such as: “Given a maximum code size, what is the highest performance that can be achieved?”, or “Given a performance goal, what is the smallest code size that can be achieved?”. Within the OCEANS compiler this trade-off is evaluated quantitatively by applying a novel compiler strategy. Thus, the choice of the most suitable optimisation is made a posteriori, when the impact of each possible transformation is known. More details can be found in [6].
7
Conclusion and Future Work
The previous sections outlined the current status of the OCEANS compiler. Although the results obtained so far, using the initial prototype, are satisfactory (comparing with a production compiler), the implementation work still continues on both the high and low levels. A major part of the work during the next months and until the end of the project is devoted to the integration of the two levels, the development of a prototype framework for iterative compilation, and its experimental evaluation and tuning. Finally, it is intended that the system be
1130
Michel Barreteau et al.
made publically available in due time (at the moment Salto is available on request).
References 1. B. Aarts, et al. OCEANS: Optimizing Compilers for Embedded Applications. In C. Lengauer, M. Griebl, S. Gorlatch (Eds.), Proceedings of Euro-Par’97, Lecture Notes in Computer Science 1300, Springer-Verlag, 1997, pp. 1351–1356. 1123 2. R. A. M. Bakker, F. Bregt, P. M. W. Knijnenburg, P. Touber, and H. A. G. Wijshoff. Strategy Specification Language. OCEANS Deliverable D1.2a, 1997. 1125, 1126 3. A. J. C. Bik, P. J. Brinkhaus, P. M. W. Knijnenburg, P. Touber, and H. A. G. Wijshoff. Transformation Definition Language. OCEANS Deliverable D1.1, 1997. 1125 4. A. J. C. Bik, P. J. Brinkhaus, P. M. W. Knijnenburg, P. Touber, H. A. G. Wijshoff, W. Jalby, H.-P. Charles, M. Barreteau. Identification of Code Kernels and Validation of Initial System. OCEANS Deliverable D3.1c, 1997. 1128 5. A. J. C. Bik and H. A. G. Wijshoff. MT1: A Prototype Restructuring Compiler. Technical Report 93-32, Department of Computer Science, Leiden University, 1993. 1125 6. F. Bodin, Z. Chamski, C. Eisenbeis, E. Rohou, and A. Seznec. GCDS: A Compiler Strategy for Trading Code Size Against Performance in Embedded Applications. Research Report 1153, IRISA, 1997. 1129 7. F. Bodin and E. Rohou. High-level Low-level Interface Language. OCEANS Deliverable D2.3a, 1997. 1127 8. B. Case. Philips Hope to Displace DSPs with VLIW. Microprocessor Report, 8(16), 5 Dec. 1994, pp. 12–15. See also http://www.trimedia-philips.com/ 1123 9. C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: a new model for loop cyclic register allocation. Proceedings of PACT’95 (Cyprus, June 1995). 1127 10. J. Gurd, A. Laffitte, R. Sakellariou, E. A. St¨ ohr, Y. T. Chu, and M. F. P. O’Boyle. On Compile-Time Cost Models. OCEANS Deliverable 1.2b, 1997. 1128 11. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE, 83(12), Dec. 1995, pp. 1625–1639. 1127 12. W. Hwu, et al. The Superblock: An Effective Technique for VLIW and Superscalar Compilation. The Journal of Supercomputing, 7(1), May 1993, pp. 229–248. 1127 13. P.M.W. Knijnenburg. Flattening: VLIW Code Generation for Imperfectly Nested Loops. Proceedings CPC’98, 1998. To appear. 1128 14. OCEANS Web Site at http://www.wi.leidenuniv.nl/Oceans/ 1123 15. E. Rohou, F. Bodin, A. Seznec, G. Le Fol, F. Charot, F. Raimbault. SALTO: System for Assembly-Language Transformation and Optimization. Technical Report 1032, IRISA, June 1996. See also http://www.irisa.fr/caps/Salto/ 1126 16. R. Sakellariou, E. A. St¨ ohr, and M. F. P. O’Boyle. Compiling Multimedia Applications on a VLIW Architecture. Proceedings of the 13th International Conference on Digital Signal Processing (DSP97) (Santorini, July 1997), vol. 2, IEEE Press, 1997, pp. 1007–1010. 1123 17. J. Wang, C. Eisenbeis, M. Jourdan, and B. Su. Decomposed Software Pipelining: a New Perspective and a New Approach. International Journal on Parallel Processing, 22(3), 1994, pp. 357–379. 1127
Industrial Stochastic Simulations on a European Meta-Computer Ken Meacham, Nick Floros, and Mike Surridge Parallel Applications Centre, 2 Venture Road, Chilworth, Southampton, SO17 7NP, UK. {kem,nf,ms}@pac.soton.ac.uk http://www.pac.soton.ac.uk/
Abstract. This paper outlines the experiences of running a large stochastic multi-body simulation across a pan-European meta-computer, to demonstrate the use of the PROMENVIR tool within such a largescale WAN environment. We describe the meta-application management approach developed by PAC and discuss the technical issues raised by this experiment.
1
Introduction
PROMENVIR (PRObabilistic MEchanical desigN enVIRonment) is an advanced meta-computing tool for performing stochastic analysis of generic physical systems, which has been developed within the PROMENVIR ESPRIT project (No. 20189) [1,2]. The tool provides a user with the framework for running a stochastic Monte Carlo (MC) simulation, using any preferred deterministic solver (e.g. NASTRAN), then provides sophisticated statistical analysis tools to analyse the results. PROMENVIR facilitates investigations into the reliability of components, by analysing how uncertainties in their manufacture, deployment or environment affect key mechanical properties such as stresses. PROMENVIR does this by generating many hundreds of analysis “shots”, in which the uncertain parameters are generated from statistical distributions selected by the user. Key output values are then extracted from each analysis shot, and can be analysed to determine the sensitivity of the component with respect to uncertain parameters, correlations between physical properties and behaviour, and clustering characteristics (e.g. failure set distributions). PROMENVIR is an open environment, and can generate analyses for essentially any solver. Since solver runs are independent, the overall simulation is intrinsically parallel and is therefore an ideal candidate for exploiting parallel HPC resources, which may include heterogeneous clusters of workstations, MPP or SMP platforms. These resources may reside within a company Local Area Network (LAN), or may be accessible over a geographically-distributed Wide Area Network (WAN). The only restriction on the use of resources within a PROMENVIR simulation is the availability of hosts and solver licenses. The implication of D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1131–1139, 1998. c Springer-Verlag Berlin Heidelberg 1998
1132
Ken Meacham et al.
this is that, by co-operating with partner sites across a WAN, a PROMENVIR stochastic simulation of many hundreds of solver runs may be executed within a few hours, rather than many days, hence providing a rapid turnaround time for analysis of new designs. One of the main challenges within the PROMENVIR project was therefore to demonstrate the effectiveness of the PROMENVIR package within a pan-European meta-computing (WAN) environment, to solve a large stochastic problem of industrial significance. The Parallel Applications Centre (PAC) has successfully set up and run a series of “WAN experiments”, involving partner sites within a European consortium. This paper outlines the experiments which took place, the technical issues which arose and summarizes the results.
2
Testcase for WAN Experiment
The most widely available solver within the PROMENVIR consortium was the Multi-Body Simulation code SIMAID, developed by CEIT in San Sebastian [3], so it was decided that a suitable demonstration would be set up, using this code as the basis for a large stochastic simulation. The testcase chosen was a simulation of a satellite antenna deployment, during which the antenna unfolds to give a planar rim (see Figs 1 - 3). The antenna was modelled using a set of beams, with springs and dampers at the joints. A single deterministic simulation had already been carried out, which predicted that the outer rim of the antenna would be essentially flat. However, the antenna simulation model was constructed using components with identical properties, and did not take into account the manufacturing tolerances inherent within the components and their assembly.
Fig. 1. SIMAID antenna model (folded)
Industrial Stochastic Simulations on a European Meta-Computer
Fig. 2. SIMAID antenna model (partially deployed)
Fig. 3. SIMAID antenna model (fully deployed)
1133
1134
Ken Meacham et al.
The PROMENVIR Monte Carlo simulation consisted of 500-1000 shots of the SIMAID solver, for the antenna testcase (at least 500 shots were required for convergence of results). For each shot, values for the critical component parameters (e.g. spring stiffness) were chosen from a normal distribution, which modelled the known tolerances. The results of the stochastic simulation could then be analysed to see the effect of manufacturing tolerances on the planarity of the deployed antenna, and hence the effectiveness of the antenna.
3
The European Meta-Computer
A total of 102 CPUs were provided for the experiment, from within the consortium (see Fig. 4). These were restricted to SGI architectures, since SIMAID was only available for IRIX or IRIX64 operating systems. However, many different types of SGI machines were made available, ranging from older Indigo R3000 workstations up to a 64-cpu Origin (at UPC, Barcelona). The topology of this hardware configuration is shown geographically in Fig 4. The master host was running at PAC, while slave hosts were used in UK, Spain, Germany and Italy, at various partner sites. We refer to such a meta-computer as a Parallel Virtual Computer (PVC). SIMAID was installed and licensed on each machine, and access permissions set up to allow remote access from the master SGI Indy at PAC, on which the PROMENVIR front end and resource manager were running.
4 4.1
Technical Issues Meta-Application Manager
A fundamental component of the package is the Advanced Parallel Scheduler (APS), which has been developed at the Parallel Applications Centre (PAC). This has been designed as a meta-application manager, which orchestrates the use of the PVC by an application such as PROMENVIR. The APS is capable of making intelligent resourcing decisions, based on performance models (developed at PAC) of the application and resources. These models are used to predict resource usage for a solver in terms of CPU, memory, disk space and I/O traffic, and hence allows capacity planning of the available resources, before submission of any jobs. A hosts (PVC) database is set up, using the System Definition tool (SDM), which automatically obtains characteristics for each required host, within the LAN or WAN. Performance characteristics for hosts may also be entered here, to be used by the APS. The APS creates daemons on remote hosts, which are responsible for submitting jobs, copying files and communicating load information back to the master workstation. It is capable of initiating UNIX processes directly, but can also be used to control and submit meta-applications via conventional load-sharing software, such as LSF from Platform Computing [4].
Industrial Stochastic Simulations on a European Meta-Computer
1135
PROMENVIR Multi−National Meta−Computer (UK, D, E, I) PROMENVIR Multi−National Meta−Computer NCPU 102 (UK, D,=E, I)
PAC Southampton NCPU=25
RUS Stuttgart NCPU=12
CEIT San Sebastian NCPU=12
Italdesign BLUE Torino NCPU=22
CASA Madrid NCPU=15 UPC Barcelona NCPU=16
Master
Slave Nodes
Fig. 4. Meta-computer (PVC) topology for WAN experiment
4.2
Remote Host Access and Security
A major part of the work involved in setting up the WAN experiment was ensuring remote host accessibility. For example, some partners were able to set up, quite quickly, remote access to their hosts via rsh (remote shell); these sites generally consisted of academic institutions, where access is not normally too restricted. However, several partners (including PAC) had security features including firewall machines. This was a major obstacle, since rsh access is normally restricted and, even where available, a firewall machine generally allows access in one direction only. This makes it impossible to set up two-way rsh access between two secure sites, through two firewalls (i.e. one at each end). A further requirement on the meta-computer was to make any file transfers between remote sites both robust and secure. For these reasons, we decided to explore the use of SSH/SCP protocols, which turned out to solve most of our problems and requirements. SSH (Secure Shell) and SCP (Secure Remote Copy), developed by Data Fellows [5,6], provide secure TCP/IP connections between trusted sites. Features include: – strong authentication (via RSA keys) – automatic data encryption and compression – access though firewalls (via SSH officially registered port 22)
1136
Ken Meacham et al.
We were also able to exploit an advanced SSH facility called port forwarding, which enables TCP traffic to be forwarded to a remote host via the secure connection. Various partner sites already had an SSH facility, while others needed to install it. By using SSH and SCP protocols for all TCP connections, and by configuring partner firewalls to allow SSH access, we were then able to submit tasks transparently to the PVC, via the APS, and receive back results from solver runs, for collection by the PROMENVIR application. This demonstrated sharing of resources between sites for a single meta-application, even between sites which both have firewall security.
5
Running the Experiment
Several smaller-scale tests of PROMENVIR were carried out, firstly using machines that were available and easily accessible (i.e. without firewalls). It was soon found that the unreliability of the network was the major factor in any lost or failed solver runs (shots). Due to the statistical nature of the PROMENVIR environment, it is not critical if some shots are lost, since results will still converge, as long as enough shots are completed in total. However, we were finding that up to 10-20% of the remote tasks could fail, due to slow and unreliable connections to certain sites (particularly to Italy). This figure was unacceptable, particularly if the WAN was intended for use with other (non-stochastic) types of application. We therefore made several improvements to the APS, to enhance robustness and optimise data transfer. For example, rather than attempting single file transfers, we set up the APS to perform multiple retries. If a host could not be contacted after this, it was assumed to be down, and no further tasks were sent to it. As more and more machines were made available, we were able to increase the number of CPUs used for the experiments. Our target was to reach 100 CPUs across the meta-computer.
6 6.1
WAN Experiment Findings Results
A summary of the most successful experiment is shown in Table 1. This shows a list of the partner sites, and their contributions to the meta-computer in terms of CPUs provided (as SGI workstations or SMP platforms). Of the 8 sites represented in this table, 5 had firewall security. 6.2
Availability and Reliability
Of course, the fact that remote hosts are set up and installed does not necessarily imply that they will be available at the time of running the experiment. Of the
Industrial Stochastic Simulations on a European Meta-Computer
1137
Table 1. PVC host usage statistics during WAN experiment Partner PAC So’ton University UPC RUS CASA CEIT ItalDesign Blue Grand Totals Total Total Total Total
cpus cpus cpus cpus
CPUs Availability Shot Statistics Nproc In PVC Access Used Failed Successful Total 15 15 15 14 1 150 151 10 10 9 6 0 40 40 16 16 16 16 1 275 276 12 12 11 9 0 104 104 15 15 15 14 15 184 199 12 12 12 8 0 98 98 11 11 6 5 2 63 65 11 11 11 7 6 61 67 102 102 95 79 25 975 1000
installed 102 Elapsed Execution Time: defined in PVC 102 available 95 Approx Single CPU Time: used in WAN 79
4:39:16 250 hrs
102 CPUs defined in the PVC, only 95 could be contacted at the time of running the experiment. We found that this number fluctuated during the day; this was mainly due to the network load, but was sometimes due to machines being down, or turned off. Of the 95 CPUs available to PROMENVIR via the APS (i.e. with daemons running), 79 were actually used during the simulation to run solver tasks. The reason for this lies in how the APS decides to allocate tasks to hosts. The CPU load is measured on each remote host (via the UNIX uptime command), and sent back to the APS. If this is below a certain threshold and the performance models indicate that it is advantageous to do so, a task will be submitted to that host; otherwise the task will be scheduled elsewhere. We found that jobs were already running on certain hosts (generally those which had not been provided exclusively for the WAN experiment), so these were not used by the APS to run PROMENVIR tasks. However, we also found that some machines, which were apparently not running any jobs, were still not used by the APS; these hosts still appeared overloaded. We eventually traced this problem to certain SGI workstations which were running screen lock programs, which fully loaded the machines CPU resources, when no other (higher priority) jobs were running. A total of 1000 SIMAID solver shots were submitted to the meta-computer, of which 975 completed successfully, and the total elapsed time of the PROMENVIR simulation was only 4h:39m:16s. This compares with an approximate time of 250 hours on a typical SGI workstation (a single SIMAID run taking around 15 mins). The 25 shot failures were still mainly due to network problems, which will always be present to some extent, within a large meta-
1138
Ken Meacham et al.
computer, unless very fast and reliable links are used (e.g. ATM). A handful of shots were lost, due to solver problems. In certain extreme cases, combinations of random input parameters can cause the solver to crash, though this is rare.
6.3
Analysis of Results
The detailed analysis of the results is beyond the scope of this paper, and will be published elsewhere, in conjunction with results from further simulations which are being carried out using differing ranges of manufacturing tolerances. However, preliminary results show that, when a stochastic approach is used, there is a broader variation in the deployed geometry, and hence a poorer planarity in the dish than predicted with the deterministic simulation. By using the stochastic approach, further insights may be obtained than were previously available. For example, by examining the planarity variation according to the standard deviation of manufacturing tolerances, it is possible to work back and ask questions such as “what tolerances must be ensured in the manufacturing process, to produce an antenna of certain planarity”. These types of questions are highly relevant to industry.
7
Conclusions
PROMENVIR has been demonstrated as a highly useful tool for running industrial stochastic simulations across a European meta-computer, showing that corporate-scale meta-computing resources really can be used to solve large problems for industry. The APS, developed at PAC, has facilitated management of meta-applications and meta-computing resources (including firewall-protected systems), and is also being used with other meta-computing applications [7,8]. Results of the WAN experiment have shown that, once a meta-computer has been set up, further experiments can be carried out routinely. However, the overheads of setting up remote sites can be quite large. Furthermore, it has been found to be difficult to obtain full (exclusive) access to machines, during a large-scale experiment, without the full cooperation from contributing partners. Our experience suggests that, due to a combination of background loads, network performance and human factors (e.g. failure to log out of a machine), around 80% resource availability is the maximum which can realistically be expected for large-scale meta-applications. Finally, there will always be some inherent unreliability within a large-scale meta-computing resource, mainly due to network problems. These must be eliminated as much as possible and, where uncontrollable problems arise, the system and application must be made as robust as possible, to ensure that any simulation does not suffer fatally from individual task failures.
Industrial Stochastic Simulations on a European Meta-Computer
1139
Acknowledgements The work reported here was part of the ESPRIT PROMENVIR project. We are greatful to our partners (CASA, UPC, CEIT, RUS, Atos, ItalDesign and Blue Engineering) for a most fruitful collaboration.
References 1. Marczyk, J.: Meta-Computing and Computational Stochastic Mechanics. Computational Stochastic Mechanics in a Meta-Computing Perspective (1997) 1–18 Ed. J. Marczyk 1131 2. PROMENVIR product web page: http://www.atos-group.de/cae/index-promenvir.htm 1131 3. CEIT home page: http://www.ceit.es/ 1132 4. Platform Computing (LSF) home page: http://www.platform.com 1134 5. Data Fellows home page: http://www.datafellows.com/ 1135 6. Further information on SSH may be found at: http://www.cs.hut.fi/ssh/ 1135 7. Upstill C.: Multinational Corporate Metacomputing. Proceedings of the ACM/IEEE SC97 Conference, San Jose, CA, November 15-21, 1997 1138 8. Hey, A.J.G., Scott, C.J., Surridge, M. and Upstill, C.: Integrating Computation and Information Resources - An MPP Perspective. 3rd Working Conference on Massively Parallel Programming Models (MPPM-97), London, November 12-14, 1997 1138
Porting the SEMC3D Electromagnetics Code to HPF Henri Luzet1 and L.M. Delves2 1
SEMCAP, 20 rue Sarinen, Silic 270 RUNGIS 94578, France 2 N.A Software Ltd, 62 Roscoe St, Liverpool L1 9DW, UK [email protected] Abstract. Within the Esprit Pharos project, four industrial simulation codes were ported to HPF1.1 to provide a test of the suitability of this language for handling realistic Fortran77 applications, and to estimate the effort involved. We describe here the port of one of these: the SEMC3D electromagnetic simulation code. An outline of the porting process, and results obtained on the Meiko CS2 are given.
1
The Pharos Project
Pharos was an Esprit–funded project aimed at: – Bringing together an HPF-oriented toolset from three vendors; – Using this toolset to develop HPF versions of four industrial–strength codes owned by project partners to estimate the effectiveness of HPF1.1 and current compilers for handling such codes, and the effort involved in producing effective HPF ports. The project ran from January 1996 to December 1997; partners included: Tools Vendors : N.A. Software : HPF Compiler and source level debugger (HPFPlus); Simulog : Fortran77 to Fortran90 conversion tool (FORESYS); Pallas : Performance and communications monitor (VAMPIR). Code Owners : SEMCAP : Electromagnetic Simulation code (SEMC3D); CISE : Finite Element Structural code (CANT-SD); Matra BAe Dynamics : Compressible Flow code (Aerolog); debis : Linear Elastostatics code (DBETSY3D). HPF Experts : GMD : Developers of ADAPTOR, an early and highly efficient HPF research compiler University of Vienna : Developers of Vienna Fortran, a precursor to HPF; and of VFCS, an interactive research compilation system implementing Vienna Fortran and later modified to accept standard HPF. We describe here the HPF port of SEMC3D, and give benchmark results obtained with this code. D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1140–1148, 1998. c Springer-Verlag Berlin Heidelberg 1998
Porting the SEMC3D Electromagnetics Code to HPF
2 2.1
1141
The SEMC3D Code The Application
The SEMC3D package is an Electromagnetic code developed to simulate EMC (ElectroMagnetic Compatibility) phenomena. It has a wide range of applications. For example, in the automotive industry electric and electronic components and systems must comply with increasingly tougher regulations. They must resist hostile electromagnetic environments and show reduced emissions. The SEMC3D package computes the electromagnetic environment of complex structures made up of metal, dielectric, ferrite and multilayers or human tissues. These structures could be surrounded with ionised gas, laid on a realistic or perfect ground, or with antennas or cables. The electromagnetic environment should be taken as the knowledge of electric and magnetic fields inside and outside the analysed structures. Many source models can be used: NEMP (Nuclear ElectroMagnetic Pulse), plane wave, current injection, lightning and all analytical and digital signals. The connection of the SEMC3D software with specialised Cable Networks codes (as CRIPTE) allows many applications in the field of EMC problems (emission and /or susceptibility) as well as test cases where EMI (ElectroMagnetic Interferences) phenomena occur. 2.2
Numerical and Computational Features
The original Fortran77 source code comprises approximately 3500 lines. The kernel of SEMC3D code is built around the Leap-Frog scheme in time and space. The Leap-Frog scheme uses only the components located within semi spacetime away with regard to the updated component. This characteristic gives a strong locality in the solution of Maxwell’s equations, that is to say the updating of variables is independent of the order of their appearance and each variable depends only on its closest neighbour through another variable. For each cell 6 variables (Ex, Ey, Ez, Hx, Hy, Hz) are necessary to solve Maxwell’s equations. Of course, the high locality of the finite-difference method leads to a domain decomposition of the work space. 2.3
Message Passing Version
The usage of electronics in vehicles has increased at a phenomenal rate in the last decade, and continues to do so, for example to enhance automatization, driver comfort and vehicle safety. As a result, the computer power needs (CPU time and memory size) steadily increase as more complex and more accurate simulations are required to meet customer needs. The use of MPP systems is one way to satisfy these needs, and a message passing version of the code was developed in the early 90s, and tested on many platforms including Paragon, CS2, CM5 and Workstation and with NX, MPI, PVM and PARMACS libraries.
1142
Henri Luzet and L.M. Delves
The design of parallelisation is founded on a standard domain decomposition approach based on nodal variables, namely there is a correspondence between one calculation node and one 3D element array (IJK representation). This correspondence is used both in the HPF version and in the message passing version, but the implementations differ. The message passing version used a decomposition with overlap. At each step of the computation, the calculations are made in each domain and then the quantities on the boundaries of domains are updated. This is a fine grain parallelisation which requires very substantial effort in a message passing implementation. A major advantage of HPF is its intrinsic ability to provide a fine grained parallelisation with modest development effort; and this is what we sought to produce.
3
Outline Porting Procedure
The porting procedure had the same steps for each code; these steps are those foreseen as appropriate at project start: Code Cleaning : The original Fortran77 code was tidied up to remove language features which were either: – non-standard Fortran77 (some of these remain in many codes either as legacies from earlier Fortran versions or because they have been widely supported by compilers); – or can be replaced even within Fortran77 by better constructs. This stage proved to be well-supported by the FORESYS tool. However, some work was necessary by hand. Translation to Fortran90 : This provides a better base for HPF. The FOREST90 tool (part of the FORESYS suite) proved to give good support for this, providing an automatic translation which was then hand polished. Improving the Fortran90 version : to make the source better suited to HPF by utilising array syntax where possible, and by removing further features (storage association, sequence association, assumed size and explicit shape array arguments) which cannot directly be parallelised in HPF. Static storage allocation was replaced by dynamic storage allocation, as part of this process. This stage made use of the FORESYS vectorising tool; but much of the work was done manually. Inserting HPF Directives : was the final stage. It involved also a small amount of code restructuring: – loops were restructured to be INDEPENDENT where possible; – Procedures were marked as HPF SERIAL where appropriate; – Some I/O statements were moved to improve parallelisation.
4 4.1
Porting the SEMC3D Code Translation to Fortran 90.
As with all of the codes, FORESYS was used to generate a Fortran90 version semi–automatically. The main features of the Fortran 90 version are :
Porting the SEMC3D Electromagnetics Code to HPF
1143
– – – –
New Fortran 90 syntax used. All variables are declared. Interface modules are provided for every procedure. Sequence association is not used. COMMON blocks have been replaced by modules. – Introduction of INTENT attribute for dummy arguments. – Removal of obsolete features. In the Fortran77 version, all arrays are defined in the main program with their exact dimensions specified via an include file. Fortran 90 is able to introduce dynamic arrays, so we have rewritten the main program using this feature. The next stage was the introduction of array syntax or array operations when possible. This work was very easy because the initial version of SEMC3D was written in Fortran 80 for the Alliant platform. In Fortran 77 the actual and the dummy arguments of subroutines can differ in rank. In addition, many “Fortran77” codes use vendor extensions which allow dummy and actual to differ also in type or kind. The original SEMC3D code uses all these facilities. We used the capacity of Fortran 90 to create user–defined generic procedures to keep the same calls. After the initial translation was completed we carried out portability tests, and final tuning on several platforms: IBM, Cray, HP with vendor–supplied compilers and on the Sun and Meiko CS2 platforms at Vienna with the NASL FortranPlus compiler. Some minor parts were rewritten to bypass various compiler bugs. 4.2
HPF Port
As noted above, the HPF port was performed using fine grain parallelisation. No major difficulties were encountered; the code was originally designed to vectorise well, and the majority of loops have no data dependency. Choosing the partition SEMC3D works within a three dimensional rectangular envelope, which maps naturally onto a 3D processor grid: !HPF$ PROCESSORS Grid_3d(number_of_processors(dim=1), & !HPF$ number_of_processors(dim=2), & !HPF$ number_of_processors(dim=3)) The user can specify the actual size and shape of the grid in the run command. Using TEMPLATE The simulation space is a box which can be defined by the following TEMPLATE. !HPF$ TEMPLATE,
DIMENSION(1:nx,1:ny,1:nz) :: BOX
1144
Henri Luzet and L.M. Delves
where nx, ny and nz are the number of discretization points on each axis. Following our message passing experience, we chose to distribute this TEMPLATE with the (BLOCK,BLOCK,BLOCK) attribute; this was extended to BLOCK(mx), BLOCK(my), BLOCK(mz) during tuning with mx, my, mz chosen to optimise load balancing. However, since all directives are set via a single INCLUDE file, the actual mapping used can be trivially changed at the cost of a recompilation. Arrays: Alignment and Distribution We distinguish two cases: the main program and the subroutines. In the main program, all the arrays are aligned with the BOX template, and in the subroutines they are distributed according to the inherited mapping. For example: In main : !HPF$ ALIGN ex(I,J,K) WITH BOX(I,J,K) In the subroutines : !HPF$ DISTRIBUTE ex *(BLOCK,BLOCK,BLOCK) Major Parallel Constructs Used The two major sources of parallelism in the HPF code are: Use of Array Expressions : As noted, SEMC3D is highly vectorisable. Many of the Fortran77 DO loops translate naturally to Fortran90 array assignments (and FORESYS carries out many of these translations automatically). These are very efficiently handled by the NASL HPFPlus compiler. Use of HPF SERIAL Procedures : Computationally expensive code sections which could not readily be expressed using array syntax, were wrapped in HPF SERIAL procedures. These are typically called within a loop; the NASL HPFPlus compiler places the calls intelligently by analysing the processor residency of the actual arguments, and the resulting loops run in parallel with very low overheads. Note that this approach can be used for both fine grain and coarse grain parallelism; in this application the granularity is low. 4.3
Indirection
Some boundary conditions used in SEMC3D can only be handled using indirection. In the Fortran77 version, these conditions are located with an integer pointer array. In the first translation step we replicated these arrays on all processors, but at a later stage we recoded the one dimensional pointer array to reflect the 3-D nature of the problem, and used a masked WHERE to retain parallelism: integer iptstr(1:nstr) do i=1,nstr e(ipstr(i)) = 0 enddo
Porting the SEMC3D Electromagnetics Code to HPF
1145
became real, dimension(1:,1:,1:), intent(inout) :: e logical, dimension(1:,1:,1:), intent(in) :: iptstr !HPF$ DISTRIBUTE e*(BLOCK,BLOCK,BLOCK) !HPF$ ALIGN iptstr(I,J,K) WITH *e(I,J,K) WHERE(iptstr(:,:,:)) e(:,:,:) = 0 ENDWHERE
5
Benchmarks
We have benchmarked the code using several datasets: cube 07 : a small (44*44*44) test cube used primarily for single processor testing; primary data requirements 3MBytes cube 08 : a modest (54*54*54) test cube; primary data requirements around 6MB. cube 09 : An 80*80*80 test cube, with primary data requirements around 20MB. This provides a test problem small enough to be run on a single processor of the Meiko CS2, which has only limited storage (48MB user– accessible per processor), but large enough to provide a test of the parallelism. CAR 1 : An industrial model of a complete car, with primary data requirements around 100MB. This represents a typical production problem for the code. 5.1
Single Processor Benchmarks
The first stage in the port was the conversion of the Fortran77 code to Fortran90. This conversion itself has major advantages: the resulting code is cleaner, and written at a much higher level; so that it is easier to maintain and develop. Utilising these advantages in practice means relying on the Fortran90 compiler available on the chosen target system. The NASL HPFPlus compiler also has this reliance, since it targets Fortran90. Fortran90 compilers however still vary in their optimisation level for specific targets. Table 1 shows the single processor performance of the original Fortran77 code; of the converted Fortran90 code; and of the final HPF code running on one processor of the Vienna Meiko CS2. The compilers used were: Fortran77: F77 SC3.0.1 Fortran90: NASL FortranPlus Release 2.0beta HPF: NASL HPFPlus Release 2.1 We see that the CS2 FortranPlus compiler has not been comparably optimised to the Sun F77 compiler. Using HPF on one processor introduces a factor of between 2.8 and 3.4 in the single processor times. Since (surprisingly perhaps) the HPF times are better than the F90 times, all of this is due to the performance of the backend compiler.
1146
Henri Luzet and L.M. Delves
Table 1. Single Processor Times (seconds) on the CS2 Compiler Cube 07 Cube 08 Cube 09 F77 -O2 26.0 46.2 147 F77 -O5 16.6 31 101 F90 -fast 79.4 106 348 HPF -O 56.5 88 281
5.2
Parallel Performance of the HPF Code
We have run the converted HPF code on the Meiko CS2 at the University of Vienna; this is one of the two project platforms for Pharos. Table 2 shows the performance achieved. The times recorded are for four simulation time periods for cube 08, cube 09; one time period for CAR 1.
Table 2. Performance of the HPF version of SEMC3D on the Meiko CS2. Times are in seconds. Processors 1 2 4 8 16 32
Cube 08 Cube 09 CAR 1 89 281 52 154 35 76 268 13 39.7 152 8.6 22.7 86 6.2 10.7 45
The speedups achieved are very good; the code clearly scales well. The mildly anomalous time measured for cube 08 on eight processors was repeatable but not really understood. CAR 1 could not be run on less than four processors. 5.3
Comparison with a Message–Passing MPI Port
As noted above, an MPI/Fortran77 version of the code had already been developed prior to the project start. Interest in HPF stemmed from the cost of producing and maintaining this port; and especially the difficulty of keeping it in step with ongoing development of the serial code. We ran the cube 09 benchmark with both MPI and HPF versions, on the Meiko CS2, with results shown in Table 3. The single processor performance advantage of the Fortran77 compiler is maintained roughly constant over the range of processors used; that is, the speedups achieved by HPF and MPI codes are essentially the same. The MPI figures are roughly a factor 3 better than those achieved with HPF; comparison with Table 1 shows that this is attributable wholly to the lack of optimisation of the backend Fortran90 compiler.
Porting the SEMC3D Electromagnetics Code to HPF
1147
Table 3. MPI (F77 -O5) vs HPF (HPFPlus -O) for cube 09. Times are in seconds. Procs 1 2 4 8 16 32
6
MPI 101 57.5 28.3 13.5 6.9 5.7
HPF 281 154 76 39.7 22.7 10.7
Ratio 1:2.8 1:2.7 1:2.7 1:2.9 1:3.3 1:1.9
Discussion
The speedups obtained in the benchmarking are extremely encouraging; what do the results tell us about the use of HPF at this stage in compiler development? 6.1
Friendliness of the HPF Language
This has certainly been confirmed. The HPF version of the code is as easy to read as the Fortran90 version; and certainly much easier to maintain than either the Fortran77 or the message passing version. This last is particularly hard to maintain: the implementation of a fine grain message passing parallelisation obscures the structure of the original serial code, while this structure is cleanly maintained in the HPF version. 6.2
Requirements on the HPF Compiler
One important aspect of the initial porting process was to identify those requirements on the HPF compiler needed to produce a final scalable working version of the code. Here is a summary of the findings: Fortran90 Compiler Optimisation Improved runtime efficiency in the backend Fortran90 compilers, would help the benchmarks. The NASL HPFPlus compiler targets Fortran90 directly, and hence can (and necessarily does) rely on the many optimisations available to such a compiler. In the medium term, Fortran90 compilers will generate better code for well written programs than the equivalent code passed through a Fortran77 compiler. This is already true for some Fortran90 compilers; but not yet on the CS2. HPFPlus Efficiency Release 2.0 of HPFPlus has substantial optimisations which have reduced the HPF overheads in many codes to between zero and 20%; indeed, here they prove to be negative overheads. HPF overheads are certainly not a significant issue with SEMC3D.
1148
Henri Luzet and L.M. Delves
Language Support The major language features used in this port were: – – – – – –
Multidimensional BLOCK(m) distributions; ALIGN; PROCESSOR arrays with runtime determined size and shape; Efficient handling of distributed array arguments; Array expressions using distributed components; Autoparallelised DO loops with calls to HPF SERIAL procedures.
These features were sufficient to produce an extremely effective parallel code with parallel efficiency as good as the pre-existing MPI version; the HPF source is however very much easier to maintain, and further develop, than the message– passing MPI code.
Acknowledgements This work was supported in part via Esprit Project No P20162, Pharos. We are indebted to our colleagues in this project for many helpful discussions, and aid in running on a variety of systems. Especial thanks are due to Thomas Brandes, Junxian Liu, Ruth Lovely, John Merlin and Dave Watson.
HiPEC: High Performance Computing Visualization System Supporting Networked Electronic Commerce Applications Reinhard L¨ uling and Olaf Schmidt Department of Computer Science University of Paderborn F¨ urstenalle 11,D-33102 Paderborn, Germany {rl,merlin}@uni-paderborn.de
Abstract. This paper presents the basic ideas and the technical background of the Esprit project HiPEC. The HiPEC Project aims at integrating HPCN technologies into an electronic commerce scenario to ease the interactive configuration of products and their photorealistic visualization by a high performance parallel rendering system.
1
Introduction
It is well accepted that electronic commerce systems will be of large benefit for the European industry if these technologies are actively used by companies and integrated into their business as soon as possible. Especially SMEs can benefit largely as market access and communication to suppliers and larger industries can be eased considerably using Internet technologies. The main objective of the HiPEC project is to integrate advanced HPCN technologies to form a generic electronic commerce application. This application is aimed at giving a large number of SMEs a configuration and visualization tool to support the selling process within the showroom and to enlarge their business using the same tool also on the Internet. The enabling technology behind this tool is a high performance computing system as well as advanced networking technology. The HPC system is used to run the configuration and visualization system and networks give the end-users access to a HPC system to run the service from their showroom. It is the special aim of this project to make the service as general usable as possible. This means that open interfaces have to be used that allow a future integration of other components to customize the system to any special needs. The HPC equipment which is used for the electronic commerce application is provided as a service to the end-users in a way that allows the end-users to use these technologies via suitable communication networks. Thus, the end-users can have cost-efficient access to HPC technologies. The potential customers of these technologies are retailers of industrial products. The application range selected here is reselling of bathroom equipment. In
This work is supported by the EU, ESPRIT project 27232 (Domain 6, High Performance Computing and Networking).
D. Pritchard, J. Reeve (Eds.): Euro-Par’98, LNCS 1470, pp. 1149–1153, 1998. c Springer-Verlag Berlin Heidelberg 1998
1150
Reinhard L¨ uling and Olaf Schmidt
a typical scenario, the customer can configure the bathroom equipment in an interactive configuration session. To ease the decision process of the customer, the configuration of the final product is supported by presenting the customer a photorealistic animation of the composed bathroom. As such a presentation has to be computed in short time with highest quality it is necessary to use a high performance computing system for the realistic image synthesis. In section 2 the basic ideas of HiPEC will be discussed and a more detailed view on the technical background of the project is given.
2
Approach of the HiPEC Project
The HiPEC project aims at integrating HPCN technologies into an electronic commerce scenario to ease the interactive configuration of products. In the project scenario a retailer is equipped with a PC that is connected to the Internet. This PC holds a local database of product information from different manufacturers of bathroom equipment. Product information that was previously only available in books and brochures is digitized and loaded into the local database that can be frequently updated by the manufacturer of the bathroom equipment via the Internet. A customer that wants to configure his bathroom can do this using a CAD program on the PC. During this selling process an employee of the reseller guides the selection of products and shows the different scenarios using the local CAD program on the PC. Using the local database selections can easily be made and different scenarios can be presented to the customer. Configuration of products is only one aspect. The other and even more important aspect is the visualization of a product. Customers want to have a visual impression of their product. This is of special importance in areas where furniture and other consumer goods are configured and offered to customers. As a high quality animation of products can lead to an increased attractivity the HiPEC project integrates a rendering service that generates photorealistic pictures of the selected product. As these animations can only be generated using HPC computing systems, a parallel computing system is used to generate these pictures. To generate this in highest quality, the architecture of the bathroom is loaded to a remote parallel computer system where a database is installed that contains detailed graphical models of all products. The model of the complete bathroom is build here and forwarded to the parallel system that generates a picture of highest quality. The encoded picture is send back to the PC to be printed and handed over to the customer. Thus, the HiPEC electronic commerce system contains the following components (see Figure 1). User Interface and Configuration System The User Interface and configuration system is installed on a PC in the showroom of the seller of the product. Together with the seller the customer configures the final product and has the possibility of using the remote high-end rendering service installed on the parallel computer system for a high-quality visualization.
HiPEC: High Performance Computing Visualization System
C lie n t - S y s t e m
s
C o n f ig u r a t io n - T o o ls
IS D p a r a s c e n im a g
In t e r n e t - P C
1151
U s e r
N m
e s e s
e t e r s
W o v ir t u d a t a s t a r
r l a b t i
d
W l s h a s e n g r
id e W o w r o r e q u e c a lc
e b o m e s t s u la t io n s
H P C - F r o n t e n d
W
W
IS D N - In t e r f a c e
2 + M M
o r k s t a t io n
a n a g e m a n a g e m
W
W
- In t e r f a c e
D a t a b a s e ( S c e n e s ,Im a g e s )
e n t o f U s e r s e n t o f R e s o u r c e s
S e r v ic e
P r o v id e r
N F S - In t e r f a c e s c e a c a m im a p a r s t a m
H P C - R e n d e r in g
n e d t e r ia a r a g e d a m e t u s f
a t a l d e s c r ip o t io n p o s it io n s a t a t e r f ile s ile s
N F S - In t e r f a c e R e n d e r in g - S e r v e r F r o n t e n d C o n v e r t in g in p u t f ile f o r m S t a r t in g im a g e s y n t h e s is P a r a lle l R e n d e r in g R a y t r a c in g R a d io s it y
T w
a t
H P C - C e n t e r
o P a s s
Fig. 1. Architecture of the HiPEC-System
This visualization might be only initiated at the end of the selling process or also in between to guide the user. The user is enabled to store generated product visualizations and configured products in a database. This database is used to set up a virtual showroom on the World Wide Web, which can be visited by potential customers from their PCs at home. Communication network The configuration and visualization service will be offered in the showroom of the retailers. Thus, communication networks will be used for different purposes. We will use ISDN dial-up lines to connect the PCs installed in the end-users showrooms to the parallel server. Using the database installed at the site of the parallel server, the bandwidth of these lines is large enough to support the application. Service Broker The electronic commerce application is achieved by a service broker which is installed on frontend computers to the HPC system. The service broker is responsible for managing the user accounts and for performing efficient job-scheduling of rendering jobs on the available hardware resources. Database of product components To generate a high quality animation, the complete product that is configured on the local PC has to be delivered to the parallel rendering server. To save communication bandwidth, the most important information about structure of the sub-components, structure of the surface of the object and other information are stored in a database that is located on a frontend PC connected to the parallel
1152
Reinhard L¨ uling and Olaf Schmidt
system. Using this database, the client system only has to communicate some small information about the configuration. The complete scene that has to be animated is generated on the parallel computing system. Parallel rendering system Ray tracing and radiosity algorithms currently implemented in image synthesis systems provide the necessary rendering quality, but these methods are suffering from their extensive computational costs and their enormous memory requirements. Parallel computing and graphics are two fields of computer science that have much to offer each other. Efficient use of parallel computers requires algorithms where the same computation is performed repeatedly or where separate tasks are performed with little or no coordination, and most computer graphics tasks fit these needs well. On the other hand, most graphics applications have an insatiable appetite for raw computational power, and parallel computers are a good way to provide this power. It can be concluded that there is an obvious synergy between computer graphics and parallel computation. Several sophisticated methods for efficient parallel simulation of illumination effects in complex environments have been developed in the past [1][2] [4]. In the HiPEC approach the parallel rendering system is installed on a parallel computing platform that is located at the site of a service provider that provides the parallel computing service. Over all, this application shows the benefits of networked HPC technologies in an electronic commerce application that goes beyond presenting products in the WWW. This technology is general, i.e. it can be applied to a large number of other applications and is presented here for a complete industrial area allowing to demonstrate the benefits of these technologies.
3
Conclusion
The HiPEC project integrates advanced HPC technology into a special electronic commerce application supporting the selling process within showrooms of retailers of bathroom furniture. So even SMEs can profit by the advantages of powerful HPC systems which are to expensive for a local installation. The design of the HiPEC system is kept as general as possible in order to allow the integration of other components to customize the system to the demands of additional electronic commerce applications.
References 1. Baum, D.R.; Winget, J.M.: Real time radiosity through parallel processing and hardware acceleration, Proc. of SIGGRAPH 90, March 1990, pp. 67-75 1152 2. Chalmers, A.G.; Paddon, D.J.: Parallel processing of progressive refinement radiosity methods, Proc. 2nd Eurographics Workshod on Rendering, Mai 1991
HiPEC: High Performance Computing Visualization System
1153
3. Schmidt, O.; Lange, B.: Interaktive photorealistische Bildgenerierung durch effiziente parallele Simulation der Lichtenergieverteilung, Proc. of Informatik97,1997, pp. 476-485, Springer- Verlag 1152 4. Schmidt, O.; Rathert, J.; Reeker, L.: Parallel Simulation of the Global Illumination, accepted for PDPTA’98, Las Vegas, July 1998 1152
I n d e x of A u t h o r s
Abdallah, Ali E. 165 Abu-Ghazaleh, Nael B. 1089 Agrawal, Nidhi 981 Akarsu, E. 55 Alefragis, Panayiotis 1104 Almeida, F. 234 Alur, Rajeev 191 Andonov, Rumen 480 Antonis, Konstantinos 352 Antonoiu, Gheorghe 545 Arbenz, Peter 771 Banerjee, P. 422 Barbosa da Silva, Fabricio Alves Barreteau, Michel 1123 Bataller, Jordi 887 Beckmann, Olav 413 Ben-Asher, Yosi 933 Bernab~u-Aubs Jos~ M. 887 Berrendorf, Rudolf 299 Bianchini, Ricardo 831 Bischof, Stefan 383 Biswas, Rupak 307 Bode, Arndt 193 Bodin, Francois 1123 Boeres, Cristina 337 Boku, Taisuke 244 Bono, Ida de 812 Borges, Leonardo 763 Boukerche, Azzedine 318,534 Boulet, Pierre 263 Bourgeois, J. 113 Brandes, Thomas 629,639 Br6gier, Fr6d6ric 639 Brent, Richard P. I Brinkhaus, Peter 1123 Broughton, P. 126 Brown, T.J. 102 Brunie, Lionel 503 Bull, J. Mark 377 Burger, A. 126
Burkhart, Helmar
75
Cahoon, Brendon 521 Calamoneri, Tiziana 1029 Campos, Luis Miguel 367 Cappello, Franck 216 Carpenter, Bryan 659 Catthoor, F. 923 Cavalheiro, Gerson G.H. 373 Cesati, Marco 892 Chakravarty, Manuel M.T. 709 Chamski, Zbigniew 1123 Chapman, Barbara 650 367 Charles, Henri-Pierre 1123 Chiola, G. 620 Choudhary, A. 422 Ciaccio, G. 620 Clark, A.F. 92 Clint, Maurice 747 Cole, Murray 625 Collard, Jean-Francois 411 Coloma, I. 539 Corradi, Antonio 625 Counilh, Christine Marie 639 Crookes, D. 102 Danelutto, M. 698 Dantas, M.A.R. 397 Das, Dibyendu 401 Das, Sajal K. 318,534 Datta, Ajoy 534 Davenport, Glorianna 47 Davy, John 136 De Man, H. 923 De Vito, Dominique 742 Delaitre, T. 113 Delgado, A. 539 Delves, L.M. 1140 Dempster, E.W. 126 Denneulin, Yves 373 Dfaz de Cerio, Luis 1051
1154 Dickens, Phillip M. 959 Diekmann, Ralf 347 Di Ianni, Miriam 892,1029 di Serafino, Daniela 812 Doallo, RamSn 224 Donaldson, Stephen R. 80,970 dos Santos, Rafael R. 1010 Downton, A.C. 92 Ducloux, Eric 812 Ebner, Ralf 383 Eisenbeis, Christine 1123 Eisenbiegler, JSrn 456,865 Erlebach, Thomas 383 Essah, Wissal 136 Evripidou, Paraskevas 181,463 Feautrier, Paul 470 Ferreira, A. 875 Fimmel, Dirk 1018 Fisher, M.D. 852 Fissgus, U. 273 Fleury, M. 92 Floechini, P. 554 Floros, Nick 1131 Flynn Hummel, Susan 297 Fox, Geoffrey C. 55,659 Fraguela, Basilio B. 224 Frayss~, Val~rie 751 Fuerle, Thomas 953 Furmanski, W. 55 Futatsugi, Kokichi 846 Gaber, J. 405 Gabow, Harold N. 307 Garofalakis, John 352 Gehrke, Thomas 733 Germain, C~cile 629 Geus, Roman 771 Giavitto, Jean-Louis 742 Giraud, Luc 751 Gonzs Antonio 1051 Goumopoulos, Christos 1104 Grfin, Thomas 999 Gu~rin Lassous, I. 875
Guerraoui, Rachid 513 Gurd, John 1123 Haber, Gady 933 Haupt, T. 55 Hayashi, Tatsuya 390 Heinrich-Litan, L. 273 Herley, Kieran 967 Herns Emilio 220 Hey, Tony 220 Hill, Jonathan M.D. 80,157,970 Hillebrand, Mark A. 999 Hirata, Hiromiehi 846 Hockauf, Robert 206 Hofstedt, Petra 676 Hoogerbrugge, Jan 1123 Housos, Efthymios 1104 Hu, Ping 1123 Ioroi, Shigenori 846 Ishii, Naohiro 390 Itakura, Ken'ichi 244 Iwahori, Yuji 390 Jalby, William 1123 Jarvis, Stephen A. 157 Jensch, JSrg 944 Justo, G.R. 113 Kacsuk, P6ter 842 Kandemir, M. 422 Kaniewski, Juri 798 Karl, Wolfgang 206 Kautonen, Anssi 993 Keane, J.A. 852 Keller, Gabriele 709 Kelly, Paul H.J. 413,1062 Kharraz-Aroussi, Hatim 751 Kiper, Ay~e 793 Knijnenburg, Peter M.W. 1123 Knoop, Jens 445 Koch, Povl T. 601 Koppler, Rainer 435 Kort, Iskander 255,940 Kreuchlin, Wolfgang 747
1155 Krommer, Arnold R. 804 Kshemkalyani, Ajay D. 578 Kubota, Kazuto 244 Kulkarni, C. 923 Lachanas, Adrianos 463 Lanfear, Tim 80 Launay, Pascale 729 Ledn, C. 539 Leberecht, Markus 206 LeMaster, Timothy E. 534 Leppgnen, Ville 993 Li, Dingchao 390 Li, Xiaoming 659 Li, Xinying 659 Lodi, E. 554 Lorcy, St@hane 738 LSwe, Well 328,865 Lii, J. 126 Lu, Zhihong 521 Luccio, F. 554 Ludwig, Thomas 173,193 Liiling, Reinhard 944,1149 Lundberg, Lars 288 Luzet, Henri 1140 Mallet, Julien 688 Malony, Allen D. 191 Marcus, K. 875 Marfn, Mauricio 897 Maslennikow, Oleg 798 May, David 591 Mayer, Ernst 503 McAleese, G. 102 McColl, Bill 863 McKinley, Kathryn S. 521 Meacham, Ken 1131 Mehrotra, Piyush 650 Melas, P. 570 Merker, Renate 1018 Meyer, Ulrich 1040 Middendorf, Martin 328 Miller, Barton P. 146 Molitor, P. 273 Morales, D.G. 234
Morrow, P.J. 102 Muller, Henk L. 591 Najjar, Faza 528 Nash, Jonathan 906 Navaux, Philippe O.A. 1010 N~meth, Zsolt 842 Neophytou, Neophytos 181 Newhall, Tia 146 Nikolopoulos, Dimitrios S. 491 Nodera, Takashi 788 O'Boyle, Michael 1123 Oberhuber, Michael 206 Ogata, Kazuhiro 846 Oliker, Leonid 307 Oliveira, Suely 763 Orlando, Salvatore 356 Osoba, Femi O. 704 Ould-Khaoua, Mohamed 989 Ozawa, Kazufumi 780 Ozdemir, H. 55 Pagli, L. 554 Papatheodorou, Theodore S. 491 Pazat, Jean-Louis 729 Pedone, Fernando 513 Pelagatti, S. 698 Penttonen, Martti 993 Perego, Raffaele 356 Perrott, Ron 1101 Piquer, Jos~ M. 610 Plouzeau, Nofil 738 Polychronopoulos, Eleftherios D. 491 Preis, Robert 347 Priol, Thierry 1114 Pua, C.S. 126 Puntigam, Franz 720 Rabenseifner, Rolf 563 Rabhi, Fethi A. 704 Rahayu, J. Wenny 505 Rajopadhye, Sanjay 480 Ramanuj am, J. 422 Ramirez, Rafael 279
1156 Randoux, S. 113 Rapine, Christophe 322 Rau-Chaplin, A. 875 Rauber, Th. 273 Ravikumar, C.P. 981 Rebello, Vinod E.F. 337 Redon, Xavier 263 Ren~, Christophe 1114 Richard, Olivier 216 Riley, Graham 297 Roantree, D. 102 Roch, Jean-Louis 373 Roda, J.L. 234 RSder, Christian 193 Rodr/guez, C. 234,539 Rohou, Erven 1123 Roman, Jean 639 Rousset de Pina, Xavier 601 Royo, Dolors 1051 Sakellariou, Rizos 297,1123 Salinger, Petr 1057 Sande, F. 539 Sanders, Peter 1104 Sansonnet, Jean-Paul 742 Santoro, N. 554 Santos Costa, Vitor 831 Sarvan, N. 92 Sato, Mitsuhisa 244 Scherson, Isaac D. 322,367 Schikuta, Erich 953 Schimmler, Manfred 916 Schinkmann, F. 113 Schiper, Andr~ 513 Schlimbach, Frank 347 Schmidt, Bertil 916 Schmidt, Olaf 1149 SchrSder, Heiko 916 Scott, R.I. 852 Sensen, Norbert 944 Seznec, Andr~ 1123 Shenoy, N. 422 Sibeyn, Jop F. 1040 Siniolakis, Constantinos 157 Sips, Henk 625
Skeie, Tor 1076 Skillicorn, D.B. 337,698,970 Slimani, Yahya 528 Snelling, David 967 Soch, Michal 1047 Spence, I.T.A. 102 Spezzano, Giandomenico 669 Spies, F. 113 Spirakis, Paul 352 Srimani, Pradip K. 545 StShr, Elena A. 1123 Surridge, Mike 1131 Sutter, St. 273 Takkula, Tuomo 1104 Talbot, Sarah A.M. 1062 Talia, Domenico 669 Taniar, David 505 Taylor, H. 126 Thakur, Rajeev 959 Thorelli, Lars-Erik 682 Tomov, N.T. 126 Toursel, B. 405 Townsend, Paul D. 35 Treffers, Menno 1123 Trinitis, JSrg 173 Trystram, Denis 255,322,940 Tsuno, Naoto 788 Tvrdik, Pavel 1047,1057 Upstill, Colin
1101
Valero-Garcfa, Miguel 1051 Vanneschi, Marco 21 Vasilev, Vasil P. 157 Vekariya, P. 113 Visconti, Ivana 610 Vlassov, Vladimir 682 Wadsworth, Chris 75 Wagner, Michael 206 Wagner, Clemens 821 Walker, David 863 Walshaw, Chris 347 Wanek, Helmut 953
1157 Wedelin, Dag 1104 Wen, Yuhong 659 Wijshoff, Harry A.G. 1123 Williams, M.H. 126 Wilsey, Philip A. 1089 Winter, S.C. 113 Wismfiller, Roland 173 Wyrzykowski, Roman 798 Yamada, Susumu 780 Yanev, Nicola 480 Zaluska, E.J. 570 Zapata, Emilio L. 224 Zavanella, A. 698 Zemerly, M.J. 113 Zhang, Guansong 659 Zimmermann, Wolf 328,865
Lecture Notes in Computer Science For information about Vols. 1-1397 please contact your bookseller or Springer-Verlag
Vol. 1398: C. N6dellec, C. Rouveirol (Eds.), Machine Learning: ECML-98. Proceedings, 1998. XII, 420 pages. 1998. (Subseries LNAI). Vol. 1399: O. Etzion, S. Jajodia, S. Sripada (Eds.), Temporal Databases: Research and Practice. X, 429 pages. 1998.
Vol. 1400: M. Lenz, B. Bartsch-Sp6rl, H.-D. Burkhard, S. Wess (Eds.), Case-Based Reasoning Technology. XVIII, 405 pages. 1998. (Subseries LNAI).
Vol. 1416: A.P. del Pobil, J. Mira, M.Ali (Eds.), Tasks and Methods in Applied Artificial Intelligence. Vol.II. Proceedings, 1998. XXIII, 943 pages. 1998. (Subseries LNAI). Vol. 1417: S. Yalamanchili, J. Duato (Eds.), Parallel Computer Routing and Communication. Proceedings, 1997. XII, 309 pages. 1998. Vol. 1418: R. Mercer, E. Neufeld (Eds.), Advances in Artificial Intelligence. Proceedings, 1998. XII, 467 pages. 1998. (Subseries LNAI).
Vol. 1401: P. Sloot, M. Bubak, B. Hertzberger (Eds.), High-Performance Computing and Networking. Proceedings, 1998. XX, 1309 pages. 1998.
Vol. 1419: G. Vigna (Ed.), Mobile Agents and Security. XII, 257 pages. 1998.
Vol. 1402: W. Lamersdorf, M. Merz (Eds.), Trends in Distributed Systems for Electronic Commerce. Proceedings, 1998. XII, 255 pages. 1998.
Vol. 1420: .i. Desel, M. Silva (Eds.), Application and Theory of Petri Nets 1998. Proceedings, 1998. VIII, 385 pages. 1998.
Vol. 1403: K. Nyberg (Ed.), Advances in Cryptology EUROCRYPT '98. Proceedings, 1998. X, 607 pages. 1998.
Vol. 1421: C. Kirchner, H. Kirchner (Eds.), Automated Deduction - CADE-15. Proceedings, 1998. XIV, 443 pages. 1998. (Subseries LNAI).
Vol. 1404: C. Freksa, C. Habel. K.F. Wender (Eds.), Spatial Cognition. VIII, 491 pages. 1998. (Subseries LNAI).
Vol. 1422: J. Jeuring (Ed.), Mathematics of Program Construction. Proceedings, 1998. X, 383 pages. 1998.
Vol. 1405: S.M. Embury, N.J. Fiddian, W.A. Gray, A.C. ,iones (Eds.), Advances in Databases. Proceedings, 1998. XII, 183 pages. 1998.
Vol. 1423: ,I.P. Buhler (Ed.), Algorithmic Number Theory. Proceedings, 1998. X, 640 pages. 1998.
Vol. 1406: H. Burkhardt, B. Neumann (Eds.), Computer Vision - ECCV'98. Vol. I. Proceedings, 1998. XVI, 927 pages. 1998. Vol. 1408: E. Burke, M. Carter (Eds.), Practice and Theory of Automated Timetabling 11. Proceedings, 1997. XII, 273 pages. 1998. Vol. 1407: H. Burkhardt, B. Neumann (Eds.), Computer Vision - ECCV'98. Vol. II. Proceedings, 1998. XVI, 881 pages. 1998. Vol. 1409: T. Schaub, The Automation of Reasoning with Incomplete Information. XI, 159 pages. 1998. (Subseries LNAI). Vol. 1411 : L. Asplund (Ed.), Reliable Software Technologies - Ada-Europe. Proceedings, 1998. XI, 297 pages. 1998. Vol. 1412: R.E. Bixby, E.A. Boyd, R.Z. R[os-Mercado (Eds.), Integer Programming and Combinatorial Optimization. Proceedings, 1998. IX, 437 pages. 1998. Vol. 1413: B. Pernici, C. Thanos (Eds.), Advanced Information Systems Engineering. Proceedings, 1998. X, 423 pages. 1998. Vol. 1414: M. Nielsen, W. Thomas (Eds.), Computer Science Logic. Selected Papers, 1997. VIII, 511 pages. 1998. Vol. 1415: ,i. Mira, A.P. del Pobil, M.Ali (Eds.), Methodology and Tools in Knowledge-Based Systems. Vol. I. Proceedings, 1998. XXIV, 887 pages. 1998. (Subseries LNAI).
Vol. 1424: L. Polkowski, A. Skowron (Eds.), Rough Sets and Current Trends in Computing. Proceedings, 1998. XIII, 626 pages. 1998. (Subseries LNAI). Vol. 1425: D. Hutchison, R. Sch~ifer (Eds.), Multimedia Applications, Services and Techniques - ECMAST'98. Proceedings, 1998. XVI, 532 pages. 1998. Vol. 1427: A.J. Hu, M.Y. Vardi (Eds.), Computer Aided Verification. Proceedings, 1998. IX, 552 pages. 1998. Vol. 1429: F. van der Linden (Ed.), Development and Evolution of Software Architectures for Product Families. Proceedings, 1998. IX, 258 pages. 1998. Vol. 1430: S. Trigila, A. Mullery, M. Campolargo, H. Vanderstraeten, M. Mampaey (Eds.), Intelligence in Services and Networks: Technology for Ubiquitous Telecom Services. Proceedings, 1998. XII, 550 pages. 1998. Vol. 1431: H. Imai. Y. Zheng (Eds.), Public Key Cryptography. Proceedings, 1998. XI, 263 pages. 1998. Vol. 1432: S. Arnborg, L. Ivansson (Eds.), Algorithm Theory - SWAT '98. Proceedings, 1998. IX, 347 pages. 1998. Vol. 1433: V. Honavar, G. Slutzki (Eds.), Grammatical Inference. P r o c e e d i n g s , 1998. X, 271 pages. 1998. (Subseries LNAI). Vol. 1434: J.-C. Heudin (Ed.), Virtual Worlds. Proceedings, 1998. XII, 412 pages. 1998. (Subseries LNAI). Vol. 1435: M. Klusch, G. Well3 (Eds.), Cooperative Information Agents I1. Proceedings, 1998. IX, 307 pages. 1998. (Subseries LNAI).
Vol. 1436: D. Wood, S. Yu (Eds.), Automata Implementation. Proceedings, 1997. VIII, 253 pages. 1998. Vol. 1437: S. Albayrak, F.J. Garijo (Eds.), Intelligent Agents for Telecommunication Applications. Proceedings, 1998. XII, 251 pages. 1998. (Subseries LNAI). Vol. 1438: C. Boyd, E. Dawson (Eds.), Information Security and Privacy. Proceedings, 1998. XI, 423 pages. 1998. Vol. 1439: B. Magnusson (Ed.), System Configuration Management. Proceedings, 1998. X, 207 pages. 1998. Vol. 1441: W. Wobcke, M. Pagnucco, C. Zhang (Eds.), Agents and Multi-Agent Systems. Proceedings, 1997. XII, 241 pages. 1998. (Subseries LNAI). Vol. 1442: A. Fiat. G.J. Woeginger (Eds.), Online Algorithms. XVIII, 436 pages: 1998. Vol. 1443: K.G. Larsem S. Skyum, G. Winskel (Eds.), Automata, Languages and Programming. Proceedings, 1998. XVI, 932 pages. 1998. Vol. 1444: K. Jansen, J. Rolim (Eds.), Approximation Algorithms for Combinatorial Optimization. Proceedings, 1998. VIII, 201 pages. 1998. Vol. 1445: E. Jul (Ed.), ECOOP'98 - Object-Oriented Programming. Proceedings, 1998. XII, 635 pages. 1998. Vol. 1446: D. Page (Ed.), Inductive Logic Programming. Proceedings, 1998. VIII, 301 pages. 1998. (Subseries LNAI). Vol. 1447: V.W. Porto, N. Saravanan, D. Waagen, A.E. Eiben (Eds.), Evolutionary Programming VII. Proceedings, 1998. XVI, 840 pages. 1998. Vol. 1448: M. Farach-Colton (Ed.), Combinatorial Pattern Matching. Proceedings, 1998. VIII, 251 pages. 1998. Vol. 1449: W.-L. Hsu, M.-Y. Kao (Eds.), Computing and Combinatorics. Proceedings, 1998. XI1, 372 pages. 1998. Vol. 1450: L. Brim, F. Gruska, J. Zlatugka (Eds.), Mathematical Foundations of Computer Science 1998. Proceedings, 1998. XVII, 846 pages. 1998. Vol. 1451: A. Amin, D. Dori, P. Pudil, H. Freeman (Eds.), Advances in Pattern Recognition. Proceedings, 1998. XXI, 1048 pages. 1998. Vol. 1452: B.P, Goettl, H.M. Halff, C.L. Redfield, V.J. Shute (Eds.), Intelligent Tutoring Systems. Proceedings, 1998. XIX, 629 pages. 1998. Vol. 1453: M.-L. Mugnier, M. Chein (Eds.), Conceptual Structures: Theory, Tools and Applications. Proceedings, t 998. XIII, 439 pages. (Subseries LNAI). Vol. 1454: I. Smith (Ed.), Artificial Intelligence in Structural Engineering. XI, 497 pages. 1998. (Subseries LNAI). Vol. 1456: A. Drogoul, M. Tambe, T. Fukuda (Eds.), Collective Robotics. Proceedings, 1998. VII, 161 pages. 1998. (Subseries LNAI). Vol. 1457: A. Ferreira, J. Rolim, H. Simon, S.-H. Teng (Eds.), Solving Irregularly Structured Problems in Prallel. Proceedings, 1998. X, 408 pages. 1998. Vol. 1458: V.O. Mittal, H.A. Yanco, J. Aronis, R-. Simpson (Eds.), Assistive Technology in Artificial Intelligence. X, 273 pages. 1998. (Subseries LNAI). Vol. 1459: D.G. Feitelson, L. Rudolph (Eds.), Job Scheduling Strategies for Parallel Processing. Proceedings, 1998. VII, 257 pages. 1998.
Vol. 1460: G. Quirchmayr, E. Schweighofer, T.J.M. Bench-Capon (Eds.), Database and Expert Systems Applications. Proceedings, 1998. XVI, 905 pages. 1998. Vol. 1461: G. Bilardi, G.F. Italiano, A. Pietracaprina, G. Pucci (Eds.), Algorithms - ESA'98. Proceedings, 1998. XII, 516 pages. 1998. Vol. 1462: H. Krawczyk (Ed.), Advances in Cryptology CRYPTO '98. Proceedings, 1998. XII, 519 pages. 1998. Vol. 1464: H.H.S. Ip, A.W.M. Smeulders (Eds.), Multimedia Information Analysis and Retrieval. Proceedings, 1998. VIII, 264 pages. 1998. Vol. 1465: R. Hirschfeld (Ed.), Financial Cryptography. Proceedings, 1998. VIII, 311 pages. 1998. Vol. 1466: D. Sangiorgi, R. de Simone (Eds.), CONCUR'98: Concurrency Theory. Proceedings, 1998. XI, 657 pages. 1998. Vol. 1467: C. Clack, K. Hammond, T. Davie (Eds.), Implementation of Functional Languages. Proceedings, 1997. X, 375 pages. 1998. Vol. 1468: P. Husbands, J.-A. Meyer (Eds.), Evolutionary Robotics. Proceedings, 1998. VIII, 247 pages. 1998. Vol. 1469: R. Puigjaner, N.N. Savino, B. Serra (Eds.), Computer Performance Evaluation. Proceedings, 1998. XIII, 376 pages. 1998. Vol. 1470: D. Pritchard, J. Reeve (Eds.), Euro-Par'98: Parallel Processing. Proceedings, 1998. XXII, 1157 pages. 1998. Vol. 1471: J. Dix, L. Moniz Pereira, T.C. Przymusinski (Eds.), Logic Programming and Knowledge Representation. Proceedings, 1997. IX, 246 pages. 1998. (Subseries LNAI). Vol. 1473: X. Leroy, A. Ohori (Eds.), Types in Compilation. Proceedings, 1998. VIII, 299 pages. 1998. Vol. 1475: W. Litwin, T. Morzy, G. Vossen (Eds.), Advances in Databases and Information Systems. Proceedings, 1998. XIV, 369 pages. 1998. Vol. 1477: K. Rothermel, F. Hohl (Eds.), Mobile Agents. Proceedings, 1998. VIII, 285 pages. 1998. Vol. 1478: M. Sipper, D. Mange, A. P6rez-Uribe (Eds.), Evolvable Systems: From Biology to Hardware. Proceedings, 1998. IX, 382 pages. 1998. Vol. 1479: J. Grundy, M. Newey (Eds.), Theorem Proving in Higher Order Logics. Proceedings, 1998. VIII, 497 pages. 1998. Vol. 1480: F. Giunchiglia (Ed.), Artificial Intelligence: Methodology, Systems, and Applications. Proceedings, 1998. IX, 502 pages. 1998. (Subseries LNAI). Vol. 1482: R.W. Hartenstein, A. Keevallik (Eds.), FieldProgrammable Logic and Applications. Proceedings, 1998. XI, 533 pages. 1998. Vol. 1483: T. Plagemann, V. Goebel (Eds.), Interactive Distributed Multimedia Systems and Telecommunication Services. Proceedings, 1998. XV, 326 pages. 1998. Vol. 1487: V. Gruhn (Ed.), Software Process Technology. Proceedings, 1998. IX, 157 pages. 1998. Vol. 1488: B. Smyth, P. Cunningham (Eds.), Advances in Case-Based Reasoning. Proceedings, 1998. X1, 482 pages. 1998. (Subseries LNAI).